spark SQL项目应用

最新推荐文章于 2023-01-14 13:02:11 发布

DSJ_ kohler

最新推荐文章于 2023-01-14 13:02:11 发布

阅读量247

点赞数

分类专栏：大数据日常记录文章标签： spark 数据可视化大数据数据分析

本文链接：https://blog.csdn.net/qq_38705144/article/details/113574167

版权

大数据日常记录专栏收录该内容

40 篇文章 4 订阅

订阅专栏

需求：

统计各区域热门商品Top3

1、一共有3张表：1张用户行为表，1张城市表，1张产品表。
2、地区商品名称点击次数城市备注（计算各个区域前三大热门商品，并备注上每个商品在主要城市中的分布比例，超过两个城市用其他显示。）

表一：城市表

1	北京	华北
2	上海	华东
3	深圳	华南
4	广州	华南
5	武汉	华中
6	南京	华东
7	天津	华北
8	成都	西南
9	哈尔滨	东北
10	大连	东北
11	沈阳	东北
12	西安	西北
13	长沙	华中
14	重庆	西南
15	济南	华东
16	石家庄	华北
17	银川	西北
18	杭州	华东
19	保定	华北
20	福州	华南
21	贵阳	西南
22	青岛	华东
23	苏州	华东
24	郑州	华北
25	无锡	华东
26	厦门	华南

表2：用户表
在这里插入图片描述
表3：商品表

import org.apache.spark.sql.{SaveMode, SparkSession}

/**
  * @ClassName: Hotgoods
  * @Description: 统计各区域热门商品Top3
  *              1、一共有3张表：1张用户行为表，1张城市表，1张产品表。
  *              2、地区	商品名称	点击次数	城市备注（计算各个区域前三大热门商品，并备注上每个商品在主要城市中的分布比例，超过两个城市用其他显示。）
  * @Author: kele
  * @Date: 2021/2/2 16:30
  **/
object Hotgoods {

  def main(args: Array[String]): Unit = {

    val spark = SparkSession.builder().appName("goods").master("local[4]").getOrCreate()

    import org.apache.spark.sql.functions._
    spark.udf.register("ACity",udaf(new AnalyzeCity))

    //1、获取用户信息，只获取用户有点击的信息，完成后创建一张表
    spark.read.option("sep","\t")
      .option("inferSchema","true")
      .csv("E:\\data\\user_visit_action.txt")
      .toDF("date","user_id","session_id","page_id","action_time","search_keyword","click_category_id","click_product_id","order_category_ids","order_product_ids","pay_category_ids","pay_product_ids","city_id")
      .filter("click_category_id !=-1 ")
      .createOrReplaceTempView("user_info")

    //2、获取商品信息
    spark.read.option("sep","\t")
      .option("interSchema","true")
      .csv("E:\\data\\product_info.txt")
      .toDF("product_id","product_name","extend_info")
      .createOrReplaceTempView("product_info")

    //3、获取地区信息
    spark.read.option("sep","\t")
      .option("interSchema","true")
      .csv("E:\\data\\city_info.txt")
      .toDF("city_id","city_name","area")
      .createOrReplaceTempView("city_info")

    spark.sql(
      """
        |select c.area area,b.product_name product_name,c.city_name city_name
        |from user_info as a join product_info as b
        |on a.click_product_id = b.product_id
        |join city_info as c
        |on a.city_id=c.city_id
      """.stripMargin).createOrReplaceTempView("InintForm")

//分组之后，查看城市及对应的次数没有相关API，所以自定义UDAF函数
    spark.sql(
      """
        |select area,product_name,count(1) num,ACity(city_name) cityinfo
        |from InintForm
        |group by area,product_name
      """.stripMargin).createOrReplaceTempView("InintForm2")

    spark.sql(
      """
        |select t1.area,t1.product_name,t1.num,t1.cityinfo from(
        |select area,product_name,num,cityinfo,rank() over(partition by area order by num desc) rk
        |from InintForm2)t1
        |where t1.rk<=3
      """.stripMargin).repartition(1).write.mode(SaveMode.Overwrite).option("header","true").csv("E:/result")

  }
}

自定义UDAF函数

查看城市及对应的点击次数

import org.apache.spark.sql.{Encoder, Encoders}
import org.apache.spark.sql.expressions.Aggregator

import scala.collection.mutable

/**
  * @ClassName: StaticCity
  * @Description:
  * @Author: kele
  * @Date: 2021/2/2 18:32
  **/

/** 中间变量的类型
  * 中间变量需要两个  1、统计总数目（用来作为分母）
  *                2、每个城市的名称及对应的点击数（每个城市的点击数作为分子）
  */

case class bufferValue(var count:Int,var city_info:mutable.Map[String,Int])

class AnalyzeCity extends Aggregator[String,bufferValue,String]{

  /**
    * 初始化buffer的值
    * @return bufferValue类型
    */
  override def zero: bufferValue = bufferValue(0,mutable.Map[String,Int]())

  /**
    * 单个task中的计算过程
    *   统计总的count的个数，统计每个城市的city点击次数
    * @param b
    * @param a
    * @return
    */
  override def reduce(buffer: bufferValue, city: String): bufferValue = {

    /**
      * 如果city在map中存在，则累计，没有则添加到map中
      */
    if(buffer.city_info.contains(city)){

      val city_num = buffer.city_info.get(city).get+1

      buffer.city_info.put(city,city_num)

    }else{
      buffer.city_info.put(city,1)
    }

    buffer.count = buffer.count +1

    buffer
  }

  /**
    *统计分区间的
    * @param b1
    * @param b2
    * @return
    */
  override def merge(b1: bufferValue, b2: bufferValue): bufferValue = {

    val buffer = b1.city_info.toList:::b2.city_info.toList

    val buff = buffer.groupBy(_._1).map(x=>{
      val num = x._2.map(_._2).sum
      (x._1,num)
    })

    b1.count = b1.count + b2.count

    b1.city_info = mutable.Map[String,Int]().++=(buff)

    b1

  }

  /**
    * 统计最终结果
    *注意使用格式
    * @param reduction
    * @return
    */
  override def finish(reduction: bufferValue): String = {

    val take2 = reduction.city_info.map(x=>{
      val percent = x._2.toDouble/reduction.count*100
      (x._1,percent)
    }).toList.sortBy(_._2).reverse.take(2)

    val other = 100 - take2.map(_._2).sum

    val first2 = take2.map(x=>s"${x._1}:${x._2.formatted("%.3f")}%")

    s"${first2.mkString(",")},other:${other}%"

  }

  override def bufferEncoder: Encoder[bufferValue] = Encoders.product

  override def outputEncoder: Encoder[String] = Encoders.STRING
}

结果：
在这里插入图片描述

DSJ_ kohler

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
spark SQL项目应用

需求：统计各区域热门商品Top31、一共有3张表：1张用户行为表，1张城市表，1张产品表。2、地区商品名称点击次数城市备注（计算各个区域前三大热门商品，并备注上每个商品在主要城市中的分布比例，超过两个城市用其他显示。）表一：城市表1 北京华北2 上海华东3 深圳华南4 广州华南5 武汉华中6 南京华东7 天津华北8 成都西南9 哈尔滨东北10 大连东北11 沈阳东北12 西安西北13 长沙华中14 重庆西南15 济南华东16 石
复制链接

扫一扫