SparkSQL项目实战练习——各区域热门商品TOP3

数据求学家

已于 2022-05-04 19:04:01 修改

阅读量2.1k

点赞数 2

分类专栏： spark 大数据开发文章标签： hive spark big data

于 2022-05-04 19:00:52 首次发布

本文链接：https://blog.csdn.net/weixin_36040866/article/details/124567620

版权

大数据开发同时被 2 个专栏收录

8 篇文章 0 订阅

订阅专栏

spark

6 篇文章 0 订阅

订阅专栏

一、环境要求

IDEA中SPARK可以连接虚拟机外置HIVE 可参考（IDEA中Spark连接外置hive详细步骤）

Spark3.0.0；Hadoop3.2.1; HIVE3.1.2

二、数据准备

1 张用户行为表，1 张城市表，1 张产品表

用户行为表user_visit_action：

主要包含用户的 4 种行为：搜索，点击，下单，支付。数据规则如下：

➢ 数据文件中每行数据采用下划线分隔数据

➢ 每一行数据表示用户的一次行为，这个行为只能是 4 种行为的一种

➢ 如果点击的品类 ID 和产品 ID 为-1，表示数据不是点击数据

本次需求关键的字段：

click_product_id,city_id

城市表city_info：

包含城市及地区

产品表product_info：

包含商品名称和渠道（自营或第三方）

三、需求说明

需求：各区域热门商品 Top3

这里的热门商品是从点击量的维度来看的，计算各个区域前三大热门商品，并备注上每个商品在主要城市中的分布比例，超过两个城市用其他显示。

地区	商品名称	点击次数	城市备注
华北	商品A	100000	北京21.2%，天津 13.2%，其他 65.6%
华北	商品 P	80200	北京 63.0%，太原 10%，其他 27.0%
华北	商品 M	40000	北京 63.0%，太原 10%，其他 27.0%
东北	商品 J	92000	大连 28%，辽宁 17.0%，其他 55.0%

四、代码实现

1.建立3张表：

 def main(args: Array[String]): Unit = {
        System.setProperty("HADOOP_USER_NAME", "root")

        val sparkConf = new SparkConf().setMaster("local[*]").setAppName("sparkSQL")
        val spark = 
            SparkSession
            .builder()
            .enableHiveSupport()
            .config(sparkConf)
            .config("dfs.client.use.datanode.hostname", "true")
            .config("dfs.replication", "2")
            .getOrCreate()

        spark.sql("use hive")

        // 准备数据
        spark.sql(
            """
              |CREATE TABLE `user_visit_action`(
              |  `date` string,
              |  `user_id` bigint,
              |  `session_id` string,
              |  `page_id` bigint,
              |  `action_time` string,
              |  `search_keyword` string,
              |  `click_category_id` bigint,
              |  `click_product_id` bigint,
              |  `order_category_ids` string,
              |  `order_product_ids` string,
              |  `pay_category_ids` string,
              |  `pay_product_ids` string,
              |  `city_id` bigint)
              |row format delimited fields terminated by '\t'
            """.stripMargin)

        spark.sql(
            """
              |load data local inpath './datas/user_visit_action.txt' into table hive.user_visit_action
            """.stripMargin)

        spark.sql(
            """
              |CREATE TABLE `product_info`(
              |  `product_id` bigint,
              |  `product_name` string,
              |  `extend_info` string)
              |row format delimited fields terminated by '\t'
            """.stripMargin)

        spark.sql(
            """
              |load data local inpath './datas/product_info.txt' into table hive.product_info
            """.stripMargin)

        spark.sql(
            """
              |CREATE TABLE `city_info`(
              |  `city_id` bigint,
              |  `city_name` string,
              |  `area` string)
              |row format delimited fields terminated by '\t'
            """.stripMargin)

        spark.sql(
            """
              |load data local inpath './datas/city_info.txt' into table hive.city_info
            """.stripMargin)

        spark.sql("""select * from city_info""").show


        spark.close()
    }

在运行过程中出现报错：

ERROR Hive: Failed to move: org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /user/hive/warehouse/hdu.db/user_visit_action/user_visit_action.txt could only be written to 0 of the 1 minReplication nodes. There are 1 datanode(s) running and 1 node(s) are excluded in this operation.

报错说我datanode有异常，首先我检查了datanode是正常启动的，后来解决方法如下：

NameNode节点存放的是文件目录，也就是文件夹、文件名称，本地可以通过公网访NameNode，所以可以进行文件夹的创建，当上传文件需要写入数据到DataNode时，NameNode 和DataNode 是通过局域网进行通信，NameNode返回地址为 DataNode 的私有 IP，本地无法访问返回的IP地址无法返回公网IP，所以通过设置让其返回主机名，通过主机名与公网地址的映射便可以访问到DataNode节点，问题将解决。
由于代码的设置的优先级为最高，所以直接进行代码的设置：在sparksession中添加配置信息即可：

config("dfs.client.use.datanode.hostname", "true")
config("dfs.replication", "2")

2.需求实现

文字部分比较麻烦，可以用udf也可以使用sql。

    spark.sql(
      """
        |select
        |*
        |from
        |
        |(
        | select
        |     t2.*,
        |     row_number() over(partition by t2.area order by t2.cli_cnt desc) rank
        |     from
        |     (
        |     select
        |       t1.area,
        |       t1.product_name,
        |       count(*) as cli_cnt
        |       from
        |       (select
        |         c.area,
        |         c.city_name,
        |         p.product_name,
        |         u.click_product_id
        |         from
        |         city_info c
        |         left join user_visit_action u on c.city_id =  u.city_id
        |         left join product_info p on u.click_product_id = p.product_id) t1
        |         where click_product_id >=0
        |         group by t1.area, t1.product_name
        |         ) t2
        |      )t3 where rank <= 3
        |
        |
        |
        |""".stripMargin
    ).createOrReplaceTempView("d1")

    //    城市备注部分，将商品和城市连接，需要取得每个商品TOP2城市的点击次数,文字部分用concat,多行聚合可以使用group_concat(sparksql不支持)/concat_ws
    spark.sql(
      """
        |select t6.product_name,t6.area, concat_ws(',', collect_set(info1)) as info
        |from
        |(select
        |*,
        |concat(city,round((case when rank1 = 3 then total-sum_+cnt else cnt end)*100/total,1),'%') info1
        |from
        |(select *,
        |(case when rank1 =3 then '其他' else city_name end) city,
        |sum(cnt) over (partition by product_name,area order by rank1 asc) sum_
        | from
        |(select
        |t3.*,
        |row_number() over(partition by t3.product_name order by cnt desc) rank1
        |
        | from
        | (select
        |   t2.*,
        |   sum(cnt) over(partition by area,product_name) total
        |   from
        |   (select
        |      t1.city_name,
        |      t1.product_name,
        |      t1.area,
        |      count(*) as cnt
        |      from
        |       (select
        |         c.area,
        |         c.city_name,
        |         p.product_name,
        |         u.click_product_id
        |         from
        |         city_info c
        |         left join user_visit_action u on c.city_id =  u.city_id
        |         left join product_info p on u.click_product_id = p.product_id) t1
        |       where click_product_id >=0
        |       group by product_name,area,city_name) t2)t3
        |join d1 on d1.product_name = t3.product_name and d1.area = t3.area)t4
        |  where  t4.rank1 <=3)t5  order by area,product_name,rank1 asc
        |)t6
        |group by area,product_name
        |order by area,product_name asc
        |""".stripMargin
    ).createOrReplaceTempView("d2")

    spark.sql(
      """
        |select
        |   d1.area,
        |   d1.product_name,
        |   d1.cli_cnt,
        |   d2.info
        |from
        |   d1
        |   left join d2
        |   on d1.product_name = d2.product_name and d1.area = d2.area
        |   order by area,rank asc
        |""".stripMargin
    ).show

结果图：