目录
一、环境要求
IDEA中SPARK可以连接虚拟机外置HIVE 可参考(IDEA中Spark连接外置hive详细步骤)
Spark3.0.0;Hadoop3.2.1; HIVE3.1.2
二、数据准备
1 张用户行为表,1 张城市表,1 张产品表
- 用户行为表user_visit_action:
主要包含用户的 4 种行为:搜索,点击,下单,支付。数据规则如下:
➢ 数据文件中每行数据采用下划线分隔数据
➢ 每一行数据表示用户的一次行为,这个行为只能是 4 种行为的一种
➢ 如果点击的品类 ID 和产品 ID 为-1,表示数据不是点击数据
本次需求关键的字段:
click_product_id,city_id
- 城市表city_info:
包含城市及地区
- 产品表product_info:
包含商品名称和渠道(自营或第三方)
三、需求说明
需求:各区域热门商品 Top3
这里的热门商品是从点击量的维度来看的,计算各个区域前三大热门商品,并备注上每个商品在主要城市中的分布比例,超过两个城市用其他显示。
地区 | 商品名称 | 点击次数 | 城市备注 |
华北 | 商品A | 100000 | 北京21.2%,天津 13.2%,其他 65.6% |
华北 | 商品 P | 80200 | 北京 63.0%,太原 10%,其他 27.0% |
华北 | 商品 M | 40000 | 北京 63.0%,太原 10%,其他 27.0% |
东北 | 商品 J | 92000 | 大连 28%,辽宁 17.0%,其他 55.0% |
四、代码实现
1.建立3张表:
def main(args: Array[String]): Unit = {
System.setProperty("HADOOP_USER_NAME", "root")
val sparkConf = new SparkConf().setMaster("local[*]").setAppName("sparkSQL")
val spark =
SparkSession
.builder()
.enableHiveSupport()
.config(sparkConf)
.config("dfs.client.use.datanode.hostname", "true")
.config("dfs.replication", "2")
.getOrCreate()
spark.sql("use hive")
// 准备数据
spark.sql(
"""
|CREATE TABLE `user_visit_action`(
| `date` string,
| `user_id` bigint,
| `session_id` string,
| `page_id` bigint,
| `action_time` string,
| `search_keyword` string,
| `click_category_id` bigint,
| `click_product_id` bigint,
| `order_category_ids` string,
| `order_product_ids` string,
| `pay_category_ids` string,
| `pay_product_ids` string,
| `city_id` bigint)
|row format delimited fields terminated by '\t'
""".stripMargin)
spark.sql(
"""
|load data local inpath './datas/user_visit_action.txt' into table hive.user_visit_action
""".stripMargin)
spark.sql(
"""
|CREATE TABLE `product_info`(
| `product_id` bigint,
| `product_name` string,
| `extend_info` string)
|row format delimited fields terminated by '\t'
""".stripMargin)
spark.sql(
"""
|load data local inpath './datas/product_info.txt' into table hive.product_info
""".stripMargin)
spark.sql(
"""
|CREATE TABLE `city_info`(
| `city_id` bigint,
| `city_name` string,
| `area` string)
|row format delimited fields terminated by '\t'
""".stripMargin)
spark.sql(
"""
|load data local inpath './datas/city_info.txt' into table hive.city_info
""".stripMargin)
spark.sql("""select * from city_info""").show
spark.close()
}
在运行过程中出现报错:
ERROR Hive: Failed to move: org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /user/hive/warehouse/hdu.db/user_visit_action/user_visit_action.txt could only be written to 0 of the 1 minReplication nodes. There are 1 datanode(s) running and 1 node(s) are excluded in this operation.
报错说我datanode有异常,首先我检查了datanode是正常启动的,后来解决方法如下:
NameNode节点存放的是文件目录,也就是文件夹、文件名称,本地可以通过公网访NameNode,所以可以进行文件夹的创建,当上传文件需要写入数据到DataNode时,NameNode 和DataNode 是通过局域网进行通信,NameNode返回地址为 DataNode 的私有 IP,本地无法访问返回的IP地址无法返回公网IP,所以通过设置让其返回主机名,通过主机名与公网地址的映射便可以访问到DataNode节点,问题将解决。
由于代码的设置的优先级为最高,所以直接进行代码的设置:在sparksession中添加配置信息即可:
config("dfs.client.use.datanode.hostname", "true")
config("dfs.replication", "2")
2.需求实现
文字部分比较麻烦,可以用udf也可以使用sql。
spark.sql(
"""
|select
|*
|from
|
|(
| select
| t2.*,
| row_number() over(partition by t2.area order by t2.cli_cnt desc) rank
| from
| (
| select
| t1.area,
| t1.product_name,
| count(*) as cli_cnt
| from
| (select
| c.area,
| c.city_name,
| p.product_name,
| u.click_product_id
| from
| city_info c
| left join user_visit_action u on c.city_id = u.city_id
| left join product_info p on u.click_product_id = p.product_id) t1
| where click_product_id >=0
| group by t1.area, t1.product_name
| ) t2
| )t3 where rank <= 3
|
|
|
|""".stripMargin
).createOrReplaceTempView("d1")
// 城市备注部分,将商品和城市连接,需要取得每个商品TOP2城市的点击次数,文字部分用concat,多行聚合可以使用group_concat(sparksql不支持)/concat_ws
spark.sql(
"""
|select t6.product_name,t6.area, concat_ws(',', collect_set(info1)) as info
|from
|(select
|*,
|concat(city,round((case when rank1 = 3 then total-sum_+cnt else cnt end)*100/total,1),'%') info1
|from
|(select *,
|(case when rank1 =3 then '其他' else city_name end) city,
|sum(cnt) over (partition by product_name,area order by rank1 asc) sum_
| from
|(select
|t3.*,
|row_number() over(partition by t3.product_name order by cnt desc) rank1
|
| from
| (select
| t2.*,
| sum(cnt) over(partition by area,product_name) total
| from
| (select
| t1.city_name,
| t1.product_name,
| t1.area,
| count(*) as cnt
| from
| (select
| c.area,
| c.city_name,
| p.product_name,
| u.click_product_id
| from
| city_info c
| left join user_visit_action u on c.city_id = u.city_id
| left join product_info p on u.click_product_id = p.product_id) t1
| where click_product_id >=0
| group by product_name,area,city_name) t2)t3
|join d1 on d1.product_name = t3.product_name and d1.area = t3.area)t4
| where t4.rank1 <=3)t5 order by area,product_name,rank1 asc
|)t6
|group by area,product_name
|order by area,product_name asc
|""".stripMargin
).createOrReplaceTempView("d2")
spark.sql(
"""
|select
| d1.area,
| d1.product_name,
| d1.cli_cnt,
| d2.info
|from
| d1
| left join d2
| on d1.product_name = d2.product_name and d1.area = d2.area
| order by area,rank asc
|""".stripMargin
).show
结果图: