2021.1.19课堂测试3_scan 'exam202010:userbehavior-CSDN博客

本文链接：https://blog.csdn.net/m0_48758256/article/details/112856994

操作技能试卷
注意：考试结束试卷必须交回，不交回试卷者成绩无效
一、环境要求
Hadoop+Hive+Spark+HBase 开发环境。
二、提交结果要求
1.必须提交源码或对应分析语句，如不提交则不得分。
2.带有分析结果的功能，请分析结果的截图与代码一同提交。
三、数据描述
UserBehavior 是阿里巴巴提供的一个淘宝用户行为数据集。本数据集包含了 2017-09-11
至 2017-12-03 之间有行为的约 5458 位随机用户的所有行为（行为包括点击、购买、加购、喜欢）。数据集的每一行表示一条用户行为，由用户 ID、商品 ID、商品类目 ID、行为类型和时间戳组成，并以逗号分隔。关于数据集中每一列的详细描述如下具体字段说明如下：
列名称列中文名称说明
user_id 用户 ID 整数类型，序列化后的用户ID
item_id 商品 ID 整数类型，序列化后的商品ID
category_id 商品类目 ID 整数类型，序列化后的商品所属类目 ID
behavior_type 行为类型字符串，枚举类型，包括(‘pv’, ‘buy’, ‘cart’, ‘fav’)
time 时间戳行为发生的时间戳
注意到，用户行为类型共有四种，它们分别是
行为类型说明
pv 商品详情页 pv，等价于点击
buy 商品购买
cart 将商品加入购物车
fav 收藏商品
四、功能要求
1.数据准备（10 分）
请在 HDFS 中创建目录/data/userbehavior，并将 UserBehavior.csv 文件传到该目录。（5 分）
通过 HDFS 命令查询出文档有多少行数据。（5 分）
2.数据清洗（40 分）
1)请在 Hive 中创建数据库 exam（5 分）
2)请在 exam 数据库中创建外部表 userbehavior，并将 HDFS 数据映射到表中（5 分）
3)请在 HBase 中创建命名空间 exam，并在命名空间 exam 创建 userbehavior 表，包含一个列簇 info（5 分）
4)请在 Hive 中创建外部表 userbehavior_hbase，并映射到 HBase 中（5 分），并将数据加载到 HBase 中（5 分）
5)请在exam 数据库中创建内部分区表userbehavior_partitioned（按照日期进行分区），并通过查询 userbehavior 表将时间戳格式化为”年-月-日时:分:秒”格式，将数据插入至 userbehavior_partitioned 表中，例如下图：（15 分）
3.用户行为分析（20 分）
请使用 Spark，加载 HDFS 文件系统 UserBehavior.csv 文件，并分别使用 RDD 完成以下分析。
统计 uv 值（一共有多少用户访问淘宝）（10 分）
分别统计浏览行为为点击，收藏，加入购物车，购买的总数量（10 分）
4.找出有价值的用户（30 分）
使用 SparkSQL 统计用户最近购买时间。以 2017-12-03 为当前日期，计算时间范围为一个月，计算用户最近购买时间，时间的区间为 0-30 天，将其分为 5 档，0-6 天,7-12天,13-18 天,19-24 天,25-30 天分别对应评分 4 到 0（15 分）
使用 SparkSQL 统计用户的消费频率。以 2017-12-03 为当前日期，计算时间范围为一个月，计算用户的消费次数，用户中消费次数从低到高为 1-161 次，将其分为 5 档，1-32，33-64，65-96，97-128，129-161 分别对应评分 0 到 4（15 分）

数据示例：UserBehavior.csv

1,2268318,2520377,pv,1511544070
1,2333346,2520771,pv,1511561733
1,2576651,149192,pv,1511572885
1,3830808,4181361,pv,1511593493
1,4365585,2520377,pv,1511596146
1,4606018,2735466,pv,1511616481
1,230380,411153,pv,1511644942
1,3827899,2920476,pv,1511713473
1,3745169,2891509,pv,1511725471
1,1531036,2920476,pv,1511733732
1,2266567,4145813,pv,1511741471
1,2951368,1080785,pv,1511750828
1,3108797,2355072,pv,1511758881
1,1338525,149192,pv,1511773214
1,2286574,2465336,pv,1511797167
1,5002615,2520377,pv,1511839385
1,2734026,4145813,pv,1511842184
1,5002615,2520377,pv,1511844273
1,3239041,2355072,pv,1511855664
1,4615417,4145813,pv,1511870864

1.数据准备（10 分）
请在 HDFS 中创建目录/data/userbehavior，并将 UserBehavior.csv 文件传到该目录。（5 分）
通过 HDFS 命令查询出文档有多少行数据。（5 分）

hdfs dfsadmin -safemode leave
hdfs dfs -mkdir -p /data/userbehavior
hdfs dfs -put UserBehavior.csv  /data/userbehavior
hdfs dfs -cat  /data/userbehavior/UserBehavior.csv | wc -l
21/01/19 13:43:21 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
561294

2.数据清洗（40 分）
1)请在 Hive 中创建数据库 exam（5 分）

hive> create database exam202010;

2)请在 exam 数据库中创建外部表 userbehavior，并将 HDFS 数据映射到表中（5 分）

hive> create external table userbehavior(
    > user_id int,
    > item_id int,
    > category_id int,
    > behavior_type string,
    > time bigint
    > )
    > row format delimited 
    > fields terminated by ','
    > stored as textfile
    > location '/app/data/exam202010/'
    > ;
OK

3)请在 HBase 中创建命名空间 exam，并在命名空间 exam 创建 userbehavior 表，包含一个列簇 info（5 分）

hbase(main):002:0> create_namespace 'exam202010'
hbase(main):004:0> create 'exam202010:userbehavior','info'

4)请在 Hive 中创建外部表 userbehavior_hbase，并映射到 HBase 中（5 分），并将数据加载到 HBase 中（5 分）

hive> create external table userbehavior_hbase(
    > user_id int,
    > item_id int,
    > category_id int,
    > behavior_type string,
    > time bigint
    > )
    > stored by 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
    > with serdeproperties ('hbase.columns.mapping'=':key,info:item_id,info:category_id,info:behavior_id,info:time')
    > tblproperties('hbase.table.name'='exam202010:userbehavior');
 
hive> insert into userbehavior_hbase select user_id,item_id,category_id,behavior_type,time from userbehavior;

5)请在exam 数据库中创建内部分区表userbehavior_partitioned（按照日期进行分区），并通过查询 userbehavior 表将时间戳格式化为”年-月-日时:分:秒”格式，将数据插入至 userbehavior_partitioned 表中

hbase(main):008:0> scan 'exam202010:userbehavior'
# 创建分区表
hive> create table userbehavior_partitioned(
     user_id int,
     item_id int,
     category_id int,
     behavior_type string,
     time string
     )
     partitioned by (dt string) stored as orc;
# 开启动态分区
hive> set hive.exec.dynamic.partition=true ;
hive> set hive.exec.dynamic.partition.mode=nostrict;
# 插入数据
hive> insert into userbehavior_partitioned partition(dt) 
     select user_id,item_id,category_id,behavior_type, from_unixtime(time,'YYYY-MM-dd HH:mm:ss') time,
     substring(from_unixtime(time,'YYYY-MM-dd HH:mm:ss'),1,10) dt 
 from userbehavior;

统计 uv 值（一共有多少用户访问淘宝）（10 分）

scala> val userbehavior = sc.textFile("hdfs://lijia1:9000/app/data/exam202010/")
scala> userbehavior.map(x=>x.split(",")).map(x=>(x(0))).distinct.count
res2: Long = 5458

分别统计浏览行为为点击，收藏，加入购物车，购买的总数量（10 分）

scala> userbehavior.map(x=>x.split(",")).map(x=>(x(3),1)).reduceByKey(_+_).collect.foreach(println)

使用 SparkSQL 统计用户最近购买时间。以 2017-12-03 为当前日期，计算时间范围为一个月，计算用户最近购买时间，时间的区间为 0-30 天，将其分为 5 档，0-6 天,7-12天,13-18 天,19-24 天,25-30 天分别对应评分 4 到 0（15 分）

hive> with temptb as (select user_id,DATEDIFF('2017-12-03',MAX(time)) as ltime from userbehavior_partitioned where dt between '2017-11-03' and '2017-12-03' and behavior_type='buy' group by user_id)
select user_id,
(case when ltime between 0 and 6 then 4 
when ltime between 7 and 12 then 3
when ltime between 13 and 18 then 2 
when ltime between 19 and 24 then 1
when ltime between 25 and 30 then 0 
else null end) level 
from temptb;

scala> val sqlString="""
      with temptb as (select user_id,DATEDIFF('2017-12-03',MAX(time)) as ltime from exam202010.userbehavior_partitioned where dt between '2017-11-03' and '2017-12-03' and behavior_type='buy' group by user_id)
      select user_id,
      (case when ltime between 0 and 6 then 4 
      when ltime between 7 and 12 then 3
      when ltime between 13 and 18 then 2 
      when ltime between 19 and 24 then 1
      when ltime between 25 and 30 then 0 
      else null end) level 
      from temptb
      """
scala> spark.sql(sqlString).show

使用 SparkSQL 统计用户的消费频率。以 2017-12-03 为当前日期，计算时间范围为一个月，计算用户的消费次数，用户中消费次数从低到高为 1-161 次，将其分为 5 档，1-32，33-64，65-96，97-128，129-161 分别对应评分 0 到 4（15 分）

hive> with temptb as (select user_id,count(item_id) buynum from exam202010.userbehavior_partitioned where dt between '2017-11-03' and '2017-12-03'  and behavior_type='buy'  group by user_id)
select user_id,
(case when buynum between 0 and 32 then 0 
when buynum between 33 and 64 then 1
when buynum between 65 and 96 then 2 
when buynum between 97 and 128 then 3
when buynum between 129 and 161 then 4 
else null end) level 
from temptb;

scala> val sqlString2 = """
     | with temptb as (select user_id,count(item_id) buynum from exam202010.userbehavior_partitioned where dt between '2017-11-03' and '2017-12-03'  and behavior_type='buy' group by user_id)
     | select user_id,
     | (case when buynum between 0 and 32 then 0 
     | when buynum between 33 and 64 then 1
     | when buynum between 65 and 96 then 2 
     | when buynum between 97 and 128 then 3
     | when buynum between 129 and 161 then 4 
     | else null end) level 
     | from temptb
     | """
scala> spark.sql(sqlString2).show