需求分析:求各个区域下最受欢迎的产品的TOP 3
假设我们mysql上存放了两表,一张city_info城市信息表,一张product_info产品信息表
city_info:
city_id 城市id
city_name 城市名字
area 区域信息(华东、华北、华南、华中这些)
mysql> select * from city_info;
+---------+-----------+------+
| city_id | city_name | area |
+---------+-----------+------+
| 1 | BEIJING | NC |
| 2 | SHANGHAI | EC |
| 3 | NANJING | EC |
| 4 | GUANGZHOU | SC |
| 5 | SANYA | SC |
| 6 | WUHAN | CC |
| 7 | CHANGSHA | CC |
| 8 | XIAN | NW |
| 9 | CHENGDU | SW |
| 10 | HAERBIN | NE |
+---------+-----------+------+
10 rows in set (0.00 sec)
product_info:
product_id 产品id
product_name 产品名称
extend_info 产品状态
mysql> select * from product_info;
+------------+--------------+----------------------+
| product_id | product_name | extend_info |
+------------+--------------+----------------------+
| 1 | product1 | {"product_status":1} |
| 2 | product2 | {"product_status":1} |
| 3 | product3 | {"product_status":1} |
| 4 | product4 | {"product_status":1} |
| 5 | product5 | {"product_status":1} |
| 6 | product6 | {"product_status":1} |
| 7 | product7 | {"product_status":1} |
| 8 | product8 | {"product_status":1} |
| 9 | product9 | {"product_status":0} |
| 10 | product10 | {"product_status":1} |
| 11 | product11 | {"product_status":0} |
| 12 | product12 | {"product_status":0} |
| 13 | product13 | {"product_status":0} |
| 14 | product14 | {"product_status":0} |
| 15 | product15 | {"product_status":1} |
| 16 | product16 | {"product_status":0} |
| 17 | product17 | {"product_status":1} |
| 18 | product18 | {"product_status":0} |
| 19 | product19 | {"product_status":1} |
| 20 | product20 | {"product_status":1} |
| 21 | product21 | {"product_status":0} |
| 22 | product22 | {"product_status":0} |
| 23 | product23 | {"product_status":0} |
| 24 | product24 | {"product_status":0} |
| 25 | product25 | {"product_status":1} |
| 26 | product26 | {"product_status":1} |
| 27 | product27 | {"product_status":0} |
| 28 | product28 | {"product_status":1} |
| 29 | product29 | {"product_status":0} |
| 30 | product30 | {"product_status":0} |
| 31 | product31 | {"product_status":0} |
| 32 | product32 | {"product_status":0} |
| 33 | product33 | {"product_status":1} |
| 34 | product34 | {"product_status":1} |
| 35 | product35 | {"product_status":0} |
| 36 | product36 | {"product_status":0} |
| 37 | product37 | {"product_status":1} |
| 38 | product38 | {"product_status":0} |
| 39 | product39 | {"product_status":0} |
| 40 | product40 | {"product_status":1} |
| 41 | product41 | {"product_status":1} |
| 42 | product42 | {"product_status":1} |
| 43 | product43 | {"product_status":1} |
| 44 | product44 | {"product_status":0} |
| 45 | product45 | {"product_status":1} |
| 46 | product46 | {"product_status":1} |
| 47 | product47 | {"product_status":0} |
| 48 | product48 | {"product_status":1} |
| 49 | product49 | {"product_status":1} |
| 50 | product50 | {"product_status":1} |
| 51 | product51 | {"product_status":1} |
| 52 | product52 | {"product_status":0} |
| 53 | product53 | {"product_status":0} |
| 54 | product54 | {"product_status":1} |
| 55 | product55 | {"product_status":0} |
| 56 | product56 | {"product_status":0} |
| 57 | product57 | {"product_status":1} |
| 58 | product58 | {"product_status":1} |
| 59 | product59 | {"product_status":1} |
| 60 | product60 | {"product_status":1} |
| 61 | product61 | {"product_status":0} |
| 62 | product62 | {"product_status":1} |
| 63 | product63 | {"product_status":1} |
| 64 | product64 | {"product_status":0} |
| 65 | product65 | {"product_status":0} |
| 66 | product66 | {"product_status":1} |
| 67 | product67 | {"product_status":1} |
| 68 | product68 | {"product_status":0} |
| 69 | product69 | {"product_status":1} |
| 70 | product70 | {"product_status":0} |
| 71 | product71 | {"product_status":0} |
| 72 | product72 | {"product_status":0} |
| 73 | product73 | {"product_status":1} |
| 74 | product74 | {"product_status":0} |
| 75 | product75 | {"product_status":1} |
| 76 | product76 | {"product_status":0} |
| 77 | product77 | {"product_status":0} |
| 78 | product78 | {"product_status":1} |
| 79 | product79 | {"product_status":0} |
| 80 | product80 | {"product_status":0} |
| 81 | product81 | {"product_status":0} |
| 82 | product82 | {"product_status":1} |
| 83 | product83 | {"product_status":1} |
| 84 | product84 | {"product_status":1} |
| 85 | product85 | {"product_status":0} |
| 86 | product86 | {"product_status":1} |
| 87 | product87 | {"product_status":1} |
| 88 | product88 | {"product_status":1} |
| 89 | product89 | {"product_status":1} |
| 90 | product90 | {"product_status":1} |
| 91 | product91 | {"product_status":1} |
| 92 | product92 | {"product_status":0} |
| 93 | product93 | {"product_status":0} |
| 94 | product94 | {"product_status":1} |
| 95 | product95 | {"product_status":0} |
| 96 | product96 | {"product_status":0} |
| 97 | product97 | {"product_status":1} |
| 98 | product98 | {"product_status":1} |
| 99 | product99 | {"product_status":0} |
| 100 | product100 | {"product_status":1} |
+------------+--------------+----------------------+
100 rows in set (0.00 sec)
然后我们的hive上存放的是我们采集到的用户点击行为数据,存放在user_click表中
user_click:
user_id 用户id
session_id 会话id
action_time 访问时间
city_id 城市id
product_id 产品id
[hadoop@hadoop001 data]$ hdfs dfs -tail /data/user_click.txt
19/07/21 10:38:02 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
9:23,9,22
53,770c325ace1240d88387fa19931acb72,2016-05-05 10:17:16,9,83
53,770c325ace1240d88387fa19931acb72,2016-05-05 10:25:42,9,54
53,770c325ace1240d88387fa19931acb72,2016-05-05 10:48:33,9,28
53,770c325ace1240d88387fa19931acb72,2016-05-05 10:07:19,9,72
53,770c325ace1240d88387fa19931acb72,2016-05-05 10:00:26,9,87
53,561e94c2fa99467d956a26d55aab3cb6,2016-05-05 3:35:01,9,95
53,561e94c2fa99467d956a26d55aab3cb6,2016-05-05 3:55:19,9,13
53,561e94c2fa99467d956a26d55aab3cb6,2016-05-05 3:34:25,9,24
53,561e94c2fa99467d956a26d55aab3cb6,2016-05-05 3:26:50,9,69
53,561e94c2fa99467d956a26d55aab3cb6,2016-05-05 3:49:43,9,37
53,561e94c2fa99467d956a26d55aab3cb6,2016-05-05 3:55:46,9,96
53,561e94c2fa99467d956a26d55aab3cb6,2016-05-05 3:24:13,9,67
53,561e94c2fa99467d956a26d55aab3cb6,2016-05-05 3:06:52,9,29
53,561e94c2fa99467d956a26d55aab3cb6,2016-05-05 3:02:52,9,92
53,561e94c2fa99467d956a26d55aab3cb6,2016-05-05 3:19:27,9,12
53,561e94c2fa99467d956a26d55aab3cb6,2016-05-05 3:10:07,9,18
##创建hive外部分区表,通过location指定我们数据目录。因为我们模拟的是每天的用户点击日志,所以以时间为分区字段
hive (hwzhdb)> create external table user_click(
> user_id int,
> session_id string,
> action_time string,
> city_id int,
> product_id int
> ) partitioned by (date string)
> row format delimited fields terminated by ','
> location 'hdfs://hadoop001:9000/data';
OK
Time taken: 0.331 seconds
##加载分区数据,因为我们的数据都是2016-05-05的,所以分区字段对应的值也是这个
hive (hwzhdb)> load data inpath 'hdfs://hadoop001:9000/data/user_click.txt' into table user_click partition(date='2016-05-05');
Loading data to table hwzhdb.user_click partition (date=2016-05-05)
Partition hwzhdb.user_click{date=2016-05-05} stats: [numFiles=1, numRows=0, totalSize=725264, rawDataSize=0]
OK
Time taken: 1.402 seconds
hive (hwzhdb)> select * from user_click limit 5;
OK
user_click.user_id user_click.session_id user_click.action_time user_click.city_id user_click.product_id user_click.date
95 2bf501a7637549c89cf55342331b15db 2016-05-05 21:01:56 1 72 2016-05-05
95 2bf501a7637549c89cf55342331b15db 2016-05-05 21:52:26 1 68 2016-05-05
95 2bf501a7637549c89cf55342331b15db 2016-05-05 21:17:03 1 40 2016-05-05
95 2bf501a7637549c89cf55342331b15db 2016-05-05 21:32:07 1 21 2016-05-05
95 2bf501a7637549c89cf55342331b15db 2016-05-05 21:26:06 1 63 2016-05-05
Time taken: 0.209 seconds, Fetched: 5 row(s)
!!!分析:如果要求每个区域最受欢迎的top3,那首先我们需要将city_info表和product_info表和user_click表进行关联,然后以city_info表中的area字段和product_info表中的product_id,product_name字段进行分组,然后求count,然后再使用row_number()函数求出每个区域的前三即可。
但是要想将mysql中的表和hive中的表进行关联,很明显目前是不可能做到的,所以我们需要先通过sqoop将mysql上的数据导入hive表中。
在hive上创建与mysql上与之对应的city_info表和product_info(虽然sqoop有参数--create-hive-table可以直接创建表,但是可能创建出来的表格式并不是我们想要的那种,所以需要我们自己事先创建):
hive (hwzhdb)> create table city_info(
> city_id int,
> city_name string,
> area string
> )
> row format delimited fields terminated by '\t';
OK
Time taken: 0.528 seconds
hive (hwzhdb)> create table product_info(
> product_id int,
> product_name string,
> extend_info string
> )
> row format delimited fields terminated by '\t';
OK
Time taken: 0.12 seconds
##然后使用sqoop将mysql上的表数据导入到hive表中:
city_info :
sqoop import \ ##表示导入
--connect jdbc:mysql://localhost:3306/hwzdb \ ##连接到mysql库
--username root \ ##mysql的账户
--password 123456 \ ##mysql的密码
--table city_info \ ##mysql上的哪张表
--hive-database hwzhdb \ ##导入hive的哪个库
--hive-table city_info \ ##导入hive的哪张表
--hive-import \ ##该参数相当于一个开关,表示要将数据导入到hive表中
--hive-overwrite \ ##表示每次导入都会覆盖hive表数据
--delete-target-dir \ ##表示如果目标目录存在的话,先删除
--fields-terminated-by '\t' \ ##表示hive中的分隔符,与创建表的分隔符一致就行
--split-by city_id \ ##该参数是和下面的-m参数搭配使用的,sqoop默认的是通过mysql中的主键将数据分为几部分(根据-m参数指定的值),如果mysql表中没有主键的话,就需要通过--split-by 参数来指定一个字段将数据分为几个map(根据-m参数指定的值)导入hive。
-m 2 ##代表使用2个maptask导入数据,默认好像是4个。
product_info :
sqoop import \
--connect jdbc:mysql://localhost:3306/hwzdb \
--username root \
--password 123456 \
--table product_info \
--hive-database hwzhdb \
--hive-table product_info \
--hive-import \
--hive-overwrite \
--delete-target-dir \
--fields-terminated-by '\t' \
--split-by product_id \
-m 2
然后将hive中的user_click表和city_info表,product_info表进行关联,求出每个区域,每个产品的总点击数,然后再使用开窗函数求出每个区域最受欢迎的商品top3
##创建临时表,用来存储每个区域,每个产品的总点击数
hive (hwzhdb)> create table tmp_area_product_name_sumclick as
> select ci.area,product_id,product_name,count(1) sumClick from user_click uc
> join city_info ci on uc.city_id=ci.city_id
> join product_info pi on pi.product_id=uc.product_id
> group by ci.area,pi.product_id,pi.product_name;
...mr...
##然后再使用开窗函数求出每个区域最受欢迎的商品top3
hive (hwzhdb)> create table area_product_click_top3 as
> select '2016-05-05',area,product_id,product_name,sumClick,rank from
> (select area,product_name,product_id,sumClick,row_number() over(partition by area order by sumClick desc) as rank from tmp_area_product_name_sumclick)
> tmp where rank<=3;
##查看最终结果:
hive (hwzhdb)> select * from area_product_click_top3;
OK
area_product_click_top3.area area_product_click_top3.product_id area_product_click_top3.product_name area_product_click_top3.sumclick area_product_click_top3.rank
CC 26 product26 39 1
CC 7 product7 39 2
CC 70 product70 38 3
EC 4 product4 40 1
EC 96 product96 32 2
EC 99 product99 31 3
NC 40 product40 16 1
NC 9 product9 16 2
NC 5 product5 13 3
NW 56 product56 20 1
NW 67 product67 20 2
NW 48 product48 19 3
SC 38 product38 35 1
SC 98 product98 34 2
SC 33 product33 34 3
SW 16 product16 20 1
SW 95 product95 19 2
SW 60 product60 19 3
Time taken: 0.077 seconds, Fetched: 18 row(s)
!!!由于我们最终计算的指标是需要进行前端页面展示的,所以需要将我们最终计算的指标数据导出到mysql中进行展示,所以需要先创建mysql表,然后使用sqoop将数据导入mysql
##创建mysql表
drop table if exists area_product_click_top3;
create table area_product_click_top3(
day varchar(20) comment '因为该指标可能需要经常计算,所以需要增加一个时间字段用来区分哪天的指标',
area varchar(20),
product_id int,
product_name varchar(30),
sumclick int,
rank int
);
##使用sqoop将hdfs上的数据导出到mysql里
sqoop export \
--connect jdbc:mysql://localhost:3306/hwzdb \
--username root \
--password 123456 \
--table area_product_click_top3 \
--columns day,area,product_id,product_name,sumClick,rank \
--mapreduce-job-name top3Tomysql \
--export-dir /user/hive/warehouse/hwzhdb.db/area_product_click_top3 \
--fields-terminated-by '\t' \
-m 2
##查看mysql中最终指标计算结果:
mysql> select * from area_product_click_top3;
+------------+------+------------+--------------+----------+------+
| day | area | product_id | product_name | sumclick | rank |
+------------+------+------------+--------------+----------+------+
| 2016-05-05 | CC | 26 | product26 | 39 | 1 |
| 2016-05-05 | CC | 7 | product7 | 39 | 2 |
| 2016-05-05 | CC | 70 | product70 | 38 | 3 |
| 2016-05-05 | EC | 4 | product4 | 40 | 1 |
| 2016-05-05 | EC | 96 | product96 | 32 | 2 |
| 2016-05-05 | EC | 99 | product99 | 31 | 3 |
| 2016-05-05 | NC | 40 | product40 | 16 | 1 |
| 2016-05-05 | NC | 9 | product9 | 16 | 2 |
| 2016-05-05 | NC | 5 | product5 | 13 | 3 |
| 2016-05-05 | NW | 56 | product56 | 20 | 1 |
| 2016-05-05 | NW | 67 | product67 | 20 | 2 |
| 2016-05-05 | NW | 48 | product48 | 19 | 3 |
| 2016-05-05 | SC | 38 | product38 | 35 | 1 |
| 2016-05-05 | SC | 98 | product98 | 34 | 2 |
| 2016-05-05 | SC | 33 | product33 | 34 | 3 |
| 2016-05-05 | SW | 16 | product16 | 20 | 1 |
| 2016-05-05 | SW | 95 | product95 | 19 | 2 |
| 2016-05-05 | SW | 60 | product60 | 19 | 3 |
+------------+------+------------+--------------+----------+------+
18 rows in set (0.00 sec)
Attention:
刚才我们在hive中area_product_click_top3 表中指定日期的时候写死了,但是实际生产中我们肯定不能这样做,而是通过获取一个时间。
Hive小Demo
最新推荐文章于 2022-12-26 17:22:58 发布