Hive小Demo

最新推荐文章于 2022-12-26 17:22:58 发布
yoohhwz
最新推荐文章于 2022-12-26 17:22:58 发布
阅读量484
点赞数
分类专栏： Hive
本文链接：https://blog.csdn.net/yoohhwz/article/details/97036648
版权
Hive 专栏收录该内容
10 篇文章 0 订阅
订阅专栏
需求分析：求各个区域下最受欢迎的产品的TOP 3
假设我们mysql上存放了两表，一张city_info城市信息表，一张product_info产品信息表
city_info:
city_id 城市id
city_name 城市名字
area 区域信息(华东、华北、华南、华中这些)
mysql> select * from city_info;
+---------+-----------+------+
| city_id | city_name | area |
+---------+-----------+------+
|       1 | BEIJING   | NC   |
|       2 | SHANGHAI  | EC   |
|       3 | NANJING   | EC   |
|       4 | GUANGZHOU | SC   |
|       5 | SANYA     | SC   |
|       6 | WUHAN     | CC   |
|       7 | CHANGSHA  | CC   |
|       8 | XIAN      | NW   |
|       9 | CHENGDU   | SW   |
|      10 | HAERBIN   | NE   |
+---------+-----------+------+
10 rows in set (0.00 sec)

product_info：
product_id 产品id
product_name 产品名称
extend_info 产品状态
mysql> select * from product_info;
+------------+--------------+----------------------+
| product_id | product_name | extend_info          |
+------------+--------------+----------------------+
|          1 | product1     | {"product_status":1} |
|          2 | product2     | {"product_status":1} |
|          3 | product3     | {"product_status":1} |
|          4 | product4     | {"product_status":1} |
|          5 | product5     | {"product_status":1} |
|          6 | product6     | {"product_status":1} |
|          7 | product7     | {"product_status":1} |
|          8 | product8     | {"product_status":1} |
|          9 | product9     | {"product_status":0} |
|         10 | product10    | {"product_status":1} |
|         11 | product11    | {"product_status":0} |
|         12 | product12    | {"product_status":0} |
|         13 | product13    | {"product_status":0} |
|         14 | product14    | {"product_status":0} |
|         15 | product15    | {"product_status":1} |
|         16 | product16    | {"product_status":0} |
|         17 | product17    | {"product_status":1} |
|         18 | product18    | {"product_status":0} |
|         19 | product19    | {"product_status":1} |
|         20 | product20    | {"product_status":1} |
|         21 | product21    | {"product_status":0} |
|         22 | product22    | {"product_status":0} |
|         23 | product23    | {"product_status":0} |
|         24 | product24    | {"product_status":0} |
|         25 | product25    | {"product_status":1} |
|         26 | product26    | {"product_status":1} |
|         27 | product27    | {"product_status":0} |
|         28 | product28    | {"product_status":1} |
|         29 | product29    | {"product_status":0} |
|         30 | product30    | {"product_status":0} |
|         31 | product31    | {"product_status":0} |
|         32 | product32    | {"product_status":0} |
|         33 | product33    | {"product_status":1} |
|         34 | product34    | {"product_status":1} |
|         35 | product35    | {"product_status":0} |
|         36 | product36    | {"product_status":0} |
|         37 | product37    | {"product_status":1} |
|         38 | product38    | {"product_status":0} |
|         39 | product39    | {"product_status":0} |
|         40 | product40    | {"product_status":1} |
|         41 | product41    | {"product_status":1} |
|         42 | product42    | {"product_status":1} |
|         43 | product43    | {"product_status":1} |
|         44 | product44    | {"product_status":0} |
|         45 | product45    | {"product_status":1} |
|         46 | product46    | {"product_status":1} |
|         47 | product47    | {"product_status":0} |
|         48 | product48    | {"product_status":1} |
|         49 | product49    | {"product_status":1} |
|         50 | product50    | {"product_status":1} |
|         51 | product51    | {"product_status":1} |
|         52 | product52    | {"product_status":0} |
|         53 | product53    | {"product_status":0} |
|         54 | product54    | {"product_status":1} |
|         55 | product55    | {"product_status":0} |
|         56 | product56    | {"product_status":0} |
|         57 | product57    | {"product_status":1} |
|         58 | product58    | {"product_status":1} |
|         59 | product59    | {"product_status":1} |
|         60 | product60    | {"product_status":1} |
|         61 | product61    | {"product_status":0} |
|         62 | product62    | {"product_status":1} |
|         63 | product63    | {"product_status":1} |
|         64 | product64    | {"product_status":0} |
|         65 | product65    | {"product_status":0} |
|         66 | product66    | {"product_status":1} |
|         67 | product67    | {"product_status":1} |
|         68 | product68    | {"product_status":0} |
|         69 | product69    | {"product_status":1} |
|         70 | product70    | {"product_status":0} |
|         71 | product71    | {"product_status":0} |
|         72 | product72    | {"product_status":0} |
|         73 | product73    | {"product_status":1} |
|         74 | product74    | {"product_status":0} |
|         75 | product75    | {"product_status":1} |
|         76 | product76    | {"product_status":0} |
|         77 | product77    | {"product_status":0} |
|         78 | product78    | {"product_status":1} |
|         79 | product79    | {"product_status":0} |
|         80 | product80    | {"product_status":0} |
|         81 | product81    | {"product_status":0} |
|         82 | product82    | {"product_status":1} |
|         83 | product83    | {"product_status":1} |
|         84 | product84    | {"product_status":1} |
|         85 | product85    | {"product_status":0} |
|         86 | product86    | {"product_status":1} |
|         87 | product87    | {"product_status":1} |
|         88 | product88    | {"product_status":1} |
|         89 | product89    | {"product_status":1} |
|         90 | product90    | {"product_status":1} |
|         91 | product91    | {"product_status":1} |
|         92 | product92    | {"product_status":0} |
|         93 | product93    | {"product_status":0} |
|         94 | product94    | {"product_status":1} |
|         95 | product95    | {"product_status":0} |
|         96 | product96    | {"product_status":0} |
|         97 | product97    | {"product_status":1} |
|         98 | product98    | {"product_status":1} |
|         99 | product99    | {"product_status":0} |
|        100 | product100   | {"product_status":1} |
+------------+--------------+----------------------+
100 rows in set (0.00 sec)

然后我们的hive上存放的是我们采集到的用户点击行为数据，存放在user_click表中
user_click：
user_id 用户id
session_id 会话id
action_time 访问时间
city_id 城市id
product_id 产品id
[hadoop@hadoop001 data]$ hdfs dfs -tail /data/user_click.txt
19/07/21 10:38:02 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
9:23,9,22
53,770c325ace1240d88387fa19931acb72,2016-05-05 10:17:16,9,83
53,770c325ace1240d88387fa19931acb72,2016-05-05 10:25:42,9,54
53,770c325ace1240d88387fa19931acb72,2016-05-05 10:48:33,9,28
53,770c325ace1240d88387fa19931acb72,2016-05-05 10:07:19,9,72
53,770c325ace1240d88387fa19931acb72,2016-05-05 10:00:26,9,87
53,561e94c2fa99467d956a26d55aab3cb6,2016-05-05 3:35:01,9,95
53,561e94c2fa99467d956a26d55aab3cb6,2016-05-05 3:55:19,9,13
53,561e94c2fa99467d956a26d55aab3cb6,2016-05-05 3:34:25,9,24
53,561e94c2fa99467d956a26d55aab3cb6,2016-05-05 3:26:50,9,69
53,561e94c2fa99467d956a26d55aab3cb6,2016-05-05 3:49:43,9,37
53,561e94c2fa99467d956a26d55aab3cb6,2016-05-05 3:55:46,9,96
53,561e94c2fa99467d956a26d55aab3cb6,2016-05-05 3:24:13,9,67
53,561e94c2fa99467d956a26d55aab3cb6,2016-05-05 3:06:52,9,29
53,561e94c2fa99467d956a26d55aab3cb6,2016-05-05 3:02:52,9,92
53,561e94c2fa99467d956a26d55aab3cb6,2016-05-05 3:19:27,9,12
53,561e94c2fa99467d956a26d55aab3cb6,2016-05-05 3:10:07,9,18

##创建hive外部分区表，通过location指定我们数据目录。因为我们模拟的是每天的用户点击日志，所以以时间为分区字段
hive (hwzhdb)> create external table user_click(
             > user_id int,
             > session_id string,
             > action_time string,
             > city_id int,
             > product_id int
             > ) partitioned by (date string)
             > row format delimited fields terminated by ','
             > location 'hdfs://hadoop001:9000/data';
OK
Time taken: 0.331 seconds
##加载分区数据，因为我们的数据都是2016-05-05的，所以分区字段对应的值也是这个
hive (hwzhdb)> load data inpath 'hdfs://hadoop001:9000/data/user_click.txt' into table user_click partition(date='2016-05-05');
Loading data to table hwzhdb.user_click partition (date=2016-05-05)
Partition hwzhdb.user_click{date=2016-05-05} stats: [numFiles=1, numRows=0, totalSize=725264, rawDataSize=0]
OK
Time taken: 1.402 seconds
hive (hwzhdb)> select * from user_click limit 5;
OK
user_click.user_id      user_click.session_id   user_click.action_time  user_click.city_id      user_click.product_id   user_click.date
95      2bf501a7637549c89cf55342331b15db        2016-05-05 21:01:56     1       72     2016-05-05
95      2bf501a7637549c89cf55342331b15db        2016-05-05 21:52:26     1       68     2016-05-05
95      2bf501a7637549c89cf55342331b15db        2016-05-05 21:17:03     1       40     2016-05-05
95      2bf501a7637549c89cf55342331b15db        2016-05-05 21:32:07     1       21     2016-05-05
95      2bf501a7637549c89cf55342331b15db        2016-05-05 21:26:06     1       63     2016-05-05
Time taken: 0.209 seconds, Fetched: 5 row(s)

！！！分析：如果要求每个区域最受欢迎的top3，那首先我们需要将city_info表和product_info表和user_click表进行关联，然后以city_info表中的area字段和product_info表中的product_id,product_name字段进行分组，然后求count，然后再使用row_number()函数求出每个区域的前三即可。
但是要想将mysql中的表和hive中的表进行关联，很明显目前是不可能做到的，所以我们需要先通过sqoop将mysql上的数据导入hive表中。

在hive上创建与mysql上与之对应的city_info表和product_info(虽然sqoop有参数--create-hive-table可以直接创建表，但是可能创建出来的表格式并不是我们想要的那种，所以需要我们自己事先创建)：
hive (hwzhdb)> create table city_info(
             > city_id int,
             > city_name string,
             > area string
             > )
             > row format delimited fields terminated by '\t';
OK
Time taken: 0.528 seconds
hive (hwzhdb)> create table product_info(
             > product_id int,
             > product_name string,
             > extend_info string
             > )
             > row format delimited fields terminated by '\t';
OK
Time taken: 0.12 seconds
##然后使用sqoop将mysql上的表数据导入到hive表中：
city_info ：
sqoop import \															 ##表示导入
--connect jdbc:mysql://localhost:3306/hwzdb \        ##连接到mysql库
--username root \													##mysql的账户
--password 123456 \												##mysql的密码
--table city_info \												##mysql上的哪张表
--hive-database hwzhdb \										##导入hive的哪个库
--hive-table city_info \										##导入hive的哪张表
--hive-import \															##该参数相当于一个开关，表示要将数据导入到hive表中
--hive-overwrite \														##表示每次导入都会覆盖hive表数据
--delete-target-dir \													##表示如果目标目录存在的话，先删除
--fields-terminated-by '\t' \										##表示hive中的分隔符，与创建表的分隔符一致就行
--split-by city_id \												##该参数是和下面的-m参数搭配使用的，sqoop默认的是通过mysql中的主键将数据分为几部分(根据-m参数指定的值)，如果mysql表中没有主键的话，就需要通过--split-by 参数来指定一个字段将数据分为几个map(根据-m参数指定的值)导入hive。
-m 2																			##代表使用2个maptask导入数据，默认好像是4个。

product_info ：
sqoop import \
--connect jdbc:mysql://localhost:3306/hwzdb \
--username root \
--password 123456 \
--table product_info \
--hive-database hwzhdb \
--hive-table product_info \
--hive-import \
--hive-overwrite \
--delete-target-dir \
--fields-terminated-by '\t' \
--split-by product_id \
-m 2

然后将hive中的user_click表和city_info表，product_info表进行关联，求出每个区域，每个产品的总点击数，然后再使用开窗函数求出每个区域最受欢迎的商品top3

##创建临时表，用来存储每个区域，每个产品的总点击数
hive (hwzhdb)> create table tmp_area_product_name_sumclick as
             > select ci.area,product_id,product_name,count(1) sumClick from user_click uc
             > join city_info ci on uc.city_id=ci.city_id
             > join product_info pi on pi.product_id=uc.product_id
             > group by ci.area,pi.product_id,pi.product_name;
             ...mr...
 ##然后再使用开窗函数求出每个区域最受欢迎的商品top3
 hive (hwzhdb)> create table area_product_click_top3 as
             > select '2016-05-05',area,product_id,product_name,sumClick,rank from
             > (select area,product_name,product_id,sumClick,row_number() over(partition by area order by sumClick desc) as rank from tmp_area_product_name_sumclick)
             > tmp where rank<=3;
   ##查看最终结果：
   hive (hwzhdb)> select * from area_product_click_top3;
OK
area_product_click_top3.area    area_product_click_top3.product_id      area_product_click_top3.product_name    area_product_click_top3.sumclick        area_product_click_top3.rank
CC      26      product26       39      1
CC      7       product7        39      2
CC      70      product70       38      3
EC      4       product4        40      1
EC      96      product96       32      2
EC      99      product99       31      3
NC      40      product40       16      1
NC      9       product9        16      2
NC      5       product5        13      3
NW      56      product56       20      1
NW      67      product67       20      2
NW      48      product48       19      3
SC      38      product38       35      1
SC      98      product98       34      2
SC      33      product33       34      3
SW      16      product16       20      1
SW      95      product95       19      2
SW      60      product60       19      3
Time taken: 0.077 seconds, Fetched: 18 row(s)

！！！由于我们最终计算的指标是需要进行前端页面展示的，所以需要将我们最终计算的指标数据导出到mysql中进行展示，所以需要先创建mysql表，然后使用sqoop将数据导入mysql
##创建mysql表
drop table if exists area_product_click_top3;
create table area_product_click_top3(
day varchar(20) comment '因为该指标可能需要经常计算，所以需要增加一个时间字段用来区分哪天的指标',
area varchar(20),
product_id int,
product_name varchar(30),
sumclick int,
rank int
);


##使用sqoop将hdfs上的数据导出到mysql里
sqoop export \
--connect jdbc:mysql://localhost:3306/hwzdb \
--username root \
--password 123456 \
--table area_product_click_top3 \
--columns day,area,product_id,product_name,sumClick,rank \
--mapreduce-job-name top3Tomysql \
--export-dir /user/hive/warehouse/hwzhdb.db/area_product_click_top3 \
--fields-terminated-by '\t' \
-m 2

##查看mysql中最终指标计算结果：
mysql> select * from area_product_click_top3;
+------------+------+------------+--------------+----------+------+
| day        | area | product_id | product_name | sumclick | rank |
+------------+------+------------+--------------+----------+------+
| 2016-05-05 | CC   |         26 | product26    |       39 |    1 |
| 2016-05-05 | CC   |          7 | product7     |       39 |    2 |
| 2016-05-05 | CC   |         70 | product70    |       38 |    3 |
| 2016-05-05 | EC   |          4 | product4     |       40 |    1 |
| 2016-05-05 | EC   |         96 | product96    |       32 |    2 |
| 2016-05-05 | EC   |         99 | product99    |       31 |    3 |
| 2016-05-05 | NC   |         40 | product40    |       16 |    1 |
| 2016-05-05 | NC   |          9 | product9     |       16 |    2 |
| 2016-05-05 | NC   |          5 | product5     |       13 |    3 |
| 2016-05-05 | NW   |         56 | product56    |       20 |    1 |
| 2016-05-05 | NW   |         67 | product67    |       20 |    2 |
| 2016-05-05 | NW   |         48 | product48    |       19 |    3 |
| 2016-05-05 | SC   |         38 | product38    |       35 |    1 |
| 2016-05-05 | SC   |         98 | product98    |       34 |    2 |
| 2016-05-05 | SC   |         33 | product33    |       34 |    3 |
| 2016-05-05 | SW   |         16 | product16    |       20 |    1 |
| 2016-05-05 | SW   |         95 | product95    |       19 |    2 |
| 2016-05-05 | SW   |         60 | product60    |       19 |    3 |
+------------+------+------------+--------------+----------+------+
18 rows in set (0.00 sec)

Attention:
刚才我们在hive中area_product_click_top3 表中指定日期的时候写死了，但是实际生产中我们肯定不能这样做，而是通过获取一个时间。
yoohhwz
关注
0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Hive小Demo

需求分析：求各个区域下最受欢迎的产品的TOP 3假设我们mysql上存放了两表，一张city_info城市信息表，一张product_info产品信息表city_info:city_id 城市idcity_name 城市名字area 区域信息(华东、华北、华南、华中这些)product_info：product_id 产品idproduct_name 产品名称extend_inf...
复制链接

扫一扫
专栏目录