今天,遇到了一个问题,求 : 某一天HIVE分区表的数据,观看视频个数不同的前5名user_id。
问题描述 :
用户浏览日志(date, user_id, video_id), 统计 2020.03.29 观看不同视频个数的前5名 user_id。(大数据集)
先创建表,并构建数据
CREATE TABLE IF NOT EXSITS user_video_log(
user_id BIGINT,
video_id BIGINT
)PARTITIONED BY(
pt STRING
);
INSERT OVERWRITE TABLE user_video_log PARTITION( pt = '20200603')
VALUES
(1,1)
,(1,1)
,(1,2)
,(1,3)
,(1,4)
,(1,5)
,(1,6)
,(2,1)
,(2,2)
,(2,2)
,(2,3)
,(2,4)
,(2,5)
,(3,1)
,(4,1)
,(5,1)
,(6,1)
,(7,1)
,(7,1)
,(7,1)
,(7,2)
,(7,2)
,(7,3)
,(7,4)
,(7,4)
;
注意
这里有两个坑,
Hive_HIVE优化指南_场景一_去重场景问题_高达一号的博客-CSDN博客
1.COUNT(DISTINCT)
优化原理
我们先说下为什么大数据集下 先 GROUP BY 再COUNT 的效率 要优于 直接 COUNT(DISTINCT ...) .
因为 COUNT(DISTINCT ...) , 会把相关的列组成一个key 传入到 Reducer 中。即 count(DISTINCT KEY._col0:0._col0, KEY._col0:0._col1, KEY._col0:0._col2) | 这样需要在 一个 Reducer 中 ,完成全排序并去重。
先GROUP BY 再去 COUNT ,则GROUP BY 可以 将不同的KEY , 分发到多个 Reducer 中,在 GROUP BY流程中完成了去重。此时,去重时并不会把数据放入到 一个 Reducer 中,利用了分布式的优势。这个去重效率更高。在下一步 COUNT 阶段,再将上一步奏 GROUP BY 去重后的 KEY , 进行统计计算。
所以大数据量下 先GROUP BY ,再去 COUNT 效率比 COUNT(DISTINCT) 更高。
2.使用全局排序ORDER BY
会把数据打到一个reducer task 中!
错误写法
SELECT user_id, differ_video
FROM
(
SELECT user_id, COUNT(DISTINCT video_id) AS differ_video
FROM
user_video_log
WHERE pt = '20200603'
) tmp
ORDER BY differ_video DESC
LIMIT 5;
执行计划
+----------------------------------------------------+
| Explain |
+----------------------------------------------------+
| STAGE DEPENDENCIES: |
| Stage-1 is a root stage |
| Stage-2 depends on stages: Stage-1 |
| Stage-0 depends on stages: Stage-2 |
| |
| STAGE PLANS: |
| Stage: Stage-1 |
| Map Reduce |
| Map Operator Tree: |
| TableScan |
| alias: user_video_log |
| filterExpr: (pt = '20200603') (type: boolean) |
| Statistics: Num rows: 25 Data size: 400 Basic stats: COMPLETE Column stats: NONE |
| Select Operator |
| expressions: user_id (type: bigint), video_id (type: bigint) |
| outputColumnNames: user_id, video_id |
| Statistics: Num rows: 25 Data size: 400 Basic stats: COMPLETE Column stats: NONE |
| Group By Operator |
| aggregations: count(DISTINCT video_id) |
| keys: user_id (type: bigint), video_id (type: bigint) |
| mode: hash |
| outputColumnNames: _col0, _col1, _col2 |
| Statistics: Num rows: 25 Data size: 400 Basic stats: COMPLETE Column stats: NONE |
| Reduce Output Operator |
| key expressions: _col0 (type: bigint), _col1 (type: bigint) |
| sort order: ++ |
| Map-reduce partition columns: _col0 (type: bigint) |
| Statistics: Num rows: 25 Data size: 400 Basic stats: COMPLETE Column stats: NONE |
| Execution mode: vectorized |
| Reduce Operator Tree: |
| Group By Operator |
| aggregations: count(DISTINCT KEY._col1:0._col0) |
| keys: KEY._col0 (type: bigint) |
| mode: mergepartial |
| outputColumnNames: _col0, _col1 |
| Statistics: Num rows: 12 Data size: 192 Basic stats: COMPLETE Column stats: NONE |
| File Output Operator |
| compressed: false |
| table: |
| input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
| output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
| serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe |
| |
| Stage: Stage-2 |
| Map Reduce |
| Map Operator Tree: |
| TableScan |
| Reduce Output Operator |
| key expressions: _col1 (type: bigint) |
| sort order: - |
| Statistics: Num rows: 12 Data size: 192 Basic stats: COMPLETE Column stats: NONE |
| TopN Hash Memory Usage: 0.1 |
| value expressions: _col0 (type: bigint) |
| Reduce Operator Tree: |
| Select Operator |
| expressions: VALUE._col0 (type: bigint), KEY.reducesinkkey0 (type: bigint) |
| outputColumnNames: _col0, _col1 |
| Statistics: Num rows: 12 Data size: 192 Basic stats: COMPLETE Column stats: NONE |
| Limit |
| Number of rows: 5 |
| Statistics: Num rows: 5 Data size: 80 Basic stats: COMPLETE Column stats: NONE |
| File Output Operator |
| compressed: false |
| Statistics: Num rows: 5 Data size: 80 Basic stats: COMPLETE Column stats: NONE |
| table: |
| input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
| output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
| serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
| |
| Stage: Stage-0 |
| Fetch Operator |
| limit: 5 |
| Processor Tree: |
| ListSink |
| |
+----------------------------------------------------+
输出
INFO : MapReduce Jobs Launched:
INFO : Stage-Stage-1: Map: 1 Reduce: 5 Cumulative CPU: 16.48 sec HDFS Read: 30865 HDFS Write: 613 SUCCESS
INFO : Stage-Stage-2: Map: 1 Reduce: 1 Cumulative CPU: 4.44 sec HDFS Read: 7131 HDFS Write: 167 SUCCESS
INFO : Total MapReduce CPU Time Spent: 20 seconds 920 msec
INFO : Completed executing command(queryId=hive_20200603213400_312b6a86-0b0f-446d-8dfe-f3c7672d9845); Time taken: 78.274 seconds
INFO : OK
+----------+---------------+
| user_id | differ_video |
+----------+---------------+
| 1 | 6 |
| 2 | 5 |
| 7 | 4 |
| 6 | 1 |
| 4 | 1 |
+----------+---------------+
注意 : 可以看到虽然设置了5个reduce task , 但是全排序还是用了一个task
数据查询
use data_warehouse_test;
set mapred.reduce.tasks=5;
SELECT user_id, video_id
FROM
user_video_log
WHERE pt = '20200603'
SORT BY user_id
;
EXPLAIN
SELECT user_id, differ_video
FROM
(
SELECT user_id, COUNT(1) AS differ_video
FROM
(
SELECT user_id, video_id
FROM
(
SELECT *
FROM user_video_log
WHERE pt = '20200603'
) tmp
GROUP BY user_id, video_id
) tmp2
GROUP BY user_id
) tmp3
SORT BY differ_video DESC
LIMIT 5
;
SELECT user_id, differ_video
FROM
(
SELECT user_id, COUNT(1) AS differ_video
FROM
(
SELECT user_id, video_id
FROM
(
SELECT *
FROM user_video_log
WHERE pt = '20200603'
) tmp
GROUP BY user_id, video_id
) tmp2
GROUP BY user_id
) tmp3
SORT BY differ_video DESC
LIMIT 5
;
分为3个部分
part1 先看下SORT BY 是不是分段有序的
part2 看下作业:取观看不同视频用户的前5名 的执行计划
part3 取观看不同视频用户的前5名
part1输出
+----------+-----------+
| user_id | video_id |
+----------+-----------+
| 2 | 5 |
| 3 | 1 |
| 1 | 6 |
| 2 | 4 |
| 7 | 1 |
| 1 | 1 |
| 1 | 4 |
| 4 | 1 |
| 7 | 1 |
| 1 | 1 |
| 1 | 3 |
| 2 | 2 |
| 2 | 2 |
| 5 | 1 |
| 7 | 1 |
| 7 | 2 |
| 1 | 5 |
| 1 | 2 |
| 2 | 3 |
| 2 | 1 |
| 6 | 1 |
| 7 | 3 |
| 7 | 2 |
| 7 | 4 |
| 7 | 4 |
+----------+-----------+
可以看到是分段有序的,5段,每一段都是有序的
part2 输出
可以看到作业拆分为了4个部分
+----------------------------------------------------+
| Explain |
+----------------------------------------------------+
| STAGE DEPENDENCIES: |
| Stage-1 is a root stage |
| Stage-2 depends on stages: Stage-1 |
| Stage-3 depends on stages: Stage-2 |
| Stage-0 depends on stages: Stage-3 |
| |
| STAGE PLANS: |
| Stage: Stage-1 |
| Map Reduce |
| Map Operator Tree: |
| TableScan |
| alias: user_video_log |
| filterExpr: (pt = '20200603') (type: boolean) |
| Statistics: Num rows: 25 Data size: 400 Basic stats: COMPLETE Column stats: NONE |
| Select Operator |
| expressions: user_id (type: bigint), video_id (type: bigint) |
| outputColumnNames: _col0, _col1 |
| Statistics: Num rows: 25 Data size: 400 Basic stats: COMPLETE Column stats: NONE |
| Group By Operator |
| keys: _col0 (type: bigint), _col1 (type: bigint) |
| mode: hash |
| outputColumnNames: _col0, _col1 |
| Statistics: Num rows: 25 Data size: 400 Basic stats: COMPLETE Column stats: NONE |
| Reduce Output Operator |
| key expressions: _col0 (type: bigint), _col1 (type: bigint) |
| sort order: ++ |
| Map-reduce partition columns: _col0 (type: bigint) |
| Statistics: Num rows: 25 Data size: 400 Basic stats: COMPLETE Column stats: NONE |
| Execution mode: vectorized |
| Reduce Operator Tree: |
| Group By Operator |
| keys: KEY._col0 (type: bigint), KEY._col1 (type: bigint) |
| mode: mergepartial |
| outputColumnNames: _col0, _col1 |
| Statistics: Num rows: 12 Data size: 192 Basic stats: COMPLETE Column stats: NONE |
| Select Operator |
| expressions: _col0 (type: bigint) |
| outputColumnNames: _col0 |
| Statistics: Num rows: 12 Data size: 192 Basic stats: COMPLETE Column stats: NONE |
| Group By Operator |
| aggregations: count(1) |
| keys: _col0 (type: bigint) |
| mode: complete |
| outputColumnNames: _col0, _col1 |
| Statistics: Num rows: 6 Data size: 96 Basic stats: COMPLETE Column stats: NONE |
| File Output Operator |
| compressed: false |
| table: |
| input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
| output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
| serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe |
| |
| Stage: Stage-2 |
| Map Reduce |
| Map Operator Tree: |
| TableScan |
| Reduce Output Operator |
| key expressions: _col1 (type: bigint) |
| sort order: - |
| Statistics: Num rows: 6 Data size: 96 Basic stats: COMPLETE Column stats: NONE |
| TopN Hash Memory Usage: 0.1 |
| value expressions: _col0 (type: bigint) |
| Reduce Operator Tree: |
| Select Operator |
| expressions: VALUE._col0 (type: bigint), KEY.reducesinkkey0 (type: bigint) |
| outputColumnNames: _col0, _col1 |
| Statistics: Num rows: 6 Data size: 96 Basic stats: COMPLETE Column stats: NONE |
| Limit |
| Number of rows: 5 |
| Statistics: Num rows: 5 Data size: 80 Basic stats: COMPLETE Column stats: NONE |
| File Output Operator |
| compressed: false |
| table: |
| input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
| output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
| serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe |
| |
| Stage: Stage-3 |
| Map Reduce |
| Map Operator Tree: |
| TableScan |
| Reduce Output Operator |
| key expressions: _col1 (type: bigint) |
| sort order: - |
| Statistics: Num rows: 5 Data size: 80 Basic stats: COMPLETE Column stats: NONE |
| TopN Hash Memory Usage: 0.1 |
| value expressions: _col0 (type: bigint) |
| Reduce Operator Tree: |
| Select Operator |
| expressions: VALUE._col0 (type: bigint), KEY.reducesinkkey0 (type: bigint) |
| outputColumnNames: _col0, _col1 |
| Statistics: Num rows: 5 Data size: 80 Basic stats: COMPLETE Column stats: NONE |
| Limit |
| Number of rows: 5 |
| Statistics: Num rows: 5 Data size: 80 Basic stats: COMPLETE Column stats: NONE |
| File Output Operator |
| compressed: false |
| Statistics: Num rows: 5 Data size: 80 Basic stats: COMPLETE Column stats: NONE |
| table: |
| input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
+----------------------------------------------------+
| Explain |
+----------------------------------------------------+
| output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
| serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
| |
| Stage: Stage-0 |
| Fetch Operator |
| limit: 5 |
| Processor Tree: |
| ListSink |
| |
+----------------------------------------------------+
part3 结果
两次执行结果
INFO : Stage-Stage-1: Map: 1 Reduce: 5 Cumulative CPU: 18.41 sec HDFS Read: 31097 HDFS Write: 613 SUCCESS
INFO : Stage-Stage-2: Map: 1 Reduce: 5 Cumulative CPU: 11.06 sec HDFS Read: 14785 HDFS Write: 575 SUCCESS
INFO : Stage-Stage-3: Map: 1 Reduce: 1 Cumulative CPU: 3.68 sec HDFS Read: 6923 HDFS Write: 167 SUCCESS
INFO : Total MapReduce CPU Time Spent: 33 seconds 150 msec
INFO : Completed executing command(queryId=hive_20200603211837_67bfdd47-82e1-4dd6-81a8-456e80663079); Time taken: 127.029 seconds
INFO : OK
+----------+---------------+
| user_id | differ_video |
+----------+---------------+
| 1 | 6 |
| 2 | 5 |
| 7 | 4 |
| 6 | 1 |
| 4 | 1 |
+----------+---------------+
5 rows selected (127.212 seconds)
INFO : MapReduce Jobs Launched:
INFO : Stage-Stage-1: Map: 1 Reduce: 5 Cumulative CPU: 16.01 sec HDFS Read: 31097 HDFS Write: 613 SUCCESS
INFO : Stage-Stage-2: Map: 1 Reduce: 5 Cumulative CPU: 10.6 sec HDFS Read: 14785 HDFS Write: 575 SUCCESS
INFO : Stage-Stage-3: Map: 1 Reduce: 1 Cumulative CPU: 4.45 sec HDFS Read: 6923 HDFS Write: 167 SUCCESS
INFO : Total MapReduce CPU Time Spent: 31 seconds 60 msec
INFO : Completed executing command(queryId=hive_20200603213610_f0088785-9727-48ee-81bd-7bbcf531b486); Time taken: 122.529 seconds
INFO : OK
+----------+---------------+
| user_id | differ_video |
+----------+---------------+
| 1 | 6 |
| 2 | 5 |
| 7 | 4 |
| 6 | 1 |
| 4 | 1 |
+----------+---------------+