Hive_HQL_Hive优化_复杂SQL_观看不同视频个数的前5名

高达一号

已于 2023-07-20 18:12:05 修改

阅读量1k

点赞数

分类专栏： SQL 文章标签： hive sql 音视频

于 2020-06-04 09:44:21 首次发布

本文链接：https://blog.csdn.net/u010003835/article/details/106536970

版权

SQL 专栏收录该内容

12 篇文章 0 订阅

订阅专栏

今天，遇到了一个问题，求 : 某一天HIVE分区表的数据，观看视频个数不同的前5名user_id。

问题描述：

用户浏览日志（date, user_id, video_id）, 统计 2020.03.29 观看不同视频个数的前5名 user_id。（大数据集）

先创建表，并构建数据

CREATE TABLE IF NOT EXSITS user_video_log(
	user_id BIGINT,
	video_id BIGINT
)PARTITIONED BY(
	pt STRING
);

INSERT OVERWRITE TABLE user_video_log PARTITION( pt = '20200603')
VALUES
(1,1)
,(1,1)
,(1,2)
,(1,3)
,(1,4)
,(1,5)
,(1,6)
,(2,1)
,(2,2)
,(2,2)
,(2,3)
,(2,4)
,(2,5)
,(3,1)
,(4,1)
,(5,1)
,(6,1)
,(7,1)
,(7,1)
,(7,1)
,(7,2)
,(7,2)
,(7,3)
,(7,4)
,(7,4)
;

注意

这里有两个坑，

Hive_HIVE优化指南_场景一_去重场景问题_高达一号的博客-CSDN博客

1.COUNT(DISTINCT)

优化原理

我们先说下为什么大数据集下先 GROUP BY 再COUNT 的效率要优于直接 COUNT(DISTINCT ...) .

因为 COUNT(DISTINCT ...) , 会把相关的列组成一个key 传入到 Reducer 中。即 count(DISTINCT KEY._col0:0._col0, KEY._col0:0._col1, KEY._col0:0._col2) | 这样需要在一个 Reducer 中，完成全排序并去重。

先GROUP BY 再去 COUNT ，则GROUP BY 可以将不同的KEY , 分发到多个 Reducer 中，在 GROUP BY流程中完成了去重。此时，去重时并不会把数据放入到一个 Reducer 中，利用了分布式的优势。这个去重效率更高。在下一步 COUNT 阶段，再将上一步奏 GROUP BY 去重后的 KEY , 进行统计计算。

所以大数据量下先GROUP BY ，再去 COUNT 效率比 COUNT(DISTINCT) 更高。

2.使用全局排序ORDER BY

会把数据打到一个reducer task 中!

错误写法

SELECT user_id, differ_video
FROM
(
   SELECT user_id, COUNT(DISTINCT video_id) AS differ_video
   FROM
       user_video_log
   WHERE pt = '20200603'
) tmp
ORDER BY differ_video DESC
LIMIT 5;

执行计划

+----------------------------------------------------+
|                      Explain                       |
+----------------------------------------------------+
| STAGE DEPENDENCIES:                                |
|   Stage-1 is a root stage                          |
|   Stage-2 depends on stages: Stage-1               |
|   Stage-0 depends on stages: Stage-2               |
|                                                    |
| STAGE PLANS:                                       |
|   Stage: Stage-1                                   |
|     Map Reduce                                     |
|       Map Operator Tree:                           |
|           TableScan                                |
|             alias: user_video_log                  |
|             filterExpr: (pt = '20200603') (type: boolean) |
|             Statistics: Num rows: 25 Data size: 400 Basic stats: COMPLETE Column stats: NONE |
|             Select Operator                        |
|               expressions: user_id (type: bigint), video_id (type: bigint) |
|               outputColumnNames: user_id, video_id |
|               Statistics: Num rows: 25 Data size: 400 Basic stats: COMPLETE Column stats: NONE |
|               Group By Operator                    |
|                 aggregations: count(DISTINCT video_id) |
|                 keys: user_id (type: bigint), video_id (type: bigint) |
|                 mode: hash                         |
|                 outputColumnNames: _col0, _col1, _col2 |
|                 Statistics: Num rows: 25 Data size: 400 Basic stats: COMPLETE Column stats: NONE |
|                 Reduce Output Operator             |
|                   key expressions: _col0 (type: bigint), _col1 (type: bigint) |
|                   sort order: ++                   |
|                   Map-reduce partition columns: _col0 (type: bigint) |
|                   Statistics: Num rows: 25 Data size: 400 Basic stats: COMPLETE Column stats: NONE |
|       Execution mode: vectorized                   |
|       Reduce Operator Tree:                        |
|         Group By Operator                          |
|           aggregations: count(DISTINCT KEY._col1:0._col0) |
|           keys: KEY._col0 (type: bigint)           |
|           mode: mergepartial                       |
|           outputColumnNames: _col0, _col1          |
|           Statistics: Num rows: 12 Data size: 192 Basic stats: COMPLETE Column stats: NONE |
|           File Output Operator                     |
|             compressed: false                      |
|             table:                                 |
|                 input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
|                 output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
|                 serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe |
|                                                    |
|   Stage: Stage-2                                   |
|     Map Reduce                                     |
|       Map Operator Tree:                           |
|           TableScan                                |
|             Reduce Output Operator                 |
|               key expressions: _col1 (type: bigint) |
|               sort order: -                        |
|               Statistics: Num rows: 12 Data size: 192 Basic stats: COMPLETE Column stats: NONE |
|               TopN Hash Memory Usage: 0.1          |
|               value expressions: _col0 (type: bigint) |
|       Reduce Operator Tree:                        |
|         Select Operator                            |
|           expressions: VALUE._col0 (type: bigint), KEY.reducesinkkey0 (type: bigint) |
|           outputColumnNames: _col0, _col1          |
|           Statistics: Num rows: 12 Data size: 192 Basic stats: COMPLETE Column stats: NONE |
|           Limit                                    |
|             Number of rows: 5                      |
|             Statistics: Num rows: 5 Data size: 80 Basic stats: COMPLETE Column stats: NONE |
|             File Output Operator                   |
|               compressed: false                    |
|               Statistics: Num rows: 5 Data size: 80 Basic stats: COMPLETE Column stats: NONE |
|               table:                               |
|                   input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
|                   output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
|                   serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
|                                                    |
|   Stage: Stage-0                                   |
|     Fetch Operator                                 |
|       limit: 5                                     |
|       Processor Tree:                              |
|         ListSink                                   |
|                                                    |
+----------------------------------------------------+

输出

INFO  : MapReduce Jobs Launched: 
INFO  : Stage-Stage-1: Map: 1  Reduce: 5   Cumulative CPU: 16.48 sec   HDFS Read: 30865 HDFS Write: 613 SUCCESS
INFO  : Stage-Stage-2: Map: 1  Reduce: 1   Cumulative CPU: 4.44 sec   HDFS Read: 7131 HDFS Write: 167 SUCCESS
INFO  : Total MapReduce CPU Time Spent: 20 seconds 920 msec
INFO  : Completed executing command(queryId=hive_20200603213400_312b6a86-0b0f-446d-8dfe-f3c7672d9845); Time taken: 78.274 seconds
INFO  : OK
+----------+---------------+
| user_id  | differ_video  |
+----------+---------------+
| 1        | 6             |
| 2        | 5             |
| 7        | 4             |
| 6        | 1             |
| 4        | 1             |
+----------+---------------+

注意：可以看到虽然设置了5个reduce task , 但是全排序还是用了一个task

数据查询

use data_warehouse_test;

set mapred.reduce.tasks=5;

SELECT user_id, video_id
FROM
    user_video_log
WHERE pt = '20200603' 
SORT BY user_id
;

EXPLAIN
SELECT user_id, differ_video
FROM 
(
	SELECT user_id, COUNT(1) AS differ_video
	FROM
	(
		SELECT user_id, video_id
		FROM
		(
			SELECT * 
			FROM user_video_log
			WHERE pt = '20200603'
		) tmp
		GROUP BY user_id, video_id
	) tmp2
	GROUP BY user_id
) tmp3
SORT BY differ_video DESC
LIMIT 5
;

SELECT user_id, differ_video
FROM 
(
	SELECT user_id, COUNT(1) AS differ_video
	FROM
	(
		SELECT user_id, video_id
		FROM
		(
			SELECT * 
			FROM user_video_log
			WHERE pt = '20200603'
		) tmp
		GROUP BY user_id, video_id
	) tmp2
	GROUP BY user_id
) tmp3
SORT BY differ_video DESC
LIMIT 5
;

分为3个部分

part1 先看下SORT BY 是不是分段有序的

part2 看下作业：取观看不同视频用户的前5名的执行计划

part3 取观看不同视频用户的前5名

part1输出

+----------+-----------+
| user_id  | video_id  |
+----------+-----------+
| 2        | 5         |
| 3        | 1         |
| 1        | 6         |
| 2        | 4         |
| 7        | 1         |
| 1        | 1         |
| 1        | 4         |
| 4        | 1         |
| 7        | 1         |
| 1        | 1         |
| 1        | 3         |
| 2        | 2         |
| 2        | 2         |
| 5        | 1         |
| 7        | 1         |
| 7        | 2         |
| 1        | 5         |
| 1        | 2         |
| 2        | 3         |
| 2        | 1         |
| 6        | 1         |
| 7        | 3         |
| 7        | 2         |
| 7        | 4         |
| 7        | 4         |
+----------+-----------+

可以看到是分段有序的，5段，每一段都是有序的

part2 输出

可以看到作业拆分为了4个部分

+----------------------------------------------------+
|                      Explain                       |
+----------------------------------------------------+
| STAGE DEPENDENCIES:                                |
|   Stage-1 is a root stage                          |
|   Stage-2 depends on stages: Stage-1               |
|   Stage-3 depends on stages: Stage-2               |
|   Stage-0 depends on stages: Stage-3               |
|                                                    |
| STAGE PLANS:                                       |
|   Stage: Stage-1                                   |
|     Map Reduce                                     |
|       Map Operator Tree:                           |
|           TableScan                                |
|             alias: user_video_log                  |
|             filterExpr: (pt = '20200603') (type: boolean) |
|             Statistics: Num rows: 25 Data size: 400 Basic stats: COMPLETE Column stats: NONE |
|             Select Operator                        |
|               expressions: user_id (type: bigint), video_id (type: bigint) |
|               outputColumnNames: _col0, _col1      |
|               Statistics: Num rows: 25 Data size: 400 Basic stats: COMPLETE Column stats: NONE |
|               Group By Operator                    |
|                 keys: _col0 (type: bigint), _col1 (type: bigint) |
|                 mode: hash                         |
|                 outputColumnNames: _col0, _col1    |
|                 Statistics: Num rows: 25 Data size: 400 Basic stats: COMPLETE Column stats: NONE |
|                 Reduce Output Operator             |
|                   key expressions: _col0 (type: bigint), _col1 (type: bigint) |
|                   sort order: ++                   |
|                   Map-reduce partition columns: _col0 (type: bigint) |
|                   Statistics: Num rows: 25 Data size: 400 Basic stats: COMPLETE Column stats: NONE |
|       Execution mode: vectorized                   |
|       Reduce Operator Tree:                        |
|         Group By Operator                          |
|           keys: KEY._col0 (type: bigint), KEY._col1 (type: bigint) |
|           mode: mergepartial                       |
|           outputColumnNames: _col0, _col1          |
|           Statistics: Num rows: 12 Data size: 192 Basic stats: COMPLETE Column stats: NONE |
|           Select Operator                          |
|             expressions: _col0 (type: bigint)      |
|             outputColumnNames: _col0               |
|             Statistics: Num rows: 12 Data size: 192 Basic stats: COMPLETE Column stats: NONE |
|             Group By Operator                      |
|               aggregations: count(1)               |
|               keys: _col0 (type: bigint)           |
|               mode: complete                       |
|               outputColumnNames: _col0, _col1      |
|               Statistics: Num rows: 6 Data size: 96 Basic stats: COMPLETE Column stats: NONE |
|               File Output Operator                 |
|                 compressed: false                  |
|                 table:                             |
|                     input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
|                     output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
|                     serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe |
|                                                    |
|   Stage: Stage-2                                   |
|     Map Reduce                                     |
|       Map Operator Tree:                           |
|           TableScan                                |
|             Reduce Output Operator                 |
|               key expressions: _col1 (type: bigint) |
|               sort order: -                        |
|               Statistics: Num rows: 6 Data size: 96 Basic stats: COMPLETE Column stats: NONE |
|               TopN Hash Memory Usage: 0.1          |
|               value expressions: _col0 (type: bigint) |
|       Reduce Operator Tree:                        |
|         Select Operator                            |
|           expressions: VALUE._col0 (type: bigint), KEY.reducesinkkey0 (type: bigint) |
|           outputColumnNames: _col0, _col1          |
|           Statistics: Num rows: 6 Data size: 96 Basic stats: COMPLETE Column stats: NONE |
|           Limit                                    |
|             Number of rows: 5                      |
|             Statistics: Num rows: 5 Data size: 80 Basic stats: COMPLETE Column stats: NONE |
|             File Output Operator                   |
|               compressed: false                    |
|               table:                               |
|                   input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
|                   output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
|                   serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe |
|                                                    |
|   Stage: Stage-3                                   |
|     Map Reduce                                     |
|       Map Operator Tree:                           |
|           TableScan                                |
|             Reduce Output Operator                 |
|               key expressions: _col1 (type: bigint) |
|               sort order: -                        |
|               Statistics: Num rows: 5 Data size: 80 Basic stats: COMPLETE Column stats: NONE |
|               TopN Hash Memory Usage: 0.1          |
|               value expressions: _col0 (type: bigint) |
|       Reduce Operator Tree:                        |
|         Select Operator                            |
|           expressions: VALUE._col0 (type: bigint), KEY.reducesinkkey0 (type: bigint) |
|           outputColumnNames: _col0, _col1          |
|           Statistics: Num rows: 5 Data size: 80 Basic stats: COMPLETE Column stats: NONE |
|           Limit                                    |
|             Number of rows: 5                      |
|             Statistics: Num rows: 5 Data size: 80 Basic stats: COMPLETE Column stats: NONE |
|             File Output Operator                   |
|               compressed: false                    |
|               Statistics: Num rows: 5 Data size: 80 Basic stats: COMPLETE Column stats: NONE |
|               table:                               |
|                   input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
+----------------------------------------------------+
|                      Explain                       |
+----------------------------------------------------+
|                   output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
|                   serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
|                                                    |
|   Stage: Stage-0                                   |
|     Fetch Operator                                 |
|       limit: 5                                     |
|       Processor Tree:                              |
|         ListSink                                   |
|                                                    |
+----------------------------------------------------+

part3 结果

两次执行结果

INFO  : Stage-Stage-1: Map: 1  Reduce: 5   Cumulative CPU: 18.41 sec   HDFS Read: 31097 HDFS Write: 613 SUCCESS
INFO  : Stage-Stage-2: Map: 1  Reduce: 5   Cumulative CPU: 11.06 sec   HDFS Read: 14785 HDFS Write: 575 SUCCESS
INFO  : Stage-Stage-3: Map: 1  Reduce: 1   Cumulative CPU: 3.68 sec   HDFS Read: 6923 HDFS Write: 167 SUCCESS
INFO  : Total MapReduce CPU Time Spent: 33 seconds 150 msec
INFO  : Completed executing command(queryId=hive_20200603211837_67bfdd47-82e1-4dd6-81a8-456e80663079); Time taken: 127.029 seconds
INFO  : OK
+----------+---------------+
| user_id  | differ_video  |
+----------+---------------+
| 1        | 6             |
| 2        | 5             |
| 7        | 4             |
| 6        | 1             |
| 4        | 1             |
+----------+---------------+
5 rows selected (127.212 seconds)

INFO  : MapReduce Jobs Launched: 
INFO  : Stage-Stage-1: Map: 1  Reduce: 5   Cumulative CPU: 16.01 sec   HDFS Read: 31097 HDFS Write: 613 SUCCESS
INFO  : Stage-Stage-2: Map: 1  Reduce: 5   Cumulative CPU: 10.6 sec   HDFS Read: 14785 HDFS Write: 575 SUCCESS
INFO  : Stage-Stage-3: Map: 1  Reduce: 1   Cumulative CPU: 4.45 sec   HDFS Read: 6923 HDFS Write: 167 SUCCESS
INFO  : Total MapReduce CPU Time Spent: 31 seconds 60 msec
INFO  : Completed executing command(queryId=hive_20200603213610_f0088785-9727-48ee-81bd-7bbcf531b486); Time taken: 122.529 seconds
INFO  : OK
+----------+---------------+
| user_id  | differ_video  |
+----------+---------------+
| 1        | 6             |
| 2        | 5             |
| 7        | 4             |
| 6        | 1             |
| 4        | 1             |
+----------+---------------+