Hive_HIVE优化指南_场景五_排序问题

最新推荐文章于 2021-05-08 17:23:41 发布

高达一号

最新推荐文章于 2021-05-08 17:23:41 发布

阅读量445

点赞数

分类专栏： Hive

本文链接：https://blog.csdn.net/u010003835/article/details/105494790

版权

Hive 专栏收录该内容

64 篇文章 19 订阅

订阅专栏

优化大纲：https://blog.csdn.net/u010003835/article/details/105334641

测试表以及测试数据

+----------------------------------------------------+
|                   createtab_stmt                   |
+----------------------------------------------------+
| CREATE TABLE `datacube_salary_org`(                |
|   `company_name` string COMMENT '????',            |
|   `dep_name` string COMMENT '????',                |
|   `user_id` bigint COMMENT '??id',                 |
|   `user_name` string COMMENT '????',               |
|   `salary` decimal(10,2) COMMENT '??',             |
|   `create_time` date COMMENT '????',               |
|   `update_time` date COMMENT '????')               |
| PARTITIONED BY (                                   |
|   `pt` string COMMENT '????')                      |
| ROW FORMAT SERDE                                   |
|   'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'  |
| WITH SERDEPROPERTIES (                             |
|   'field.delim'=',',                               |
|   'serialization.format'=',')                      |
| STORED AS INPUTFORMAT                              |
|   'org.apache.hadoop.mapred.TextInputFormat'       |
| OUTPUTFORMAT                                       |
|   'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' |
| LOCATION                                           |
|   'hdfs://cdh-manager:8020/user/hive/warehouse/data_warehouse_test.db/datacube_salary_org' |
| TBLPROPERTIES (                                    |
|   'transient_lastDdlTime'='1586310488')            |
+----------------------------------------------------+

+-----------------------------------+-------------------------------+------------------------------+--------------------------------+-----------------------------+----------------------------------+----------------------------------+-------------------------+
| datacube_salary_org.company_name  | datacube_salary_org.dep_name  | datacube_salary_org.user_id  | datacube_salary_org.user_name  | datacube_salary_org.salary  | datacube_salary_org.create_time  | datacube_salary_org.update_time  | datacube_salary_org.pt  |
+-----------------------------------+-------------------------------+------------------------------+--------------------------------+-----------------------------+----------------------------------+----------------------------------+-------------------------+
| s.zh                              | engineer                      | 1                            | szh                            | 28000.00                    | 2020-04-07                       | 2020-04-07                       | 20200405                |
| s.zh                              | engineer                      | 2                            | zyq                            | 26000.00                    | 2020-04-03                       | 2020-04-03                       | 20200405                |
| s.zh                              | tester                        | 3                            | gkm                            | 20000.00                    | 2020-04-07                       | 2020-04-07                       | 20200405                |
| x.qx                              | finance                       | 4                            | pip                            | 13400.00                    | 2020-04-07                       | 2020-04-07                       | 20200405                |
| x.qx                              | finance                       | 5                            | kip                            | 24500.00                    | 2020-04-07                       | 2020-04-07                       | 20200405                |
| x.qx                              | finance                       | 6                            | zxxc                           | 13000.00                    | 2020-04-07                       | 2020-04-07                       | 20200405                |
| x.qx                              | kiccp                         | 7                            | xsz                            | 8600.00                     | 2020-04-07                       | 2020-04-07                       | 20200405                |
| s.zh                              | engineer                      | 1                            | szh                            | 28000.00                    | 2020-04-07                       | 2020-04-07                       | 20200406                |
| s.zh                              | engineer                      | 2                            | zyq                            | 26000.00                    | 2020-04-03                       | 2020-04-03                       | 20200406                |
| s.zh                              | tester                        | 3                            | gkm                            | 20000.00                    | 2020-04-07                       | 2020-04-07                       | 20200406                |
| x.qx                              | finance                       | 4                            | pip                            | 13400.00                    | 2020-04-07                       | 2020-04-07                       | 20200406                |
| x.qx                              | finance                       | 5                            | kip                            | 24500.00                    | 2020-04-07                       | 2020-04-07                       | 20200406                |
| x.qx                              | finance                       | 6                            | zxxc                           | 13000.00                    | 2020-04-07                       | 2020-04-07                       | 20200406                |
| x.qx                              | kiccp                         | 7                            | xsz                            | 8600.00                     | 2020-04-07                       | 2020-04-07                       | 20200406                |
| s.zh                              | enginer                       | 1                            | szh                            | 28000.00                    | 2020-04-07                       | 2020-04-07                       | 20200407                |
| s.zh                              | enginer                       | 2                            | zyq                            | 26000.00                    | 2020-04-03                       | 2020-04-03                       | 20200407                |
| s.zh                              | tester                        | 3                            | gkm                            | 20000.00                    | 2020-04-07                       | 2020-04-07                       | 20200407                |
| x.qx                              | finance                       | 4                            | pip                            | 13400.00                    | 2020-04-07                       | 2020-04-07                       | 20200407                |
| x.qx                              | finance                       | 5                            | kip                            | 24500.00                    | 2020-04-07                       | 2020-04-07                       | 20200407                |
| x.qx                              | finance                       | 6                            | zxxc                           | 13000.00                    | 2020-04-07                       | 2020-04-07                       | 20200407                |
| x.qx                              | kiccp                         | 7                            | xsz                            | 8600.00                     | 2020-04-07                       | 2020-04-07                       | 20200407                |
+-----------------------------------+-------------------------------+------------------------------+--------------------------------+-----------------------------+----------------------------------+----------------------------------+-------------------------+

场景五.排序问题

1) 合理使用 ORDER BY 与 SORT BY , 在两者之间做取舍

2) 通过使用 LIMIT 限制排序的输出

1) 合理使用 ORDER BY 与 SORT BY , 在两者之间做取舍

应该尽量避免使用 ORDER BY ，ORDER BY 会在全局进行排序。会单独增加一个作业，在全局进行排序。并且在排序中，数据会被分放在一个Reducer 中。

SORT BY 只是在每一个 reducer 中，进行局部的排序。

这里不熟悉SORT BY 的同学，可以看下我的这篇文章。

https://blog.csdn.net/u010003835/article/details/80938339

可以看到这SORT BY 与 ORDER BY 是不一样的。

我们利用以上数据执行下排序，看看两者有什么区别：

SORT BY

我们先设置下 Reduce 的数量：

因为数量量的问题，我们的Reducer 默认是1。设置 Reducer 数量，主要是帮助我们调整数据的分发问题。

set mapred.reduce.tasks=5;

SORT BY SQL

SELECT user_id, user_name, salary  
FROM datacube_salary_org 
SORT BY user_id;

SORT BY 的 EXPLAIN 结果：

INFO  : Starting task [Stage-2:EXPLAIN] in serial mode
INFO  : Completed executing command(queryId=hive_20200414042647_8a05e5e1-5eff-4fe7-b3e1-10e84b19c6c0); Time taken: 0.003 seconds
INFO  : OK
+----------------------------------------------------+
|                      Explain                       |
+----------------------------------------------------+
| STAGE DEPENDENCIES:                                |
|   Stage-1 is a root stage                          |
|   Stage-0 depends on stages: Stage-1               |
|                                                    |
| STAGE PLANS:                                       |
|   Stage: Stage-1                                   |
|     Map Reduce                                     |
|       Map Operator Tree:                           |
|           TableScan                                |
|             alias: datacube_salary_org             |
|             Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE |
|             Select Operator                        |
|               expressions: user_id (type: bigint), user_name (type: string), salary (type: decimal(10,2)) |
|               outputColumnNames: _col0, _col1, _col2 |
|               Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE |
|               Reduce Output Operator               |
|                 key expressions: _col0 (type: bigint) |
|                 sort order: +                      |
|                 Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE |
|                 value expressions: _col1 (type: string), _col2 (type: decimal(10,2)) |
|       Reduce Operator Tree:                        |
|         Select Operator                            |
|           expressions: KEY.reducesinkkey0 (type: bigint), VALUE._col0 (type: string), VALUE._col1 (type: decimal(10,2)) |
|           outputColumnNames: _col0, _col1, _col2   |
|           Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE |
|           File Output Operator                     |
|             compressed: false                      |
|             Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE |
|             table:                                 |
|                 input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
|                 output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
|                 serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
|                                                    |
|   Stage: Stage-0                                   |
|     Fetch Operator                                 |
|       limit: -1                                    |
|       Processor Tree:                              |
|         ListSink                                   |
|                                                    |
+----------------------------------------------------+
39 rows selected (0.153 seconds)

执行结果：

INFO  : Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 5
INFO  : 2020-04-14 04:28:56,154 Stage-1 map = 0%,  reduce = 0%
INFO  : 2020-04-14 04:29:03,348 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.02 sec
INFO  : 2020-04-14 04:29:08,477 Stage-1 map = 100%,  reduce = 20%, Cumulative CPU 3.77 sec
INFO  : 2020-04-14 04:29:12,587 Stage-1 map = 100%,  reduce = 40%, Cumulative CPU 5.53 sec
INFO  : 2020-04-14 04:29:16,693 Stage-1 map = 100%,  reduce = 60%, Cumulative CPU 7.04 sec
INFO  : 2020-04-14 04:29:20,804 Stage-1 map = 100%,  reduce = 80%, Cumulative CPU 8.76 sec
INFO  : 2020-04-14 04:29:25,005 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 10.99 sec
INFO  : MapReduce Total cumulative CPU time: 10 seconds 990 msec
INFO  : Ended Job = job_1586835454940_0014
INFO  : MapReduce Jobs Launched: 
INFO  : Stage-Stage-1: Map: 1  Reduce: 5   Cumulative CPU: 10.99 sec   HDFS Read: 28509 HDFS Write: 1002 SUCCESS
INFO  : Total MapReduce CPU Time Spent: 10 seconds 990 msec
INFO  : Completed executing command(queryId=hive_20200414042847_8447505f-581e-4204-804c-f63d97dcd553); Time taken: 39.228 seconds
INFO  : OK
+----------+------------+-----------+
| user_id  | user_name  |  salary   |
+----------+------------+-----------+
| 6        | zxxc       | 13000.00  |
| 7        | xsz        | 8600.00   |
| 5        | kip        | 24500.00  |
| 6        | zxxc       | 13000.00  |
| 7        | xsz        | 8600.00   |
| 1        | szh        | 28000.00  |
| 1        | szh        | 28000.00  |
| 5        | kip        | 24500.00  |
| 5        | kip        | 24500.00  |
| 2        | zyq        | 26000.00  |
| 2        | zyq        | 26000.00  |
| 2        | zyq        | 26000.00  |
| 3        | gkm        | 20000.00  |
| 4        | pip        | 13400.00  |
| 4        | pip        | 13400.00  |
| 1        | szh        | 28000.00  |
| 3        | gkm        | 20000.00  |
| 3        | gkm        | 20000.00  |
| 4        | pip        | 13400.00  |
| 6        | zxxc       | 13000.00  |
| 7        | xsz        | 8600.00   |
+----------+------------+-----------+
21 rows selected (39.402 seconds)

可以看到是5个升序的子结果，也印证了数据被分发到了5个reducer

| 6 | zxxc | 13000.00 |
| 7 | xsz | 8600.00 |

| 5 | kip | 24500.00 |
| 6 | zxxc | 13000.00 |
| 7 | xsz | 8600.00 |

| 1 | szh | 28000.00 |
| 1 | szh | 28000.00 |
| 5 | kip | 24500.00 |
| 5 | kip | 24500.00 |

| 2 | zyq | 26000.00 |
| 2 | zyq | 26000.00 |
| 2 | zyq | 26000.00 |
| 3 | gkm | 20000.00 |
| 4 | pip | 13400.00 |
| 4 | pip | 13400.00 |

| 1 | szh | 28000.00 |
| 3 | gkm | 20000.00 |
| 3 | gkm | 20000.00 |
| 4 | pip | 13400.00 |
| 6 | zxxc | 13000.00 |
| 7 | xsz | 8600.00 |

ORDER BY

我们先设置下 Reduce 的数量：

因为数量量的问题，我们的Reducer 默认是1。设置 Reducer 数量，主要是帮助我们调整数据的分发问题。

set mapred.reduce.tasks=5;

ORDER BY SQL

SELECT user_id, user_name, salary  
FROM datacube_salary_org 
ORDER BY user_id;

EXPLAIN 结果

INFO  : Starting task [Stage-2:EXPLAIN] in serial mode
INFO  : Completed executing command(queryId=hive_20200414044711_439184d4-6d74-46d6-8b79-1853b21090f2); Time taken: 0.006 seconds
INFO  : OK
+----------------------------------------------------+
|                      Explain                       |
+----------------------------------------------------+
| STAGE DEPENDENCIES:                                |
|   Stage-1 is a root stage                          |
|   Stage-0 depends on stages: Stage-1               |
|                                                    |
| STAGE PLANS:                                       |
|   Stage: Stage-1                                   |
|     Map Reduce                                     |
|       Map Operator Tree:                           |
|           TableScan                                |
|             alias: datacube_salary_org             |
|             Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE |
|             Select Operator                        |
|               expressions: user_id (type: bigint), user_name (type: string), salary (type: decimal(10,2)) |
|               outputColumnNames: _col0, _col1, _col2 |
|               Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE |
|               Reduce Output Operator               |
|                 key expressions: _col0 (type: bigint) |
|                 sort order: +                      |
|                 Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE |
|                 value expressions: _col1 (type: string), _col2 (type: decimal(10,2)) |
|       Reduce Operator Tree:                        |
|         Select Operator                            |
|           expressions: KEY.reducesinkkey0 (type: bigint), VALUE._col0 (type: string), VALUE._col1 (type: decimal(10,2)) |
|           outputColumnNames: _col0, _col1, _col2   |
|           Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE |
|           File Output Operator                     |
|             compressed: false                      |
|             Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE |
|             table:                                 |
|                 input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
|                 output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
|                 serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
|                                                    |
|   Stage: Stage-0                                   |
|     Fetch Operator                                 |
|       limit: -1                                    |
|       Processor Tree:                              |
|         ListSink                                   |
|                                                    |
+----------------------------------------------------+
39 rows selected (0.143 seconds)

运行输出

INFO  : Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
INFO  : 2020-04-14 04:56:06,003 Stage-1 map = 0%,  reduce = 0%
INFO  : 2020-04-14 04:56:13,192 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.31 sec
INFO  : 2020-04-14 04:56:18,323 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 3.99 sec
INFO  : MapReduce Total cumulative CPU time: 3 seconds 990 msec
INFO  : Ended Job = job_1586835454940_0015
INFO  : MapReduce Jobs Launched: 
INFO  : Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 3.99 sec   HDFS Read: 11305 HDFS Write: 654 SUCCESS
INFO  : Total MapReduce CPU Time Spent: 3 seconds 990 msec
INFO  : Completed executing command(queryId=hive_20200414045557_ba114e3d-fe27-4268-89a6-ecb424340503); Time taken: 21.548 seconds
INFO  : OK
+----------+------------+-----------+
| user_id  | user_name  |  salary   |
+----------+------------+-----------+
| 1        | szh        | 28000.00  |
| 1        | szh        | 28000.00  |
| 1        | szh        | 28000.00  |
| 2        | zyq        | 26000.00  |
| 2        | zyq        | 26000.00  |
| 2        | zyq        | 26000.00  |
| 3        | gkm        | 20000.00  |
| 3        | gkm        | 20000.00  |
| 3        | gkm        | 20000.00  |
| 4        | pip        | 13400.00  |
| 4        | pip        | 13400.00  |
| 4        | pip        | 13400.00  |
| 5        | kip        | 24500.00  |
| 5        | kip        | 24500.00  |
| 5        | kip        | 24500.00  |
| 6        | zxxc       | 13000.00  |
| 6        | zxxc       | 13000.00  |
| 6        | zxxc       | 13000.00  |
| 7        | xsz        | 8600.00   |
| 7        | xsz        | 8600.00   |
| 7        | xsz        | 8600.00   |
+----------+------------+-----------+
21 rows selected (21.682 seconds)

可以看出来，两者的 EXPLAIN 结果没有差异

主要的差异体现在 Redcuer wo数量上，ORDER BY 会严格控制 Reducer 为1 ，因为需要进行全局排序！！！，去调整 Reducer 数量是没有意义的。

2) 通过使用 LIMIT 限制排序的输出

我们都知道 ODRER BY + LIMIT N 可以取到前N 个数据。

但是 SORT BY + LIMIT N 也可以取到前N 个数据，并且更高效！！！（大数据情况下）

我们按照 user_id 取前5条数据

ORDER BY 写法

set mapred.reduce.tasks=5;

SELECT user_id, user_name, salary  
FROM datacube_salary_org 
ORDER BY user_id
LIMIT 5;

执行结果：

INFO  : Starting Job = job_1586835454940_0018, Tracking URL = http://cdh-manager:8088/proxy/application_1586835454940_0018/
INFO  : Kill Command = /opt/cloudera/parcels/CDH-6.0.1-1.cdh6.0.1.p0.590678/lib/hadoop/bin/hadoop job  -kill job_1586835454940_0018
INFO  : Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
INFO  : 2020-04-14 05:21:17,010 Stage-1 map = 0%,  reduce = 0%
INFO  : 2020-04-14 05:21:24,194 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.48 sec
INFO  : 2020-04-14 05:21:30,379 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 3.22 sec
INFO  : MapReduce Total cumulative CPU time: 3 seconds 220 msec
INFO  : Ended Job = job_1586835454940_0018
INFO  : MapReduce Jobs Launched: 
INFO  : Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 3.22 sec   HDFS Read: 11463 HDFS Write: 222 SUCCESS
INFO  : Total MapReduce CPU Time Spent: 3 seconds 220 msec
INFO  : Completed executing command(queryId=hive_20200414052108_f4b9e94b-ade0-4756-9175-1139a5975b7f); Time taken: 22.635 seconds
INFO  : OK
+----------+------------+-----------+
| user_id  | user_name  |  salary   |
+----------+------------+-----------+
| 1        | szh        | 28000.00  |
| 1        | szh        | 28000.00  |
| 1        | szh        | 28000.00  |
| 2        | zyq        | 26000.00  |
| 2        | zyq        | 26000.00  |
+----------+------------+-----------+
5 rows selected (22.751 seconds)

可以看到只有一个 Reduce

SORT BY 写法

set mapred.reduce.tasks=5;

SELECT user_id, user_name, salary  
FROM datacube_salary_org 
SORT BY user_id
LIMIT 5;

执行结果：

INFO  : Starting Job = job_1586835454940_0020, Tracking URL = http://cdh-manager:8088/proxy/application_1586835454940_0020/
INFO  : Kill Command = /opt/cloudera/parcels/CDH-6.0.1-1.cdh6.0.1.p0.590678/lib/hadoop/bin/hadoop job  -kill job_1586835454940_0020
INFO  : Hadoop job information for Stage-2: number of mappers: 1; number of reducers: 1
INFO  : 2020-04-14 05:23:52,360 Stage-2 map = 0%,  reduce = 0%
INFO  : 2020-04-14 05:23:58,508 Stage-2 map = 100%,  reduce = 0%, Cumulative CPU 2.28 sec
INFO  : 2020-04-14 05:24:04,679 Stage-2 map = 100%,  reduce = 100%, Cumulative CPU 4.09 sec
INFO  : MapReduce Total cumulative CPU time: 4 seconds 90 msec
INFO  : Ended Job = job_1586835454940_0020
INFO  : MapReduce Jobs Launched: 
INFO  : Stage-Stage-1: Map: 1  Reduce: 5   Cumulative CPU: 10.08 sec   HDFS Read: 24849 HDFS Write: 610 SUCCESS
INFO  : Stage-Stage-2: Map: 1  Reduce: 1   Cumulative CPU: 4.09 sec   HDFS Read: 7846 HDFS Write: 222 SUCCESS
INFO  : Total MapReduce CPU Time Spent: 14 seconds 170 msec
INFO  : Completed executing command(queryId=hive_20200414052303_4228dcc5-a45f-4fe1-93e8-20be71cf84d3); Time taken: 62.358 seconds
INFO  : OK
+----------+------------+-----------+
| user_id  | user_name  |  salary   |
+----------+------------+-----------+
| 1        | szh        | 28000.00  |
| 1        | szh        | 28000.00  |
| 1        | szh        | 28000.00  |
| 2        | zyq        | 26000.00  |
| 2        | zyq        | 26000.00  |
+----------+------------+-----------+

可以看到第一个作业也就是 SORT BY 是 5个 Reduce, 而第二个作业 LIMIT 是 1个Reduce

我们看下执行计划

ORDER BY + LIMIT

INFO  : Starting task [Stage-2:EXPLAIN] in serial mode
INFO  : Completed executing command(queryId=hive_20200414052614_9ded2fc7-ad41-4f6d-9c3d-8a46f529bc7d); Time taken: 0.005 seconds
INFO  : OK
+----------------------------------------------------+
|                      Explain                       |
+----------------------------------------------------+
| STAGE DEPENDENCIES:                                |
|   Stage-1 is a root stage                          |
|   Stage-0 depends on stages: Stage-1               |
|                                                    |
| STAGE PLANS:                                       |
|   Stage: Stage-1                                   |
|     Map Reduce                                     |
|       Map Operator Tree:                           |
|           TableScan                                |
|             alias: datacube_salary_org             |
|             Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE |
|             Select Operator                        |
|               expressions: user_id (type: bigint), user_name (type: string), salary (type: decimal(10,2)) |
|               outputColumnNames: _col0, _col1, _col2 |
|               Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE |
|               Reduce Output Operator               |
|                 key expressions: _col0 (type: bigint) |
|                 sort order: +                      |
|                 Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE |
|                 TopN Hash Memory Usage: 0.1        |
|                 value expressions: _col1 (type: string), _col2 (type: decimal(10,2)) |
|       Reduce Operator Tree:                        |
|         Select Operator                            |
|           expressions: KEY.reducesinkkey0 (type: bigint), VALUE._col0 (type: string), VALUE._col1 (type: decimal(10,2)) |
|           outputColumnNames: _col0, _col1, _col2   |
|           Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE |
|           Limit                                    |
|             Number of rows: 5                      |
|             Statistics: Num rows: 5 Data size: 240 Basic stats: COMPLETE Column stats: NONE |
|             File Output Operator                   |
|               compressed: false                    |
|               Statistics: Num rows: 5 Data size: 240 Basic stats: COMPLETE Column stats: NONE |
|               table:                               |
|                   input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
|                   output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
|                   serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
|                                                    |
|   Stage: Stage-0                                   |
|     Fetch Operator                                 |
|       limit: 5                                     |
|       Processor Tree:                              |
|         ListSink                                   |
|                                                    |
+----------------------------------------------------+
43 rows selected (0.142 seconds)

SORT BY + LIMIT

INFO  : Starting task [Stage-3:EXPLAIN] in serial mode
INFO  : Completed executing command(queryId=hive_20200414052914_89ed6d4a-e498-4372-bdc1-6aa8ceb43ddd); Time taken: 0.004 seconds
INFO  : OK
+----------------------------------------------------+
|                      Explain                       |
+----------------------------------------------------+
| STAGE DEPENDENCIES:                                |
|   Stage-1 is a root stage                          |
|   Stage-2 depends on stages: Stage-1               |
|   Stage-0 depends on stages: Stage-2               |
|                                                    |
| STAGE PLANS:                                       |
|   Stage: Stage-1                                   |
|     Map Reduce                                     |
|       Map Operator Tree:                           |
|           TableScan                                |
|             alias: datacube_salary_org             |
|             Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE |
|             Select Operator                        |
|               expressions: user_id (type: bigint), user_name (type: string), salary (type: decimal(10,2)) |
|               outputColumnNames: _col0, _col1, _col2 |
|               Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE |
|               Reduce Output Operator               |
|                 key expressions: _col0 (type: bigint) |
|                 sort order: +                      |
|                 Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE |
|                 TopN Hash Memory Usage: 0.1        |
|                 value expressions: _col1 (type: string), _col2 (type: decimal(10,2)) |
|       Reduce Operator Tree:                        |
|         Select Operator                            |
|           expressions: KEY.reducesinkkey0 (type: bigint), VALUE._col0 (type: string), VALUE._col1 (type: decimal(10,2)) |
|           outputColumnNames: _col0, _col1, _col2   |
|           Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE |
|           Limit                                    |
|             Number of rows: 5                      |
|             Statistics: Num rows: 5 Data size: 240 Basic stats: COMPLETE Column stats: NONE |
|             File Output Operator                   |
|               compressed: false                    |
|               table:                               |
|                   input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
|                   output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
|                   serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe |
|                                                    |
|   Stage: Stage-2                                   |
|     Map Reduce                                     |
|       Map Operator Tree:                           |
|           TableScan                                |
|             Reduce Output Operator                 |
|               key expressions: _col0 (type: bigint) |
|               sort order: +                        |
|               Statistics: Num rows: 5 Data size: 240 Basic stats: COMPLETE Column stats: NONE |
|               TopN Hash Memory Usage: 0.1          |
|               value expressions: _col1 (type: string), _col2 (type: decimal(10,2)) |
|       Reduce Operator Tree:                        |
|         Select Operator                            |
|           expressions: KEY.reducesinkkey0 (type: bigint), VALUE._col0 (type: string), VALUE._col1 (type: decimal(10,2)) |
|           outputColumnNames: _col0, _col1, _col2   |
|           Statistics: Num rows: 5 Data size: 240 Basic stats: COMPLETE Column stats: NONE |
|           Limit                                    |
|             Number of rows: 5                      |
|             Statistics: Num rows: 5 Data size: 240 Basic stats: COMPLETE Column stats: NONE |
|             File Output Operator                   |
|               compressed: false                    |
|               Statistics: Num rows: 5 Data size: 240 Basic stats: COMPLETE Column stats: NONE |
|               table:                               |
|                   input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
|                   output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
|                   serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
|                                                    |
|   Stage: Stage-0                                   |
|     Fetch Operator                                 |
|       limit: 5                                     |
|       Processor Tree:                              |
|         ListSink                                   |
|                                                    |
+----------------------------------------------------+
69 rows selected (0.143 seconds)

STAGE1 -> STAGE2 -> STAGE0

可以看到第一个阶段 STAGE1 是每个Reduce 取前5。第二个阶段，所有的Reduce 取了前5

总结

通过使用SORT BY + LMIT N , 我们只需要每个 Reduce 控制前N 条记录即可。然后我们在总的 M个 Reduce 中再去选取其中的前N 个数据就可以找到前 N 排名的数据了。

Tips

1) 使用 SORT BY + LIMIT N 的方式会比 ORDER BY 多一个JOB。所以，在使用 SORT BY + LIMIT 的方式，我们也要注意数据量的规模

2) 据说每个Reduce 取前N 条记录，使用的是冒泡排序。

高达一号

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
Hive_HIVE优化指南_场景五_排序问题

优化大纲：https://blog.csdn.net/u010003835/article/details/105334641测试表以及测试数据+----------------------------------------------------+| createtab_stmt |+...
复制链接

扫一扫

专栏目录