Hive_HIVE优化指南_场景八_数据裁剪问题

 

Hive优化,大纲地址:

https://blog.csdn.net/u010003835/article/details/105334641

 

关闭优化参数 

由于数据量比较小,所以为了避免 map join , 所以关闭自动 map join.

set hive.auto.convert.join=false;

另外由于 HIVE 的谓词下推是 默认自动开启的,需要将其关闭

set hive.optimize.ppd=false;

 

测试数据 

use data_warehouse_test;
 
 
CREATE TABLE IF NOT EXISTS datacube_salary_org (
 company_name STRING COMMENT '公司名称'
 ,dep_name STRING COMMENT '部门名称'
 ,user_id BIGINT COMMENT '用户id'
 ,user_name STRING COMMENT '用户姓名'
 ,salary DECIMAL(10,2) COMMENT '薪水'
 ,create_time DATE COMMENT '创建时间'
 ,update_time DATE COMMENT '修改时间'
) 
PARTITIONED BY(
 pt STRING COMMENT '数据分区'
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE
;
 
 
CREATE TABLE IF NOT EXISTS datacube_salary_basic_aggr(
 company_name STRING COMMENT '公司名称'
 ,dep_name STRING COMMENT '部门名称'
 ,user_id BIGINT COMMENT '用户id'
 ,salary DECIMAL(10,2) COMMENT '薪水'
)
STORED AS ORC
;
 
 
CREATE TABLE IF NOT EXISTS datacube_salary_dep_aggr(
 company_name STRING COMMENT '公司名称'
 ,dep_name STRING COMMENT '部门名称'
 ,total_salary DECIMAL(10,2) COMMENT '薪水'
)
STORED AS ORC
;
 
 
CREATE TABLE IF NOT EXISTS datacube_salary_company_aggr(
 company_name STRING COMMENT '公司名称'
 ,total_salary DECIMAL(10,2) COMMENT '薪水'
)
STORED AS ORC
;
 
 
CREATE TABLE IF NOT EXISTS datacube_salary_total_aggr(
 total_salary DECIMAL(10,2) COMMENT '薪水'
)
STORED AS ORC
;

 

看一下用到的表的数据:

datacube_salary_org 表的数据

+-----------------------------------+-------------------------------+------------------------------+--------------------------------+-----------------------------+----------------------------------+----------------------------------+-------------------------+
| datacube_salary_org.company_name  | datacube_salary_org.dep_name  | datacube_salary_org.user_id  | datacube_salary_org.user_name  | datacube_salary_org.salary  | datacube_salary_org.create_time  | datacube_salary_org.update_time  | datacube_salary_org.pt  |
+-----------------------------------+-------------------------------+------------------------------+--------------------------------+-----------------------------+----------------------------------+----------------------------------+-------------------------+
| s.zh                              | engineer                      | 1                            | szh                            | 28000.00                    | 2020-04-07                       | 2020-04-07                       | 20200405                |
| s.zh                              | engineer                      | 2                            | zyq                            | 26000.00                    | 2020-04-03                       | 2020-04-03                       | 20200405                |
| s.zh                              | tester                        | 3                            | gkm                            | 20000.00                    | 2020-04-07                       | 2020-04-07                       | 20200405                |
| x.qx                              | finance                       | 4                            | pip                            | 13400.00                    | 2020-04-07                       | 2020-04-07                       | 20200405                |
| x.qx                              | finance                       | 5                            | kip                            | 24500.00                    | 2020-04-07                       | 2020-04-07                       | 20200405                |
| x.qx                              | finance                       | 6                            | zxxc                           | 13000.00                    | 2020-04-07                       | 2020-04-07                       | 20200405                |
| x.qx                              | kiccp                         | 7                            | xsz                            | 8600.00                     | 2020-04-07                       | 2020-04-07                       | 20200405                |
| s.zh                              | engineer                      | 1                            | szh                            | 28000.00                    | 2020-04-07                       | 2020-04-07                       | 20200406                |
| s.zh                              | engineer                      | 2                            | zyq                            | 26000.00                    | 2020-04-03                       | 2020-04-03                       | 20200406                |
| s.zh                              | tester                        | 3                            | gkm                            | 20000.00                    | 2020-04-07                       | 2020-04-07                       | 20200406                |
| x.qx                              | finance                       | 4                            | pip                            | 13400.00                    | 2020-04-07                       | 2020-04-07                       | 20200406                |
| x.qx                              | finance                       | 5                            | kip                            | 24500.00                    | 2020-04-07                       | 2020-04-07                       | 20200406                |
| x.qx                              | finance                       | 6                            | zxxc                           | 13000.00                    | 2020-04-07                       | 2020-04-07                       | 20200406                |
| x.qx                              | kiccp                         | 7                            | xsz                            | 8600.00                     | 2020-04-07                       | 2020-04-07                       | 20200406                |
| s.zh                              | enginer                       | 1                            | szh                            | 28000.00                    | 2020-04-07                       | 2020-04-07                       | 20200407                |
| s.zh                              | enginer                       | 2                            | zyq                            | 26000.00                    | 2020-04-03                       | 2020-04-03                       | 20200407                |
| s.zh                              | tester                        | 3                            | gkm                            | 20000.00                    | 2020-04-07                       | 2020-04-07                       | 20200407                |
| x.qx                              | finance                       | 4                            | pip                            | 13400.00                    | 2020-04-07                       | 2020-04-07                       | 20200407                |
| x.qx                              | finance                       | 5                            | kip                            | 24500.00                    | 2020-04-07                       | 2020-04-07                       | 20200407                |
| x.qx                              | finance                       | 6                            | zxxc                           | 13000.00                    | 2020-04-07                       | 2020-04-07                       | 20200407                |
| x.qx                              | kiccp                         | 7                            | xsz                            | 8600.00                     | 2020-04-07                       | 2020-04-07                       | 20200407                |
+-----------------------------------+-------------------------------+------------------------------+--------------------------------+-----------------------------+----------------------------------+----------------------------------+-------------------------+
21 rows selected (0.153 seconds)

datacube_salary_dep_aggr 表的数据

+----------------------------------------+------------------------------------+----------------------------------------+
| datacube_salary_dep_aggr.company_name  | datacube_salary_dep_aggr.dep_name  | datacube_salary_dep_aggr.total_salary  |
+----------------------------------------+------------------------------------+----------------------------------------+
| s.zh                                   | enginer                            | 54000.00                               |
| s.zh                                   | tester                             | 20000.00                               |
| x.qx                                   | finance                            | 50900.00                               |
| x.qx                                   | kiccp                              | 8600.00                                |
+----------------------------------------+------------------------------------+----------------------------------------+
4 rows selected (0.262 seconds)

 

 

 

 

 

场景八_数据裁剪问题

 

1) 记录数裁剪

   i.通过分区,分桶 表的优势 去构建 

  ii.通过筛选条件,去除无效的记录,使得无效数据在 map 阶段 剔除

2) 列裁剪

 i.剔除无效, 非计算范围内的列数据

 ii.使用列式存储

 

 

场景八_数据裁剪问题

 

1) 记录数裁剪

   i.通过分区,分桶 表的优势 去构建 

分桶表的构建过程与使用:https://blog.csdn.net/qq_26803795/article/details/105205702

分区表与分桶表本身就是一个一个优化策略。

分区表  更适用可以合理方式进行数据组织的列,比如 每天的log 日志,由于每天的数据都差不多,可以按分区进行组织!!

分桶表 适用于无法形成合理分区的场景,并且可以比较方便的实现抽取,如用户id列。

当然两者可以相互结合,并无冲突

 

分区表 : 通过筛选指定的分区,就可以很容易做到数据筛选,因为分区在 hdfs 会以文件夹的形式存在 !!(分区列 ,可以大幅度提升筛选效率)

分桶表 :由于分桶列是按照  对列值哈希,然后除以桶的个数求余,决定将该条记录存放到哪个桶中。所以在后续的筛选也有很大的优势 (不用扫描全表)

 

 

  ii.通过筛选条件,去除无效的记录,使得无效数据在 map 阶段 剔除

由于数据量比较小,所以为了避免 map join , 所以关闭自动 map join.

set hive.auto.convert.join=false;

另外由于 HIVE 的谓词下推是 默认自动开启的,需要将其关闭

set hive.optimize.ppd=false;

 

 

 

这个也比较好理解:

例如 下面3个语句执行的结果都是一致的。

SELECT 
 a.user_id
 ,b.user_id
FROM datacube_salary_org  AS a
JOIN datacube_salary_org  AS b
 ON a.user_id = b.user_id
 AND a.salary >= 10000.0
 AND b.salary >= 10000.0
ORDER BY a.user_id, b.user_id
;
SELECT
 tmp_a.user_id
 ,tmp_b.user_id
FROM
(
SELECT 
 *
FROM datacube_salary_org  AS a
WHERE a.salary >= 10000.0
) tmp_a
JOIN 
(
SELECT 
 *
FROM datacube_salary_org  AS b
WHERE b.salary >= 10000.0
) tmp_b
ON tmp_a.user_id = tmp_b.user_id
ORDER BY tmp_a.user_id, tmp_b.user_id
;
SELECT 
 a.user_id
 ,b.user_id
FROM datacube_salary_org  AS a
JOIN datacube_salary_org  AS b
 ON a.user_id = b.user_id
WHERE 
 a.salary >= 10000.0
 AND b.salary >= 10000.0
ORDER BY a.user_id, b.user_id
;

 

运行结果

+------------+------------+
| a.user_id  | b.user_id  |
+------------+------------+
| 1          | 1          |
| 1          | 1          |
| 1          | 1          |
| 1          | 1          |
| 1          | 1          |
| 1          | 1          |
| 1          | 1          |
| 1          | 1          |
| 1          | 1          |
| 2          | 2          |
| 2          | 2          |
| 2          | 2          |
| 2          | 2          |
| 2          | 2          |
| 2          | 2          |
| 2          | 2          |
| 2          | 2          |
| 2          | 2          |
| 3          | 3          |
| 3          | 3          |
| 3          | 3          |
| 3          | 3          |
| 3          | 3          |
| 3          | 3          |
| 3          | 3          |
| 3          | 3          |
| 3          | 3          |
| 4          | 4          |
| 4          | 4          |
| 4          | 4          |
| 4          | 4          |
| 4          | 4          |
| 4          | 4          |
| 4          | 4          |
| 4          | 4          |
| 4          | 4          |
| 5          | 5          |
| 5          | 5          |
| 5          | 5          |
| 5          | 5          |
| 5          | 5          |
| 5          | 5          |
| 5          | 5          |
| 5          | 5          |
| 5          | 5          |
| 6          | 6          |
| 6          | 6          |
| 6          | 6          |
| 6          | 6          |
| 6          | 6          |
| 6          | 6          |
| 6          | 6          |
| 6          | 6          |
| 6          | 6          |
+------------+------------+
54 rows selected (51.263 seconds)

 

我们比较下3个语句的执行计划 , 利用 EXPLAIN

EXPLAIN 
SELECT 
 a.user_id
 ,b.user_id
FROM datacube_salary_org  AS a
JOIN datacube_salary_org  AS b
 ON a.user_id = b.user_id
 AND a.salary >= 10000.0
 AND b.salary >= 10000.0
ORDER BY a.user_id, b.user_id
;
+----------------------------------------------------+
|                      Explain                       |
+----------------------------------------------------+
| STAGE DEPENDENCIES:                                |
|   Stage-1 is a root stage                          |
|   Stage-2 depends on stages: Stage-1               |
|   Stage-0 depends on stages: Stage-2               |
|                                                    |
| STAGE PLANS:                                       |
|   Stage: Stage-1                                   |
|     Map Reduce                                     |
|       Map Operator Tree:                           |
|           TableScan                                |
|             alias: a                               |
|             filterExpr: ((salary >= 10000.0) and user_id is not null) (type: boolean) |
|             Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE |
|             Filter Operator                        |
|               predicate: ((salary >= 10000.0) and user_id is not null) (type: boolean) |
|               Statistics: Num rows: 2 Data size: 97 Basic stats: COMPLETE Column stats: NONE |
|               Reduce Output Operator               |
|                 key expressions: user_id (type: bigint) |
|                 sort order: +                      |
|                 Map-reduce partition columns: user_id (type: bigint) |
|                 Statistics: Num rows: 2 Data size: 97 Basic stats: COMPLETE Column stats: NONE |
|           TableScan                                |
|             alias: b                               |
|             filterExpr: ((salary >= 10000.0) and user_id is not null) (type: boolean) |
|             Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE |
|             Filter Operator                        |
|               predicate: ((salary >= 10000.0) and user_id is not null) (type: boolean) |
|               Statistics: Num rows: 2 Data size: 97 Basic stats: COMPLETE Column stats: NONE |
|               Reduce Output Operator               |
|                 key expressions: user_id (type: bigint) |
|                 sort order: +                      |
|                 Map-reduce partition columns: user_id (type: bigint) |
|                 Statistics: Num rows: 2 Data size: 97 Basic stats: COMPLETE Column stats: NONE |
|       Reduce Operator Tree:                        |
|         Join Operator                              |
|           condition map:                           |
|                Inner Join 0 to 1                   |
|           keys:                                    |
|             0 user_id (type: bigint)               |
|             1 user_id (type: bigint)               |
|           outputColumnNames: _col2, _col13         |
|           Statistics: Num rows: 2 Data size: 106 Basic stats: COMPLETE Column stats: NONE |
|           Select Operator                          |
|             expressions: _col2 (type: bigint), _col13 (type: bigint) |
|             outputColumnNames: _col0, _col1        |
|             Statistics: Num rows: 2 Data size: 106 Basic stats: COMPLETE Column stats: NONE |
|             File Output Operator                   |
|               compressed: false                    |
|               table:                               |
|                   input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
|                   output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
|                   serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe |
|                                                    |
|   Stage: Stage-2                                   |
|     Map Reduce                                     |
|       Map Operator Tree:                           |
|           TableScan                                |
|             Reduce Output Operator                 |
|               key expressions: _col0 (type: bigint), _col1 (type: bigint) |
|               sort order: ++                       |
|               Statistics: Num rows: 2 Data size: 106 Basic stats: COMPLETE Column stats: NONE |
|       Reduce Operator Tree:                        |
|         Select Operator                            |
|           expressions: KEY.reducesinkkey0 (type: bigint), KEY.reducesinkkey1 (type: bigint) |
|           outputColumnNames: _col0, _col1          |
|           Statistics: Num rows: 2 Data size: 106 Basic stats: COMPLETE Column stats: NONE |
|           File Output Operator                     |
|             compressed: false                      |
|             Statistics: Num rows: 2 Data size: 106 Basic stats: COMPLETE Column stats: NONE |
|             table:                                 |
|                 input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
|                 output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
|                 serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
|                                                    |
|   Stage: Stage-0                                   |
|     Fetch Operator                                 |
|       limit: -1                                    |
|       Processor Tree:                              |
|         ListSink                                   |
|                                                    |
+----------------------------------------------------+
80 rows selected (0.165 seconds)

 

==============================

 

第二种方式

EXPLAIN
SELECT
 tmp_a.user_id
 ,tmp_b.user_id
FROM
(
SELECT 
 *
FROM datacube_salary_org  AS a
WHERE a.salary >= 10000.0
) tmp_a
JOIN 
(
SELECT 
 *
FROM datacube_salary_org  AS b
WHERE b.salary >= 10000.0
) tmp_b
ON tmp_a.user_id = tmp_b.user_id
ORDER BY tmp_a.user_id, tmp_b.user_id
;
+----------------------------------------------------+
|                      Explain                       |
+----------------------------------------------------+
| STAGE DEPENDENCIES:                                |
|   Stage-1 is a root stage                          |
|   Stage-2 depends on stages: Stage-1               |
|   Stage-0 depends on stages: Stage-2               |
|                                                    |
| STAGE PLANS:                                       |
|   Stage: Stage-1                                   |
|     Map Reduce                                     |
|       Map Operator Tree:                           |
|           TableScan                                |
|             alias: a                               |
|             Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE |
|             Filter Operator                        |
|               predicate: (salary >= 10000.0) (type: boolean) |
|               Statistics: Num rows: 2 Data size: 97 Basic stats: COMPLETE Column stats: NONE |
|               Select Operator                      |
|                 expressions: user_id (type: bigint) |
|                 outputColumnNames: _col2           |
|                 Statistics: Num rows: 2 Data size: 97 Basic stats: COMPLETE Column stats: NONE |
|                 Filter Operator                    |
|                   predicate: _col2 is not null (type: boolean) |
|                   Statistics: Num rows: 2 Data size: 97 Basic stats: COMPLETE Column stats: NONE |
|                   Reduce Output Operator           |
|                     key expressions: _col2 (type: bigint) |
|                     sort order: +                  |
|                     Map-reduce partition columns: _col2 (type: bigint) |
|                     Statistics: Num rows: 2 Data size: 97 Basic stats: COMPLETE Column stats: NONE |
|           TableScan                                |
|             alias: b                               |
|             Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE |
|             Filter Operator                        |
|               predicate: (salary >= 10000.0) (type: boolean) |
|               Statistics: Num rows: 2 Data size: 97 Basic stats: COMPLETE Column stats: NONE |
|               Select Operator                      |
|                 expressions: user_id (type: bigint) |
|                 outputColumnNames: _col2           |
|                 Statistics: Num rows: 2 Data size: 97 Basic stats: COMPLETE Column stats: NONE |
|                 Filter Operator                    |
|                   predicate: _col2 is not null (type: boolean) |
|                   Statistics: Num rows: 2 Data size: 97 Basic stats: COMPLETE Column stats: NONE |
|                   Reduce Output Operator           |
|                     key expressions: _col2 (type: bigint) |
|                     sort order: +                  |
|                     Map-reduce partition columns: _col2 (type: bigint) |
|                     Statistics: Num rows: 2 Data size: 97 Basic stats: COMPLETE Column stats: NONE |
|       Reduce Operator Tree:                        |
|         Join Operator                              |
|           condition map:                           |
|                Inner Join 0 to 1                   |
|           keys:                                    |
|             0 _col2 (type: bigint)                 |
|             1 _col2 (type: bigint)                 |
|           outputColumnNames: _col2, _col10         |
|           Statistics: Num rows: 2 Data size: 106 Basic stats: COMPLETE Column stats: NONE |
|           Select Operator                          |
|             expressions: _col2 (type: bigint), _col10 (type: bigint) |
|             outputColumnNames: _col0, _col1        |
|             Statistics: Num rows: 2 Data size: 106 Basic stats: COMPLETE Column stats: NONE |
|             File Output Operator                   |
|               compressed: false                    |
|               table:                               |
|                   input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
|                   output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
|                   serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe |
|                                                    |
|   Stage: Stage-2                                   |
|     Map Reduce                                     |
|       Map Operator Tree:                           |
|           TableScan                                |
|             Reduce Output Operator                 |
|               key expressions: _col0 (type: bigint), _col1 (type: bigint) |
|               sort order: ++                       |
|               Statistics: Num rows: 2 Data size: 106 Basic stats: COMPLETE Column stats: NONE |
|       Reduce Operator Tree:                        |
|         Select Operator                            |
|           expressions: KEY.reducesinkkey0 (type: bigint), KEY.reducesinkkey1 (type: bigint) |
|           outputColumnNames: _col0, _col1          |
|           Statistics: Num rows: 2 Data size: 106 Basic stats: COMPLETE Column stats: NONE |
|           File Output Operator                     |
|             compressed: false                      |
|             Statistics: Num rows: 2 Data size: 106 Basic stats: COMPLETE Column stats: NONE |
|             table:                                 |
|                 input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
|                 output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
|                 serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
|                                                    |
|   Stage: Stage-0                                   |
|     Fetch Operator                                 |
|       limit: -1                                    |
|       Processor Tree:                              |
|         ListSink                                   |
|                                                    |
+----------------------------------------------------+
92 rows selected (0.166 seconds)

 

==============================

 

第三种方式

EXPLAIN 
SELECT 
 a.user_id
 ,b.user_id
FROM datacube_salary_org  AS a
JOIN datacube_salary_org  AS b
 ON a.user_id = b.user_id
WHERE 
 a.salary >= 10000.0
 AND b.salary >= 10000.0
ORDER BY a.user_id, b.user_id
;
+----------------------------------------------------+
|                      Explain                       |
+----------------------------------------------------+
| STAGE DEPENDENCIES:                                |
|   Stage-1 is a root stage                          |
|   Stage-2 depends on stages: Stage-1               |
|   Stage-0 depends on stages: Stage-2               |
|                                                    |
| STAGE PLANS:                                       |
|   Stage: Stage-1                                   |
|     Map Reduce                                     |
|       Map Operator Tree:                           |
|           TableScan                                |
|             alias: a                               |
|             Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE |
|             Filter Operator                        |
|               predicate: user_id is not null (type: boolean) |
|               Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE |
|               Reduce Output Operator               |
|                 key expressions: user_id (type: bigint) |
|                 sort order: +                      |
|                 Map-reduce partition columns: user_id (type: bigint) |
|                 Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE |
|                 value expressions: salary (type: decimal(10,2)) |
|           TableScan                                |
|             alias: b                               |
|             Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE |
|             Filter Operator                        |
|               predicate: user_id is not null (type: boolean) |
|               Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE |
|               Reduce Output Operator               |
|                 key expressions: user_id (type: bigint) |
|                 sort order: +                      |
|                 Map-reduce partition columns: user_id (type: bigint) |
|                 Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE |
|                 value expressions: salary (type: decimal(10,2)) |
|       Reduce Operator Tree:                        |
|         Join Operator                              |
|           condition map:                           |
|                Inner Join 0 to 1                   |
|           keys:                                    |
|             0 user_id (type: bigint)               |
|             1 user_id (type: bigint)               |
|           outputColumnNames: _col2, _col4, _col13, _col15 |
|           Statistics: Num rows: 7 Data size: 374 Basic stats: COMPLETE Column stats: NONE |
|           Filter Operator                          |
|             predicate: ((_col4 >= 10000.0) and (_col15 >= 10000.0)) (type: boolean) |
|             Statistics: Num rows: 1 Data size: 53 Basic stats: COMPLETE Column stats: NONE |
|             Select Operator                        |
|               expressions: _col2 (type: bigint), _col13 (type: bigint) |
|               outputColumnNames: _col0, _col1      |
|               Statistics: Num rows: 1 Data size: 53 Basic stats: COMPLETE Column stats: NONE |
|               File Output Operator                 |
|                 compressed: false                  |
|                 table:                             |
|                     input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
|                     output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
|                     serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe |
|                                                    |
|   Stage: Stage-2                                   |
|     Map Reduce                                     |
|       Map Operator Tree:                           |
|           TableScan                                |
|             Reduce Output Operator                 |
|               key expressions: _col0 (type: bigint), _col1 (type: bigint) |
|               sort order: ++                       |
|               Statistics: Num rows: 1 Data size: 53 Basic stats: COMPLETE Column stats: NONE |
|       Reduce Operator Tree:                        |
|         Select Operator                            |
|           expressions: KEY.reducesinkkey0 (type: bigint), KEY.reducesinkkey1 (type: bigint) |
|           outputColumnNames: _col0, _col1          |
|           Statistics: Num rows: 1 Data size: 53 Basic stats: COMPLETE Column stats: NONE |
|           File Output Operator                     |
|             compressed: false                      |
|             Statistics: Num rows: 1 Data size: 53 Basic stats: COMPLETE Column stats: NONE |
|             table:                                 |
|                 input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
|                 output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
|                 serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
|                                                    |
|   Stage: Stage-0                                   |
|     Fetch Operator                                 |
|       limit: -1                                    |
|       Processor Tree:                              |
|         ListSink                                   |
|                                                    |
+----------------------------------------------------+
83 rows selected (0.193 seconds)

 

可以看到 第一种 , 第二种方式的执行计划 基本一致。 第三个执行计划,可以看到 把筛选条件 放到了 JOIN 之后, 在大数据量下,必定慢 :

|       Reduce Operator Tree:                        |
|         Join Operator                              |
|           condition map:                           |
|                Inner Join 0 to 1                   |
|           keys:                                    |
|             0 user_id (type: bigint)               |
|             1 user_id (type: bigint)               |
|           outputColumnNames: _col2, _col4, _col13, _col15 |
|           Statistics: Num rows: 7 Data size: 374 Basic stats: COMPLETE Column stats: NONE |
|           Filter Operator                          |
|             predicate: ((_col4 >= 10000.0) and (_col15 >= 10000.0)) (type: boolean) |
|             Statistics: Num rows: 1 Data size: 53 Basic stats: COMPLETE Column stats: NONE |
|             Select Operator                        |
|               expressions: _col2 (type: bigint), _col13 (type: bigint) |
|               outputColumnNames: _col0, _col1      |
|               Statistics: Num rows: 1 Data size: 53 Basic stats: COMPLETE Column stats: NONE |
|               File Output Operator                 |
|                 compressed: false                  |
|                 table:                             |
|                     input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
|                     output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
|                     serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe |

 

 

 

另外 我们需要注意 第二种方式 

SELECT
 tmp_a.user_id
 ,tmp_b.user_id
FROM
(
SELECT 
 *
FROM datacube_salary_org  AS a
WHERE a.salary >= 10000.0
) tmp_a
JOIN 
(
SELECT 
 *
FROM datacube_salary_org  AS b
WHERE b.salary >= 10000.0
) tmp_b
ON tmp_a.user_id = tmp_b.user_id
ORDER BY tmp_a.user_id, tmp_b.user_id
;

需要注意 :  虽然我们子查询里面用的是 * 。 但是实际上,执行器会帮我优化,只选取 user_id !!!

 

 

 

2) 列裁剪

 i.剔除无效, 非计算范围内的列数据

这个比较好理解,在 SELECT 之中对 需要的列做筛选。而不是使用 SELECT * 

下面,我们看下这两种方式有什么区别:

 

只选取两列

EXPLAIN
SELECT 
 a.user_id
 ,a.company_name
FROM datacube_salary_org  AS a
WHERE 
 a.pt = '20200406'
;
+----------------------------------------------------+
|                      Explain                       |
+----------------------------------------------------+
| STAGE DEPENDENCIES:                                |
|   Stage-1 is a root stage                          |
|   Stage-0 depends on stages: Stage-1               |
|                                                    |
| STAGE PLANS:                                       |
|   Stage: Stage-1                                   |
|     Map Reduce                                     |
|       Map Operator Tree:                           |
|           TableScan                                |
|             alias: a                               |
|             Statistics: Num rows: 7 Data size: 984 Basic stats: COMPLETE Column stats: PARTIAL |
|             Filter Operator                        |
|               predicate: (pt = '20200406') (type: boolean) |
|               Statistics: Num rows: 7 Data size: 644 Basic stats: COMPLETE Column stats: PARTIAL |
|               Select Operator                      |
|                 expressions: user_id (type: bigint), company_name (type: string) |
|                 outputColumnNames: _col0, _col1    |
|                 Statistics: Num rows: 7 Data size: 0 Basic stats: PARTIAL Column stats: PARTIAL |
|                 File Output Operator               |
|                   compressed: false                |
|                   Statistics: Num rows: 7 Data size: 0 Basic stats: PARTIAL Column stats: PARTIAL |
|                   table:                           |
|                       input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
|                       output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
|                       serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
|                                                    |
|   Stage: Stage-0                                   |
|     Fetch Operator                                 |
|       limit: -1                                    |
|       Processor Tree:                              |
|         ListSink                                   |
|                                                    |
+----------------------------------------------------+
32 rows selected (0.133 seconds)

 

========================================================

 

选取所有列

EXPLAIN
SELECT 
 a.*
FROM datacube_salary_org  AS a
WHERE 
 a.pt = '20200406'
;
+----------------------------------------------------+
|                      Explain                       |
+----------------------------------------------------+
| STAGE DEPENDENCIES:                                |
|   Stage-1 is a root stage                          |
|   Stage-0 depends on stages: Stage-1               |
|                                                    |
| STAGE PLANS:                                       |
|   Stage: Stage-1                                   |
|     Map Reduce                                     |
|       Map Operator Tree:                           |
|           TableScan                                |
|             alias: a                               |
|             Statistics: Num rows: 7 Data size: 984 Basic stats: COMPLETE Column stats: PARTIAL |
|             Filter Operator                        |
|               predicate: (pt = '20200406') (type: boolean) |
|               Statistics: Num rows: 7 Data size: 644 Basic stats: COMPLETE Column stats: PARTIAL |
|               Select Operator                      |
|                 expressions: company_name (type: string), dep_name (type: string), user_id (type: bigint), user_name (type: string), salary (type: decimal(10,2)), create_time (type: date), update_time (type: date), '20200406' (type: string) |
|                 outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7 |
|                 Statistics: Num rows: 7 Data size: 644 Basic stats: COMPLETE Column stats: PARTIAL |
|                 File Output Operator               |
|                   compressed: false                |
|                   Statistics: Num rows: 7 Data size: 644 Basic stats: COMPLETE Column stats: PARTIAL |
|                   table:                           |
|                       input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
|                       output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
|                       serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
|                                                    |
|   Stage: Stage-0                                   |
|     Fetch Operator                                 |
|       limit: -1                                    |
|       Processor Tree:                              |
|         ListSink                                   |
|                                                    |
+----------------------------------------------------+
32 rows selected (0.117 seconds)

 

 

 

 ii.使用列式存储

    使用列式存储的 作为 表的存储方式,也可以帮助我们更迅速的实现对列 的快速筛选。

目前 Hive 支持的列式存储方式 :  PARQUET,  RCFILE(不推荐),  ORCFILE

 

参考文章:

https://blog.csdn.net/u010003835/article/details/88236132

 

 

 

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值