Hive优化,大纲地址:
https://blog.csdn.net/u010003835/article/details/105334641
关闭优化参数
由于数据量比较小,所以为了避免 map join , 所以关闭自动 map join.
set hive.auto.convert.join=false;
另外由于 HIVE 的谓词下推是 默认自动开启的,需要将其关闭
set hive.optimize.ppd=false;
测试数据
use data_warehouse_test;
CREATE TABLE IF NOT EXISTS datacube_salary_org (
company_name STRING COMMENT '公司名称'
,dep_name STRING COMMENT '部门名称'
,user_id BIGINT COMMENT '用户id'
,user_name STRING COMMENT '用户姓名'
,salary DECIMAL(10,2) COMMENT '薪水'
,create_time DATE COMMENT '创建时间'
,update_time DATE COMMENT '修改时间'
)
PARTITIONED BY(
pt STRING COMMENT '数据分区'
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE
;
CREATE TABLE IF NOT EXISTS datacube_salary_basic_aggr(
company_name STRING COMMENT '公司名称'
,dep_name STRING COMMENT '部门名称'
,user_id BIGINT COMMENT '用户id'
,salary DECIMAL(10,2) COMMENT '薪水'
)
STORED AS ORC
;
CREATE TABLE IF NOT EXISTS datacube_salary_dep_aggr(
company_name STRING COMMENT '公司名称'
,dep_name STRING COMMENT '部门名称'
,total_salary DECIMAL(10,2) COMMENT '薪水'
)
STORED AS ORC
;
CREATE TABLE IF NOT EXISTS datacube_salary_company_aggr(
company_name STRING COMMENT '公司名称'
,total_salary DECIMAL(10,2) COMMENT '薪水'
)
STORED AS ORC
;
CREATE TABLE IF NOT EXISTS datacube_salary_total_aggr(
total_salary DECIMAL(10,2) COMMENT '薪水'
)
STORED AS ORC
;
看一下用到的表的数据:
datacube_salary_org 表的数据
+-----------------------------------+-------------------------------+------------------------------+--------------------------------+-----------------------------+----------------------------------+----------------------------------+-------------------------+
| datacube_salary_org.company_name | datacube_salary_org.dep_name | datacube_salary_org.user_id | datacube_salary_org.user_name | datacube_salary_org.salary | datacube_salary_org.create_time | datacube_salary_org.update_time | datacube_salary_org.pt |
+-----------------------------------+-------------------------------+------------------------------+--------------------------------+-----------------------------+----------------------------------+----------------------------------+-------------------------+
| s.zh | engineer | 1 | szh | 28000.00 | 2020-04-07 | 2020-04-07 | 20200405 |
| s.zh | engineer | 2 | zyq | 26000.00 | 2020-04-03 | 2020-04-03 | 20200405 |
| s.zh | tester | 3 | gkm | 20000.00 | 2020-04-07 | 2020-04-07 | 20200405 |
| x.qx | finance | 4 | pip | 13400.00 | 2020-04-07 | 2020-04-07 | 20200405 |
| x.qx | finance | 5 | kip | 24500.00 | 2020-04-07 | 2020-04-07 | 20200405 |
| x.qx | finance | 6 | zxxc | 13000.00 | 2020-04-07 | 2020-04-07 | 20200405 |
| x.qx | kiccp | 7 | xsz | 8600.00 | 2020-04-07 | 2020-04-07 | 20200405 |
| s.zh | engineer | 1 | szh | 28000.00 | 2020-04-07 | 2020-04-07 | 20200406 |
| s.zh | engineer | 2 | zyq | 26000.00 | 2020-04-03 | 2020-04-03 | 20200406 |
| s.zh | tester | 3 | gkm | 20000.00 | 2020-04-07 | 2020-04-07 | 20200406 |
| x.qx | finance | 4 | pip | 13400.00 | 2020-04-07 | 2020-04-07 | 20200406 |
| x.qx | finance | 5 | kip | 24500.00 | 2020-04-07 | 2020-04-07 | 20200406 |
| x.qx | finance | 6 | zxxc | 13000.00 | 2020-04-07 | 2020-04-07 | 20200406 |
| x.qx | kiccp | 7 | xsz | 8600.00 | 2020-04-07 | 2020-04-07 | 20200406 |
| s.zh | enginer | 1 | szh | 28000.00 | 2020-04-07 | 2020-04-07 | 20200407 |
| s.zh | enginer | 2 | zyq | 26000.00 | 2020-04-03 | 2020-04-03 | 20200407 |
| s.zh | tester | 3 | gkm | 20000.00 | 2020-04-07 | 2020-04-07 | 20200407 |
| x.qx | finance | 4 | pip | 13400.00 | 2020-04-07 | 2020-04-07 | 20200407 |
| x.qx | finance | 5 | kip | 24500.00 | 2020-04-07 | 2020-04-07 | 20200407 |
| x.qx | finance | 6 | zxxc | 13000.00 | 2020-04-07 | 2020-04-07 | 20200407 |
| x.qx | kiccp | 7 | xsz | 8600.00 | 2020-04-07 | 2020-04-07 | 20200407 |
+-----------------------------------+-------------------------------+------------------------------+--------------------------------+-----------------------------+----------------------------------+----------------------------------+-------------------------+
21 rows selected (0.153 seconds)
datacube_salary_dep_aggr 表的数据
+----------------------------------------+------------------------------------+----------------------------------------+
| datacube_salary_dep_aggr.company_name | datacube_salary_dep_aggr.dep_name | datacube_salary_dep_aggr.total_salary |
+----------------------------------------+------------------------------------+----------------------------------------+
| s.zh | enginer | 54000.00 |
| s.zh | tester | 20000.00 |
| x.qx | finance | 50900.00 |
| x.qx | kiccp | 8600.00 |
+----------------------------------------+------------------------------------+----------------------------------------+
4 rows selected (0.262 seconds)
场景八_数据裁剪问题
1) 记录数裁剪
i.通过分区,分桶 表的优势 去构建
ii.通过筛选条件,去除无效的记录,使得无效数据在 map 阶段 剔除
2) 列裁剪
i.剔除无效, 非计算范围内的列数据
ii.使用列式存储
场景八_数据裁剪问题
1) 记录数裁剪
i.通过分区,分桶 表的优势 去构建
分桶表的构建过程与使用:https://blog.csdn.net/qq_26803795/article/details/105205702
分区表与分桶表本身就是一个一个优化策略。
分区表 更适用可以合理方式进行数据组织的列,比如 每天的log 日志,由于每天的数据都差不多,可以按分区进行组织!!
分桶表 适用于无法形成合理分区的场景,并且可以比较方便的实现抽取,如用户id列。
当然两者可以相互结合,并无冲突
分区表 : 通过筛选指定的分区,就可以很容易做到数据筛选,因为分区在 hdfs 会以文件夹的形式存在 !!(分区列 ,可以大幅度提升筛选效率)
分桶表 :由于分桶列是按照 对列值哈希,然后除以桶的个数求余,决定将该条记录存放到哪个桶中。所以在后续的筛选也有很大的优势 (不用扫描全表)
ii.通过筛选条件,去除无效的记录,使得无效数据在 map 阶段 剔除
由于数据量比较小,所以为了避免 map join , 所以关闭自动 map join.
set hive.auto.convert.join=false;
另外由于 HIVE 的谓词下推是 默认自动开启的,需要将其关闭
set hive.optimize.ppd=false;
这个也比较好理解:
例如 下面3个语句执行的结果都是一致的。
SELECT
a.user_id
,b.user_id
FROM datacube_salary_org AS a
JOIN datacube_salary_org AS b
ON a.user_id = b.user_id
AND a.salary >= 10000.0
AND b.salary >= 10000.0
ORDER BY a.user_id, b.user_id
;
SELECT
tmp_a.user_id
,tmp_b.user_id
FROM
(
SELECT
*
FROM datacube_salary_org AS a
WHERE a.salary >= 10000.0
) tmp_a
JOIN
(
SELECT
*
FROM datacube_salary_org AS b
WHERE b.salary >= 10000.0
) tmp_b
ON tmp_a.user_id = tmp_b.user_id
ORDER BY tmp_a.user_id, tmp_b.user_id
;
SELECT
a.user_id
,b.user_id
FROM datacube_salary_org AS a
JOIN datacube_salary_org AS b
ON a.user_id = b.user_id
WHERE
a.salary >= 10000.0
AND b.salary >= 10000.0
ORDER BY a.user_id, b.user_id
;
运行结果
+------------+------------+
| a.user_id | b.user_id |
+------------+------------+
| 1 | 1 |
| 1 | 1 |
| 1 | 1 |
| 1 | 1 |
| 1 | 1 |
| 1 | 1 |
| 1 | 1 |
| 1 | 1 |
| 1 | 1 |
| 2 | 2 |
| 2 | 2 |
| 2 | 2 |
| 2 | 2 |
| 2 | 2 |
| 2 | 2 |
| 2 | 2 |
| 2 | 2 |
| 2 | 2 |
| 3 | 3 |
| 3 | 3 |
| 3 | 3 |
| 3 | 3 |
| 3 | 3 |
| 3 | 3 |
| 3 | 3 |
| 3 | 3 |
| 3 | 3 |
| 4 | 4 |
| 4 | 4 |
| 4 | 4 |
| 4 | 4 |
| 4 | 4 |
| 4 | 4 |
| 4 | 4 |
| 4 | 4 |
| 4 | 4 |
| 5 | 5 |
| 5 | 5 |
| 5 | 5 |
| 5 | 5 |
| 5 | 5 |
| 5 | 5 |
| 5 | 5 |
| 5 | 5 |
| 5 | 5 |
| 6 | 6 |
| 6 | 6 |
| 6 | 6 |
| 6 | 6 |
| 6 | 6 |
| 6 | 6 |
| 6 | 6 |
| 6 | 6 |
| 6 | 6 |
+------------+------------+
54 rows selected (51.263 seconds)
我们比较下3个语句的执行计划 , 利用 EXPLAIN
EXPLAIN
SELECT
a.user_id
,b.user_id
FROM datacube_salary_org AS a
JOIN datacube_salary_org AS b
ON a.user_id = b.user_id
AND a.salary >= 10000.0
AND b.salary >= 10000.0
ORDER BY a.user_id, b.user_id
;
+----------------------------------------------------+
| Explain |
+----------------------------------------------------+
| STAGE DEPENDENCIES: |
| Stage-1 is a root stage |
| Stage-2 depends on stages: Stage-1 |
| Stage-0 depends on stages: Stage-2 |
| |
| STAGE PLANS: |
| Stage: Stage-1 |
| Map Reduce |
| Map Operator Tree: |
| TableScan |
| alias: a |
| filterExpr: ((salary >= 10000.0) and user_id is not null) (type: boolean) |
| Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE |
| Filter Operator |
| predicate: ((salary >= 10000.0) and user_id is not null) (type: boolean) |
| Statistics: Num rows: 2 Data size: 97 Basic stats: COMPLETE Column stats: NONE |
| Reduce Output Operator |
| key expressions: user_id (type: bigint) |
| sort order: + |
| Map-reduce partition columns: user_id (type: bigint) |
| Statistics: Num rows: 2 Data size: 97 Basic stats: COMPLETE Column stats: NONE |
| TableScan |
| alias: b |
| filterExpr: ((salary >= 10000.0) and user_id is not null) (type: boolean) |
| Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE |
| Filter Operator |
| predicate: ((salary >= 10000.0) and user_id is not null) (type: boolean) |
| Statistics: Num rows: 2 Data size: 97 Basic stats: COMPLETE Column stats: NONE |
| Reduce Output Operator |
| key expressions: user_id (type: bigint) |
| sort order: + |
| Map-reduce partition columns: user_id (type: bigint) |
| Statistics: Num rows: 2 Data size: 97 Basic stats: COMPLETE Column stats: NONE |
| Reduce Operator Tree: |
| Join Operator |
| condition map: |
| Inner Join 0 to 1 |
| keys: |
| 0 user_id (type: bigint) |
| 1 user_id (type: bigint) |
| outputColumnNames: _col2, _col13 |
| Statistics: Num rows: 2 Data size: 106 Basic stats: COMPLETE Column stats: NONE |
| Select Operator |
| expressions: _col2 (type: bigint), _col13 (type: bigint) |
| outputColumnNames: _col0, _col1 |
| Statistics: Num rows: 2 Data size: 106 Basic stats: COMPLETE Column stats: NONE |
| File Output Operator |
| compressed: false |
| table: |
| input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
| output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
| serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe |
| |
| Stage: Stage-2 |
| Map Reduce |
| Map Operator Tree: |
| TableScan |
| Reduce Output Operator |
| key expressions: _col0 (type: bigint), _col1 (type: bigint) |
| sort order: ++ |
| Statistics: Num rows: 2 Data size: 106 Basic stats: COMPLETE Column stats: NONE |
| Reduce Operator Tree: |
| Select Operator |
| expressions: KEY.reducesinkkey0 (type: bigint), KEY.reducesinkkey1 (type: bigint) |
| outputColumnNames: _col0, _col1 |
| Statistics: Num rows: 2 Data size: 106 Basic stats: COMPLETE Column stats: NONE |
| File Output Operator |
| compressed: false |
| Statistics: Num rows: 2 Data size: 106 Basic stats: COMPLETE Column stats: NONE |
| table: |
| input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
| output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
| serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
| |
| Stage: Stage-0 |
| Fetch Operator |
| limit: -1 |
| Processor Tree: |
| ListSink |
| |
+----------------------------------------------------+
80 rows selected (0.165 seconds)
==============================
第二种方式
EXPLAIN
SELECT
tmp_a.user_id
,tmp_b.user_id
FROM
(
SELECT
*
FROM datacube_salary_org AS a
WHERE a.salary >= 10000.0
) tmp_a
JOIN
(
SELECT
*
FROM datacube_salary_org AS b
WHERE b.salary >= 10000.0
) tmp_b
ON tmp_a.user_id = tmp_b.user_id
ORDER BY tmp_a.user_id, tmp_b.user_id
;
+----------------------------------------------------+
| Explain |
+----------------------------------------------------+
| STAGE DEPENDENCIES: |
| Stage-1 is a root stage |
| Stage-2 depends on stages: Stage-1 |
| Stage-0 depends on stages: Stage-2 |
| |
| STAGE PLANS: |
| Stage: Stage-1 |
| Map Reduce |
| Map Operator Tree: |
| TableScan |
| alias: a |
| Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE |
| Filter Operator |
| predicate: (salary >= 10000.0) (type: boolean) |
| Statistics: Num rows: 2 Data size: 97 Basic stats: COMPLETE Column stats: NONE |
| Select Operator |
| expressions: user_id (type: bigint) |
| outputColumnNames: _col2 |
| Statistics: Num rows: 2 Data size: 97 Basic stats: COMPLETE Column stats: NONE |
| Filter Operator |
| predicate: _col2 is not null (type: boolean) |
| Statistics: Num rows: 2 Data size: 97 Basic stats: COMPLETE Column stats: NONE |
| Reduce Output Operator |
| key expressions: _col2 (type: bigint) |
| sort order: + |
| Map-reduce partition columns: _col2 (type: bigint) |
| Statistics: Num rows: 2 Data size: 97 Basic stats: COMPLETE Column stats: NONE |
| TableScan |
| alias: b |
| Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE |
| Filter Operator |
| predicate: (salary >= 10000.0) (type: boolean) |
| Statistics: Num rows: 2 Data size: 97 Basic stats: COMPLETE Column stats: NONE |
| Select Operator |
| expressions: user_id (type: bigint) |
| outputColumnNames: _col2 |
| Statistics: Num rows: 2 Data size: 97 Basic stats: COMPLETE Column stats: NONE |
| Filter Operator |
| predicate: _col2 is not null (type: boolean) |
| Statistics: Num rows: 2 Data size: 97 Basic stats: COMPLETE Column stats: NONE |
| Reduce Output Operator |
| key expressions: _col2 (type: bigint) |
| sort order: + |
| Map-reduce partition columns: _col2 (type: bigint) |
| Statistics: Num rows: 2 Data size: 97 Basic stats: COMPLETE Column stats: NONE |
| Reduce Operator Tree: |
| Join Operator |
| condition map: |
| Inner Join 0 to 1 |
| keys: |
| 0 _col2 (type: bigint) |
| 1 _col2 (type: bigint) |
| outputColumnNames: _col2, _col10 |
| Statistics: Num rows: 2 Data size: 106 Basic stats: COMPLETE Column stats: NONE |
| Select Operator |
| expressions: _col2 (type: bigint), _col10 (type: bigint) |
| outputColumnNames: _col0, _col1 |
| Statistics: Num rows: 2 Data size: 106 Basic stats: COMPLETE Column stats: NONE |
| File Output Operator |
| compressed: false |
| table: |
| input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
| output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
| serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe |
| |
| Stage: Stage-2 |
| Map Reduce |
| Map Operator Tree: |
| TableScan |
| Reduce Output Operator |
| key expressions: _col0 (type: bigint), _col1 (type: bigint) |
| sort order: ++ |
| Statistics: Num rows: 2 Data size: 106 Basic stats: COMPLETE Column stats: NONE |
| Reduce Operator Tree: |
| Select Operator |
| expressions: KEY.reducesinkkey0 (type: bigint), KEY.reducesinkkey1 (type: bigint) |
| outputColumnNames: _col0, _col1 |
| Statistics: Num rows: 2 Data size: 106 Basic stats: COMPLETE Column stats: NONE |
| File Output Operator |
| compressed: false |
| Statistics: Num rows: 2 Data size: 106 Basic stats: COMPLETE Column stats: NONE |
| table: |
| input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
| output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
| serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
| |
| Stage: Stage-0 |
| Fetch Operator |
| limit: -1 |
| Processor Tree: |
| ListSink |
| |
+----------------------------------------------------+
92 rows selected (0.166 seconds)
==============================
第三种方式
EXPLAIN
SELECT
a.user_id
,b.user_id
FROM datacube_salary_org AS a
JOIN datacube_salary_org AS b
ON a.user_id = b.user_id
WHERE
a.salary >= 10000.0
AND b.salary >= 10000.0
ORDER BY a.user_id, b.user_id
;
+----------------------------------------------------+
| Explain |
+----------------------------------------------------+
| STAGE DEPENDENCIES: |
| Stage-1 is a root stage |
| Stage-2 depends on stages: Stage-1 |
| Stage-0 depends on stages: Stage-2 |
| |
| STAGE PLANS: |
| Stage: Stage-1 |
| Map Reduce |
| Map Operator Tree: |
| TableScan |
| alias: a |
| Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE |
| Filter Operator |
| predicate: user_id is not null (type: boolean) |
| Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE |
| Reduce Output Operator |
| key expressions: user_id (type: bigint) |
| sort order: + |
| Map-reduce partition columns: user_id (type: bigint) |
| Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE |
| value expressions: salary (type: decimal(10,2)) |
| TableScan |
| alias: b |
| Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE |
| Filter Operator |
| predicate: user_id is not null (type: boolean) |
| Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE |
| Reduce Output Operator |
| key expressions: user_id (type: bigint) |
| sort order: + |
| Map-reduce partition columns: user_id (type: bigint) |
| Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE |
| value expressions: salary (type: decimal(10,2)) |
| Reduce Operator Tree: |
| Join Operator |
| condition map: |
| Inner Join 0 to 1 |
| keys: |
| 0 user_id (type: bigint) |
| 1 user_id (type: bigint) |
| outputColumnNames: _col2, _col4, _col13, _col15 |
| Statistics: Num rows: 7 Data size: 374 Basic stats: COMPLETE Column stats: NONE |
| Filter Operator |
| predicate: ((_col4 >= 10000.0) and (_col15 >= 10000.0)) (type: boolean) |
| Statistics: Num rows: 1 Data size: 53 Basic stats: COMPLETE Column stats: NONE |
| Select Operator |
| expressions: _col2 (type: bigint), _col13 (type: bigint) |
| outputColumnNames: _col0, _col1 |
| Statistics: Num rows: 1 Data size: 53 Basic stats: COMPLETE Column stats: NONE |
| File Output Operator |
| compressed: false |
| table: |
| input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
| output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
| serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe |
| |
| Stage: Stage-2 |
| Map Reduce |
| Map Operator Tree: |
| TableScan |
| Reduce Output Operator |
| key expressions: _col0 (type: bigint), _col1 (type: bigint) |
| sort order: ++ |
| Statistics: Num rows: 1 Data size: 53 Basic stats: COMPLETE Column stats: NONE |
| Reduce Operator Tree: |
| Select Operator |
| expressions: KEY.reducesinkkey0 (type: bigint), KEY.reducesinkkey1 (type: bigint) |
| outputColumnNames: _col0, _col1 |
| Statistics: Num rows: 1 Data size: 53 Basic stats: COMPLETE Column stats: NONE |
| File Output Operator |
| compressed: false |
| Statistics: Num rows: 1 Data size: 53 Basic stats: COMPLETE Column stats: NONE |
| table: |
| input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
| output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
| serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
| |
| Stage: Stage-0 |
| Fetch Operator |
| limit: -1 |
| Processor Tree: |
| ListSink |
| |
+----------------------------------------------------+
83 rows selected (0.193 seconds)
可以看到 第一种 , 第二种方式的执行计划 基本一致。 第三个执行计划,可以看到 把筛选条件 放到了 JOIN 之后, 在大数据量下,必定慢 :
| Reduce Operator Tree: |
| Join Operator |
| condition map: |
| Inner Join 0 to 1 |
| keys: |
| 0 user_id (type: bigint) |
| 1 user_id (type: bigint) |
| outputColumnNames: _col2, _col4, _col13, _col15 |
| Statistics: Num rows: 7 Data size: 374 Basic stats: COMPLETE Column stats: NONE |
| Filter Operator |
| predicate: ((_col4 >= 10000.0) and (_col15 >= 10000.0)) (type: boolean) |
| Statistics: Num rows: 1 Data size: 53 Basic stats: COMPLETE Column stats: NONE |
| Select Operator |
| expressions: _col2 (type: bigint), _col13 (type: bigint) |
| outputColumnNames: _col0, _col1 |
| Statistics: Num rows: 1 Data size: 53 Basic stats: COMPLETE Column stats: NONE |
| File Output Operator |
| compressed: false |
| table: |
| input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
| output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
| serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe |
另外 我们需要注意 第二种方式
SELECT
tmp_a.user_id
,tmp_b.user_id
FROM
(
SELECT
*
FROM datacube_salary_org AS a
WHERE a.salary >= 10000.0
) tmp_a
JOIN
(
SELECT
*
FROM datacube_salary_org AS b
WHERE b.salary >= 10000.0
) tmp_b
ON tmp_a.user_id = tmp_b.user_id
ORDER BY tmp_a.user_id, tmp_b.user_id
;
需要注意 : 虽然我们子查询里面用的是 * 。 但是实际上,执行器会帮我优化,只选取 user_id !!!
2) 列裁剪
i.剔除无效, 非计算范围内的列数据
这个比较好理解,在 SELECT 之中对 需要的列做筛选。而不是使用 SELECT *
下面,我们看下这两种方式有什么区别:
只选取两列
EXPLAIN
SELECT
a.user_id
,a.company_name
FROM datacube_salary_org AS a
WHERE
a.pt = '20200406'
;
+----------------------------------------------------+
| Explain |
+----------------------------------------------------+
| STAGE DEPENDENCIES: |
| Stage-1 is a root stage |
| Stage-0 depends on stages: Stage-1 |
| |
| STAGE PLANS: |
| Stage: Stage-1 |
| Map Reduce |
| Map Operator Tree: |
| TableScan |
| alias: a |
| Statistics: Num rows: 7 Data size: 984 Basic stats: COMPLETE Column stats: PARTIAL |
| Filter Operator |
| predicate: (pt = '20200406') (type: boolean) |
| Statistics: Num rows: 7 Data size: 644 Basic stats: COMPLETE Column stats: PARTIAL |
| Select Operator |
| expressions: user_id (type: bigint), company_name (type: string) |
| outputColumnNames: _col0, _col1 |
| Statistics: Num rows: 7 Data size: 0 Basic stats: PARTIAL Column stats: PARTIAL |
| File Output Operator |
| compressed: false |
| Statistics: Num rows: 7 Data size: 0 Basic stats: PARTIAL Column stats: PARTIAL |
| table: |
| input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
| output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
| serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
| |
| Stage: Stage-0 |
| Fetch Operator |
| limit: -1 |
| Processor Tree: |
| ListSink |
| |
+----------------------------------------------------+
32 rows selected (0.133 seconds)
========================================================
选取所有列
EXPLAIN
SELECT
a.*
FROM datacube_salary_org AS a
WHERE
a.pt = '20200406'
;
+----------------------------------------------------+
| Explain |
+----------------------------------------------------+
| STAGE DEPENDENCIES: |
| Stage-1 is a root stage |
| Stage-0 depends on stages: Stage-1 |
| |
| STAGE PLANS: |
| Stage: Stage-1 |
| Map Reduce |
| Map Operator Tree: |
| TableScan |
| alias: a |
| Statistics: Num rows: 7 Data size: 984 Basic stats: COMPLETE Column stats: PARTIAL |
| Filter Operator |
| predicate: (pt = '20200406') (type: boolean) |
| Statistics: Num rows: 7 Data size: 644 Basic stats: COMPLETE Column stats: PARTIAL |
| Select Operator |
| expressions: company_name (type: string), dep_name (type: string), user_id (type: bigint), user_name (type: string), salary (type: decimal(10,2)), create_time (type: date), update_time (type: date), '20200406' (type: string) |
| outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7 |
| Statistics: Num rows: 7 Data size: 644 Basic stats: COMPLETE Column stats: PARTIAL |
| File Output Operator |
| compressed: false |
| Statistics: Num rows: 7 Data size: 644 Basic stats: COMPLETE Column stats: PARTIAL |
| table: |
| input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
| output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
| serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
| |
| Stage: Stage-0 |
| Fetch Operator |
| limit: -1 |
| Processor Tree: |
| ListSink |
| |
+----------------------------------------------------+
32 rows selected (0.117 seconds)
ii.使用列式存储
使用列式存储的 作为 表的存储方式,也可以帮助我们更迅速的实现对列 的快速筛选。
目前 Hive 支持的列式存储方式 : PARQUET, RCFILE(不推荐), ORCFILE
参考文章:
https://blog.csdn.net/u010003835/article/details/88236132