大纲地址 : https://blog.csdn.net/u010003835/article/details/105334641
测试表以及测试数据
+----------------------------------------------------+
| createtab_stmt |
+----------------------------------------------------+
| CREATE TABLE `datacube_salary_org`( |
| `company_name` string COMMENT '????', |
| `dep_name` string COMMENT '????', |
| `user_id` bigint COMMENT '??id', |
| `user_name` string COMMENT '????', |
| `salary` decimal(10,2) COMMENT '??', |
| `create_time` date COMMENT '????', |
| `update_time` date COMMENT '????') |
| PARTITIONED BY ( |
| `pt` string COMMENT '????') |
| ROW FORMAT SERDE |
| 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' |
| WITH SERDEPROPERTIES ( |
| 'field.delim'=',', |
| 'serialization.format'=',') |
| STORED AS INPUTFORMAT |
| 'org.apache.hadoop.mapred.TextInputFormat' |
| OUTPUTFORMAT |
| 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' |
| LOCATION |
| 'hdfs://cdh-manager:8020/user/hive/warehouse/data_warehouse_test.db/datacube_salary_org' |
| TBLPROPERTIES ( |
| 'transient_lastDdlTime'='1586310488') |
+----------------------------------------------------+
+-----------------------------------+-------------------------------+------------------------------+--------------------------------+-----------------------------+----------------------------------+----------------------------------+-------------------------+
| datacube_salary_org.company_name | datacube_salary_org.dep_name | datacube_salary_org.user_id | datacube_salary_org.user_name | datacube_salary_org.salary | datacube_salary_org.create_time | datacube_salary_org.update_time | datacube_salary_org.pt |
+-----------------------------------+-------------------------------+------------------------------+--------------------------------+-----------------------------+----------------------------------+----------------------------------+-------------------------+
| s.zh | engineer | 1 | szh | 28000.00 | 2020-04-07 | 2020-04-07 | 20200405 |
| s.zh | engineer | 2 | zyq | 26000.00 | 2020-04-03 | 2020-04-03 | 20200405 |
| s.zh | tester | 3 | gkm | 20000.00 | 2020-04-07 | 2020-04-07 | 20200405 |
| x.qx | finance | 4 | pip | 13400.00 | 2020-04-07 | 2020-04-07 | 20200405 |
| x.qx | finance | 5 | kip | 24500.00 | 2020-04-07 | 2020-04-07 | 20200405 |
| x.qx | finance | 6 | zxxc | 13000.00 | 2020-04-07 | 2020-04-07 | 20200405 |
| x.qx | kiccp | 7 | xsz | 8600.00 | 2020-04-07 | 2020-04-07 | 20200405 |
| s.zh | engineer | 1 | szh | 28000.00 | 2020-04-07 | 2020-04-07 | 20200406 |
| s.zh | engineer | 2 | zyq | 26000.00 | 2020-04-03 | 2020-04-03 | 20200406 |
| s.zh | tester | 3 | gkm | 20000.00 | 2020-04-07 | 2020-04-07 | 20200406 |
| x.qx | finance | 4 | pip | 13400.00 | 2020-04-07 | 2020-04-07 | 20200406 |
| x.qx | finance | 5 | kip | 24500.00 | 2020-04-07 | 2020-04-07 | 20200406 |
| x.qx | finance | 6 | zxxc | 13000.00 | 2020-04-07 | 2020-04-07 | 20200406 |
| x.qx | kiccp | 7 | xsz | 8600.00 | 2020-04-07 | 2020-04-07 | 20200406 |
| s.zh | enginer | 1 | szh | 28000.00 | 2020-04-07 | 2020-04-07 | 20200407 |
| s.zh | enginer | 2 | zyq | 26000.00 | 2020-04-03 | 2020-04-03 | 20200407 |
| s.zh | tester | 3 | gkm | 20000.00 | 2020-04-07 | 2020-04-07 | 20200407 |
| x.qx | finance | 4 | pip | 13400.00 | 2020-04-07 | 2020-04-07 | 20200407 |
| x.qx | finance | 5 | kip | 24500.00 | 2020-04-07 | 2020-04-07 | 20200407 |
| x.qx | finance | 6 | zxxc | 13000.00 | 2020-04-07 | 2020-04-07 | 20200407 |
| x.qx | kiccp | 7 | xsz | 8600.00 | 2020-04-07 | 2020-04-07 | 20200407 |
+-----------------------------------+-------------------------------+------------------------------+--------------------------------+-----------------------------+----------------------------------+----------------------------------+-------------------------+
场景三.合理进行并行控制
合理的使用并行化参数控制
并行化参数有以下几个:
开启任务并行执行
set hive.exec.parallel=true;
允许并行任务的最大线程数
set hive.exec.parallel.thread.number=8;
这个只是开启并行化建议,具体最终能不能并行,还是依赖于之间有无依赖关系。
有依赖关系,既使开启了优化参数也是不行的。
如何查看每个阶段间是否有依赖关系,我们可以通过 EXPLAIN 去查看
起作用的情况 ,针对于以下情况(但不限于)
1) UNION ALL 语句 :
因为 UNION ALL 的子句之间是没有依赖关系,所以开启并行参数,可以让这些STAGE 并行工作。
例如:如下SQL
SELECT
'a' AS type
,COUNT(1) AS num
FROM datacube_salary_basic_aggr AS a
UNION ALL
SELECT
'b' AS type
,COUNT(1) AS num
FROM datacube_salary_company_aggr AS b
UNION ALL
SELECT
'c' AS type
,COUNT(1) AS num
FROM datacube_salary_dep_aggr AS c
UNION ALL
SELECT
'd' AS type
,COUNT(1) AS num
FROM datacube_salary_total_aggr AS d
;
+----------------------------------------------------+
| Explain |
+----------------------------------------------------+
| STAGE DEPENDENCIES: |
| Stage-1 is a root stage |
| Stage-2 depends on stages: Stage-1, Stage-3, Stage-4, Stage-5 |
| Stage-3 is a root stage |
| Stage-4 is a root stage |
| Stage-5 is a root stage |
| Stage-0 depends on stages: Stage-2 |
| |
可以看到 stage1, stage3, stage4, stage5 之间并无依赖。它们之间是并行的。
2) JOIN 语句:
当多张表 JOIN 是相同条件的时候,会转化为一个STAGE
SELECT
a.user_id
,a.mobile
,a.sex
,b.user_name
,c.type
FROM join_multi_a AS a
LEFT JOIN join_multi_b AS b
ON a.user_id = b.user_id
LEFT JOIN join_multi_c AS c
ON a.user_id = c.user_id
;
+----------------------------------------------------+
| Explain |
+----------------------------------------------------+
| STAGE DEPENDENCIES: |
| Stage-1 is a root stage |
| Stage-0 depends on stages: Stage-1 |
| |
| STAGE PLANS: |
| Stage: Stage-1 |
| Map Reduce |
| Map Operator Tree: |
| TableScan |
| alias: a |
| Statistics: Num rows: 3 Data size: 30 Basic stats: COMPLETE Column stats: NONE |
| Reduce Output Operator |
| key expressions: user_id (type: bigint) |
| sort order: + |
| Map-reduce partition columns: user_id (type: bigint) |
| Statistics: Num rows: 3 Data size: 30 Basic stats: COMPLETE Column stats: NONE |
| value expressions: mobile (type: string), sex (type: bigint) |
| TableScan |
| alias: b |
| Statistics: Num rows: 3 Data size: 42 Basic stats: COMPLETE Column stats: NONE |
| Reduce Output Operator |
| key expressions: user_id (type: bigint) |
| sort order: + |
| Map-reduce partition columns: user_id (type: bigint) |
| Statistics: Num rows: 3 Data size: 42 Basic stats: COMPLETE Column stats: NONE |
| value expressions: user_name (type: string) |
| TableScan |
| alias: c |
| Statistics: Num rows: 3 Data size: 30 Basic stats: COMPLETE Column stats: NONE |
| Reduce Output Operator |
| key expressions: user_id (type: bigint) |
| sort order: + |
| Map-reduce partition columns: user_id (type: bigint) |
| Statistics: Num rows: 3 Data size: 30 Basic stats: COMPLETE Column stats: NONE |
| value expressions: type (type: string) |
| Reduce Operator Tree: |
| Join Operator |
| condition map: |
| Left Outer Join0 to 1 |
| Left Outer Join0 to 2 |
| keys: |
| 0 user_id (type: bigint) |
| 1 user_id (type: bigint) |
| 2 user_id (type: bigint) |
| outputColumnNames: _col0, _col1, _col2, _col8, _col14 |
| Statistics: Num rows: 6 Data size: 66 Basic stats: COMPLETE Column stats: NONE |
| Select Operator |
| expressions: _col0 (type: bigint), _col1 (type: string), _col2 (type: bigint), _col8 (type: string), _col14 (type: string) |
| outputColumnNames: _col0, _col1, _col2, _col3, _col4 |
| Statistics: Num rows: 6 Data size: 66 Basic stats: COMPLETE Column stats: NONE |
| File Output Operator |
| compressed: false |
| Statistics: Num rows: 6 Data size: 66 Basic stats: COMPLETE Column stats: NONE |
| table: |
| input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
| output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
| serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
| |
| Stage: Stage-0 |
| Fetch Operator |
| limit: -1 |
| Processor Tree: |
| ListSink |
| |
+----------------------------------------------------+
筛选下重要的部分:
+----------------------------------------------------+
| STAGE DEPENDENCIES: |
| Stage-1 is a root stage |
| Stage-0 depends on stages: Stage-1 |
Reduce Operator Tree: |
| Join Operator |
| condition map: |
| Left Outer Join0 to 1 |
| Left Outer Join0 to 2 |
| keys: |
| 0 user_id (type: bigint) |
| 1 user_id (type: bigint) |
| 2 user_id (type: bigint) |