- The dependencies between the different stages of the plan(SQL语句会被划分成多少MapReduce Stage以及Stage之间的依赖关系)
- The description of each of the stages(各个Stage内部的详细内容描述)
explain select * from
select id
from test_table_aaa
where dt='20210202'
and id is not null
group by id
select id, min(event_time) as min_event_time
from test_table_bbb
where id is not null
group by id
on( = 100
STAGE DEPENDENCIES: Stage-1 is a root stage Stage-6 depends on stages: Stage-1, Stage-3 , consists of Stage-7, Stage-8, Stage-2 Stage-7 has a backup stage: Stage-2 Stage-4 depends on stages: Stage-7 Stage-8 has a backup stage: Stage-2 Stage-5 depends on stages: Stage-8 Stage-2 Stage-3 is a root stage Stage-0 depends on stages: Stage-4, Stage-5, Stage-2
这里显示 Stage-1和Stage-3是root stage,root stage是DAG图执行的起点。默认情况下HSQL一次只能执行一个Stage,但是如果enable并行执行的话,多个相互之间没有依赖关系的Stage可以同时执行,这也是提升HSQL性能的一个方法。
Stage-6 depends on stages: Stage-1, Stage-3 表明了Stage之间的执行顺序,consists of 表示Stage6由多个部分组成。
Stage-7 has a backup stage: Stage-2 这个当前暂时还不是很了解...个人理解是如果Stage-7无法执行,那么就会选取备用的Stage-2进行执行。
可以从Yarn上看出当前这整条SQL执行的顺序为:Stage1 -> Stage3 -> Stage2:
STAGE PLANS: Stage: Stage-1 Map Reduce // 表示当前的是Map阶段的操作 Map Operator Tree: // 进行Hive表:test_table_aaa的扫描 TableScan alias: test_table_aaa // 当前阶段行数和数据大小的统计信息(rows如果元数据表中不存在的话,那么Hive会帮忙估算,所以说不一定准确) Statistics: Num rows: 1513882 Data size: 3565192110 Basic stats: COMPLETE Column stats: NONE // 对数据集进行过滤,对应where条件 Filter Operator // 过滤时所用的谓词 predicate: uaid is not null (type: boolean) Statistics: Num rows: 756941 Data size: 1782596055 Basic stats: COMPLETE Column stats: NONE // 表示对过滤之后的结果集进行分组聚合操作 Group By Operator // 分组聚合所使用的算法,这里用的是min() aggregations: min(install_time_selected_timezone) // 在uaid这一列上进行分组聚合 keys: uaid (type: string) mode: hash outputColumnNames: _col0, _col1 Statistics: Num rows: 756941 Data size: 1782596055 Basic stats: COMPLETE Column stats: NONE // Map端结果进行输出 Reduce Output Operator key expressions: _col0 (type: string) // 表示输出结果是否排, +表示正序,-表示倒序,一个符号对应一个列 sort order: + // Map阶段输出到Reduce阶段的分区列 Map-reduce partition columns: _col0 (type: string) Statistics: Num rows: 756941 Data size: 1782596055 Basic stats: COMPLETE Column stats: NONE value expressions: _col1 (type: string) // 表示当前的是Reduce阶段的操作(有些SQL语句不一定会有Reduce阶段) Reduce Operator Tree: Group By Operator aggregations: min(VALUE._col0) keys: KEY._col0 (type: string) // 对Map端输出的结果进行最终的合并 mode: mergepartial outputColumnNames: _col0, _col1 Statistics: Num rows: 378470 Data size: 891296850 Basic stats: COMPLETE Column stats: NONE File Output Operator // 文件输出结果进行压缩 compressed: true // 输入输出的文件格式以及读取数据的序列化方式 table: input format: org.apache.hadoop.mapred.SequenceFileInputFormat output format: serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDeStage: Stage-2 Map Reduce Map Operator Tree: TableScan // 这里就是是扫描表了,而是上一个MapReduce作业的输出结果 Reduce Output Operator // Map阶段和Reduce阶段输出的都是键-值对的形式,key expression和value expressions分别描述的就是Map阶段输出的键(key)和值(value)所用的数据列 key expressions: _col0 (type: string) sort order: + Map-reduce partition columns: _col0 (type: string) Statistics: Num rows: 378470 Data size: 891296850 Basic stats: COMPLETE Column stats: NONE value expressions: _col1 (type: string) TableScan Reduce Output Operator key expressions: _col0 (type: string) sort order: + Map-reduce partition columns: _col0 (type: string) Statistics: Num rows: 135055067 Data size: 178049618343 Basic stats: COMPLETE Column stats: NONE value expressions: _col1 (type: string) Reduce Operator Tree: // Join操作 Join Operator // 0和1分别代表两个数据集进行join,,并且join的操作为inner join condition map: Inner Join 0 to 1 // 两个数据集进行join的列 keys: 0 _col0 (type: string) 1 _col0 (type: string) outputColumnNames: _col0, _col1, _col2, _col3 Statistics: Num rows: 148560576 Data size: 195854584422 Basic stats: COMPLETE Column stats: NONE // 对应SQL中的limit 100 Limit Number of rows: 100 Statistics: Num rows: 100 Data size: 131800 Basic stats: COMPLETE Column stats: NONE File Output Operator compressed: true Statistics: Num rows: 100 Data size: 131800 Basic stats: COMPLETE Column stats: NONE table: input format: org.apache.hadoop.mapred.TextInputFormat output format: serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
参考: (Hive执行计划分析) (Hive执行计划分析)