HiveQL是一种声明式语言,用户会提交声明式的查询,而Hive会将其转换成MapReduce job。
使用EXPLAIN可以帮助我们学习Hive是如何将查询转换成MapReduce任务的。在查询语句前面加上explain关键字,可以看到查询计划和其它一些信息。这个查询本身是不会执行的。
一个Hive任务会包含一个或者多个stage(阶段),不同的stage会存在依赖关系。越复杂的查询会引入越多的stage,同样耗时也会越多。
一个stage可以是一个MapReduce任务,也可以是一个抽象阶段,或者一个合并阶段,还可以是一个limit阶段,以及Hive需要的其它某个任务的一个阶段。默认情况下,Hive会一次只执行一个stage,不过也可以并行执行。
Stage-1包含了大部分处理过程,而且还会触发一个MapReduce job。
map阶段--Map Operator Tree:TableScan以这个表作为输入,然后会产生一个字段amt_case的输出。Group By Operator会应用到sum(amt_case),然后产生一个输出字段_col0(临时字段)。
reduce阶段--Reduce Operator Tree:相同的Group By Operator,对_col0字段进行sum操作。File Output Operator说明输出结果是文本格式,是基于字符串的输出格式:HiveIgnoreKeyTextOutputFormat
因为这个语句没有limit,因此Stage-0阶段没有任何操作。
hive> explain select sum(amt_case) from t_orderinfo;
OK
STAGE DEPENDENCIES: --------------------显示各个阶段的依赖关系
Stage-1 is a root stage ---------------------阶段一
Stage-0 depends on stages: Stage-1 ---------------------结束阶段,依赖阶段一
STAGE PLANS:
Stage: Stage-1
Map Reduce
Map Operator Tree: ---------------------Map阶段
TableScan
alias: t_orderinfo ---------------------扫描的表
Statistics: Num rows: 2 Data size: 20 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: amt_case (type: bigint)
outputColumnNames: amt_case
Statistics: Num rows: 2 Data size: 20 Basic stats: COMPLETE Column stats: NONE
Group By Operator
aggregations: sum(amt_case)
mode: hash
outputColumnNames: _col0 ---------------------临时字段
Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
sort order:
Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE
value expressions: _col0 (type: bigint)
Reduce Operator Tree: ---------------------Reduce阶段
Group By Operator
aggregations: sum(VALUE._col0)
mode: mergepartial
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: true
Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: -1 ---------------------job没有limit语句,stage-0是没有任何操作的阶段
Processor Tree:
ListSink
图1
hive> explain select count(1) from t_orderinfo;
OK
STAGE DEPENDENCIES:
Stage-1 is a root stage ---------------------阶段一
Stage-0 depends on stages: Stage-1 ---------------------结束阶段,依赖阶段一
STAGE PLANS:
Stage: Stage-1
Map Reduce
Map Operator Tree: ---------------------Map阶段
TableScan
alias: t_orderinfo ---------------------扫描的表
Statistics: Num rows: 8 Data size: 1362 Basic stats: COMPLETE Column stats: COMPLETE
Select Operator
Statistics: Num rows: 8 Data size: 1362 Basic stats: COMPLETE Column stats: COMPLETE
Group By Operator
aggregations: count(1) ---------------------函数
mode: hash
outputColumnNames: _col0 ---------------------临时字段
Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE
Reduce Output Operator
sort order:
Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE
value expressions: _col0 (type: bigint)
Reduce Operator Tree: ---------------------Reduce阶段
Group By Operator
aggregations: count(VALUE._col0)
mode: mergepartial
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE
File Output Operator
compressed: false
Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: -1 ---------------------job没有limit语句,stage-0是没有任何操作的阶段
Processor Tree:
ListSink