摘要:无论是在面试的时候,还是需要进一步了解Hive背后的理论知识和工作原理,第一步都需要了解Explain字段,它能详细的表示出相应的SQL语句与对应的MapReduce任务。
我们都知道,Hive的SQL语言是一门声明式的语言,客户端在提交声明式查询后,Hive会将HQL转换成MapReduce任务,Explain使我们更详细了解底层一些实现细节,会让人们更高效的使用Hive。
用法
EXPLAIN [EXTENDED|DEPENDENCY|AUTHORIZATION] query
Explain 会将HiveSQL转换成Stage的序列,主要有三部分组成:
- 抽象语法树(需要加上EXTENDED关键字才会显示)
- 执行计划不同Stage之间的依赖关系
- Stage描述:对应的操作算子和对应的数据
例子
下面我们通过一个例子来仔细看看Explain到底做了什么:
hive> explain extended select sum(last_sat_date) from app.dim_week_date group bydt limit 1;
# 解析语法树
OK
ABSTRACT SYNTAX TREE:
TOK_QUERY
TOK_FROM
TOK_TABREF
TOK_TABNAME #查询的表名
app
dim_week_date
TOK_INSERT #查询的结果会在客户端展示,但在展示之前,会将结果存在一个临时文件中
TOK_DESTINATION
TOK_DIR
TOK_TMP_FILE
TOK_SELECT
TOK_SELEXPR
TOK_FUNCTION
SUM
TOK_TABLE_OR_COL
last_sat_date
TOK_GROUPBY
TOK_TABLE_OR_COL
dt
TOK_LIMIT
1
# 执行计划中不同环节之间的依赖关系依赖关系
STAGE DEPENDENCIES:
Stage-1 is a root stage #Stage-1是根环节
Stage-0 depends on stages: Stage-1 #Stage-0依赖Stage-0
#
STAGE PLANS:
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan #要查询的表
alias: dim_week_date
Statistics: Num rows: 4 Data size: 418 Basic stats: COMPLETE Column stats: NONE
Select Operator #select算子操作
expressions: dt (type: string), last_sat_date (type: string) #Select对应的数据类型
outputColumnNames: dt, last_sat_date
Statistics: Num rows: 4 Data size: 418 Basic stats: COMPLETE Column stats: NONE
Group By Operator #Group By 算子,会应用到sum
aggregations: sum(last_sat_date) #聚合函数
keys: dt (type: string)
mode: hash
outputColumnNames: _col0, _col1 #输出的临时字段名
Statistics: Num rows: 4 Data size: 418 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: _col0 (type: string)
sort order: +
Map-reduce partition columns: _col0 (type: string)
Statistics: Num rows: 4 Data size: 418 Basic stats: COMPLETE Column stats: NONE
value expressions: _col1 (type: double)
Reduce Operator Tree: #Reduce阶段
Group By Operator #Group By 算子
aggregations: sum(VALUE._col0)
keys: KEY._col0 (type: string)
mode: mergepartial
outputColumnNames: _col0, _col1
Statistics: Num rows: 2 Data size: 209 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: _col1 (type: double)
outputColumnNames: _col0
Statistics: Num rows: 2 Data size: 209 Basic stats: COMPLETE Column stats: NONE
Limit
Number of rows: 1
Statistics: Num rows: 1 Data size: 104 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 1 Data size: 104 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: 1
Processor Tree:
ListSink
Time taken: 1.101 seconds, Fetched: 55 row(s)
总结
利用Explain查看Hive执行任务过程并不总是有用的,但是当某个任务执行特别复杂或者执行效率特别低效时,可以利用Explain来查看调优过程对SQL执行底层的影响,这样可以对HiveSQL的调优起到一定的作用。