主要介绍hive执行计划的基本使用。
目录
一、基本语法
explain
查看执行计划的基本信息;
explain dependency
查看输入的各种属性;
explain authorization
查看SQL操作相关权限的信息;
explain vectorization
查看SQL的向量化描述信息,从 Hive 2.3.0 开始支持;
explain cbo
输出由Calcite优化器生成的计划。CBO 从 Hive 4.0.0 版本开始支持;
explain ast
输出查询的抽象语法树。AST 在 Hive 2.1.0 版本删除了,存在bug,转储AST可能会导致OOM错误,将在4.0.0版本修复;
explain extended
查看执行计划的扩展信息;
二、explain基本用法
1. explain案例1 普通函数(or none)
select-from-where类型
sql如下
explain
select
nvl(region,'unknown') region,
age*2 age,
concat('prefix_',name) name
from test_user where age >=20;
执行结果如下
# stage依赖
STAGE DEPENDENCIES:
Stage-0 is a root stage
# stage详细信息
STAGE PLANS:
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
# 表扫描
TableScan
# 表名
alias: test_user
# 当前阶段的统计信息,有行数、数据量
Statistics: Num rows: 4 Data size: 1105 Basic stats: COMPLETE Column stats: NONE
# 当前阶段进行过滤
Filter Operator
# 过滤谓词
predicate: (age >= 20) (type: boolean)
# 过滤后的统计信息
Statistics: Num rows: 1 Data size: 276 Basic stats: COMPLETE Column stats: NONE
# 列投影
Select Operator
expressions: NVL(region,'unknown') (type: string), (age * 2) (type: int), concat('prefix_', name) (type: string)
outputColumnNames: _col0, _col1, _col2
Statistics: Num rows: 1 Data size: 276 Basic stats: COMPLETE Column stats: NONE
ListSink
执行计划分很多层,由外向内查看。
最外层分为以下两部分STAGE DEPENDENCIES
:描述了stage之间的依赖关系STAGE PLANS
:stage详细信息
可以看到,主要信息都在 STAGE PLANS
内,内部主要也分以下两部分Map Operator Tree
:MAP端的执行计划树Reduce Operator Tree
:Reduce端的执行计划树
其他关键字见执行计划内的注释。
本案例为最简单的select-from-where类型,从执行计划里可以看到,只有map阶段,控制好数据量的话,可以实现本地执行
。
2. explain案例2 聚合函数
select-agg_func-from-where-groupby类型
sql如下
explain select age,count(*) from test_user where age>20 group by age;
执行结果如下
# 描述了stage之间的依赖关系
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 depends on stages: Stage-1
# stage详细信息
STAGE PLANS:
Stage: Stage-1
Map Reduce
# MAP端的执行计划树
Map Operator Tree:
# 对alias: test_user进行表扫描
TableScan
alias: test_user
# 当前阶段的统计信息,有行数、数据量
Statistics: Num rows: 4 Data size: 1105 Basic stats: COMPLETE Column stats: NONE
# 对TableScan的表进行过滤
Filter Operator
# 过滤条件,过滤谓词
# 有些sql隐藏的过滤条件在这里都可以看到,如join是否对null的过滤
predicate: (age > 20) (type: boolean)
# 过滤后的信息
Statistics: Num rows: 1 Data size: 276 Basic stats: COMPLETE Column stats: NONE
# 这里开启了map端聚合,所以会在map端先聚合
# 查看配置sql如下
# set hive.map.aggr;
# 对结果进行groupby
Group By Operator
# 聚合方式
aggregations: count()
# 分组key,分组字段
keys: age (type: int)
# 聚合模式
# hash:随机聚合,hash partition;
# partial:局部聚合;
# final:最终聚合
mode: hash
outputColumnNames: _col0, _col1
# 聚合后的信息
Statistics: Num rows: 1 Data size: 276 Basic stats: COMPLETE Column stats: NONE
# map端的输出操作,也是一个operator
Reduce Output Operator
# map阶段输出的kv数据的key
key expressions: _col0 (type: int)
# 升序排序
sort order: +
# map阶段输出时,会进行分区字段
Map-reduce partition columns: _col0 (type: int)
# 输出的信息
Statistics: Num rows: 1 Data size: 276 Basic stats: COMPLETE Column stats: NONE
# map阶段输出的kv数据的value
value expressions: _col1 (type: bigint)
# Reduce端的执行计划树
Reduce Operator Tree:
# groupby操作
Group By Operator
# 聚合方式
aggregations: count(VALUE._col0)
# 分组字段
keys: KEY._col0 (type: int)
# 聚合方式:局部合并
# complete:表示都在reduce端聚合
mode: mergepartial
outputColumnNames: _col0, _col1
Statistics: Num rows: 1 Data size: 276 Basic stats: COMPLETE Column stats: NONE
File Output Operator
# 输出是否压缩
compressed: false
Statistics: Num rows: 1 Data size: 276 Basic stats: COMPLETE Column stats: NONE
# 当前操作表的信息
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
# -1为不限制条数
limit: -1
Processor Tree:
ListSink
3. explain案例3 窗口函数
sql如下
explain
select id,row_number() over(partition by region order by age) rown
from test_user;
执行结果如下
# stage依赖
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 depends on stages: Stage-1
# stage详细信息
STAGE PLANS:
Stage: Stage-1
Map Reduce
# map阶段操作树
Map Operator Tree:
# 表扫描
TableScan
# 表名
alias: test_user
Statistics: Num rows: 4 Data size: 1105 Basic stats: COMPLETE Column stats: NONE
# 输出到reduce操作
Reduce Output Operator
# 输出到reduce的kv中的key
# 可以看到over函数中的region和age都在
key expressions: region (type: string), age (type: int)
# 两个加号,表示对两个字段排序
sort order: ++
# 输出时只会对region字段进行mr分区
Map-reduce partition columns: region (type: string)
Statistics: Num rows: 4 Data size: 1105 Basic stats: COMPLETE Column stats: NONE
# 输出到reduce的kv中的value
# 可以理解为id字段是被打了标记
value expressions: id (type: int)
# 向量化执行
Execution mode: vectorized
# reduce阶段操作树
Reduce Operator Tree:
# 列投影
Select Operator
# VALUE._col0 为map输出的value
# KEY.reducesinkkey0 为map输出的第一个key,即region
# KEY.reducesinkkey1 为map输出的第二个key,即age
expressions: VALUE._col0 (type: int), KEY.reducesinkkey0 (type: string), KEY.reducesinkkey1 (type: int)
outputColumnNames: _col0, _col1, _col3
Statistics: Num rows: 4 Data size: 1105 Basic stats: COMPLETE Column stats: NONE
# Partitioned table Function, Window Functions
PTF Operator
Function definitions:
Input definition
input alias: ptf_0
output shape: _col0: int, _col1: string, _col3: int
type: WINDOWING
Windowing table definition
input alias: ptf_1
name: windowingtablefunction
# 排序字段
order by: _col3 ASC NULLS FIRST
# 窗口分区字段
partition by: _col1
raw input shape:
window functions:
window function definition
alias: row_number_window_0
# 窗口函数方法
name: row_number
# row_number对应的java实现
window function: GenericUDAFRowNumberEvaluator
# 窗口函数的frame参数
window frame: PRECEDING(MAX)~FOLLOWING(MAX)
isPivotResult: true
Statistics: Num rows: 4 Data size: 1105 Basic stats: COMPLETE Column stats: NONE
# 窗口函数后的列投影
Select Operator
expressions: _col0 (type: int), row_number_window_0 (type: int)
outputColumnNames: _col0, _col1
Statistics: Num rows: 4 Data size: 1105 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 4 Data size: 1105 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
4. explain案例4 join(非mapjoin)
sql如下
set hive.auto.convert.join=false;
explain
select a.id from test_partition1 a
inner join test_partition2 b
on a.id=b.id;
执行结果如下
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 depends on stages: Stage-1
STAGE PLANS:
Stage: Stage-1
Map Reduce
# map阶段计划
Map Operator Tree:
# 第一个表扫描,也就是左表
TableScan
alias: a
Statistics: Num rows: 5 Data size: 485 Basic stats: COMPLETE Column stats: NONE
# 数据过滤
Filter Operator
# 隐藏的过滤条件
predicate: id is not null (type: boolean)
Statistics: Num rows: 5 Data size: 485 Basic stats: COMPLETE Column stats: NONE
# 列投影
Select Operator
expressions: id (type: int)
outputColumnNames: _col0
Statistics: Num rows: 5 Data size: 485 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: _col0 (type: int)
sort order: +
Map-reduce partition columns: _col0 (type: int)
Statistics: Num rows: 5 Data size: 485 Basic stats: COMPLETE Column stats: NONE
# 第二个表扫描,也就是右表
TableScan
alias: b
Statistics: Num rows: 5 Data size: 40 Basic stats: COMPLETE Column stats: NONE
# 数据过滤
Filter Operator
# 隐藏的过滤条件
predicate: id is not null (type: boolean)
Statistics: Num rows: 5 Data size: 40 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: id (type: int)
outputColumnNames: _col0
Statistics: Num rows: 5 Data size: 40 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: _col0 (type: int)
sort order: +
Map-reduce partition columns: _col0 (type: int)
Statistics: Num rows: 5 Data size: 40 Basic stats: COMPLETE Column stats: NONE
# reduce阶段计划
Reduce Operator Tree:
# join操作
# 说明案例关联发生在reduce阶段
Join Operator
condition map:
# 内连接
Inner Join 0 to 1
keys:
0 _col0 (type: int)
1 _col0 (type: int)
outputColumnNames: _col0
Statistics: Num rows: 5 Data size: 533 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 5 Data size: 533 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
5. explain案例5 join(mapjoin)
sql如下
set hive.auto.convert.join=true;
explain
select a.id from test_partition1 a
inner join test_partition2 b
on a.id=b.id;
执行结果如下
STAGE DEPENDENCIES:
Stage-4 is a root stage
Stage-3 depends on stages: Stage-4
Stage-0 depends on stages: Stage-3
STAGE PLANS:
Stage: Stage-4
# 本地作业
Map Reduce Local Work
Alias -> Map Local Tables:
$hdt$_1:b
# fetch数据
Fetch Operator
limit: -1
# map本地操作数
Alias -> Map Local Operator Tree:
$hdt$_1:b
# 表扫描
TableScan
alias: b
# 统计信息
Statistics: Num rows: 5 Data size: 40 Basic stats: COMPLETE Column stats: NONE
# 隐藏的过滤条件
Filter Operator
predicate: id is not null (type: boolean)
Statistics: Num rows: 5 Data size: 40 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: id (type: int)
outputColumnNames: _col0
Statistics: Num rows: 5 Data size: 40 Basic stats: COMPLETE Column stats: NONE
# 列投影后的数据存入了HashTable
HashTable Sink Operator
keys:
0 _col0 (type: int)
1 _col0 (type: int)
Stage: Stage-3
Map Reduce
# 本stage只有map阶段
# 说明mapjoin后,数据都在本地处理完毕
# 不需要进行shuffle与reduce
Map Operator Tree:
TableScan
alias: a
Statistics: Num rows: 5 Data size: 485 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: id is not null (type: boolean)
Statistics: Num rows: 5 Data size: 485 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: id (type: int)
outputColumnNames: _col0
Statistics: Num rows: 5 Data size: 485 Basic stats: COMPLETE Column stats: NONE
# mapjoin操作
Map Join Operator
# 内连接
condition map:
Inner Join 0 to 1
keys:
0 _col0 (type: int)
1 _col0 (type: int)
outputColumnNames: _col0
Statistics: Num rows: 5 Data size: 533 Basic stats: COMPLETE Column stats: NONE
# 数据输出
File Output Operator
compressed: false
Statistics: Num rows: 5 Data size: 533 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Local Work:
Map Reduce Local Work
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
三、explain dependency基本用法
explain dependency主要用于分析sql的数据来源。
准备好两张分区表以及测试数据后,sql案例如下,两段join的sql,关联条件加上了非等值判断
explain dependency
select a.id from test_partition1 a
inner join test_partition2 b
on a.id=b.id and a.part=b.part and a.part>=1 and a.part <=2;
explain dependency
select a.id from test_partition1 a
inner join test_partition2 b
on a.id=b.id and a.part=b.part
where a.part>=1 and a.part <=2;
执行结果如下,两个结果相同。笔者看到别的文章都表示上述两段sql结果不同,但是笔者实际执行的结果是相同的,基于hive2.1.1。
{
"input_tables": [
{
"tablename": "test_mall@test_partition1",
"tabletype": "MANAGED_TABLE"
},
{
"tablename": "test_mall@test_partition2",
"tabletype": "MANAGED_TABLE"
}
],
"input_partitions": [
{
"partitionName": "test_mall@test_partition1@part=1"
},
{
"partitionName": "test_mall@test_partition1@part=2"
},
{
"partitionName": "test_mall@test_partition2@part=1"
},
{
"partitionName": "test_mall@test_partition2@part=2"
}
]
}