目录
0 数据准备
create table follow
(
user_id int,
follower_id int
)row format delimited
fields terminated by '\t';
insert into follow values (1,2);
insert into follow values (1,4);
insert into follow values (1,5);
create table music_likes
(
user_id int,
music_id int
)row format delimited
fields terminated by '\t';
insert into music_likes values (1,20);
insert into music_likes values (1,30);
insert into music_likes values (1,40);
insert into music_likes values (2,10);
insert into music_likes values (2,20);
insert into music_likes values (2,30);
insert into music_likes values (4,10);
insert into music_likes values (4,20);
insert into music_likes values (4,30);
insert into music_likes values (4,60);
create table music
(
id int,
music_name char(10)
)row format delimited
fields terminated by '\t';
insert into music values (10,'a');
insert into music values (20,'b');
insert into music values (30,'c');
insert into music values (40,'d');
insert into music values (50,'e');
insert into music values (60,'f');
1 执行计划分析及验证
执行如下SQL:
explain select count(t0.user_id) as cnt
,sum(t1.music_id) as sum_f
from follow t0
left join music_likes t1
on t0.user_id=t1.user_id
where t0.follower_id > 2
group by t0.follower_id
having cnt > 2
order by sum_f
limit 1
执行计划如下:直接explain生成的是物理执行计划。
STAGE DEPENDENCIES:
Stage-6 is a root stage
Stage-2 depends on stages: Stage-6
Stage-3 depends on stages: Stage-2
Stage-0 depends on stages: Stage-3
STAGE PLANS:
Stage: Stage-6
Map Reduce Local Work
Alias -> Map Local Tables:
t1
Fetch Operator
limit: -1
Alias -> Map Local Operator Tree:
t1
TableScan
alias: t1
Statistics: Num rows: 10 Data size: 40 Basic stats: COMPLETE Column stats: NONE
HashTable Sink Operator
keys:
0 user_id (type: int)
1 user_id (type: int)
Stage: Stage-2
Map Reduce
Map Operator Tree:
TableScan
alias: t0
filterExpr: (follower_id > 2) (type: boolean)
Statistics: Num rows: 3 Data size: 9 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: (follower_id > 2) (type: boolean)
Statistics: Num rows: 1 Data size: 3 Basic stats: COMPLETE Column stats: NONE
Map Join Operator
condition map:
Left Outer Join0 to 1
keys:
0 user_id (type: int)
1 user_id (type: int)
outputColumnNames: _col0, _col1, _col6
Statistics: Num rows: 11 Data size: 44 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: _col1 (type: int), _col0 (type: int), _col6 (type: int)
outputColumnNames: _col1, _col0, _col6
Statistics: Num rows: 11 Data size: 44 Basic stats: COMPLETE Column stats: NONE
Group By Operator
aggregations: count(_col0), sum(_col6)
keys: _col1 (type: int)
mode: hash
outputColumnNames: _col0, _col1, _col2
Statistics: Num rows: 11 Data size: 44 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: _col0 (type: int)
sort order: +
Map-reduce partition columns: _col0 (type: int)
Statistics: Num rows: 11 Data size: 44 Basic stats: COMPLETE Column stats: NONE
value expressions: _col1 (type: bigint), _col2 (type: bigint)
Local Work:
Map Reduce Local Work
Reduce Operator Tree:
Group By Operator
aggregations: count(VALUE._col0), sum(VALUE._col1)
keys: KEY._col0 (type: int)
mode: mergepartial
outputColumnNames: _col0, _col1, _col2
Statistics: Num rows: 5 Data size: 20 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: _col1 (type: bigint), _col2 (type: bigint)
outputColumnNames: _col1, _col2
Statistics: Num rows: 5 Data size: 20 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: (_col1 > 2) (type: boolean)
Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: _col1 (type: bigint), _col2 (type: bigint)
outputColumnNames: _col0, _col1
Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe
Stage: Stage-3
Map Reduce
Map Operator Tree:
TableScan
Reduce Output Operator
key expressions: _col1 (type: bigint)
sort order: +
Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column stats: NONE
TopN Hash Memory Usage: 0.04
value expressions: _col0 (type: bigint)
Reduce Operator Tree:
Select Operator
expressions: VALUE._col0 (type: bigint), KEY.reducesinkkey0 (type: bigint)
outputColumnNames: _col0, _col1
Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column stats: NONE
Limit
Number of rows: 1
Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: 1
Processor Tree:
ListSink
Time taken: 0.351 seconds, Fetched: 116 row(s)
执行计划的理解:
我们将上述结果拆分看,先从最外层开始,包含两个大的部分:
- stage dependencies: 各个stage之间的依赖性
- stage plan: 各个stage的执行计划(物理执行计划)
stage plan,里面有一个 Map Reduce,一个MR的执行计划分为两个部分:
- Map Operator Tree: MAP端的执行计划树
- Reduce Operator Tree: Reduce端的执行计划树
这两个执行计划树里面包含这条sql语句的 operator:
- map端第一个操作肯定是加载表,所以就是 TableScan 表扫描操作,常见的属性:
- alias: 表名称
- Statistics: 表统计信息,包含表中数据条数,数据大小等
- Select Operator: 选取操作,常见的属性 :
- expressions:需要的字段名称及字段类型
- outputColumnNames:输出的列名称
- Statistics:表统计信息,包含表中数据条数,数据大小等
- Group By Operator:分组聚合操作,常见的属性:
- aggregations:显示聚合函数信息
- mode:聚合模式,值有 hash:随机聚合,就是hash partition;partial:局部聚合;final:最终聚合
- keys:分组的字段,如果没有分组,则没有此字段
- outputColumnNames:聚合之后输出列名
- Statistics: 表统计信息,包含分组聚合之后的数据条数,数据大小等
- Reduce Output Operator:输出到reduce操作,常见属性:
- sort order:值为空 不排序;值为 + 正序排序,值为 - 倒序排序;值为 +- 排序的列为两列,第一列为正序,第二列为倒序
- Filter Operator:过滤操作,常见的属性:
- predicate:过滤条件,如sql语句中的where id>=1,则此处显示(id >= 1)
- Map Join Operator:join 操作,常见的属性:
- condition map:join方式 ,如Inner Join 0 to 1 Left Outer Join0 to 2
- keys: join 的条件字段
- outputColumnNames: join 完成之后输出的字段
- Statistics: join 完成之后生成的数据条数,大小等
- File Output Operator:文件输出操作,常见的属性
- compressed:是否压缩
- table:表的信息,包含输入输出文件格式化方式,序列化方式等
- Fetch Operator 客户端获取数据操作,常见的属性:
- limit,值为 -1 表示不限制条数,其他值为限制的条数
(1)先看第一部分代表stage之间的依赖关系
可以看到stage-6是根,stage-2依赖于stage-6,stage-3依赖于stage-2依次类推
(2)stage-6阶段
可以看到该阶段主要是对t1表进行扫描
(3)stage-2阶段
Map阶段:
该阶段实际上做了很多工作,首先扫描t0表,由于hive本身对SQL进行了优化谓词下推的作用会先执行where里面的过滤操作,然后执行mapjoin操作,在mapjoin输出列中会将下一操作中group by 中的key一样进行输出,此时select选择器会选出下一操作的预选数据,如果group by中的key与select中字段不一致,那么此时就会多一步select操作将group by中的key值一并输出作为,group by操作中需要的数据,所以这一步的select是为group by做准备,其次执行map端的group by,因为hive默认是开启预聚合操作的,所以会先在map端进行聚合,此时的group by采用的是hash的方式也就是局部聚合,同时也会按照group by中的key值进行升序排序。
Reduce阶段
redcue阶段采用的group by聚合为merge操作,即将分组有序的数据进行归并操作,输出最终的结果,此时group by 后又有select操作,该select用于为下一步having操作准备数据,having操作会在该select的结果集上做进一步的过滤,最终输出having的结果。通过这里可以看出hivesql执行的时候并不是网上所说的select在having之后执行,而是在having之前,因为在hivesql中having的语句后面是可以用select的别名的,这就是看执行计划的魅力!!!事实上在hivesql中select操作并不是在指定的哪一步出现,而是每一步的结果都是为下一步准备有效数据。
(4) stage-3阶段
该阶段主要是SQL中的order by操作,按照指定的key值进行排序,最后select出结果
(5)stage-0阶段
该阶段主要是执行limit操作。
我们变换一下SQL,再查看执行计划看有什么区别
explain select count(1) as cnt
,sum(t0.follower_id) as sum_f
from follow t0
left join music_likes t1
on t0.user_id=t1.user_id
where t0.follower_id > 2
group by t0.follower_id
having cnt > 2
order by sum_f
limit 1
执行计划如下:
STAGE DEPENDENCIES:
Stage-6 is a root stage
Stage-2 depends on stages: Stage-6
Stage-3 depends on stages: Stage-2
Stage-0 depends on stages: Stage-3
STAGE PLANS:
Stage: Stage-6
Map Reduce Local Work
Alias -> Map Local Tables:
t1
Fetch Operator
limit: -1
Alias -> Map Local Operator Tree:
t1
TableScan
alias: t1
Statistics: Num rows: 10 Data size: 40 Basic stats: COMPLETE Column stats: NONE
HashTable Sink Operator
keys:
0 user_id (type: int)
1 user_id (type: int)
Stage: Stage-2
Map Reduce
Map Operator Tree:
TableScan
alias: t0
filterExpr: (follower_id > 2) (type: boolean)
Statistics: Num rows: 3 Data size: 9 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: (follower_id > 2) (type: boolean)
Statistics: Num rows: 1 Data size: 3 Basic stats: COMPLETE Column stats: NONE
Map Join Operator
condition map:
Left Outer Join0 to 1
keys:
0 user_id (type: int)
1 user_id (type: int)
outputColumnNames: _col1
Statistics: Num rows: 11 Data size: 44 Basic stats: COMPLETE Column stats: NONE
Group By Operator
aggregations: count(1), sum(_col1)
keys: _col1 (type: int)
mode: hash
outputColumnNames: _col0, _col1, _col2
Statistics: Num rows: 11 Data size: 44 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: _col0 (type: int)
sort order: +
Map-reduce partition columns: _col0 (type: int)
Statistics: Num rows: 11 Data size: 44 Basic stats: COMPLETE Column stats: NONE
value expressions: _col1 (type: bigint), _col2 (type: bigint)
Local Work:
Map Reduce Local Work
Reduce Operator Tree:
Group By Operator
aggregations: count(VALUE._col0), sum(VALUE._col1)
keys: KEY._col0 (type: int)
mode: mergepartial
outputColumnNames: _col0, _col1, _col2
Statistics: Num rows: 5 Data size: 20 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: _col1 (type: bigint), _col2 (type: bigint)
outputColumnNames: _col1, _col2
Statistics: Num rows: 5 Data size: 20 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: (_col1 > 2) (type: boolean)
Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: _col1 (type: bigint), _col2 (type: bigint)
outputColumnNames: _col0, _col1
Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe
Stage: Stage-3
Map Reduce
Map Operator Tree:
TableScan
Reduce Output Operator
key expressions: _col1 (type: bigint)
sort order: +
Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column stats: NONE
TopN Hash Memory Usage: 0.04
value expressions: _col0 (type: bigint)
Reduce Operator Tree:
Select Operator
expressions: VALUE._col0 (type: bigint), KEY.reducesinkkey0 (type: bigint)
outputColumnNames: _col0, _col1
Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column stats: NONE
Limit
Number of rows: 1
Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: 1
Processor Tree:
ListSink
查看其执行计划基本一致,主要的区别在如下:
在stage-2阶段的mapjoin后没有select操作,为什么呢?原因在于该SQL中select中的字段和group by中的字段是一致的。
2 小结
通过上面的研究我们发现在hivesql中其执行顺序大致如下:
from->where(谓词下推)->join->on->select(select中的字段与group by只要不一致就会有)->group by->select(为having准备数据,因而having中可以使用select别名)->having->select(过滤后的结果集)->distinct->select->order by ->select->limit
实际上上述需要注意的是select在group by前和group by后都有,这点需要注意,还有where语句谓词下推,就这两点比较特殊。
待研究:hivesql有窗口函数时的执行计划。
欢迎关注石榴姐公众号"我的SQL呀",关注我不迷路