你真的了解HiveSql吗?真实的HiveSql执行顺序是长这样的

48 篇文章 119 订阅
32 篇文章 15 订阅

目录

0 数据准备

1 执行计划分析及验证

2  小结


0 数据准备

create table follow
(
  user_id int,
  follower_id int
)row format delimited
fields terminated by '\t';
 
insert into follow values (1,2);
insert into follow values (1,4);
insert into follow values (1,5);
 
create table music_likes
(
  user_id int,
  music_id int
)row format delimited
fields terminated by '\t';
 
insert into music_likes values (1,20);
insert into music_likes values (1,30);
insert into music_likes values (1,40);
insert into music_likes values (2,10);
insert into music_likes values (2,20);
insert into music_likes values (2,30);
insert into music_likes values (4,10);
insert into music_likes values (4,20);
insert into music_likes values (4,30);
insert into music_likes values (4,60);
 
create table music
(
  id int,
  music_name char(10)
)row format delimited
fields terminated by '\t';
 
insert into music values (10,'a');
insert into music values (20,'b');
insert into music values (30,'c');
insert into music values (40,'d');
insert into music values (50,'e');
insert into music values (60,'f');

1 执行计划分析及验证

执行如下SQL:


explain select count(t0.user_id) as cnt
      ,sum(t1.music_id) as sum_f
from follow t0
left join music_likes t1
on t0.user_id=t1.user_id
where t0.follower_id > 2
group by t0.follower_id
having cnt > 2
order by sum_f
limit 1

执行计划如下:直接explain生成的是物理执行计划。

STAGE DEPENDENCIES:
  Stage-6 is a root stage
  Stage-2 depends on stages: Stage-6
  Stage-3 depends on stages: Stage-2
  Stage-0 depends on stages: Stage-3

STAGE PLANS:
  Stage: Stage-6
    Map Reduce Local Work
      Alias -> Map Local Tables:
        t1 
          Fetch Operator
            limit: -1
      Alias -> Map Local Operator Tree:
        t1 
          TableScan
            alias: t1
            Statistics: Num rows: 10 Data size: 40 Basic stats: COMPLETE Column stats: NONE
            HashTable Sink Operator
              keys:
                0 user_id (type: int)
                1 user_id (type: int)

  Stage: Stage-2
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: t0
            filterExpr: (follower_id > 2) (type: boolean)
            Statistics: Num rows: 3 Data size: 9 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: (follower_id > 2) (type: boolean)
              Statistics: Num rows: 1 Data size: 3 Basic stats: COMPLETE Column stats: NONE
              Map Join Operator
                condition map:
                     Left Outer Join0 to 1
                keys:
                  0 user_id (type: int)
                  1 user_id (type: int)
                outputColumnNames: _col0, _col1, _col6
                Statistics: Num rows: 11 Data size: 44 Basic stats: COMPLETE Column stats: NONE
                Select Operator
                  expressions: _col1 (type: int), _col0 (type: int), _col6 (type: int)
                  outputColumnNames: _col1, _col0, _col6
                  Statistics: Num rows: 11 Data size: 44 Basic stats: COMPLETE Column stats: NONE
                  Group By Operator
                    aggregations: count(_col0), sum(_col6)
                    keys: _col1 (type: int)
                    mode: hash
                    outputColumnNames: _col0, _col1, _col2
                    Statistics: Num rows: 11 Data size: 44 Basic stats: COMPLETE Column stats: NONE
                    Reduce Output Operator
                      key expressions: _col0 (type: int)
                      sort order: +
                      Map-reduce partition columns: _col0 (type: int)
                      Statistics: Num rows: 11 Data size: 44 Basic stats: COMPLETE Column stats: NONE
                      value expressions: _col1 (type: bigint), _col2 (type: bigint)
      Local Work:
        Map Reduce Local Work
      Reduce Operator Tree:
        Group By Operator
          aggregations: count(VALUE._col0), sum(VALUE._col1)
          keys: KEY._col0 (type: int)
          mode: mergepartial
          outputColumnNames: _col0, _col1, _col2
          Statistics: Num rows: 5 Data size: 20 Basic stats: COMPLETE Column stats: NONE
          Select Operator
            expressions: _col1 (type: bigint), _col2 (type: bigint)
            outputColumnNames: _col1, _col2
            Statistics: Num rows: 5 Data size: 20 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: (_col1 > 2) (type: boolean)
              Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column stats: NONE
              Select Operator
                expressions: _col1 (type: bigint), _col2 (type: bigint)
                outputColumnNames: _col0, _col1
                Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column stats: NONE
                File Output Operator
                  compressed: false
                  table:
                      input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                      output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                      serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe

  Stage: Stage-3
    Map Reduce
      Map Operator Tree:
          TableScan
            Reduce Output Operator
              key expressions: _col1 (type: bigint)
              sort order: +
              Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column stats: NONE
              TopN Hash Memory Usage: 0.04
              value expressions: _col0 (type: bigint)
      Reduce Operator Tree:
        Select Operator
          expressions: VALUE._col0 (type: bigint), KEY.reducesinkkey0 (type: bigint)
          outputColumnNames: _col0, _col1
          Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column stats: NONE
          Limit
            Number of rows: 1
            Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column stats: NONE
            File Output Operator
              compressed: false
              Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column stats: NONE
              table:
                  input format: org.apache.hadoop.mapred.TextInputFormat
                  output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: 1
      Processor Tree:
        ListSink

Time taken: 0.351 seconds, Fetched: 116 row(s)

执行计划的理解:

我们将上述结果拆分看,先从最外层开始,包含两个大的部分:

  1. stage dependencies: 各个stage之间的依赖性
  2. stage plan: 各个stage的执行计划(物理执行计划)

 stage plan,里面有一个 Map Reduce,一个MR的执行计划分为两个部分:

  1. Map Operator Tree: MAP端的执行计划树
  2. Reduce Operator Tree: Reduce端的执行计划树

这两个执行计划树里面包含这条sql语句的 operator:

  1. map端第一个操作肯定是加载表,所以就是 TableScan 表扫描操作,常见的属性:
    1. alias: 表名称
    2. Statistics: 表统计信息,包含表中数据条数,数据大小等
  2. Select Operator: 选取操作,常见的属性 :
    • expressions:需要的字段名称及字段类型
    • outputColumnNames:输出的列名称
    • Statistics:表统计信息,包含表中数据条数,数据大小等
  3. Group By Operator:分组聚合操作,常见的属性:
    • aggregations:显示聚合函数信息
    • mode:聚合模式,值有 hash:随机聚合,就是hash partition;partial:局部聚合;final:最终聚合
    • keys:分组的字段,如果没有分组,则没有此字段
    • outputColumnNames:聚合之后输出列名
    • Statistics: 表统计信息,包含分组聚合之后的数据条数,数据大小等
  4. Reduce Output Operator:输出到reduce操作,常见属性:
    • sort order:值为空 不排序;值为 + 正序排序,值为 - 倒序排序;值为 +- 排序的列为两列,第一列为正序,第二列为倒序
  5. Filter Operator:过滤操作,常见的属性:
    • predicate:过滤条件,如sql语句中的where id>=1,则此处显示(id >= 1)
  6. Map Join Operator:join 操作,常见的属性:
    • condition map:join方式 ,如Inner Join 0 to 1 Left Outer Join0 to 2
    • keys: join 的条件字段
    • outputColumnNames: join 完成之后输出的字段
    • Statistics: join 完成之后生成的数据条数,大小等
  7. File Output Operator:文件输出操作,常见的属性
    • compressed:是否压缩
    • table:表的信息,包含输入输出文件格式化方式,序列化方式等
  8. Fetch Operator 客户端获取数据操作,常见的属性:
    • limit,值为 -1 表示不限制条数,其他值为限制的条数

(1)先看第一部分代表stage之间的依赖关系

 可以看到stage-6是根,stage-2依赖于stage-6,stage-3依赖于stage-2依次类推

(2)stage-6阶段

可以看到该阶段主要是对t1表进行扫描

(3)stage-2阶段

Map阶段:

 该阶段实际上做了很多工作,首先扫描t0表,由于hive本身对SQL进行了优化谓词下推的作用会先执行where里面的过滤操作,然后执行mapjoin操作,在mapjoin输出列中会将下一操作中group by 中的key一样进行输出,此时select选择器会选出下一操作的预选数据,如果group by中的key与select中字段不一致,那么此时就会多一步select操作将group by中的key值一并输出作为,group by操作中需要的数据,所以这一步的select是为group by做准备,其次执行map端的group by,因为hive默认是开启预聚合操作的,所以会先在map端进行聚合,此时的group by采用的是hash的方式也就是局部聚合,同时也会按照group by中的key值进行升序排序。

Reduce阶段

redcue阶段采用的group by聚合为merge操作,即将分组有序的数据进行归并操作,输出最终的结果,此时group by 后又有select操作,该select用于为下一步having操作准备数据,having操作会在该select的结果集上做进一步的过滤,最终输出having的结果。通过这里可以看出hivesql执行的时候并不是网上所说的select在having之后执行,而是在having之前,因为在hivesql中having的语句后面是可以用select的别名的,这就是看执行计划的魅力!!!事实上在hivesql中select操作并不是在指定的哪一步出现,而是每一步的结果都是为下一步准备有效数据。

(4) stage-3阶段

该阶段主要是SQL中的order by操作,按照指定的key值进行排序,最后select出结果

(5)stage-0阶段

 该阶段主要是执行limit操作。

我们变换一下SQL,再查看执行计划看有什么区别

explain select count(1) as cnt
      ,sum(t0.follower_id) as sum_f
from follow t0
left join music_likes t1
on t0.user_id=t1.user_id
where t0.follower_id > 2
group by t0.follower_id
having cnt > 2
order by sum_f
limit 1

执行计划如下:

STAGE DEPENDENCIES:
  Stage-6 is a root stage
  Stage-2 depends on stages: Stage-6
  Stage-3 depends on stages: Stage-2
  Stage-0 depends on stages: Stage-3

STAGE PLANS:
  Stage: Stage-6
    Map Reduce Local Work
      Alias -> Map Local Tables:
        t1 
          Fetch Operator
            limit: -1
      Alias -> Map Local Operator Tree:
        t1 
          TableScan
            alias: t1
            Statistics: Num rows: 10 Data size: 40 Basic stats: COMPLETE Column stats: NONE
            HashTable Sink Operator
              keys:
                0 user_id (type: int)
                1 user_id (type: int)

  Stage: Stage-2
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: t0
            filterExpr: (follower_id > 2) (type: boolean)
            Statistics: Num rows: 3 Data size: 9 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: (follower_id > 2) (type: boolean)
              Statistics: Num rows: 1 Data size: 3 Basic stats: COMPLETE Column stats: NONE
              Map Join Operator
                condition map:
                     Left Outer Join0 to 1
                keys:
                  0 user_id (type: int)
                  1 user_id (type: int)
                outputColumnNames: _col1
                Statistics: Num rows: 11 Data size: 44 Basic stats: COMPLETE Column stats: NONE
                Group By Operator
                  aggregations: count(1), sum(_col1)
                  keys: _col1 (type: int)
                  mode: hash
                  outputColumnNames: _col0, _col1, _col2
                  Statistics: Num rows: 11 Data size: 44 Basic stats: COMPLETE Column stats: NONE
                  Reduce Output Operator
                    key expressions: _col0 (type: int)
                    sort order: +
                    Map-reduce partition columns: _col0 (type: int)
                    Statistics: Num rows: 11 Data size: 44 Basic stats: COMPLETE Column stats: NONE
                    value expressions: _col1 (type: bigint), _col2 (type: bigint)
      Local Work:
        Map Reduce Local Work
      Reduce Operator Tree:
        Group By Operator
          aggregations: count(VALUE._col0), sum(VALUE._col1)
          keys: KEY._col0 (type: int)
          mode: mergepartial
          outputColumnNames: _col0, _col1, _col2
          Statistics: Num rows: 5 Data size: 20 Basic stats: COMPLETE Column stats: NONE
          Select Operator
            expressions: _col1 (type: bigint), _col2 (type: bigint)
            outputColumnNames: _col1, _col2
            Statistics: Num rows: 5 Data size: 20 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: (_col1 > 2) (type: boolean)
              Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column stats: NONE
              Select Operator
                expressions: _col1 (type: bigint), _col2 (type: bigint)
                outputColumnNames: _col0, _col1
                Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column stats: NONE
                File Output Operator
                  compressed: false
                  table:
                      input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                      output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                      serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe

  Stage: Stage-3
    Map Reduce
      Map Operator Tree:
          TableScan
            Reduce Output Operator
              key expressions: _col1 (type: bigint)
              sort order: +
              Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column stats: NONE
              TopN Hash Memory Usage: 0.04
              value expressions: _col0 (type: bigint)
      Reduce Operator Tree:
        Select Operator
          expressions: VALUE._col0 (type: bigint), KEY.reducesinkkey0 (type: bigint)
          outputColumnNames: _col0, _col1
          Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column stats: NONE
          Limit
            Number of rows: 1
            Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column stats: NONE
            File Output Operator
              compressed: false
              Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column stats: NONE
              table:
                  input format: org.apache.hadoop.mapred.TextInputFormat
                  output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: 1
      Processor Tree:
        ListSink

查看其执行计划基本一致,主要的区别在如下:

在stage-2阶段的mapjoin后没有select操作,为什么呢?原因在于该SQL中select中的字段和group by中的字段是一致的。

2  小结

通过上面的研究我们发现在hivesql中其执行顺序大致如下:

from->where(谓词下推)->join->on->select(select中的字段与group by只要不一致就会有)->group by->select(为having准备数据,因而having中可以使用select别名)->having->select(过滤后的结果集)->distinct->select->order by ->select->limit

实际上上述需要注意的是select在group by前和group by后都有,这点需要注意,还有where语句谓词下推,就这两点比较特殊。

待研究:hivesql有窗口函数时的执行计划。

欢迎关注石榴姐公众号"我的SQL呀",关注我不迷路

 

  • 12
    点赞
  • 36
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值