hive谓词与cbo的一些奇葩事

谓词下推 Predicate Pushdown(PPD)

     就是在不影响结果的情况下,尽量将过滤条件提前执行。谓词下推后,过滤条件在map端执行,减少了map端的输出,降低了数据在集群上传输的量,节约了集群的资源,也提升了任务的性能。

  • Preserved Row table(在join中返回全部数据)
    • left join 左边的部分
    • right join 右边的部分
    • full join 左边和右边的
  • Null Supplying table(有空数据的join)
  • During Join predicate(在join中谓词下推, on条件后面)
  • After Join predicate(在join后谓词下推,where条件)

规则:

During Join predicates cannot be pushed past Preserved Row tables.

在join中返回全部数据的,不会在条件在on后面的被下推执行

After Join predicates cannot be pushed past Null Supplying tables.

join后有空数据的表如果join后有where条件不能被下推

得到以下结论:

1、对于Join(Inner Join)、Full outer Join,条件写在on后面,还是where后面,性能上面没有区别;
2、对于Left outer Join ,右侧的表写在on后面、左侧的表写在where后面,性能上有提高;
3、对于Right outer Join,左侧的表写在on后面、右侧的表写在where后面,性能上有提高;

 

hive.cbo.enable :

在Hive 1.1.0之后,这个feature是默认开启的,它可以自动优化HQL中多个JOIN的顺序,并选择合适的JOIN算法。

例子:

-- set hive.auto.convert.join=false;

-- set hive.cbo.enable=false;
-- explain

explain 
select count(case when b.userid is null then a.userid end)
from
    increasecredit_202011 a
    full outer join
    (select *
    from dw_pay_activity
    

    ) b
    	
on a.userid=b.userid
where b.visit_date=20201029
Explain
STAGE DEPENDENCIES:
  Stage-5 is a root stage
  Stage-2 depends on stages: Stage-5
  Stage-0 depends on stages: Stage-2
STAGE PLANS:
  Stage: Stage-5
    Map Reduce Local Work
      Alias -> Map Local Tables:
        $hdt$_0:$hdt$_0:a 
          Fetch Operator
            limit: -1
      Alias -> Map Local Operator Tree:
        $hdt$_0:$hdt$_0:a 
          TableScan
            alias: a
            Statistics: Num rows: 6657 Data size: 53262 Basic stats: COMPLETE Column stats: NONE
            Select Operator
              expressions: userid (type: bigint)
              outputColumnNames: _col0
              Statistics: Num rows: 6657 Data size: 53262 Basic stats: COMPLETE Column stats: NONE
              HashTable Sink Operator
                keys:
                  0 _col0 (type: bigint)
                  1 _col0 (type: bigint)
  Stage: Stage-2
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: dw_pay_activity
            Statistics: Num rows: 123722 Data size: 989778 Basic stats: COMPLETE Column stats: NONE
            Select Operator
              expressions: userid (type: bigint)
              outputColumnNames: _col0
              Statistics: Num rows: 123722 Data size: 989778 Basic stats: COMPLETE Column stats: NONE
              Map Join Operator
                condition map:
                     Right Outer Join0 to 1
                keys:
                  0 _col0 (type: bigint)
                  1 _col0 (type: bigint)
                outputColumnNames: _col0, _col1
                Statistics: Num rows: 136094 Data size: 1088755 Basic stats: COMPLETE Column stats: NONE
                Select Operator
                  expressions: CASE WHEN (_col1 is null) THEN (_col0) ELSE (null) END (type: bigint)
                  outputColumnNames: _col0
                  Statistics: Num rows: 136094 Data size: 1088755 Basic stats: COMPLETE Column stats: NONE
                  Group By Operator
                    aggregations: count(_col0)
                    mode: hash
                    outputColumnNames: _col0
                    Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE
                    Reduce Output Operator
                      sort order: 
                      Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE
                      value expressions: _col0 (type: bigint)
      Local Work:
        Map Reduce Local Work
      Reduce Operator Tree:
        Group By Operator
          aggregations: count(VALUE._col0)
          mode: mergepartial
          outputColumnNames: _col0
          Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE
          File Output Operator
            compressed: true
            Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE
            table:
                input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

在这个执行计划中,系统自动优化成right outer join .... 按照上面的原理去推测,outerjoin是不会被下推执行的,但是会被cbo给优化掉。因为条件是b. a: b leftjoin.

 

 

参考链接:https://mp.weixin.qq.com/s/ZwoAjkMlvAOoOI8Z-QNFsQ

                 https://blog.csdn.net/strongyoung88/article/details/81156271

               https://blog.csdn.net/strangerzz/article/details/78992482

               https://cwiki.apache.org/confluence/display/Hive/OuterJoinBehavior

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值