对谓词下推的一点看法

6 篇文章 0 订阅

谓词下推

1. 谓词下推概念

谓词下推原本是一个关系型数据库中的词语,优化关系 SQL 查询的一项基本技术是,将外层查询块的 WHERE 子句中的谓词移入所包含的较低层查询块(例如视图),从而能够提早进行数据过滤以及有可能更好地利用索引。

2. Hive谓词下推(Predicate pushdown):

Hive谓词下推这个词是从关系型数据库借鉴来的,即使对Hive对来说相当于谓词上推。

谓词下推的基本思想:尽可能早的处理表达式(expressions),默认产生的执行计划在看到数据的地方添加过滤器filter。

2.1 hive文档中的解释

在这里插入图片描述

  • Preserved Row table(保留表)

在outer join中需要返回所有数据的表叫做保留表,也就是说在left outer join中,左表需要返回所有数据,则左表是保留表;right outer join中右表则是保留表;在full outer join中左表和右表都要返回所有数据,则左右表都是保留表。

  • Null Supplying table(空表)

在outer join中对于没有匹配到的行需要用null来填充的表称为Null Supplying table。在left outer join中,左表的数据全返回,对于左表在右表中无法匹配的数据的相应列用null表示,则此时右表是Null Supplying table,相应的如果是right outer join的话,左表是Null Supplying table。但是在full outer join中左表和右表都是Null Supplying table,因为左表和右表都会用null来填充无法匹配的数据。

  • During Join predicate(Join中的谓词)

Join中的谓词是指 Join On语句中的谓词。如:R1 join R2 on R1.x = 5 the predicate R1.x = 5是Join中的谓词

  • After Join predicate(Join之后的谓词)

where语句中的谓词称之为Join之后的谓词

3. 测试
3.1 建表
create table test1(id int,openid string) PARTITIONED BY ( day string ) STORED AS ORC;
create table test2(id int,openid string)  PARTITIONED BY ( day string ) STORED AS ORC;
insert into table test1 partition (day='20190521') values(1,'张三');
insert into table test1 partition (day='20190521') values(2,'李四'); 
insert into table test1 partition (day='20190521') values(3,'王五');
insert into table test1 partition (day='20190521') values(1,'钱七');


insert into table test2 partition (day='20190521') values(1,'张三');
insert into table test2 partition (day='20190521') values(3,'赵六');
3.2 TESTING
select 
  count(distinct case when b.openid is null then a.openid end) as n1,
    count(distinct case when a.openid is null then b.openid end) as n2 
from test2 a full join   test1 b on a.openid = b.openid  
where a.day = '20190521' and b.day = '20190521'
select 
    count(distinct case when b.openid is null then a.openid end) as n1,
    count(distinct case when a.openid is null then b.openid end) as n2 
from (
    select * from test1 where day = '20190521'
) a full join (
    select * from test2  where day = '20190521'
)   b 
on a.openid = b.openid  
;
3.3 case1: left outer join 操作
select t1.*,t2.* from test1 t1 left join test2 t2 on t1.id=t2.id and t1.openid='钱七'  
and t2.openid='张三';
STAGE DEPENDENCIES:
  Stage-4 is a root stage
  Stage-3 depends on stages: Stage-4
  Stage-0 depends on stages: Stage-3

STAGE PLANS:
  Stage: Stage-4
    Map Reduce Local Work
      Alias -> Map Local Tables:
        $hdt$_1:t2 
          Fetch Operator
            limit: -1
      Alias -> Map Local Operator Tree:
        $hdt$_1:t2 
          TableScan
            alias: t2
            filterExpr: (openid = '张三') (type: boolean)
            Statistics: Num rows: 2 Data size: 556 Basic stats: COMPLETE Column stats: PARTIAL
            Filter Operator
              predicate: (openid = '张三') (type: boolean)
              Statistics: Num rows: 1 Data size: 184 Basic stats: COMPLETE Column stats: PARTIAL
              Select Operator
                expressions: id (type: int), '张三' (type: string), day (type: string)
                outputColumnNames: _col0, _col1, _col2
                Statistics: Num rows: 1 Data size: 270 Basic stats: COMPLETE Column stats: PARTIAL
                HashTable Sink Operator
                  filter predicates:
                    0 {(_col1 = '钱七')}
                    1 
                  keys:
                    0 _col0 (type: int)
                    1 _col0 (type: int)

  Stage: Stage-3
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: t1
            Statistics: Num rows: 4 Data size: 1112 Basic stats: COMPLETE Column stats: PARTIAL
            Select Operator
              expressions: id (type: int), openid (type: string), day (type: string)
              outputColumnNames: _col0, _col1, _col2
              Statistics: Num rows: 4 Data size: 736 Basic stats: COMPLETE Column stats: PARTIAL
              Map Join Operator
                condition map:
                     Left Outer Join0 to 1
                filter predicates:
                  0 {(_col1 = '钱七')}
                  1 
                keys:
                  0 _col0 (type: int)
                  1 _col0 (type: int)
                outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5
                Statistics: Num rows: 4 Data size: 2248 Basic stats: COMPLETE Column stats: PARTIAL
                File Output Operator
                  compressed: false
                  Statistics: Num rows: 4 Data size: 2248 Basic stats: COMPLETE Column stats: PARTIAL
                  table:
                      input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                      output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                      serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
      Execution mode: vectorized
      Local Work:
        Map Reduce Local Work

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink
select t1.*,t2.* from test1 t1 left join test2 t2 on t1.id=t2.id where t1.openid='钱七'  
and t2.openid='张三';
STAGE DEPENDENCIES:
  Stage-4 is a root stage
  Stage-3 depends on stages: Stage-4
  Stage-0 depends on stages: Stage-3

STAGE PLANS:
  Stage: Stage-4
    Map Reduce Local Work
      Alias -> Map Local Tables:
        t2 
          Fetch Operator
            limit: -1
      Alias -> Map Local Operator Tree:
        t2 
          TableScan
            alias: t2
            Statistics: Num rows: 2 Data size: 556 Basic stats: COMPLETE Column stats: PARTIAL
            HashTable Sink Operator
              keys:
                0 id (type: int)
                1 id (type: int)

  Stage: Stage-3
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: t1
            filterExpr: (openid = '钱七') (type: boolean)
            Statistics: Num rows: 4 Data size: 1112 Basic stats: COMPLETE Column stats: PARTIAL
            Filter Operator
              predicate: (openid = '钱七') (type: boolean)
              Statistics: Num rows: 2 Data size: 368 Basic stats: COMPLETE Column stats: PARTIAL
              Map Join Operator
                condition map:
                     Left Outer Join0 to 1
                keys:
                  0 id (type: int)
                  1 id (type: int)
                outputColumnNames: _col0, _col2, _col6, _col7, _col8
                Statistics: Num rows: 2 Data size: 756 Basic stats: COMPLETE Column stats: PARTIAL
                Filter Operator
                  predicate: (_col7 = '张三') (type: boolean)
                  Statistics: Num rows: 1 Data size: 368 Basic stats: COMPLETE Column stats: PARTIAL
                  Select Operator
                    expressions: _col0 (type: int), '钱七' (type: string), _col2 (type: string), _col6 (type: int), '张三' (type: string), _col8 (type: string)
                    outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5
                    Statistics: Num rows: 1 Data size: 540 Basic stats: COMPLETE Column stats: PARTIAL
                    File Output Operator
                      compressed: false
                      Statistics: Num rows: 1 Data size: 540 Basic stats: COMPLETE Column stats: PARTIAL
                      table:
                          input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                          output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                          serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
      Execution mode: vectorized
      Local Work:
        Map Reduce Local Work

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

在这里插入图片描述
在这里插入图片描述

3.4 case2 : full join操作
select t1.*,t2.* from test1 t1 full join test2 t2 on t1.id=t2.id and t1.openid='钱七'  
and t2.openid='张三';
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: t1
            Statistics: Num rows: 4 Data size: 1112 Basic stats: COMPLETE Column stats: PARTIAL
            Reduce Output Operator
              key expressions: id (type: int)
              sort order: +
              Map-reduce partition columns: id (type: int)
              Statistics: Num rows: 4 Data size: 1112 Basic stats: COMPLETE Column stats: PARTIAL
              value expressions: openid (type: string), day (type: string)
          TableScan
            alias: t2
            Statistics: Num rows: 2 Data size: 556 Basic stats: COMPLETE Column stats: PARTIAL
            Reduce Output Operator
              key expressions: id (type: int)
              sort order: +
              Map-reduce partition columns: id (type: int)
              Statistics: Num rows: 2 Data size: 556 Basic stats: COMPLETE Column stats: PARTIAL
              value expressions: openid (type: string), day (type: string)
      Reduce Operator Tree:
        Join Operator
          condition map:
               Outer Join 0 to 1
          filter predicates:
            0 {(VALUE._col0 = '钱七')}
            1 {(VALUE._col0 = '张三')}
          keys:
            0 id (type: int)
            1 id (type: int)
          outputColumnNames: _col0, _col1, _col2, _col6, _col7, _col8
          Statistics: Num rows: 6 Data size: 3456 Basic stats: COMPLETE Column stats: PARTIAL
          Select Operator
            expressions: _col0 (type: int), _col1 (type: string), _col2 (type: string), _col6 (type: int), _col7 (type: string), _col8 (type: string)
            outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5
            Statistics: Num rows: 6 Data size: 2208 Basic stats: COMPLETE Column stats: PARTIAL
            File Output Operator
              compressed: false
              Statistics: Num rows: 6 Data size: 2208 Basic stats: COMPLETE Column stats: PARTIAL
              table:
                  input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                  output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink
select t1.*,t2.* from test1 t1 full join test2 t2 on t1.id=t2.id where t1.openid='钱七'  
and t2.openid='张三';
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: t1
            Statistics: Num rows: 4 Data size: 1112 Basic stats: COMPLETE Column stats: PARTIAL
            Reduce Output Operator
              key expressions: id (type: int)
              sort order: +
              Map-reduce partition columns: id (type: int)
              Statistics: Num rows: 4 Data size: 1112 Basic stats: COMPLETE Column stats: PARTIAL
              value expressions: openid (type: string), day (type: string)
          TableScan
            alias: t2
            Statistics: Num rows: 2 Data size: 556 Basic stats: COMPLETE Column stats: PARTIAL
            Reduce Output Operator
              key expressions: id (type: int)
              sort order: +
              Map-reduce partition columns: id (type: int)
              Statistics: Num rows: 2 Data size: 556 Basic stats: COMPLETE Column stats: PARTIAL
              value expressions: openid (type: string), day (type: string)
      Reduce Operator Tree:
        Join Operator
          condition map:
               Outer Join 0 to 1
          keys:
            0 id (type: int)
            1 id (type: int)
          outputColumnNames: _col0, _col1, _col2, _col6, _col7, _col8
          Statistics: Num rows: 6 Data size: 2280 Basic stats: COMPLETE Column stats: PARTIAL
          Filter Operator
            predicate: ((_col1 = '钱七') and (_col7 = '张三')) (type: boolean)
            Statistics: Num rows: 1 Data size: 368 Basic stats: COMPLETE Column stats: PARTIAL
            Select Operator
              expressions: _col0 (type: int), '钱七' (type: string), _col2 (type: string), _col6 (type: int), '张三' (type: string), _col8 (type: string)
              outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5
              Statistics: Num rows: 1 Data size: 540 Basic stats: COMPLETE Column stats: PARTIAL
              File Output Operator
                compressed: false
                Statistics: Num rows: 1 Data size: 540 Basic stats: COMPLETE Column stats: PARTIAL
                table:
                    input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                    output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                    serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

在这里插入图片描述

在这里插入图片描述

4. 总结

谓词下推规则

在这里插入图片描述

5.求教

这种规则在hive2.x版本以后,就不是很准确了,hive2.x对CBO做了优化,CBO也对谓词下推规则产生了一些影响。

因此在hive2.1.1中影响谓词下推规则的,主要有两方面

  • Hive逻辑执行计划层面的优化
  • CBO(Cost based Optimizer)
6. 参考文献

[1] https://blog.csdn.net/strongyoung88/article/details/81156271

[2] https://cwiki.apache.org/confluence/display/Hive/OuterJoinBehavior

[3] https://blog.csdn.net/baichoufei90/article/details/85264100

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值