对谓词下推的一点看法

最新推荐文章于 2023-03-13 17:20:47 发布

七月流火_2567

最新推荐文章于 2023-03-13 17:20:47 发布

阅读量1.4k

点赞数

分类专栏： hadoop SQL 文章标签：大数据 hive sql

本文链接：https://blog.csdn.net/zhy_2117/article/details/106680694

版权

hadoop 同时被 2 个专栏收录

12 篇文章 0 订阅

订阅专栏

SQL

6 篇文章 0 订阅

订阅专栏

谓词下推

1. 谓词下推概念

谓词下推原本是一个关系型数据库中的词语，优化关系 SQL 查询的一项基本技术是，将外层查询块的 WHERE 子句中的谓词移入所包含的较低层查询块（例如视图），从而能够提早进行数据过滤以及有可能更好地利用索引。

2. Hive谓词下推(Predicate pushdown):

Hive谓词下推这个词是从关系型数据库借鉴来的，即使对Hive对来说相当于谓词上推。

谓词下推的基本思想：尽可能早的处理表达式(expressions)，默认产生的执行计划在看到数据的地方添加过滤器filter。

2.1 hive文档中的解释

在这里插入图片描述

Preserved Row table（保留表）

在outer join中需要返回所有数据的表叫做保留表，也就是说在left outer join中，左表需要返回所有数据，则左表是保留表；right outer join中右表则是保留表；在full outer join中左表和右表都要返回所有数据，则左右表都是保留表。

Null Supplying table（空表）

在outer join中对于没有匹配到的行需要用null来填充的表称为Null Supplying table。在left outer join中，左表的数据全返回，对于左表在右表中无法匹配的数据的相应列用null表示，则此时右表是Null Supplying table，相应的如果是right outer join的话，左表是Null Supplying table。但是在full outer join中左表和右表都是Null Supplying table，因为左表和右表都会用null来填充无法匹配的数据。

During Join predicate（Join中的谓词）

Join中的谓词是指 Join On语句中的谓词。如：R1 join R2 on R1.x = 5 the predicate R1.x = 5是Join中的谓词

After Join predicate（Join之后的谓词）

where语句中的谓词称之为Join之后的谓词

3. 测试

3.1 建表

create table test1(id int,openid string) PARTITIONED BY ( day string ) STORED AS ORC;
create table test2(id int,openid string)  PARTITIONED BY ( day string ) STORED AS ORC;

insert into table test1 partition (day='20190521') values(1,'张三');
insert into table test1 partition (day='20190521') values(2,'李四'); 
insert into table test1 partition (day='20190521') values(3,'王五');
insert into table test1 partition (day='20190521') values(1,'钱七');


insert into table test2 partition (day='20190521') values(1,'张三');
insert into table test2 partition (day='20190521') values(3,'赵六');

3.2 TESTING

select 
  count(distinct case when b.openid is null then a.openid end) as n1,
    count(distinct case when a.openid is null then b.openid end) as n2 
from test2 a full join   test1 b on a.openid = b.openid  
where a.day = '20190521' and b.day = '20190521'

select 
    count(distinct case when b.openid is null then a.openid end) as n1,
    count(distinct case when a.openid is null then b.openid end) as n2 
from (
    select * from test1 where day = '20190521'
) a full join (
    select * from test2  where day = '20190521'
)   b 
on a.openid = b.openid  
;

3.3 case1: left outer join 操作

select t1.*,t2.* from test1 t1 left join test2 t2 on t1.id=t2.id and t1.openid='钱七'  
and t2.openid='张三';

STAGE DEPENDENCIES:
  Stage-4 is a root stage
  Stage-3 depends on stages: Stage-4
  Stage-0 depends on stages: Stage-3

STAGE PLANS:
  Stage: Stage-4
    Map Reduce Local Work
      Alias -> Map Local Tables:
        $hdt$_1:t2 
          Fetch Operator
            limit: -1
      Alias -> Map Local Operator Tree:
        $hdt$_1:t2 
          TableScan
            alias: t2
            filterExpr: (openid = '张三') (type: boolean)
            Statistics: Num rows: 2 Data size: 556 Basic stats: COMPLETE Column stats: PARTIAL
            Filter Operator
              predicate: (openid = '张三') (type: boolean)
              Statistics: Num rows: 1 Data size: 184 Basic stats: COMPLETE Column stats: PARTIAL
              Select Operator
                expressions: id (type: int), '张三' (type: string), day (type: string)
                outputColumnNames: _col0, _col1, _col2
                Statistics: Num rows: 1 Data size: 270 Basic stats: COMPLETE Column stats: PARTIAL
                HashTable Sink Operator
                  filter predicates:
                    0 {(_col1 = '钱七')}
                    1 
                  keys:
                    0 _col0 (type: int)
                    1 _col0 (type: int)

  Stage: Stage-3
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: t1
            Statistics: Num rows: 4 Data size: 1112 Basic stats: COMPLETE Column stats: PARTIAL
            Select Operator
              expressions: id (type: int), openid (type: string), day (type: string)
              outputColumnNames: _col0, _col1, _col2
              Statistics: Num rows: 4 Data size: 736 Basic stats: COMPLETE Column stats: PARTIAL
              Map Join Operator
                condition map:
                     Left Outer Join0 to 1
                filter predicates:
                  0 {(_col1 = '钱七')}
                  1 
                keys:
                  0 _col0 (type: int)
                  1 _col0 (type: int)
                outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5
                Statistics: Num rows: 4 Data size: 2248 Basic stats: COMPLETE Column stats: PARTIAL
                File Output Operator
                  compressed: false
                  Statistics: Num rows: 4 Data size: 2248 Basic stats: COMPLETE Column stats: PARTIAL
                  table:
                      input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                      output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                      serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
      Execution mode: vectorized
      Local Work:
        Map Reduce Local Work

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

select t1.*,t2.* from test1 t1 left join test2 t2 on t1.id=t2.id where t1.openid='钱七'  
and t2.openid='张三';

STAGE DEPENDENCIES:
  Stage-4 is a root stage
  Stage-3 depends on stages: Stage-4
  Stage-0 depends on stages: Stage-3

STAGE PLANS:
  Stage: Stage-4
    Map Reduce Local Work
      Alias -> Map Local Tables:
        t2 
          Fetch Operator
            limit: -1
      Alias -> Map Local Operator Tree:
        t2 
          TableScan
            alias: t2
            Statistics: Num rows: 2 Data size: 556 Basic stats: COMPLETE Column stats: PARTIAL
            HashTable Sink Operator
              keys:
                0 id (type: int)
                1 id (type: int)

  Stage: Stage-3
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: t1
            filterExpr: (openid = '钱七') (type: boolean)
            Statistics: Num rows: 4 Data size: 1112 Basic stats: COMPLETE Column stats: PARTIAL
            Filter Operator
              predicate: (openid = '钱七') (type: boolean)
              Statistics: Num rows: 2 Data size: 368 Basic stats: COMPLETE Column stats: PARTIAL
              Map Join Operator
                condition map:
                     Left Outer Join0 to 1
                keys:
                  0 id (type: int)
                  1 id (type: int)
                outputColumnNames: _col0, _col2, _col6, _col7, _col8
                Statistics: Num rows: 2 Data size: 756 Basic stats: COMPLETE Column stats: PARTIAL
                Filter Operator
                  predicate: (_col7 = '张三') (type: boolean)
                  Statistics: Num rows: 1 Data size: 368 Basic stats: COMPLETE Column stats: PARTIAL
                  Select Operator
                    expressions: _col0 (type: int), '钱七' (type: string), _col2 (type: string), _col6 (type: int), '张三' (type: string), _col8 (type: string)
                    outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5
                    Statistics: Num rows: 1 Data size: 540 Basic stats: COMPLETE Column stats: PARTIAL
                    File Output Operator
                      compressed: false
                      Statistics: Num rows: 1 Data size: 540 Basic stats: COMPLETE Column stats: PARTIAL
                      table:
                          input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                          output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                          serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
      Execution mode: vectorized
      Local Work:
        Map Reduce Local Work

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

在这里插入图片描述

3.4 case2 : full join操作

select t1.*,t2.* from test1 t1 full join test2 t2 on t1.id=t2.id and t1.openid='钱七'  
and t2.openid='张三';

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: t1
            Statistics: Num rows: 4 Data size: 1112 Basic stats: COMPLETE Column stats: PARTIAL
            Reduce Output Operator
              key expressions: id (type: int)
              sort order: +
              Map-reduce partition columns: id (type: int)
              Statistics: Num rows: 4 Data size: 1112 Basic stats: COMPLETE Column stats: PARTIAL
              value expressions: openid (type: string), day (type: string)
          TableScan
            alias: t2
            Statistics: Num rows: 2 Data size: 556 Basic stats: COMPLETE Column stats: PARTIAL
            Reduce Output Operator
              key expressions: id (type: int)
              sort order: +
              Map-reduce partition columns: id (type: int)
              Statistics: Num rows: 2 Data size: 556 Basic stats: COMPLETE Column stats: PARTIAL
              value expressions: openid (type: string), day (type: string)
      Reduce Operator Tree:
        Join Operator
          condition map:
               Outer Join 0 to 1
          filter predicates:
            0 {(VALUE._col0 = '钱七')}
            1 {(VALUE._col0 = '张三')}
          keys:
            0 id (type: int)
            1 id (type: int)
          outputColumnNames: _col0, _col1, _col2, _col6, _col7, _col8
          Statistics: Num rows: 6 Data size: 3456 Basic stats: COMPLETE Column stats: PARTIAL
          Select Operator
            expressions: _col0 (type: int), _col1 (type: string), _col2 (type: string), _col6 (type: int), _col7 (type: string), _col8 (type: string)
            outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5
            Statistics: Num rows: 6 Data size: 2208 Basic stats: COMPLETE Column stats: PARTIAL
            File Output Operator
              compressed: false
              Statistics: Num rows: 6 Data size: 2208 Basic stats: COMPLETE Column stats: PARTIAL
              table:
                  input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                  output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

select t1.*,t2.* from test1 t1 full join test2 t2 on t1.id=t2.id where t1.openid='钱七'  
and t2.openid='张三';

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: t1
            Statistics: Num rows: 4 Data size: 1112 Basic stats: COMPLETE Column stats: PARTIAL
            Reduce Output Operator
              key expressions: id (type: int)
              sort order: +
              Map-reduce partition columns: id (type: int)
              Statistics: Num rows: 4 Data size: 1112 Basic stats: COMPLETE Column stats: PARTIAL
              value expressions: openid (type: string), day (type: string)
          TableScan
            alias: t2
            Statistics: Num rows: 2 Data size: 556 Basic stats: COMPLETE Column stats: PARTIAL
            Reduce Output Operator
              key expressions: id (type: int)
              sort order: +
              Map-reduce partition columns: id (type: int)
              Statistics: Num rows: 2 Data size: 556 Basic stats: COMPLETE Column stats: PARTIAL
              value expressions: openid (type: string), day (type: string)
      Reduce Operator Tree:
        Join Operator
          condition map:
               Outer Join 0 to 1
          keys:
            0 id (type: int)
            1 id (type: int)
          outputColumnNames: _col0, _col1, _col2, _col6, _col7, _col8
          Statistics: Num rows: 6 Data size: 2280 Basic stats: COMPLETE Column stats: PARTIAL
          Filter Operator
            predicate: ((_col1 = '钱七') and (_col7 = '张三')) (type: boolean)
            Statistics: Num rows: 1 Data size: 368 Basic stats: COMPLETE Column stats: PARTIAL
            Select Operator
              expressions: _col0 (type: int), '钱七' (type: string), _col2 (type: string), _col6 (type: int), '张三' (type: string), _col8 (type: string)
              outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5
              Statistics: Num rows: 1 Data size: 540 Basic stats: COMPLETE Column stats: PARTIAL
              File Output Operator
                compressed: false
                Statistics: Num rows: 1 Data size: 540 Basic stats: COMPLETE Column stats: PARTIAL
                table:
                    input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                    output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                    serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

在这里插入图片描述

4. 总结

谓词下推规则

在这里插入图片描述

5.求教

这种规则在hive2.x版本以后，就不是很准确了，hive2.x对CBO做了优化，CBO也对谓词下推规则产生了一些影响。

因此在hive2.1.1中影响谓词下推规则的，主要有两方面

Hive逻辑执行计划层面的优化
CBO（Cost based Optimizer）

6. 参考文献

[1] https://blog.csdn.net/strongyoung88/article/details/81156271

[2] https://cwiki.apache.org/confluence/display/Hive/OuterJoinBehavior

[3] https://blog.csdn.net/baichoufei90/article/details/85264100

七月流火_2567

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
对谓词下推的一点看法

谓词下推 1. 谓词下推概念谓词下推原本是一个关系型数据库中的词语，优化关系 SQL 查询的一项基本技术是，将外层查询块的 WHERE 子句中的谓词移入所包含的较低层查询块（例如视图），从而能够提早进行数据过滤以及有可能更好地利用索引。 2. Hive谓词下推(Predicate pushdown):Hive谓词下推这个词是从关系型数据库借鉴来的，即使对Hive对来说相当于谓词上推。谓词下推的基本思想：尽可能早的处理表达式(expressions)，默认产生的执行计...
复制链接

扫一扫