谓词下推
1. 谓词下推概念
谓词下推原本是一个关系型数据库中的词语,优化关系 SQL 查询的一项基本技术是,将外层查询块的 WHERE 子句中的谓词移入所包含的较低层查询块(例如视图),从而能够提早进行数据过滤以及有可能更好地利用索引。
2. Hive谓词下推(Predicate pushdown):
Hive谓词下推这个词是从关系型数据库借鉴来的,即使对Hive对来说相当于谓词上推。
谓词下推的基本思想:尽可能早的处理表达式(expressions),默认产生的执行计划在看到数据的地方添加过滤器filter。
2.1 hive文档中的解释
- Preserved Row table(保留表)
在outer join中需要返回所有数据的表叫做保留表,也就是说在left outer join中,左表需要返回所有数据,则左表是保留表;right outer join中右表则是保留表;在full outer join中左表和右表都要返回所有数据,则左右表都是保留表。
- Null Supplying table(空表)
在outer join中对于没有匹配到的行需要用null来填充的表称为Null Supplying table。在left outer join中,左表的数据全返回,对于左表在右表中无法匹配的数据的相应列用null表示,则此时右表是Null Supplying table,相应的如果是right outer join的话,左表是Null Supplying table。但是在full outer join中左表和右表都是Null Supplying table,因为左表和右表都会用null来填充无法匹配的数据。
- During Join predicate(Join中的谓词)
Join中的谓词是指 Join On语句中的谓词。如:R1 join R2 on R1.x = 5 the predicate R1.x = 5是Join中的谓词
- After Join predicate(Join之后的谓词)
where语句中的谓词称之为Join之后的谓词
3. 测试
3.1 建表
create table test1(id int,openid string) PARTITIONED BY ( day string ) STORED AS ORC;
create table test2(id int,openid string) PARTITIONED BY ( day string ) STORED AS ORC;
insert into table test1 partition (day='20190521') values(1,'张三');
insert into table test1 partition (day='20190521') values(2,'李四');
insert into table test1 partition (day='20190521') values(3,'王五');
insert into table test1 partition (day='20190521') values(1,'钱七');
insert into table test2 partition (day='20190521') values(1,'张三');
insert into table test2 partition (day='20190521') values(3,'赵六');
3.2 TESTING
select
count(distinct case when b.openid is null then a.openid end) as n1,
count(distinct case when a.openid is null then b.openid end) as n2
from test2 a full join test1 b on a.openid = b.openid
where a.day = '20190521' and b.day = '20190521'
select
count(distinct case when b.openid is null then a.openid end) as n1,
count(distinct case when a.openid is null then b.openid end) as n2
from (
select * from test1 where day = '20190521'
) a full join (
select * from test2 where day = '20190521'
) b
on a.openid = b.openid
;
3.3 case1: left outer join 操作
select t1.*,t2.* from test1 t1 left join test2 t2 on t1.id=t2.id and t1.openid='钱七'
and t2.openid='张三';
STAGE DEPENDENCIES:
Stage-4 is a root stage
Stage-3 depends on stages: Stage-4
Stage-0 depends on stages: Stage-3
STAGE PLANS:
Stage: Stage-4
Map Reduce Local Work
Alias -> Map Local Tables:
$hdt$_1:t2
Fetch Operator
limit: -1
Alias -> Map Local Operator Tree:
$hdt$_1:t2
TableScan
alias: t2
filterExpr: (openid = '张三') (type: boolean)
Statistics: Num rows: 2 Data size: 556 Basic stats: COMPLETE Column stats: PARTIAL
Filter Operator
predicate: (openid = '张三') (type: boolean)
Statistics: Num rows: 1 Data size: 184 Basic stats: COMPLETE Column stats: PARTIAL
Select Operator
expressions: id (type: int), '张三' (type: string), day (type: string)
outputColumnNames: _col0, _col1, _col2
Statistics: Num rows: 1 Data size: 270 Basic stats: COMPLETE Column stats: PARTIAL
HashTable Sink Operator
filter predicates:
0 {(_col1 = '钱七')}
1
keys:
0 _col0 (type: int)
1 _col0 (type: int)
Stage: Stage-3
Map Reduce
Map Operator Tree:
TableScan
alias: t1
Statistics: Num rows: 4 Data size: 1112 Basic stats: COMPLETE Column stats: PARTIAL
Select Operator
expressions: id (type: int), openid (type: string), day (type: string)
outputColumnNames: _col0, _col1, _col2
Statistics: Num rows: 4 Data size: 736 Basic stats: COMPLETE Column stats: PARTIAL
Map Join Operator
condition map:
Left Outer Join0 to 1
filter predicates:
0 {(_col1 = '钱七')}
1
keys:
0 _col0 (type: int)
1 _col0 (type: int)
outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5
Statistics: Num rows: 4 Data size: 2248 Basic stats: COMPLETE Column stats: PARTIAL
File Output Operator
compressed: false
Statistics: Num rows: 4 Data size: 2248 Basic stats: COMPLETE Column stats: PARTIAL
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Execution mode: vectorized
Local Work:
Map Reduce Local Work
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
select t1.*,t2.* from test1 t1 left join test2 t2 on t1.id=t2.id where t1.openid='钱七'
and t2.openid='张三';
STAGE DEPENDENCIES:
Stage-4 is a root stage
Stage-3 depends on stages: Stage-4
Stage-0 depends on stages: Stage-3
STAGE PLANS:
Stage: Stage-4
Map Reduce Local Work
Alias -> Map Local Tables:
t2
Fetch Operator
limit: -1
Alias -> Map Local Operator Tree:
t2
TableScan
alias: t2
Statistics: Num rows: 2 Data size: 556 Basic stats: COMPLETE Column stats: PARTIAL
HashTable Sink Operator
keys:
0 id (type: int)
1 id (type: int)
Stage: Stage-3
Map Reduce
Map Operator Tree:
TableScan
alias: t1
filterExpr: (openid = '钱七') (type: boolean)
Statistics: Num rows: 4 Data size: 1112 Basic stats: COMPLETE Column stats: PARTIAL
Filter Operator
predicate: (openid = '钱七') (type: boolean)
Statistics: Num rows: 2 Data size: 368 Basic stats: COMPLETE Column stats: PARTIAL
Map Join Operator
condition map:
Left Outer Join0 to 1
keys:
0 id (type: int)
1 id (type: int)
outputColumnNames: _col0, _col2, _col6, _col7, _col8
Statistics: Num rows: 2 Data size: 756 Basic stats: COMPLETE Column stats: PARTIAL
Filter Operator
predicate: (_col7 = '张三') (type: boolean)
Statistics: Num rows: 1 Data size: 368 Basic stats: COMPLETE Column stats: PARTIAL
Select Operator
expressions: _col0 (type: int), '钱七' (type: string), _col2 (type: string), _col6 (type: int), '张三' (type: string), _col8 (type: string)
outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5
Statistics: Num rows: 1 Data size: 540 Basic stats: COMPLETE Column stats: PARTIAL
File Output Operator
compressed: false
Statistics: Num rows: 1 Data size: 540 Basic stats: COMPLETE Column stats: PARTIAL
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Execution mode: vectorized
Local Work:
Map Reduce Local Work
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
3.4 case2 : full join操作
select t1.*,t2.* from test1 t1 full join test2 t2 on t1.id=t2.id and t1.openid='钱七'
and t2.openid='张三';
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 depends on stages: Stage-1
STAGE PLANS:
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
alias: t1
Statistics: Num rows: 4 Data size: 1112 Basic stats: COMPLETE Column stats: PARTIAL
Reduce Output Operator
key expressions: id (type: int)
sort order: +
Map-reduce partition columns: id (type: int)
Statistics: Num rows: 4 Data size: 1112 Basic stats: COMPLETE Column stats: PARTIAL
value expressions: openid (type: string), day (type: string)
TableScan
alias: t2
Statistics: Num rows: 2 Data size: 556 Basic stats: COMPLETE Column stats: PARTIAL
Reduce Output Operator
key expressions: id (type: int)
sort order: +
Map-reduce partition columns: id (type: int)
Statistics: Num rows: 2 Data size: 556 Basic stats: COMPLETE Column stats: PARTIAL
value expressions: openid (type: string), day (type: string)
Reduce Operator Tree:
Join Operator
condition map:
Outer Join 0 to 1
filter predicates:
0 {(VALUE._col0 = '钱七')}
1 {(VALUE._col0 = '张三')}
keys:
0 id (type: int)
1 id (type: int)
outputColumnNames: _col0, _col1, _col2, _col6, _col7, _col8
Statistics: Num rows: 6 Data size: 3456 Basic stats: COMPLETE Column stats: PARTIAL
Select Operator
expressions: _col0 (type: int), _col1 (type: string), _col2 (type: string), _col6 (type: int), _col7 (type: string), _col8 (type: string)
outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5
Statistics: Num rows: 6 Data size: 2208 Basic stats: COMPLETE Column stats: PARTIAL
File Output Operator
compressed: false
Statistics: Num rows: 6 Data size: 2208 Basic stats: COMPLETE Column stats: PARTIAL
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
select t1.*,t2.* from test1 t1 full join test2 t2 on t1.id=t2.id where t1.openid='钱七'
and t2.openid='张三';
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 depends on stages: Stage-1
STAGE PLANS:
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
alias: t1
Statistics: Num rows: 4 Data size: 1112 Basic stats: COMPLETE Column stats: PARTIAL
Reduce Output Operator
key expressions: id (type: int)
sort order: +
Map-reduce partition columns: id (type: int)
Statistics: Num rows: 4 Data size: 1112 Basic stats: COMPLETE Column stats: PARTIAL
value expressions: openid (type: string), day (type: string)
TableScan
alias: t2
Statistics: Num rows: 2 Data size: 556 Basic stats: COMPLETE Column stats: PARTIAL
Reduce Output Operator
key expressions: id (type: int)
sort order: +
Map-reduce partition columns: id (type: int)
Statistics: Num rows: 2 Data size: 556 Basic stats: COMPLETE Column stats: PARTIAL
value expressions: openid (type: string), day (type: string)
Reduce Operator Tree:
Join Operator
condition map:
Outer Join 0 to 1
keys:
0 id (type: int)
1 id (type: int)
outputColumnNames: _col0, _col1, _col2, _col6, _col7, _col8
Statistics: Num rows: 6 Data size: 2280 Basic stats: COMPLETE Column stats: PARTIAL
Filter Operator
predicate: ((_col1 = '钱七') and (_col7 = '张三')) (type: boolean)
Statistics: Num rows: 1 Data size: 368 Basic stats: COMPLETE Column stats: PARTIAL
Select Operator
expressions: _col0 (type: int), '钱七' (type: string), _col2 (type: string), _col6 (type: int), '张三' (type: string), _col8 (type: string)
outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5
Statistics: Num rows: 1 Data size: 540 Basic stats: COMPLETE Column stats: PARTIAL
File Output Operator
compressed: false
Statistics: Num rows: 1 Data size: 540 Basic stats: COMPLETE Column stats: PARTIAL
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
在这里插入图片描述
4. 总结
谓词下推规则
5.求教
这种规则在hive2.x版本以后,就不是很准确了,hive2.x对CBO做了优化,CBO也对谓词下推规则产生了一些影响。
因此在hive2.1.1中影响谓词下推规则的,主要有两方面
- Hive逻辑执行计划层面的优化
- CBO(Cost based Optimizer)
6. 参考文献
[1] https://blog.csdn.net/strongyoung88/article/details/81156271
[2] https://cwiki.apache.org/confluence/display/Hive/OuterJoinBehavior
[3] https://blog.csdn.net/baichoufei90/article/details/85264100