谓词下推 Predicate Pushdown(PPD):
就是在不影响结果的情况下,尽量将过滤条件提前执行。谓词下推后,过滤条件在map端执行,减少了map端的输出,降低了数据在集群上传输的量,节约了集群的资源,也提升了任务的性能。
- Preserved Row table(在join中返回全部数据)
- left join 左边的部分
- right join 右边的部分
- full join 左边和右边的
- Null Supplying table(有空数据的join)
- During Join predicate(在join中谓词下推, on条件后面)
- After Join predicate(在join后谓词下推,where条件)
规则:
During Join predicates cannot be pushed past Preserved Row tables.
在join中返回全部数据的,不会在条件在on后面的被下推执行
After Join predicates cannot be pushed past Null Supplying tables.
join后有空数据的表如果join后有where条件不能被下推
得到以下结论:
1、对于Join(Inner Join)、Full outer Join,条件写在on后面,还是where后面,性能上面没有区别;
2、对于Left outer Join ,右侧的表写在on后面、左侧的表写在where后面,性能上有提高;
3、对于Right outer Join,左侧的表写在on后面、右侧的表写在where后面,性能上有提高;
hive.cbo.enable :
在Hive 1.1.0之后,这个feature是默认开启的,它可以自动优化HQL中多个JOIN的顺序,并选择合适的JOIN算法。
例子:
-- set hive.auto.convert.join=false;
-- set hive.cbo.enable=false;
-- explain
explain
select count(case when b.userid is null then a.userid end)
from
increasecredit_202011 a
full outer join
(select *
from dw_pay_activity
) b
on a.userid=b.userid
where b.visit_date=20201029
Explain
STAGE DEPENDENCIES:
Stage-5 is a root stage
Stage-2 depends on stages: Stage-5
Stage-0 depends on stages: Stage-2
STAGE PLANS:
Stage: Stage-5
Map Reduce Local Work
Alias -> Map Local Tables:
$hdt$_0:$hdt$_0:a
Fetch Operator
limit: -1
Alias -> Map Local Operator Tree:
$hdt$_0:$hdt$_0:a
TableScan
alias: a
Statistics: Num rows: 6657 Data size: 53262 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: userid (type: bigint)
outputColumnNames: _col0
Statistics: Num rows: 6657 Data size: 53262 Basic stats: COMPLETE Column stats: NONE
HashTable Sink Operator
keys:
0 _col0 (type: bigint)
1 _col0 (type: bigint)
Stage: Stage-2
Map Reduce
Map Operator Tree:
TableScan
alias: dw_pay_activity
Statistics: Num rows: 123722 Data size: 989778 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: userid (type: bigint)
outputColumnNames: _col0
Statistics: Num rows: 123722 Data size: 989778 Basic stats: COMPLETE Column stats: NONE
Map Join Operator
condition map:
Right Outer Join0 to 1
keys:
0 _col0 (type: bigint)
1 _col0 (type: bigint)
outputColumnNames: _col0, _col1
Statistics: Num rows: 136094 Data size: 1088755 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: CASE WHEN (_col1 is null) THEN (_col0) ELSE (null) END (type: bigint)
outputColumnNames: _col0
Statistics: Num rows: 136094 Data size: 1088755 Basic stats: COMPLETE Column stats: NONE
Group By Operator
aggregations: count(_col0)
mode: hash
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
sort order:
Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE
value expressions: _col0 (type: bigint)
Local Work:
Map Reduce Local Work
Reduce Operator Tree:
Group By Operator
aggregations: count(VALUE._col0)
mode: mergepartial
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: true
Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
在这个执行计划中,系统自动优化成right outer join .... 按照上面的原理去推测,outerjoin是不会被下推执行的,但是会被cbo给优化掉。因为条件是b. a: b leftjoin.
参考链接:https://mp.weixin.qq.com/s/ZwoAjkMlvAOoOI8Z-QNFsQ
https://blog.csdn.net/strongyoung88/article/details/81156271
https://blog.csdn.net/strangerzz/article/details/78992482
https://cwiki.apache.org/confluence/display/Hive/OuterJoinBehavior
2717

被折叠的 条评论
为什么被折叠?



