Hive中的Predicate Pushdown Rules(谓词下推规则)

谓词下推概念

谓词下推 Predicate Pushdown(PPD):简而言之,就是在不影响结果的情况下,尽量将过滤条件提前执行。谓词下推后,过滤条件在map端执行,减少了map端的输出,降低了数据在集群上传输的量,节约了集群的资源,也提升了任务的性能。

PPD 配置

PPD控制参数:hive.optimize.ppd

  • Default Value: true
  • Added In: Hive 0.4.0

相关定义

  • Preserved Row table

The table in an Outer Join that must return all rows.
For left outer joins this is the Left table, for right outer joins it is the Right table, and for full outer joins both tables are Preserved Row tables.

  • Null Supplying table

This is the table that has nulls filled in for its columns in unmatched rows.
In the non-full outer join case, this is the other table in the Join. For full outer joins both tables are also Null Supplying tables.

  • During Join predicate

A predicate that is in the JOIN ON clause.
For example, in ‘R1 join R2 on R1.x = 5’ the predicate ‘R1.x = 5’ is a During Join predicate.

  • After Join predicate

A predicate that is in the WHERE clause.

PPD规则:

规则的逻辑描述如下:

  • During Join predicates cannot be pushed past Preserved Row tables.
  • After Join predicates cannot be pushed past Null Supplying tables.

以表格的形式描述如下:

-Preserved Row tablesNull Supplying tables
Join PredicateCase J1: Not PushedCase J2: Pushed
Where PredicateCase W1: PushedCase W2: Not Pushed

Push:谓词下推,可以理解为被优化
Not Push:谓词没有下推,可以理解为没有被优化

实验

实验结果列表形式:

Pushed or NotSQL
Pushedselect ename,dept_name from E join D on ( E.dept_id = D.dept_id and E.eid='HZ001');
Pushedselect ename,dept_name from E join D on E.dept_id = D.dept_id where E.eid='HZ001';
Pushedselect ename,dept_name from E join D on ( E.dept_id = D.dept_id and D.dept_id='D001');
Pushedselect ename,dept_name from E join D on E.dept_id = D.dept_id where D.dept_id='D001';
Not Pushedselect ename,dept_name from E left outer join D on ( E.dept_id = D.dept_id and E.eid='HZ001');
Pushedselect ename,dept_name from E left outer join D on E.dept_id = D.dept_id where E.eid='HZ001';
Pushedselect ename,dept_name from E left outer join D on ( E.dept_id = D.dept_id and D.dept_id='D001');
Not Pushedselect ename,dept_name from E left outer join D on E.dept_id = D.dept_id where D.dept_id='D001';
Pushedselect ename,dept_name from E right outer join D on ( E.dept_id = D.dept_id and E.eid='HZ001');
Not Pushedselect ename,dept_name from E right outer join D on E.dept_id = D.dept_id where E.eid='HZ001';
Not Pushedselect ename,dept_name from E right outer join D on ( E.dept_id = D.dept_id and D.dept_id='D001');
Pushedselect ename,dept_name from E right outer join D on E.dept_id = D.dept_id where D.dept_id='D001';
Not Pushedselect ename,dept_name from E full outer join D on ( E.dept_id = D.dept_id and E.eid='HZ001');
Not Pushedselect ename,dept_name from E full outer join D on E.dept_id = D.dept_id where E.eid='HZ001';
Not Pushedselect ename,dept_name from E full outer join D on ( E.dept_id = D.dept_id and D.dept_id='D001');
Not Pushedselect ename,dept_name from E full outer join D on E.dept_id = D.dept_id where D.dept_id='D001';

实验结果表格形式:

Join(inner join)Left Outer JoinRight Outer JoinFull Outer Join
Left TableRight TableLeft TableRight TableLeft TableRight TableLeft TableRight Table
Join PredicatePushedPushedNot PushedPushedPushedNot PushedNot PushedNot Pushed
Where PredicatePushedPushedPushedNot PushedNot PushedPushedNot PushedNot Pushed

此表实际上就是上述PPD规则表

结论

1、对于Join(Inner Join)、Full outer Join,条件写在on后面,还是where后面,性能上面没有区别;
2、对于Left outer Join ,右侧的表写在on后面、左侧的表写在where后面,性能上有提高;
3、对于Right outer Join,左侧的表写在on后面、右侧的表写在where后面,性能上有提高;
4、当条件分散在两个表时,谓词下推可按上述结论2和3自由组合,情况如下:

SQL过滤时机
select ename,dept_name from E left outer join D on ( E.dept_id = D.dept_id and E.eid='HZ001' and D.dept_id = 'D001');dept_id在map端过滤,eid在reduce端过滤
select ename,dept_name from E left outer join D on ( E.dept_id = D.dept_id and D.dept_id = 'D001') where E.eid='HZ001';dept_id,eid都在map端过滤
select ename,dept_name from E left outer join D on ( E.dept_id = D.dept_id and E.eid='HZ001') where D.dept_id = 'D001';dept_id,eid都在reduce端过滤
select ename,dept_name from E left outer join D on ( E.dept_id = D.dept_id ) where E.eid='HZ001' and D.dept_id = 'D001';dept_id在reduce端过滤,eid在map端过滤

注意:如果在表达式中含有不确定函数,整个表达式的谓词将不会被pushed,例如

select a.* 
from a join b on a.id = b.id
where a.ds = '2019-10-09' and a.create_time = unix_timestamp();

因为unix_timestamp是不确定函数,在编译的时候无法得知,所以,整个表达式不会被pushed,即ds='2019-10-09'也不会被提前过滤。类似的不确定函数还有rand()等。

参考文献:
[1] https://cwiki.apache.org/confluence/display/Hive/OuterJoinBehavior

评论 4
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值