谓词下推
在不影响结果的情况下,尽量将过滤条件提前执行。谓词下推后,过滤条件在map端执行,减少了map端的输出,降低了数据在集群上传输的量,节约了集群的资源,也提升了任务的性能。
概念引入
Preserved Row table | 保留表 | 在outer join中需要返回所有数据的表叫做保留表,即在left outer join中,左表是保留表;right outer join中右表则是保留表;在full outer join中左表和右表都要返回所有数据,则左右表都是保留表 |
Null Supplying table | 空表 | 在outer join中对于没有匹配到的行需要用null来填充的表称为空表。在left outer join中,左表的数据全返回,对于左表在右表中无法匹配的数据的相应列用null表示,则此时右表是空表,相应的如果是right outer join的话,左表是空表。但是在full outer join中左表和右表都是Null Supplying table,因为左表和右表都会用null来填充无法匹配的数据。 |
During Join predicate | Join中的谓词 | Join中的谓词是指Join On语句中的谓词。如:a join b on a.type=1 那么a.type=1是Join中的谓词 |
After Join predicate | Join之后的谓词 | where语句中的谓词称之为Join之后的谓词 |
规则的逻辑描述如下:
在 join关联情况下,过滤条件无论在join中还是where中谓词下推都生效;
在full join关联情况下,过滤条件无论在join中还是where中谓词下推都不生效。
保留表的谓词写在join中不能下推
During Join predicates cannot be pushed past Preserved Row tables.
空表的谓词写在join之后不能下推
After Join predicates cannot be pushed past Null Supplying tables.
四种关联情况下谓词是否下推
Pushed or Not | SQL |
Pushed | select ename,dept_name from E join D on ( E.dept_id = D.dept_id and E.eid='HZ001'); |
Pushed | select ename,dept_name from E join D on E.dept_id = D.dept_id where E.eid='HZ001'; |
Pushed | select ename,dept_name from E join D on ( E.dept_id = D.dept_id and D.dept_id='D001'); |
Pushed | select ename,dept_name from E join D on E.dept_id = D.dept_id where D.dept_id='D001'; |
Not Pushed | select ename,dept_name from E left outer join D on ( E.dept_id = D.dept_id and E.eid='HZ001'); |
Pushed | select ename,dept_name from E left outer join D on E.dept_id = D.dept_id where E.eid='HZ001'; |
Pushed | select ename,dept_name from E left outer join D on ( E.dept_id = D.dept_id and D.dept_id='D001'); |
Not Pushed | select ename,dept_name from E left outer join D on E.dept_id = D.dept_id where D.dept_id='D001'; |
Pushed | select ename,dept_name from E right outer join D on ( E.dept_id = D.dept_id and E.eid='HZ001'); |
Not Pushed | select ename,dept_name from E right outer join D on E.dept_id = D.dept_id where E.eid='HZ001'; |
Not Pushed | select ename,dept_name from E right outer join D on ( E.dept_id = D.dept_id and D.dept_id='D001'); |
Pushed | select ename,dept_name from E right outer join D on E.dept_id = D.dept_id where D.dept_id='D001'; |
Not Pushed | select ename,dept_name from E full outer join D on ( E.dept_id = D.dept_id and E.eid='HZ001'); |
Not Pushed | select ename,dept_name from E full outer join D on E.dept_id = D.dept_id where E.eid='HZ001'; |
Not Pushed | select ename,dept_name from E full outer join D on ( E.dept_id = D.dept_id and D.dept_id='D001'); |
Not Pushed | select ename,dept_name from E full outer join D on E.dept_id = D.dept_id where D.dept_id='D001'; |
join(inner join) | left outer join | right outer join | full outer join | |||||
left table | right table | left table | right table | left table | right table | left table | right table | |
join | Pushed | Pushed | Not Pushed | Pushed | Pushed | Not Pushed | Not Pushed | Not Pushed |
where | Pushed | Pushed | Pushed | Not Pushed | Not Pushed | Pushed | Not Pushed | Not Pushed |
注意!!如果在表达式中含有不确定函数,整个表达式的谓词将不会被下推,其中unix_timestamp()是未知的,在编译时无法得知,此外还有rand等不确定函数使谓词下推无效
select *
from a join b
on a.id = b.id
where a.create_time = unix_timestamp();
参考 https://cwiki.apache.org/confluence/display/Hive/OuterJoinBehavior