Hive及Spark中Join中过滤下推优化分析

前言: 在Spark SQL中有一条PushPredicateThroughJoin优化规则,其原理是参考Hive中的Join规则完成的,具体参考本文的规则1/2。

Definitions

  • Preserved Row table: The table in an Outer Join that must return all rows. For left outer joins this is the Left table, for right outer joins it is the Right table, and for full outer joins both tables are Preserved Row tables.

  • Null Supplying table: This is the table that has nulls filled in for its columns in unmatched rows. In the non-full outer join case, this is the other table in the Join. For full outer joins both tables are also Null Supplying tables.

  • During Join predicate: A predicate that is in the JOIN ON clause. For example, in ‘R1 join R2 on R1.x = 5’ the predicate ‘R1.x = 5’ is a During Join predicate.

  • After Join predicate: A predicate that is in the WHERE clause.

Predicate Pushdown Rules

The logic can be summarized by these two rules:

  • During Join predicates cannot be pushed past Preserved Row tables.
  • After Join predicates cannot be pushed past Null Supplying tables.

第一条法则: join条件过滤不能下推到保留行表中。

比如以下选择,left join中左表s1为保留行表,所以on条件(join过滤条件)不能下推到s1中
select s1.key, s2.key from src s1 left join src s2 on s1.key > '2';

而s2表不是保留行,所以s2.key>2条件可以下推到s2表中:
select s1.key, s2.key from src s1 left join src s2 on s2.key > '2';

第二条法则:where条件过滤不能下推到NULL补充表。

比如以下选择left join的右表s2为NULL补充表所以,s1.key>2 where条件可以下推到s1:

select s1.key, s2.key from src s1 left join src s2 where s1.key > '2';

而以下选择由于s2未NULL补充表所以s2.key>2过滤条件不能下推

select s1.key, s2.key from src s1 left join src s2 where s2.key > '2';

引用网页:https://cwiki.apache.org/confluence/display/Hive/OuterJoinBehavior#OuterJoinBehavior-Examples

展开阅读全文

没有更多推荐了,返回首页