Hive及Spark中Join中过滤下推优化分析

前言: 在Spark SQL中有一条PushPredicateThroughJoin优化规则,其原理是参考Hive中的Join规则完成的,具体参考本文的规则1/2。

Definitions

  • Preserved Row table: The table in an Outer Join that must return all rows. For left outer joins this is the Left table, for right outer joins it is the Right table, and for full outer joins both tables are Preserved Row tables.

  • Null Supplying table: This is the table that has nulls filled in for its columns in unmatched rows. In the non-full outer join case, this is the other table in the Join. For full outer joins both tables are also Null Supplying tables.

  • During Join predicate: A predicate that is in the JOIN ON clause. For example, in ‘R1 join R2 on R1.x = 5’ the predicate ‘R1.x = 5’ is a During Join predicate.

  • After Join predicate: A predicate that is in the WHERE clause.

Predicate Pushdown Rules

The logic can be summarized by these two rules:

  • During Join predicates cannot be pushed past Preserved Row tables.
  • After Join predicates cannot be pushed past Null Supplying tables.

第一条法则: join条件过滤不能下推到保留行表中。

比如以下选择,left join中左表s1为保留行表,所以on条件(join过滤条件)不能下推到s1中
select s1.key, s2.key from src s1 left join src s2 on s1.key > '2';

而s2表不是保留行,所以s2.key>2条件可以下推到s2表中:
select s1.key, s2.key from src s1 left join src s2 on s2.key > '2';

第二条法则:where条件过滤不能下推到NULL补充表。

比如以下选择left join的右表s2为NULL补充表所以,s1.key>2 where条件可以下推到s1:

select s1.key, s2.key from src s1 left join src s2 where s1.key > '2';

而以下选择由于s2未NULL补充表所以s2.key>2过滤条件不能下推

select s1.key, s2.key from src s1 left join src s2 where s2.key > '2';

引用网页:https://cwiki.apache.org/confluence/display/Hive/OuterJoinBehavior#OuterJoinBehavior-Examples

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值