Hive及Spark中Join中过滤下推优化分析

最新推荐文章于 2023-10-22 22:42:37 发布

Rilakkuma

最新推荐文章于 2023-10-22 22:42:37 发布

阅读量871

点赞数

分类专栏： spark 文章标签： Spark

本文链接：https://blog.csdn.net/CRISPY_RICE/article/details/79360377

版权

8 篇文章 0 订阅

订阅专栏

前言：在Spark SQL中有一条PushPredicateThroughJoin优化规则，其原理是参考Hive中的Join规则完成的，具体参考本文的规则1/2。

Preserved Row table: The table in an Outer Join that must return all rows. For left outer joins this is the Left table, for right outer joins it is the Right table, and for full outer joins both tables are Preserved Row tables.
Null Supplying table: This is the table that has nulls filled in for its columns in unmatched rows. In the non-full outer join case, this is the other table in the Join. For full outer joins both tables are also Null Supplying tables.
During Join predicate: A predicate that is in the JOIN ON clause. For example, in ‘R1 join R2 on R1.x = 5’ the predicate ‘R1.x = 5’ is a During Join predicate.
After Join predicate: A predicate that is in the WHERE clause.

The logic can be summarized by these two rules:

第一条法则： join条件过滤不能下推到保留行表中。

比如以下选择，left join中左表s1为保留行表，所以on条件（join过滤条件）不能下推到s1中
select s1.key, s2.key from src s1 left join src s2 on s1.key > '2';

而s2表不是保留行，所以s2.key>2条件可以下推到s2表中：
select s1.key, s2.key from src s1 left join src s2 on s2.key > '2';

第二条法则：where条件过滤不能下推到NULL补充表。

比如以下选择left join的右表s2为NULL补充表所以，s1.key>2 where条件可以下推到s1：

select s1.key, s2.key from src s1 left join src s2 where s1.key > '2';

而以下选择由于s2未NULL补充表所以s2.key>2过滤条件不能下推

select s1.key, s2.key from src s1 left join src s2 where s2.key > '2';

引用网页：https://cwiki.apache.org/confluence/display/Hive/OuterJoinBehavior#OuterJoinBehavior-Examples

关注

专栏目录