Spark之五种join策略

Spark源码中关于join策略的描述

/**
* Select the proper physical plan for join based on join strategy hints, the availability of
* equi-join keys and the sizes of joining relations. Below are the existing join strategies,
* their characteristics and their limitations.
*
* - Broadcast hash join (BHJ):
*     Only supported for equi-joins, while the join keys do not need to be sortable.
*     Supported for all join types except full outer joins.
*     BHJ usually performs faster than the other join algorithms when the broadcast side is
*     small. However, broadcasting tables is a network-intensive operation and it could cause
*     OOM or perform badly in some cases, especially when the build/broadcast side is big.
*
* - Shuffle hash join:
*     Only supported for equi-joins, while the join keys do not need to be sortable.
*     Supported for all join types except full outer joins.
*
* - Shuffle sort merge join (SMJ):
*     Only supported for equi-joins and the join keys have to be sortable.
*     Supported for all join types.
*
* - Broadcast nested loop join (BNLJ):
*     Supports both equi-joins and non-equi-joins.
*     Supports all the join types, but the implementation is optimized for:
*       1) broadcasting the left side in a right outer join;
*       2) broadcasting the right side in a left outer, left semi, left anti or existence join;
*       3) broadcasting either side in an inner-like join.
*     For other cases, we need to scan the data multiple times, which can be rather slow.
*
* - Shuffle-and-replicate nested loop join (a.k.a. cartesian product join):
*     Supports both equi-joins and non-equi-joins.
*     Supports only inner like joins.
*/

五种join策略分别为:

  1. Broadcast hash join (BHJ)
  2. Shuffle hash join
  3. Shuffle sort merge join (SMJ)
  4. Broadcast nested loop join (BNLJ)
  5. cartesian product join

Broadcast hash join

也称为map端join,只支持等值连接,一般作事实表和维表的连接,维表一般很小,放到Broadcast可以提高效率。注意:在进行 Broadcast Join 之前,Spark 需要把处于 Executor 端的数据先发送到 Driver 端,然后 Driver 端再把数据广播到 Executor 端。如果我们需要广播的数据比较多,会造成 Driver 端出现 OOM。
参数:spark.sql.autoBroadcastJoinThreshold,默认10MB

Shuffle hash join

只支持等值连接,且连接键不需要排序。
参数:spark.sql.join.prefersortmergeJoin,默认为 true
在这里插入图片描述

Shuffle sort merge join

该JOIN机制是Spark默认的,只支持等值连接,且连接键需要排序。
参数:spark.sql.join.prefersortmergeJoin,默认true
在这里插入图片描述

那Spark是如何选择JOIN策略的?

源码描述如下

If it is an equi-join, we first look at the join hints w.r.t. the following order:
  1. broadcast hint: pick broadcast hash join if the join type is supported. If both sides
     have the broadcast hints, choose the smaller side (based on stats) to broadcast.
  2. sort merge hint: pick sort merge join if join keys are sortable.
  3. shuffle hash hint: We pick shuffle hash join if the join type is supported. If both
     sides have the shuffle hash hints, choose the smaller side (based on stats) as the
     build side.
  4. shuffle replicate NL hint: pick cartesian product if join type is inner like.

If there is no hint or the hints are not applicable, we follow these rules one by one:
  1. Pick broadcast hash join if one side is small enough to broadcast, and the join type
     is supported. If both sides are small, choose the smaller side (based on stats)
     to broadcast.
  2. Pick shuffle hash join if one side is small enough to build local hash map, and is
     much smaller than the other side, and `spark.sql.join.preferSortMergeJoin` is false.
  3. Pick sort merge join if the join keys are sortable.
  4. Pick cartesian product if join type is inner like.
  5. Pick broadcast nested loop join as the final solution. It may OOM but we don't have
     other choice.
  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值