Spark源码中关于join策略的描述
/**
* Select the proper physical plan for join based on join strategy hints, the availability of
* equi-join keys and the sizes of joining relations. Below are the existing join strategies,
* their characteristics and their limitations.
*
* - Broadcast hash join (BHJ):
* Only supported for equi-joins, while the join keys do not need to be sortable.
* Supported for all join types except full outer joins.
* BHJ usually performs faster than the other join algorithms when the broadcast side is
* small. However, broadcasting tables is a network-intensive operation and it could cause
* OOM or perform badly in some cases, especially when the build/broadcast side is big.
*
* - Shuffle hash join:
* Only supported for equi-joins, while the join keys do not need to be sortable.
* Supported for all join types except full outer joins.
*
* - Shuffle sort merge join (SMJ):
* Only supported for equi-joins and the join keys have to be sortable.
* Supported for all join types.
*
* - Broadcast nested loop join (BNLJ):
* Supports both equi-joins and non-equi-joins.
* Supports all the join types, but the implementation is optimized for:
* 1) broadcasting the left side in a right outer join;
* 2) broadcasting the right side in a left outer, left semi, left anti or existence join;
* 3) broadcasting either side in an inner-like join.
* For other cases, we need to scan the data multiple times, which can be rather slow.
*
* - Shuffle-and-replicate nested loop join (a.k.a. cartesian product join):
* Supports both equi-joins and non-equi-joins.
* Supports only inner like joins.
*/
五种join策略分别为:
- Broadcast hash join (BHJ)
- Shuffle hash join
- Shuffle sort merge join (SMJ)
- Broadcast nested loop join (BNLJ)
- cartesian product join
Broadcast hash join
也称为map端join,只支持等值连接,一般作事实表和维表的连接,维表一般很小,放到Broadcast可以提高效率。注意:在进行 Broadcast Join 之前,Spark 需要把处于 Executor 端的数据先发送到 Driver 端,然后 Driver 端再把数据广播到 Executor 端。如果我们需要广播的数据比较多,会造成 Driver 端出现 OOM。
参数:spark.sql.autoBroadcastJoinThreshold,默认10MB
Shuffle hash join
只支持等值连接,且连接键不需要排序。
参数:spark.sql.join.prefersortmergeJoin,默认为 true
Shuffle sort merge join
该JOIN机制是Spark默认的,只支持等值连接,且连接键需要排序。
参数:spark.sql.join.prefersortmergeJoin,默认true
那Spark是如何选择JOIN策略的?
源码描述如下
If it is an equi-join, we first look at the join hints w.r.t. the following order:
1. broadcast hint: pick broadcast hash join if the join type is supported. If both sides
have the broadcast hints, choose the smaller side (based on stats) to broadcast.
2. sort merge hint: pick sort merge join if join keys are sortable.
3. shuffle hash hint: We pick shuffle hash join if the join type is supported. If both
sides have the shuffle hash hints, choose the smaller side (based on stats) as the
build side.
4. shuffle replicate NL hint: pick cartesian product if join type is inner like.
If there is no hint or the hints are not applicable, we follow these rules one by one:
1. Pick broadcast hash join if one side is small enough to broadcast, and the join type
is supported. If both sides are small, choose the smaller side (based on stats)
to broadcast.
2. Pick shuffle hash join if one side is small enough to build local hash map, and is
much smaller than the other side, and `spark.sql.join.preferSortMergeJoin` is false.
3. Pick sort merge join if the join keys are sortable.
4. Pick cartesian product if join type is inner like.
5. Pick broadcast nested loop join as the final solution. It may OOM but we don't have
other choice.