倾斜join优化 Skewed Join

 

翻译:倾斜join优化设计 https://issues.apache.org/jira/browse/SPARK-29544

1.背景

数据倾斜是数据表在集群中分区之间分布不均匀导致的。数据倾斜会严重降低查询性能,特别实在join的场景下。在集群中,大表之间join需要shuffling并且数据倾斜会导致任务极端不平衡的运行。有三个主要的方法处理skew join:

1.增加任务的并行数,使用spark.sql.shuffle.partition参数,让数据分布均匀

2.使用广播hash join threshold 方法,从归并排序join转变成更快的广播hash join ,另外消除shuffle带来的skew join。

3.增加前缀到数据倾斜的keys上让相同的key 变得不一样,调整成适当的数据分布。

以上三个方式与很多的局限性,所有都需要人工处理。

2.goals

新的自适应框架基于运行时统计处理skew join 问题。

3.idea思想

3.1 判断倾斜分区的条件

通过运行时map统计表数据量大小和行数来决定是否有倾斜join分区。成为倾斜分区的条件是 data size>数据量大小在中位数分区的data size *F(倾斜参数) 并且  data size >S (预设置的倾斜参数data size)  或者  行数> 行数在中位数的分区*F 并且 行数>R (行数阈值)。

3.2 优化:

如上图表A join 表B,且表B的partition 0 分区是数据倾斜的。当表A的partition 0分区和表B的partition 0分区join时,假设b表的partition 0分区读数据需要Map0, Map1…Map8的容量。“OptimizeSkewedJoin” 规则将拆分A表的partition 0分区复制三份  和 创建三个任务单独的读取b表的partition 0分区 (Map0 to Map 2, Map3 to Map5, Map6 to Map8) 的三个分区 相互 join ,最后表A的partition 0分区 join 表B的partition 0分区的结果是上述三个子join的union。这个方法会以读取N次表A的0分区的代价增加额外的开销,但是处理skewjoin可能会超过这种开销。


4.实现:

对每个分区统计行数 Collect the row count statistics for each partition:

和统计每个分区的数据大小类似,首先需要计算行数信息并封装到map结构中。然后在executors中从map结构中得到行数信息,计算行数信息可能会影响性能。

 

4.1 优化倾斜分区:

1.基于“Skewed Partition Condition”倾斜分区判断条件描述,检查是否有分区发生倾斜。

2.假设倾斜分区是mapper的数量是N个。将等效的划分N个mapper成M份,M是min (S,min(F,M)).S是预设置可变化的可以代替划分成M份的参数,F是max(skewedPatitionSize / medianSize of all partitions, skededPartitionRowCount / medianRowCount of all partitions)。然后创建一个新的 “SkewedShuffledRowRDD”对每个split分割都有相关的索引范围。最后将union所有M个子join作为join的结果。

3.在创建一个新的“SkewedShuffledRowRDD”之后,还需要定义新的API读取从处理倾斜mapper之后输出的map。

 

Skewed Join Optimization Design Doc.docx

97.99KB

原文:

Skewed Join Optimization Design Doc

Background

Data skew is a condition in which a table’s data is unevenly distributed among partitions in the cluster. Data skew can severely downgrade performance of queries, especially those with joins. Joins between big tables require shuffling data and the skew can lead to an extreme imbalance of work in the cluster. And currently there are mainly 3 approaches to handle skew join:

1.     Increase the parallelism number of “spark.sql.shuffle.partitions” to make the data distribution more evenly;

2.     Increase the broadcast hash join threshold to change the sort merge join to broadcast hash join as far as possible and then eliminate the skew join case brought by shuffle;

3.     Add prefix to the skewed keys to make the same key be different and then adjust the data distribution.

All the above 3 methods have lots of limitations and all need manual processing.

Goals

This goal is to add an optimization to auto handling skew join based on the runtime statistics with the new adaptive execution framework of SPARK-23128.

Idea

Skewed Partition Condition:

The skewed join partition is determined by the data size and row counts of the runtime map statistics. The partition is considered skewed if its size is > median partition size * F and > S or its row count > median partition row count * F and > R. F is the skew factor, S and R are the pre-configured skew size and row count threshold.

Optimization:

 

 

This above picture shows that table A join with table B and the partition 0 of table B is skewed. When partition 0 of table A join with partition 0 of table B, we assume the partition 0 of table B need reading the output from Map0, Map1…Map8. Here the “OptimizeSkewedJoin” rule will dive the partition 0 of table B into 3 splits and then create 3 tasks separately reading the partial partition 0 of table B (Map0 to Map 2, Map3 to Map5, Map6 to Map8) to join with partition 0 of table B.  Finally we will union the above 3 subjoin as the result of partition 0 of table A join partition 0 of table B. This approach will introduce the additional cost by reading N times about the partition 0 of table A. However the benefit of handing the skew join may be more than the cost.

Implement

Collect the row count statistics for each partition:

Similar with the approach of collecting data size for each partition, we need collect the row count info and wrap it in MapStatus. Then we can get the row count info from the returned MapStatus in executors side. Here collecting the row count info may affect the performance. The following is the updated files list.

 

Optimize Skewed Partition

1.    Check whether the partition is skewed based on the described in “Skewed Partition Condition”;

2.    Assuming the mappers number of the skewed partition is N. We will equally divide the N mappers to M splits. Here M is calculated by Math.min(S, Math.min(F, M)). S is a per-configured variable to represent the number of splits. F is the calculated by Math.max(skewedPatitionSize / medianSize of all partitions, skededPartitionRowCount / medianRowCount of all partitions). And then we created a new “SkewedShuffledRowRDD” with related rang map index for every split.  At last we will union all the M sub-join as the final join result.

3.    After created new “SkewedShuffledRowRDD”, we also need define the new reader API to read the map output from specified mappers.

def getReaderForSpecifiedMapper[K, C](

      handle: ShuffleHandle,

      startPartition: Int,

      endPartition: Int,

                                   startMapId:Int,

                                   endMapId:Int,

      context: TaskContext,

      metrics: ShuffleReadMetricsReporter): ShuffleReader[K, C]

 

 

 

  • 0
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
判断两张大表是否存在空键的方法是通过使用Hive中的IS NULL或IS NOT NULL语句来筛选出空键的记录。具体步骤如下: 1. 使用IS NULL语句筛选出空键的记录: ```sql SELECT * FROM table_name WHERE key_column IS NULL; ``` 2. 使用IS NOT NULL语句筛选出非空键的记录: ```sql SELECT * FROM table_name WHERE key_column IS NOT NULL;``` SMBjoin和skew join都是Hive中的join方式,但它们有一些不同之处。 SMBjoin(Sort-Merge-Bucket Join)是一种优化join算法,它利用了Hive中的bucketing机制。当两个表都进行了bucketing,并且bucket数量相同且顺序一致时,可以使用SMBjoin来加速join操作。 Skew join是一种处理数据倾斜join算法。当一个表中的某个键的值非常大,而其他键的值相对较小时,会导致join操作变得非常慢。Skew join通过将数据倾斜的键单独处理,将其拆分成多个bucket,然后进行join操作,从而提高性能。 Inner join、left join等是join操作的类型,而mapjoin和reducejoin是Hive中具体的join算法。 在Hive中,可以通过设置参数来启用mapjoin。具体步骤如下: 1. 设置hive.auto.convert.join参数为true,表示开启自动转换join操作。 ```sql SET hive.auto.convert.join=true; ``` 2. 设置hive.mapjoin.smalltable.filesize参数,指定小表的大小阈值。当小表的大小小于该阈值时,会使用mapjoin算法。 ```sql SET hive.mapjoin.smalltable.filesize=100000000; -- 设置小表的大小阈值为100MB ``` 请注意,启用mapjoin需要满足以下条件: - 小表的大小不能超过hive.mapjoin.smalltable.filesize参数指定的阈值。 - 小表和大表都需要进行了bucketing,并且bucket数量相同且顺序一致。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值