Spark_调优_Spark3.0之SparkSQL_AQE( adaptive query execution)自适应查询_参数讲解

高达一号

已于 2023-08-24 18:25:48 修改

阅读量2.6k

点赞数 3

分类专栏： Spark 文章标签： spark 大数据分布式

于 2020-12-19 17:23:48 首次发布

原文链接：https://blog.csdn.net/jiangshouzhuang/article/details/104453937

版权

Spark 专栏收录该内容

67 篇文章 9 订阅

订阅专栏

参考文章：Spark SQL 自适应执行优化引擎_DataFlow范式的博客-CSDN博客

在本篇文章中，笔者将给大家带来 Spark SQL 中关于自适应执行引擎（Spark Adaptive Execution）的内容。

参数配置基于社区版：Performance Tuning - Spark 3.4.0 Documentation

在之前的文章中，笔者介绍过 Flink SQL，目前 Flink 社区在积极地更新迭代 Flink SQL 功能和优化性能，尤其 Flink 1.10.0 版本的发布，在增强流式 SQL 处理能力的同时也具备了成熟的批处理能力。但是在 SQL 功能完整性和生产环境的实践应用等方面，Spark SQL 还是更胜一筹，至于 SQL 批处理方面性能优劣，则需要笔者亲自去实践。

不过，在超大规模集群和海量数据集上，Spark SQL 目前仍然在稳定性和性能方面遇到一些挑战。为了应对这些挑战，Spark 社区进行了改进并引入了自适应执行引擎，它可以在运行时动态地处理任务并行度、join 策略优化和数据倾斜，确保使用运行时统计信息选择最佳执行计划。笔者参考 Haifeng Chen 分享的主题《 Spark Adaptive Execution Unleash the Power of Spark SQL 》，再结合实际情况进行梳理。

前言-Spark SQL中的挑战

我们首先来看一下，Spark SQL 在实际生产案例中遇到的一些挑战。以此来看一下Spark3 中 adaptive query execution 能帮我们解决什么样的问题

挑战 1：并行度问题

在日常的 Spark SQL 开发中，我们通过设置 spark.sql.shuffle.partitions 参数来调整 partition 数量，默认值是200。即 Shuffle partition 数量需要手动调整才可以获得相对理想的性能。

虽然我们可以设置 shuffle partition 数量，但是无法给出一个对所有任务来说都是最优的值，因为每个 task 处理的的数据量以及 shuffle 策略也可能不同。

Shuffle partition 太大或太小都会带来问题：

partition 数量太大

可能会需要处理大量小的 task，导致增加 task 调度开销以及资源调度开销。另外，如果该 Stage 最后要输出存储，造成很多小的 IO 操作，还会造成在 HDFS 上存储大量的小文件。
partition 数量太小

可能会导致每个 task 处理大量的数据，处理效率低下，无法有效利用集群资源的并行处理能力，甚至导致 OOM 的问题。

目前 shuffle partition 数量无法根据每个任务动态调整，只能针对不同的任务进行多次的优化调整，才能得到较为合理的值，但是往往作业的数据量是逐日累增的，所以之前优化的值可能不再适合后续的作业。

因此理想情况下，为了获取最佳的性能，Spark 能够实现在作业执行过程中根据数据量大小动态设置合适的 shuffle partition 数量。

总结一下并行度问题带来的挑战：

数据规模是动态变化的，很难准确评估
单一的 partition 配置不可能适合所有的 partition 以获得最佳性能

挑战 2：Join 策略选择问题

针对不同数据量大小的场景，Spark 支持三种 join 策略以获取最佳的性能：

Sort Merge Join
Shuffle Hash Join
Broadcast Hash Join

既然 Spark 有三种 join 策略，那么实际会带来哪些挑战：

join 策略的选择是基于静态信息的，比如执行计划阶段的表大小
对于复杂查询，中间操作的结果集数据大小变化频繁，很难评估

因此，很多时候，运行的作业可能没有选择最有效的 join 执行策略。

挑战 3：数据倾斜

数据倾斜是指某一个 partition 的数据量远远大于其它 partition 的数据，导致该任务的运行时间远远大于其它任务，因此导致整个 SQL 的运行效率变差。

我们使用的 MapReduce、Spark 和 Flink 都会存在数据倾斜的问题，而且在实际需求开发中（比如使用 join 和 group by 操作），数据倾斜问题也是出现频率比较高的，大部分作业卡在 99% 进度的罪魁祸首。

数据倾斜引起的原因很多，比如：

源表本身就有倾斜的数据
中间操作（比如 outer join）可能生成倾斜数据

简单总结一下产生数据倾斜的问题：

通常无法提前预测
作业运行过程被单个 task 拖垮
可能引起 OOM

在 Spark SQL 实践中，处理数据倾斜的常见手段有：

1. 增加 shuffle partition 数量

通过调整 shuffle partition 数量来避免某个 partition 数据量特别大，将该 partition 数据分散到多个 partition 中。
2. 加盐处理倾斜的 key

增加 shuffle partition 数量的方法，对于同一个海量数据倾斜的 key 来说，不起作用。不过，我们可以对该数据倾斜的 key 通过加盐方式来打散数据，然后再借助 shuffle partition 的功能。
3. 使用 Broadcast Hash Join

在某些场景下，可以把 Sort Merge Join 转化成 Broadcast Hash Join，从而避免 shuffle 产生的数据倾斜。比如，如果两个 join 的表中有一个表是小表，可以优化成Broadcast Hash Join 来消除 shuffle 引起的数据倾斜问题。

但是上面这些解决方案都是针对单一任务进行调优，没有一个解决方案可以有效的解决所有的数据倾斜问题。

----------------------------------------------------------------------------------------------------------------------------

Spark Adaptive Query Execution

基于官网 3.4.0 Performance Tuning - Spark 3.4.0 Documentation

Spark SQL Execution 介绍

笔者简单说一下，SQL 语句首先通过 Parser 模块被解析为语法树，称为 Unresolved Logical Plan，接着 Unresolved Logical Plan 通过 Analyzer 模块借助于 Catalog 中的表信息解析为 Logical Plan，然后 Optimizer 再通过各种优化策略进行深入优化，得到 Optimized Logical Plan，Planner 模块再将优化后的逻辑计划根据预先设定的映射逻辑转换为 Physical Plan，最后物理执行计划做 RDD 计算，提交 Spark 集群运算，最终向用户返回数据。

Adaptive Query Execution 想法

基于社区的工作，Intel 大数据技术团队创建了 Adaptive Execution 项目，对 Adaptive Execution 做了重新的设计，实现了一个更为灵活的自适性执行框架，来解决主要的性能问题。

Adaptive Execution 项目的想法是：

当一个 stage 的 map 任务在 runtime 完成时，我们利用 map 输出大小信息，对并行度、join 策略和倾斜处理进行相应的调整。

Adaptive Execution 框架

当一个 Adaptive Stage 执行时，它会急切地执行它所有的子 Adaptive Stage
当所有的子 Adaptive Stage 执行完成后，它将拥有所有的 map 输出大小，用于优化决策。

合并Shuffle后分区 Coalescing Post Shuffle Partitions

官网原文：

This feature coalesces the post shuffle partitions based on the map output statistics when both spark.sql.adaptive.enabled and spark.sql.adaptive.coalescePartitions.enabled configurations are true. This feature simplifies the tuning of shuffle partition number when running queries. You do not need to set a proper shuffle partition number to fit your dataset. Spark can pick the proper shuffle partition number at runtime once you set a large enough initial number of shuffle partitions via spark.sql.adaptive.coalescePartitions.initialPartitionNum configuration.

通过使用 map 输出大小的信息，我们可以在运行时对并行度进行调整。

如上图所示，假设我们设置初始 shuffle partition 数量为 8，在 map stage 结束之后，可以看到每一个 Partition（1-8）的大小分别是20M、30M、10M、20M、35M、45M、10M 和 70M。假设设置每一个 reducer 处理的目标数据量（target input size）是 64M，那么在运行时，我们实际使用 4 个 reducer，即第一个 reducer 处理 Partition 1-3，共 60M，第二个 reducer 处理 Partition 4-5，共 55M，第三个 reducer 处理 Partition 6-7，共 55M，第四个 reducer 处理 Partition 8，即 70M。整个作业需要 4 个 task 运行，而不是 8 个 task。

一般情况下，一个 partition 是由一个 task 来处理的。经过优化，我们可以安排一个 task 处理多个 partition，这样，我们就可以保证各个分区相对均衡，不会存在大量数据量很小的 partition。

开启 Adaptive Execution 特性的方式：

spark.sql.adaptive.enabled = true

配置：

spark.sql.adaptive.coalescePartitions.enabled
- 默认值：true since 3.0.0
- 描述： When true and spark.sql.adaptive.enabled is true, Spark will coalesce contiguous shuffle partitions according to the target size (specified by spark.sql.adaptive.advisoryPartitionSizeInBytes), to avoid too many small tasks.
- 解释：动态缩小分区参数，默认值是true,但是得先保证spark.sql.adaptive.enabled为true。

spark.sql.adaptive.advisoryPartitionSizeInBytes
- 默认值：64 MB since 3.0.0
- 描述：The advisory size in bytes of the shuffle partition during adaptive optimization (when spark.sql.adaptive.enabled is true). It takes effect when Spark coalesces small shuffle partitions or splits skewed shuffle partition.
- 解释：缩小分区或进行拆分分区操作后所期望的每个分区的大小（数据量）。

spark.sql.adaptive.coalescePartitions.parallelismFirst
- 默认值: true since 3.2.0
- 描述 : When true, Spark ignores the target size specified by spark.sql.adaptive.advisoryPartitionSizeInBytes (default 64MB) when coalescing contiguous shuffle partitions, and only respect the minimum partition size specified by spark.sql.adaptive.coalescePartitions.minPartitionSize (default 1MB), to maximize the parallelism. This is to avoid performance regression when enabling adaptive query execution. It's recommended to set this config to false and respect the target size specified by spark.sql.adaptive.advisoryPartitionSizeInBytes.
- 解释：

spark.sql.adaptive.coalescePartitions.minPartitionSize
- 默认值：1MB since 3.2.0
- 描述：The minimum size of shuffle partitions after coalescing. Its value can be at most 20% of spark.sql.adaptive.advisoryPartitionSizeInBytes. This is useful when the target size is ignored during partition coalescing, which is the default case.
- 解释：

spark.sql.adaptive.coalescePartitions.initialPartitionNum
- 默认值：(none) since 3.0.0
- 描述：The initial number of shuffle partitions before coalescing. If not set, it equals to spark.sql.shuffle.partitions. This configuration only has an effect when spark.sql.adaptive.enabled and spark.sql.adaptive.coalescePartitions.enabled are both enabled.
- 解释：任务刚启动时的初始分区，此参数可以设置了大点，默认值与spark.sql.shuffle.partition一样为200。

----------------------------------------------------------------------------------------------------

分裂倾斜的shuffle分区 Spliting skewed shuffle partitions Since 3.2.0

spark.sql.adaptive.optimizeSkewsInRebalancePartitions.enabled
- 默认值：true since 3.2.0
- 描述：When true and spark.sql.adaptive.enabled is true, Spark will optimize the skewed shuffle partitions in RebalancePartitions and split them to smaller ones according to the target size (specified by spark.sql.adaptive.advisoryPartitionSizeInBytes), to avoid data skew.
- 解释：

spark.sql.adaptive.rebalancePartitionsSmallPartitionFactor
- 默认值：0.2 since 3.3.0
- 描述：A partition will be merged during splitting if its size is small than this factor multiply spark.sql.adaptive.advisoryPartitionSizeInBytes.
- 解释：

-----------------------------------------------------------------------------------------------------------

Join 策略优化

通过使用 map 输出大小的信息，我们可以在运行时对 join 策略进行调整。

-------------------------------------------------------------------------------------------

SortMerge转化为Broadcast

Converting sort-merge join to broadcast join

在 Shuffle Write 之后，观察两个 Stage 输出的数据量。如果有一个 Stage 数据量明显比较小，可以转换成 Broadcast Hash Join，这样就可以动态的去调整执行计划。

将 Sort Merge Join 转化成 Broadcast Hash Join，此时 join 读取数据是直接从本地读取，没有数据通过网络传输，避开了网络IO的开销，性能会高很多。

官网原文：AQE converts sort-merge join to broadcast hash join when the runtime statistics of any join side is smaller than the adaptive broadcast hash join threshold. This is not as efficient as planning a broadcast hash join in the first place, but it’s better than keep doing the sort-merge join, as we can save the sorting of both the join sides, and read shuffle files locally to save network traffic(if spark.sql.adaptive.localShuffleReader.enabled is true)

配置：

spark.sql.adaptive.autoBroadcastJoinThreshold
- 默认值：(none) since 3.2.0
- 描述：Configures the maximum size in bytes for a table that will be broadcast to all worker nodes when performing a join. By setting this value to -1, broadcasting can be disabled. The default value is the same as spark.sql.autoBroadcastJoinThreshold. Note that, this config is used only in adaptive framework.
- 解释：

spark.sql.adaptive.localShuffleReader.enabled
- 默认值：true since 3.0.0
- 描述：When true and spark.sql.adaptive.enabled is true, Spark tries to use local shuffle reader to read the shuffle data when the shuffle partitioning is not needed, for example, after converting sort-merge join to broadcast-hash join.
- 解释：

-------------------------------------------------------------------------------------------

SortMerge转化为ShuffledHashJoin

Converting sort-merge join to shuffled hash join

官网原文：AQE converts sort-merge join to shuffled hash join when all post shuffle partitions are smaller than a threshold, the max threshold can see the config spark.sql.adaptive.maxShuffledHashJoinLocalMapThreshold.

配置：

spark.sql.adaptive.maxShuffledHashJoinLocalMapThreshold
- 默认值：0 since 3.2.0
- 描述：Configures the maximum size in bytes per partition that can be allowed to build local hash map. If this value is not smaller than spark.sql.adaptive.advisoryPartitionSizeInBytes and all the partition sizes are not larger than this config, join selection prefers to use shuffled hash join instead of sort merge join regardless of the value of spark.sql.join.preferSortMergeJoin.
- 解释：

---------------------------------------------------------------------------------------------------

倾斜数据处理（只能针对于原来是SortMergeJoin的情况）

对于大量小数据的 partiiton，可以通过合并来解决问题，即一个 task 处理多个 partition 的数据。
对于数据量特别大的 partition，使用多个 task 来处理该 partition。

Data skew can severely downgrade the performance of join queries. This feature dynamically handles skew in sort-merge join by splitting (and replicating if needed) skewed tasks into roughly evenly sized tasks. It takes effect when both spark.sql.adaptive.enabled and spark.sql.adaptive.skewJoin.enabled configurations are enabled.

开启自动调整数据倾斜功能后，在作业执行过程中，Spark 会自动找出出现倾斜的 partiiton，然后用多个 task 来处理该 partition，之后再将这些 task 的处理结果进行合并。

开启方式：

spark.sql.adaptive.skewJoin.enabled=true
- 默认值： true
- 描述：When true and spark.sql.adaptive.enabled is true, Spark dynamically handles skew in sort-merge join by splitting (and replicating if needed) skewed partitions.
- 解释：开启aqe倾斜join,需要先将spark.sql.adaptive.enabled设置为true。

其他配置参数：

spark.sql.adaptive.skewJoin.skewedPartitionFactor
- 默认值： 5.0 since 3.0.0
- 描述：A partition is considered as skewed if its size is larger than this factor multiplying the median partition size and also larger than spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes
- 解释：倾斜因子，如果分区的数据量大于此因子乘以分区的中位数，并且也大于spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes，那么认为是数据倾斜的，默认值为5

spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes
- 默认值：256MB since 3.0.0
- 描述：A partition is considered as skewed if its size in bytes is larger than this threshold and also larger than spark.sql.adaptive.skewJoin.skewedPartitionFactor multiplying the median partition size. Ideally, this config should be set larger than spark.sql.adaptive.advisoryPartitionSizeInBytes.
- 解释：每个分区的阀值，默认256mb,此参数应该大于spark.sql.adaptive.advisoryPartitionSizeInBytes

spark.sql.adaptive.forceOptimizeSkewedJoin
- 默认值：false since 3.3.0
- 描述：When true, force enable OptimizeSkewedJoin, which is an adaptive rule to optimize skewed joins to avoid straggler tasks, even if it introduces extra shuffle.

其他参数配置

spark.sql.adaptive.optimizer.excludedRules
- 默认值 (none) since 3.1.0
- 描述：Configures a list of rules to be disabled in the adaptive optimizer, in which the rules are specified by their rule names and separated by comma. The optimizer will log the rules that have indeed been excluded.
spark.sql.adaptive.customCostEvaluatorClass
- 默认值 (none) since 3.2.0
- 描述：The custom cost evaluator class to be used for adaptive execution. If not being set, Spark will use its own SimpleCostEvaluator by default.

性能提升

TPC-DS 100TB

大部分查询性能提升 10% ～ 50%，一些查询性能提升超过 50%，甚至达到 200% 以上。另外有一些查询，如果不使用 Adaptive Execution，则无法完成或者失败，而使用 Adaptive Execution 全部通过测试。
Baidu 性能提升分享

50% ～ 200% 性能提升，大部分通过 sort merge join 转变为 broadcast hash join。
Alibaba 性能提升分享

TPC-DS 1TB，总体性能提升 1.38 倍，最大性能达到 3 倍。
国内其他一些公司使用

对于由 outer join 导致的数据严重倾斜的查询，最高可达 10 倍以上的性能提升。

参考

Spark Adaptive Execution Unleash the Power of Spark SQL - Haifeng Chen (Intel)
https://github.com/Intel-bigdata/spark-adaptive
https://issues.apache.org/jira/browse/SPARK-23128

高达一号

关注

3
点赞
踩
12

收藏

觉得还不错? 一键收藏
3
评论
Spark_调优_Spark3.0之SparkSQL_AQE( adaptive query execution)自适应查询_参数讲解

参考文章：https://blog.csdn.net/jiangshouzhuang/article/details/104453937在本篇文章中，笔者将给大家带来 Spark SQL 中关于自适应执行引擎（Spark Adaptive Execution）的内容。在之前的文章中，笔者介绍过 Flink SQL，目前 Flink 社区在积极地更新迭代 Flink SQL 功能和优化性能，尤其 Flink 1.10.0 版本的发布，在增强流式 SQL 处理能力的同时也具备了成熟的批处理能力。但.
复制链接

扫一扫