Spark2.4升级到Spark 3.2后，小文件数增多的问题解决

最新推荐文章于 2024-06-06 08:47:41 发布

lilyjoke

最新推荐文章于 2024-06-06 08:47:41 发布

阅读量2k

点赞数 2

分类专栏： Spark 大数据文章标签： spark

本文链接：https://blog.csdn.net/lilyjoke/article/details/124242376

版权

大数据同时被 2 个专栏收录

28 篇文章 5 订阅

订阅专栏

Spark

6 篇文章 0 订阅

订阅专栏

全部原创，仅做问题分析记录，不涉及任何业务信息。

问题描述

数仓同学报障，说升级了spark3.2后，同样的一组SQL和配置，为什么最终写表的小文件从几个变成了500个，是不是distribute by rand()不生效了... :D

Spark 3.2

Spark 2.4

首先问题的确是存在(虽然和distribute by rand()没关系。。。)，最后一个insert操作，原来的分区只有几个，从上游job500个变成了几个，而现在都是500个，进一步分析。

定位和解决

1、distribute by rand()

可能业务同学还是没搞清楚用法，这个主要是用来打散数据，并不是用来限制分区数的，只能是说，为了打散数据，需要做重分区，而在重分区的过程中，经历了一些重分区逻辑，从而导致了最终insert表的文件数发生了改变。

先明确distribute by rand()的意思，rand()是一个函数，返回0-1之间的任意数值，distribute by rand()就是每行记录，以rand()返回的任意数值作为自己的一个字段，然后根据这个字段，做重分区，从而起到打散数据的作用。

同一组SQL Spark 3.2 的物理执行计划

同一组SQL Spark 2.4 的物理执行计划

可以看到，无论是分区字段还是分区上限都是一样的，所以不是这个问题哈。

2、凡事讲证据，直接看log和源码。

Spark 3.2 最后一个job的上下文日志

04-16 10:10:51 [INFO][org.apache.spark.sql.execution.adaptive.ShufflePartitionsUtil][Driver] - For shuffle(2), advisory target size: 268435456, actual target size 2624526, minimum partition size: 1048576
04-16 10:10:52 [INFO][org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator][Driver] - Code generated in 42.326706 ms
04-16 10:10:52 [INFO][org.apache.spark.SparkContext][Driver] - Starting job: sql at SparkSqlJob.java:78
04-16 10:10:52 [INFO][org.apache.spark.scheduler.DAGScheduler][dag-scheduler-event-loop] - Got job 3 (sql at SparkSqlJob.java:78) with 500 output partitions

Spark 2.4 最后一个job的上下文日志

04-13 02:20:50 [INFO][org.apache.spark.sql.execution.exchange.ExchangeCoordinator][Driver] - advisoryTargetPostShuffleInputSize: 268435456, targetPostShuffleInputSize 268435456. 
...
04-13 02:20:50 [INFO][org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter][Driver] - File Output Committer Algorithm version is 2
04-13 02:20:50 [INFO][org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter][Driver] - FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
04-13 02:20:50 [INFO][org.apache.spark.sql.execution.datasources.SQLHadoopMapReduceCommitProtocol][Driver] - Using output committer class org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
04-13 02:20:50 [INFO][org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator][Driver] - Code generated in 11.517121 ms
04-13 02:20:50 [INFO][org.apache.spark.SparkContext][Driver] - Starting job: sql at SparkSqlJob.java:72
04-13 02:20:50 [INFO][org.apache.spark.scheduler.DAGScheduler][dag-scheduler-event-loop] - Got job 9 (sql at SparkSqlJob.java:72) with 3 output partitions

发现partition target size，原来我们配置的是256M，但是spark3.2生效的是2M。这个就是问题原因。

Spark 2.4 重分区代码逻辑

org.apache.spark.sql.execution.exchange.ExchangeCoordinator

 // If minNumPostShufflePartitions is defined, it is possible that we need to use a
    // value less than advisoryTargetPostShuffleInputSize as the target input size of
    // a post shuffle task.
    val totalPostShuffleInputSize = mapOutputStatistics.map(_.bytesByPartitionId.sum).sum
    // The max at here is to make sure that when we have an empty table, we
    // only have a single post-shuffle partition.
    // There is no particular reason that we pick 16. We just need a number to
    // prevent maxPostShuffleInputSize from being set to 0.
    val maxPostShuffleInputSize =
    math.max(math.ceil(totalPostShuffleInputSize / minNumPostShufflePartitions.toDouble).toLong, 16)
    val targetPostShuffleInputSize =
      math.min(maxPostShuffleInputSize, advisoryTargetPostShuffleInputSize)

Spark 3.2 重分区代码逻辑

org.apache.spark.sql.execution.adaptive.ShufflePartitionsUtil

// If `minNumPartitions` is very large, it is possible that we need to use a value less than
    // `advisoryTargetSize` as the target size of a coalesced task.
    val totalPostShuffleInputSize = mapOutputStatistics.flatMap(_.map(_.bytesByPartitionId.sum)).sum
    val maxTargetSize = math.ceil(totalPostShuffleInputSize / minNumPartitions.toDouble).toLong
    // It's meaningless to make target size smaller than minPartitionSize.
    val targetSize = maxTargetSize.min(advisoryTargetSize).max(minPartitionSize) //关键在这里

    val shuffleIds = mapOutputStatistics.flatMap(_.map(_.shuffleId)).mkString(", ")
    logInfo(s"For shuffle($shuffleIds), advisory target size: $advisoryTargetSize, " +
      s"actual target size $targetSize, minimum partition size: $minPartitionSize")

    // If `inputPartitionSpecs` are all empty, it means skew join optimization is not applied.
    if (inputPartitionSpecs.forall(_.isEmpty)) {
      coalescePartitionsWithoutSkew(
        mapOutputStatistics, targetSize, minPartitionSize)
    } else {
      coalescePartitionsWithSkew(
        mapOutputStatistics, inputPartitionSpecs, targetSize, minPartitionSize)
    }

org.apache.spark.sql.execution.adaptive.CoalesceShufflePartitions

// Ideally, this rule should simply coalesce partitions w.r.t. the target size specified by
    // ADVISORY_PARTITION_SIZE_IN_BYTES (default 64MB). To avoid perf regression in AQE, this
    // rule by default tries to maximize the parallelism and set the target size to
    // `total shuffle size / Spark default parallelism`. In case the `Spark default parallelism`
    // is too big, this rule also respect the minimum partition size specified by
    // COALESCE_PARTITIONS_MIN_PARTITION_SIZE (default 1MB).
    // For history reason, this rule also need to support the config
    // COALESCE_PARTITIONS_MIN_PARTITION_NUM. We should remove this config in the future.
    val minNumPartitions = conf.getConf(SQLConf.COALESCE_PARTITIONS_MIN_PARTITION_NUM).getOrElse {
      if (conf.getConf(SQLConf.COALESCE_PARTITIONS_PARALLELISM_FIRST)) {
        // We fall back to Spark default parallelism if the minimum number of coalesced partitions
        // is not set, so to avoid perf regressions compared to no coalescing.
        session.sparkContext.defaultParallelism
      } else {
        // If we don't need to maximize the parallelism, we set `minPartitionNum` to 1, so that
        // the specified advisory partition size will be respected.
        1
      }
    }
    val advisoryTargetSize = conf.getConf(SQLConf.ADVISORY_PARTITION_SIZE_IN_BYTES)
    val minPartitionSize = if (Utils.isTesting) {
      // In the tests, we usually set the target size to a very small value that is even smaller
      // than the default value of the min partition size. Here we also adjust the min partition
      // size to be not larger than 20% of the target size, so that the tests don't need to set
      // both configs all the time to check the coalescing behavior.
      conf.getConf(SQLConf.COALESCE_PARTITIONS_MIN_PARTITION_SIZE).min(advisoryTargetSize / 5) //这里是关键
    } else {
      conf.getConf(SQLConf.COALESCE_PARTITIONS_MIN_PARTITION_SIZE)
    }

可以理解简单为，Spark 3.2的分区大小是由下面两个配置共同影响的(还有其他配置)，为了防止小文件过多导致的IO问题，我们会尽量把每个分区的大小上限设置和块大小相同。

spark.sql.adaptive.advisoryPartitionSizeInBytes
spark.sql.adaptive.coalescePartitions.minPartitionSize

问题解决。

lilyjoke

关注

2
点赞
踩
11

收藏

觉得还不错? 一键收藏
0
评论
Spark2.4升级到Spark 3.2后，小文件数增多的问题解决

全部原创，仅做问题分析记录，不涉及任何业务信息。问题描述数仓同学报障，说升级了spark3.2后，同样的一组SQL和配置，为什么最终写表的小文件从几个变成了500个，是不是distribute by rand()不生效了... :DSpark 3.2Spark 2.4首先问题的确是存在(虽然和distribute by rand()没关系。。。)，最后一个insert操作，原来的分区只有几个，从上游job500个变成了几个，而现在都是500个，进一步分析。定位和解决.
复制链接

扫一扫

专栏目录