Spark2.4升级到Spark 3.2后,小文件数增多的问题解决

全部原创,仅做问题分析记录,不涉及任何业务信息。

问题描述

数仓同学报障,说升级了spark3.2后,同样的一组SQL和配置,为什么最终写表的小文件从几个变成了500个,是不是distribute by rand()不生效了... :D

Spark 3.2

Spark 2.4

 

首先问题的确是存在(虽然和distribute by rand()没关系。。。),最后一个insert操作,原来的分区只有几个,从上游job500个变成了几个,而现在都是500个,进一步分析。

定位和解决

1、distribute by rand()

可能业务同学还是没搞清楚用法,这个主要是用来打散数据,并不是用来限制分区数的,只能是说,为了打散数据,需要做重分区,而在重分区的过程中,经历了一些重分区逻辑,从而导致了最终insert表的文件数发生了改变

先明确distribute by rand()的意思,rand()是一个函数,返回0-1之间的任意数值,distribute by rand()就是每行记录,以rand()返回的任意数值作为自己的一个字段,然后根据这个字段,做重分区,从而起到打散数据的作用。

同一组SQL Spark 3.2 的物理执行计划

同一组SQL Spark 2.4 的物理执行计划

 可以看到,无论是分区字段还是分区上限都是一样的,所以不是这个问题哈。

2、凡事讲证据,直接看log和源码。

Spark 3.2 最后一个job的上下文日志

04-16 10:10:51 [INFO][org.apache.spark.sql.execution.adaptive.ShufflePartitionsUtil][Driver] - For shuffle(2), advisory target size: 268435456, actual target size 2624526, minimum partition size: 1048576
04-16 10:10:52 [INFO][org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator][Driver] - Code generated in 42.326706 ms
04-16 10:10:52 [INFO][org.apache.spark.SparkContext][Driver] - Starting job: sql at SparkSqlJob.java:78
04-16 10:10:52 [INFO][org.apache.spark.scheduler.DAGScheduler][dag-scheduler-event-loop] - Got job 3 (sql at SparkSqlJob.java:78) with 500 output partitions

Spark 2.4 最后一个job的上下文日志

04-13 02:20:50 [INFO][org.apache.spark.sql.execution.exchange.ExchangeCoordinator][Driver] - advisoryTargetPostShuffleInputSize: 268435456, targetPostShuffleInputSize 268435456. 
...
04-13 02:20:50 [INFO][org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter][Driver] - File Output Committer Algorithm version is 2
04-13 02:20:50 [INFO][org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter][Driver] - FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
04-13 02:20:50 [INFO][org.apache.spark.sql.execution.datasources.SQLHadoopMapReduceCommitProtocol][Driver] - Using output committer class org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
04-13 02:20:50 [INFO][org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator][Driver] - Code generated in 11.517121 ms
04-13 02:20:50 [INFO][org.apache.spark.SparkContext][Driver] - Starting job: sql at SparkSqlJob.java:72
04-13 02:20:50 [INFO][org.apache.spark.scheduler.DAGScheduler][dag-scheduler-event-loop] - Got job 9 (sql at SparkSqlJob.java:72) with 3 output partitions

发现partition target size,原来我们配置的是256M,但是spark3.2生效的是2M。这个就是问题原因。

Spark 2.4 重分区代码逻辑

org.apache.spark.sql.execution.exchange.ExchangeCoordinator

 // If minNumPostShufflePartitions is defined, it is possible that we need to use a
    // value less than advisoryTargetPostShuffleInputSize as the target input size of
    // a post shuffle task.
    val totalPostShuffleInputSize = mapOutputStatistics.map(_.bytesByPartitionId.sum).sum
    // The max at here is to make sure that when we have an empty table, we
    // only have a single post-shuffle partition.
    // There is no particular reason that we pick 16. We just need a number to
    // prevent maxPostShuffleInputSize from being set to 0.
    val maxPostShuffleInputSize =
    math.max(math.ceil(totalPostShuffleInputSize / minNumPostShufflePartitions.toDouble).toLong, 16)
    val targetPostShuffleInputSize =
      math.min(maxPostShuffleInputSize, advisoryTargetPostShuffleInputSize)

Spark 3.2 重分区代码逻辑

org.apache.spark.sql.execution.adaptive.ShufflePartitionsUtil

// If `minNumPartitions` is very large, it is possible that we need to use a value less than
    // `advisoryTargetSize` as the target size of a coalesced task.
    val totalPostShuffleInputSize = mapOutputStatistics.flatMap(_.map(_.bytesByPartitionId.sum)).sum
    val maxTargetSize = math.ceil(totalPostShuffleInputSize / minNumPartitions.toDouble).toLong
    // It's meaningless to make target size smaller than minPartitionSize.
    val targetSize = maxTargetSize.min(advisoryTargetSize).max(minPartitionSize) //关键在这里

    val shuffleIds = mapOutputStatistics.flatMap(_.map(_.shuffleId)).mkString(", ")
    logInfo(s"For shuffle($shuffleIds), advisory target size: $advisoryTargetSize, " +
      s"actual target size $targetSize, minimum partition size: $minPartitionSize")

    // If `inputPartitionSpecs` are all empty, it means skew join optimization is not applied.
    if (inputPartitionSpecs.forall(_.isEmpty)) {
      coalescePartitionsWithoutSkew(
        mapOutputStatistics, targetSize, minPartitionSize)
    } else {
      coalescePartitionsWithSkew(
        mapOutputStatistics, inputPartitionSpecs, targetSize, minPartitionSize)
    }

org.apache.spark.sql.execution.adaptive.CoalesceShufflePartitions

// Ideally, this rule should simply coalesce partitions w.r.t. the target size specified by
    // ADVISORY_PARTITION_SIZE_IN_BYTES (default 64MB). To avoid perf regression in AQE, this
    // rule by default tries to maximize the parallelism and set the target size to
    // `total shuffle size / Spark default parallelism`. In case the `Spark default parallelism`
    // is too big, this rule also respect the minimum partition size specified by
    // COALESCE_PARTITIONS_MIN_PARTITION_SIZE (default 1MB).
    // For history reason, this rule also need to support the config
    // COALESCE_PARTITIONS_MIN_PARTITION_NUM. We should remove this config in the future.
    val minNumPartitions = conf.getConf(SQLConf.COALESCE_PARTITIONS_MIN_PARTITION_NUM).getOrElse {
      if (conf.getConf(SQLConf.COALESCE_PARTITIONS_PARALLELISM_FIRST)) {
        // We fall back to Spark default parallelism if the minimum number of coalesced partitions
        // is not set, so to avoid perf regressions compared to no coalescing.
        session.sparkContext.defaultParallelism
      } else {
        // If we don't need to maximize the parallelism, we set `minPartitionNum` to 1, so that
        // the specified advisory partition size will be respected.
        1
      }
    }
    val advisoryTargetSize = conf.getConf(SQLConf.ADVISORY_PARTITION_SIZE_IN_BYTES)
    val minPartitionSize = if (Utils.isTesting) {
      // In the tests, we usually set the target size to a very small value that is even smaller
      // than the default value of the min partition size. Here we also adjust the min partition
      // size to be not larger than 20% of the target size, so that the tests don't need to set
      // both configs all the time to check the coalescing behavior.
      conf.getConf(SQLConf.COALESCE_PARTITIONS_MIN_PARTITION_SIZE).min(advisoryTargetSize / 5) //这里是关键
    } else {
      conf.getConf(SQLConf.COALESCE_PARTITIONS_MIN_PARTITION_SIZE)
    }

可以理解简单为,Spark 3.2的分区大小是由下面两个配置共同影响的(还有其他配置),为了防止小文件过多导致的IO问题,我们会尽量把每个分区的大小上限设置和块大小相同。

spark.sql.adaptive.advisoryPartitionSizeInBytes
spark.sql.adaptive.coalescePartitions.minPartitionSize

问题解决。

  • 2
    点赞
  • 11
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值