全部原创,仅做问题分析记录,不涉及任何业务信息。
问题描述
数仓同学报障,说升级了spark3.2后,同样的一组SQL和配置,为什么最终写表的小文件从几个变成了500个,是不是distribute by rand()不生效了... :D
Spark 3.2
Spark 2.4
首先问题的确是存在(虽然和distribute by rand()没关系。。。),最后一个insert操作,原来的分区只有几个,从上游job500个变成了几个,而现在都是500个,进一步分析。
定位和解决
1、distribute by rand()
可能业务同学还是没搞清楚用法,这个主要是用来打散数据,并不是用来限制分区数的,只能是说,为了打散数据,需要做重分区,而在重分区的过程中,经历了一些重分区逻辑,从而导致了最终insert表的文件数发生了改变。
先明确distribute by rand()的意思,rand()是一个函数,返回0-1之间的任意数值,distribute by rand()就是每行记录,以rand()返回的任意数值作为自己的一个字段,然后根据这个字段,做重分区,从而起到打散数据的作用。
同一组SQL Spark 3.2 的物理执行计划
同一组SQL Spark 2.4 的物理执行计划
可以看到,无论是分区字段还是分区上限都是一样的,所以不是这个问题哈。
2、凡事讲证据,直接看log和源码。
Spark 3.2 最后一个job的上下文日志
04-16 10:10:51 [INFO][org.apache.spark.sql.execution.adaptive.ShufflePartitionsUtil][Driver] - For shuffle(2), advisory target size: 268435456, actual target size 2624526, minimum partition size: 1048576
04-16 10:10:52 [INFO][org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator][Driver] - Code generated in 42.326706 ms
04-16 10:10:52 [INFO][org.apache.spark.SparkContext][Driver] - Starting job: sql at SparkSqlJob.java:78
04-16 10:10:52 [INFO][org.apache.spark.scheduler.DAGScheduler][dag-scheduler-event-loop] - Got job 3 (sql at SparkSqlJob.java:78) with 500 output partitions
Spark 2.4 最后一个job的上下文日志
04-13 02:20:50 [INFO][org.apache.spark.sql.execution.exchange.ExchangeCoordinator][Driver] - advisoryTargetPostShuffleInputSize: 268435456, targetPostShuffleInputSize 268435456.
...
04-13 02:20:50 [INFO][org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter][Driver] - File Output Committer Algorithm version is 2
04-13 02:20:50 [INFO][org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter][Driver] - FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
04-13 02:20:50 [INFO][org.apache.spark.sql.execution.datasources.SQLHadoopMapReduceCommitProtocol][Driver] - Using output committer class org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
04-13 02:20:50 [INFO][org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator][Driver] - Code generated in 11.517121 ms
04-13 02:20:50 [INFO][org.apache.spark.SparkContext][Driver] - Starting job: sql at SparkSqlJob.java:72
04-13 02:20:50 [INFO][org.apache.spark.scheduler.DAGScheduler][dag-scheduler-event-loop] - Got job 9 (sql at SparkSqlJob.java:72) with 3 output partitions
发现partition target size,原来我们配置的是256M,但是spark3.2生效的是2M。这个就是问题原因。
Spark 2.4 重分区代码逻辑
org.apache.spark.sql.execution.exchange.ExchangeCoordinator
// If minNumPostShufflePartitions is defined, it is possible that we need to use a
// value less than advisoryTargetPostShuffleInputSize as the target input size of
// a post shuffle task.
val totalPostShuffleInputSize = mapOutputStatistics.map(_.bytesByPartitionId.sum).sum
// The max at here is to make sure that when we have an empty table, we
// only have a single post-shuffle partition.
// There is no particular reason that we pick 16. We just need a number to
// prevent maxPostShuffleInputSize from being set to 0.
val maxPostShuffleInputSize =
math.max(math.ceil(totalPostShuffleInputSize / minNumPostShufflePartitions.toDouble).toLong, 16)
val targetPostShuffleInputSize =
math.min(maxPostShuffleInputSize, advisoryTargetPostShuffleInputSize)
Spark 3.2 重分区代码逻辑
org.apache.spark.sql.execution.adaptive.ShufflePartitionsUtil
// If `minNumPartitions` is very large, it is possible that we need to use a value less than
// `advisoryTargetSize` as the target size of a coalesced task.
val totalPostShuffleInputSize = mapOutputStatistics.flatMap(_.map(_.bytesByPartitionId.sum)).sum
val maxTargetSize = math.ceil(totalPostShuffleInputSize / minNumPartitions.toDouble).toLong
// It's meaningless to make target size smaller than minPartitionSize.
val targetSize = maxTargetSize.min(advisoryTargetSize).max(minPartitionSize) //关键在这里
val shuffleIds = mapOutputStatistics.flatMap(_.map(_.shuffleId)).mkString(", ")
logInfo(s"For shuffle($shuffleIds), advisory target size: $advisoryTargetSize, " +
s"actual target size $targetSize, minimum partition size: $minPartitionSize")
// If `inputPartitionSpecs` are all empty, it means skew join optimization is not applied.
if (inputPartitionSpecs.forall(_.isEmpty)) {
coalescePartitionsWithoutSkew(
mapOutputStatistics, targetSize, minPartitionSize)
} else {
coalescePartitionsWithSkew(
mapOutputStatistics, inputPartitionSpecs, targetSize, minPartitionSize)
}
org.apache.spark.sql.execution.adaptive.CoalesceShufflePartitions
// Ideally, this rule should simply coalesce partitions w.r.t. the target size specified by
// ADVISORY_PARTITION_SIZE_IN_BYTES (default 64MB). To avoid perf regression in AQE, this
// rule by default tries to maximize the parallelism and set the target size to
// `total shuffle size / Spark default parallelism`. In case the `Spark default parallelism`
// is too big, this rule also respect the minimum partition size specified by
// COALESCE_PARTITIONS_MIN_PARTITION_SIZE (default 1MB).
// For history reason, this rule also need to support the config
// COALESCE_PARTITIONS_MIN_PARTITION_NUM. We should remove this config in the future.
val minNumPartitions = conf.getConf(SQLConf.COALESCE_PARTITIONS_MIN_PARTITION_NUM).getOrElse {
if (conf.getConf(SQLConf.COALESCE_PARTITIONS_PARALLELISM_FIRST)) {
// We fall back to Spark default parallelism if the minimum number of coalesced partitions
// is not set, so to avoid perf regressions compared to no coalescing.
session.sparkContext.defaultParallelism
} else {
// If we don't need to maximize the parallelism, we set `minPartitionNum` to 1, so that
// the specified advisory partition size will be respected.
1
}
}
val advisoryTargetSize = conf.getConf(SQLConf.ADVISORY_PARTITION_SIZE_IN_BYTES)
val minPartitionSize = if (Utils.isTesting) {
// In the tests, we usually set the target size to a very small value that is even smaller
// than the default value of the min partition size. Here we also adjust the min partition
// size to be not larger than 20% of the target size, so that the tests don't need to set
// both configs all the time to check the coalescing behavior.
conf.getConf(SQLConf.COALESCE_PARTITIONS_MIN_PARTITION_SIZE).min(advisoryTargetSize / 5) //这里是关键
} else {
conf.getConf(SQLConf.COALESCE_PARTITIONS_MIN_PARTITION_SIZE)
}
可以理解简单为,Spark 3.2的分区大小是由下面两个配置共同影响的(还有其他配置),为了防止小文件过多导致的IO问题,我们会尽量把每个分区的大小上限设置和块大小相同。
spark.sql.adaptive.advisoryPartitionSizeInBytes
spark.sql.adaptive.coalescePartitions.minPartitionSize
问题解决。