Spark 中的shuffle解读以及repartition和coalesce介绍以及使用场景

本文链接：https://blog.csdn.net/yu0_zhang0/article/details/80454517

本文深入探讨了Spark中的shuffle机制，包括其工作原理、对性能的影响以及如何通过repartition和coalesce操作来优化shuffle过程。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

1 shuffle操作

官网描述
Spark中的某些操作会触发称为shuffle的事件。随机播放是Spark的重新分配数据的机制，因此它可以跨分区进行不同的分组。这通常涉及跨执行程序和机器复制数据，使得混洗成为复杂且昂贵的操作。

2 背景

为了理解在shuffle期间发生的事情，我们可以考虑reduceByKey操作的示例。 reduceByKey操作生成一个新的RDD，其中单个键的所有值都组合成一个元组 - 键和对与该键关联的所有值执行reduce函数的结果。挑战在于，当我们计算时，数据不可能在同一个节点上，如果我们进行了一个reduceByKey 的操作，每个task是对应一个partition的，这时候我们必须从每个分区中读取数据，找到该键对应的值，然后将分区的值拉取出来集合在一起，以计算每个键的最终结果-这就是所谓的shuffle。

尽管经过shuffle操作过后每个分区中的元素集将是确定性的，并且分区本身的排序也是如此，但这些元素的排序不是。如果在随机播放后需要可预测的有序数据，则可以使用：

mapPartitions使用例如.sorted对每个分区进行排序
repartitionAndSortWithinPartitions在同时重新分区的同时有效地对分区进行排序
sortBy来创建一个全局排序的RDD

可以导致混洗的操作包括重新分区操作，例如repartition and coalesce，“ByKey操作（计数除外）”，如groupByKey和reduceByKey，以及联合操作，如cogroup和join。

3 性能影响

shuffle过程中，各个节点上的相同key都会先写入本地磁盘文件中，然后其他节点需要通过网络传输拉取各个节
点上的磁盘文件中的相同key。而且相同key都拉取到同一个节点进行聚合操作时，还有可能会因为一个节点上处
读写的IO操作，以及数据的网络传输操作。磁盘IO、网络数据传输、数据序列化也是shuffle性能较差的主要原
因。


某些shuffle操作会消耗大量的堆内存。具体来说，reduceByKey和aggregateByKey首先在map端进行操作，
通过ByKey在reduce端进行操作，接着，每写一条数据进入内存数据结构之后，就会判断一下，是否达到了某个
临界阈值。如果达到临界阈值的话，那么就会尝试将内存数据结构中的数据溢写到磁盘，然后清空内存数据结构，
从而导致磁盘I / O的额外开销和垃圾回收的额外开销。

shuffle还会在磁盘上生成大量中间文件。从Spark 1.3开始，这些文件将被保留，直到相应的RDD不再使用或者
被垃圾收集为止。这样做是为了在重新计算时不需要重新创建洗牌文件。如果应用程序保留对这些RDD的引用或者
GC未频繁引入，垃圾收集可能会在很长一段时间后才会发生。这意味着长时间运行的Spark作业可能会消耗大量的
磁盘空间。临时存储目录在配置Spark上下文时由spark.local.dir配置参数指定。

请参阅“Spark配置指南”中的“随机行为”部分。http://spark.apache.org/docs/latest/configuration.html

4 repartition和coalesce介绍以及使用场景

coalesce

/**
   * Return a new RDD that is reduced into `numPartitions` partitions.
   *
   * This results in a narrow dependency, e.g. if you go from 1000 partitions
   * to 100 partitions, there will not be a shuffle, instead each of the 100
   * new partitions will claim 10 of the current partitions. If a larger number
   * of partitions is requested, it will stay at the current number of partitions.
   *
   * However, if you're doing a drastic coalesce, e.g. to numPartitions = 1,
   * this may result in your computation taking place on fewer nodes than
   * you like (e.g. one node in the case of numPartitions = 1). To avoid this,
   * you can pass shuffle = true. This will add a shuffle step, but means the
   * current upstream partitions will be executed in parallel (per whatever
   * the current partitioning is).
   *
   * @note With shuffle = true, you can actually coalesce to a larger number
   * of partitions. This is useful if you have a small number of partitions,
   * say 100, potentially with a few partitions being abnormally large. Calling
   * coalesce(1000, shuffle = true) will result in 1000 partitions with the
   * data distributed using a hash partitioner. The optional partition coalescer
   * passed in must be serializable.
   */
  def coalesce(numPartitions: Int, shuffle: Boolean = false,
               partitionCoalescer: Option[PartitionCoalescer] = Option.empty)
              (implicit ord: Ordering[T] = null)
      : RDD[T] = withScope {
    require(numPartitions > 0, s"Number of partitions ($numPartitions) must be positive.")
    if (shuffle) {
      /** Distributes elements evenly across output partitions, starting from a random partition. */
      val distributePartition = (index: Int, items: Iterator[T]) => {
        var position = (new Random(index)).nextInt(numPartitions)
        items.map { t =>
          // Note that the hash code of the key will just be the key itself. The HashPartitioner
          // will mod it with the number of total partitions.
          position = position + 1
          (position, t)
        }
      } : Iterator[(Int, T)]

从源码中我们可以看到coalesce算子可以传递连个参数：分区数量、默认不使用shuffle；

使用coalesce算子它会返回一个经过简化到numPartitions个分区的新RDD。这会导致一个窄依赖，例如：你将1000个分区转换成100个分区，这个过程不会发生shuffle，相反如果10个分区转换成100个分区将会发生shuffle。然而如果你想大幅度合并分区，例如合并成一个分区，这会导致你的计算在少数几个集群节点上计算（言外之意：并行度不够）。为了避免这种情况，你可以将第二个shuffle参数传递一个true，这样会在重新分区过程中多一步shuffle，这意味着上游的分区可以并行运行。

使用场景： 小文件合并：例如，对rdd操作时如果中间做了多个过滤操作，我现在每个分区有100条数据经过最终过滤只有10条数据，那我现在有100个分区，必然产生很多小文件，所有这时候我们再最后加上一个coalesce算子进行小文件合并。

repartition

  /**
   * Return a new RDD that has exactly numPartitions partitions.
   *
   * Can increase or decrease the level of parallelism in this RDD. Internally, this uses
   * a shuffle to redistribute data.
   *
   * If you are decreasing the number of partitions in this RDD, consider using `coalesce`,
   * which can avoid performing a shuffle.
   */
  def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
    coalesce(numPartitions, shuffle = true)
  }