Spark 系列——Spark的Shuffle原理

最新推荐文章于 2023-09-11 18:02:04 发布

fseast

最新推荐文章于 2023-09-11 18:02:04 发布

阅读量741

点赞数

分类专栏： Spark 文章标签： spark 大数据

本文链接：https://blog.csdn.net/fseast/article/details/117369315

版权

Spark 专栏收录该内容

5 篇文章 0 订阅

订阅专栏

一、基本介绍

基本大纲：
在这里插入图片描述

1.1 Lineage

RDD只支持粗粒度转换，即在大量记录上执行的单个操作。将创建RDD的一系列Lineage（血统）记录下来，以便恢复丢失的分区。RDD的Lineage会记录RDD的元数据信息和转换行为，当该RDD的部分分区数据丢失时，它可以根据这些信息来重新运算和恢复丢失的数据分区。

RDD有一个算子叫做 toDebugString ，可以看到这个RDD的血缘情况以及并行度什么的。
dependencies算子可以查看该RDD依赖的谁，主要有宽依赖和窄依赖。

1.2 窄依赖

简单的一句话就是父 RDD 的每个分区最多被子RDD 的一个分区使用就是窄依赖。

具体来说, 窄依赖的时候, 子 RDD 中的分区要么只依赖一个父 RDD 中的一个分区(比如map, filter操作), 要么在设计时候就能确定子 RDD 是父 RDD 的一个子集(比如: coalesce，coalesce用于减少分区时没有shuffle也是窄依赖)。

所以, 窄依赖的转换可以在任何的的一个分区上单独执行, 而不需要其他分区的任何信息。

如下图：
在这里插入图片描述

1.3 宽依赖

宽依赖指的是多个子RDD的Partition会依赖同一个父RDD的Partition，会引起shuffle，所以宽依赖也叫shuffle依赖。

如下图：

在这里插入图片描述

图画的略微粗糙，看的出分区间的依赖关系即可。

二、Spark Shuffle的原理

2.1 ShuffleManager

现在的版本的ShuffleManager只有一个实现就是SortShuffleManager，以前的版本是还有一个基于hash的Hash Based Shuffle，在2.0版本就没有了，主要原因是生成的文件太多。

假设上游task（Map端）数量为n，下游task（Reduce端）数量为m，Executor的数量为k，在未优化前的HashShuffleManager会产生 n*m 个文件等待Reduce端的task拉取，优化后的HashShuffleManager会有个合并的过程，会产生 k*m 个文件等待Reduce端的task拉取，如果Reduce端的并行度比较大，总的文件数还是挺大的。

虽然SortShuffleManager没有Reduce端拉取文件过多的问题，但是SortShuffleManager需要排序，所以性能也是有一定的影响的。关于Hash Shuffle 的过程以及存在的问题可以参考一下这里。

2.2 ShuffleWriter

ShuffleWriter是一个抽象类，主要是从map段写出到磁盘这段过程的一些操作。

ShuffleWriter有三个子类，分别是BypassMergeSortShuffleWriter、UnsafeShuffleWriter、SortShuffleWriter，不同的子类在map段的操作略有不同。

不过BypassMergeSortShuffleWriter、UnsafeShuffleWriter需要满足一定的条件，它们是由ShuffleHandle决定，具体条件在下面的源码部分有详细说明。接下来原理的讲解主要是针对SortShuffleWriter。

2.2.1 BypassMergeSortShuffleWriter与SortShuffleWriter的区别

很多地方也管BypassMergeSortShuffleWriter 叫bypass运行机制。
BypassMergeSortShuffleWriter不需要排序，task会为每个下游task（分区）都创建一个临时磁盘文件，将数据按key进行hash然后根据key的hash值，然后根据hash值写到对应的临时文件中，最终也是和SortShuffleWriter一样，把临时文件合并成索引文件和数据文件，关于满足BypassMergeSortShuffleWriter的条件：map段不能有预聚合操作(比如groupByKey) 且下游的分区数小于等于spark.shuffle.sort.bypassMergeThreshold(默认值是200)这个阈值。

总的来说不同点就是：1、磁盘写机制不同；2、bypass运行机制不会进行排序。也就是说，启用该机制的最大好处在于，shuffle write过程中，不需要进行数据的排序操作，也就节省掉了这部分的性能开销。

所以这里有个调优的点就是在不需要聚合，也不需要排序的计算场景中，我们就可以通过设置spark.shuffle.sort.bypassMergeThreshold的参数，没有预聚合操作的shuffle操作，当Reduce端的分区数小于这个设置值的时候，我们就能避免Shuffle在计算过程引入排序。

2.3 Spark Shuffle

在这里插入图片描述

Spark Shuffle概述：

Map端：
每个分区的数据不可能是来一条数据就开始往磁盘写，会先往内存的数据结构写，这个数据结构有两种：PartitionedPairBuffer和PartitionedAppendOnlyMap。
PartitionedPairBuffer是针对Map端没有预聚合操作（例如groupByKey）的一种数据结构。
PartitionedAppendOnlyMap是针对Map端有预聚合操作（例如reduceByKey）的一种数据结构，Value值是可累加、可更新的，推荐使用有预聚合操作的算子也是为了减少shuffle落盘以及网络传输中的数据量。

写到内存中达到一定条件，就会往磁盘写，往磁盘写之前会进行一次排序，根据分区号再根据数据的Key排序，保证每个临时文件内部都是有序的，写的时候会借助内存缓冲BufferedOutputStream进行，这时候可能会生成很多临时文件，最后就会把每个分区的临时文件以及内存中的数据用归并排序的方式排序然后合并成一组文件，一组文件包含了索引文件和数据文件两个文件，数据文件中的数据是有序的，里面的数据是根据分区号排序，再根据数据的key值排序，索引文件记录的就是具体分区的在数据文件的偏移量。

Reduce端：
Reduce端主要读取Map端输出的文件，当 Parent Stage 的所有 ShuffleMapTasks 结束后再 fetch。
因为 Spark 不要求 Shuffle 后的数据全局有序，因此没必要等到全部数据 shuffle 完成后再处理，所以是边 fetch 边处理Reduce操作。
Shuffle中数据分发的网络开销，会随着Map Task与Reduce Task的线性增长，呈指数级爆炸。

Map阶段和Reduce阶段在UI上的体现：
在这里插入图片描述

2.4 Shuffle相关参数

spark.shuffle.file.buffer Map端，该参数用于设置shuffle write task的BufferedOutputStream的buffer缓冲大小。将数据写到磁盘文件之前，会先写入buffer缓冲中，待缓冲写满之后，才会溢写到磁盘，默认为32k。

spark.reducer.maxSizeInFlight Reduce端的，该参数用于设置shuffle read task的buffer缓冲大小，而这个buffer缓冲决定了每次能够拉取多少数据，可以适当调大减少网络io次数，默认48M。

spark.shuffle.sort.bypassMergeThreshold 和使用BypassMergeSortShuffleWriter的条件相关。当Reduce端的分区数小于这个设置值的时候，没有预聚合的算子就能避免Shuffle在计算过程引入排序，默认值是200。

spark.shuffle.io.maxRetries shuffle read task从shuffle write task所在节点拉取属于自己的数据时，如果因为网络异常或者GC问题导致拉取失败，是会自动进行重试的。该参数就代表了可以重试的最大次数。如果在指定次数之内拉取还是没有成功，就可能会导致作业执行失败，仅限Netty（现在版本Netty取代了Akka，所以不需要管），默认值为3。

spark.shuffle.io.retryWait 与spark.shuffle.io.maxRetries 相关，失败重试的等待间隔，总等待时间就是：maxRetries * retryWait，默认值为5s。

更多shuffle参数查看。

三、源码

由于每个人的表达和理解方式都不一样，所以上面有些点没有理解清楚的话，可以试着看看代码。

这里用到的spark版本是2.4.5的，版本不同有略微差异，这里主要看Map阶段，而且ShuffleWriter也主要是看子类SortShuffleWriter。

步骤1: DAGScheduler.scala的shuffle部分的代码:

这是看Shuffle的源码入口，前部分(ShuffleMapStage)为Map阶段，后部分(ResultStage)为Reduce阶段

    val tasks: Seq[Task[_]] = try {
      val serializedTaskMetrics = closureSerializer.serialize(stage.latestInfo.taskMetrics).array()
      stage match {
      	// 写数据阶段
        case stage: ShuffleMapStage =>
          stage.pendingPartitions.clear()
          partitionsToCompute.map { id =>
            val locs = taskIdToLocations(id)
            val part = partitions(id)
            stage.pendingPartitions += id
            // 点ShuffleMapTask进去看runTask方法，跳转到步骤2
            new ShuffleMapTask(stage.id, stage.latestInfo.attemptNumber,
              taskBinary, part, locs, properties, serializedTaskMetrics, Option(jobId),
              Option(sc.applicationId), sc.applicationAttemptId)
          }
		
		//  读数据阶段
        case stage: ResultStage =>
          partitionsToCompute.map { id =>
            val p: Int = stage.partitions(id)
            val part = partitions(p)
            val locs = taskIdToLocations(id)
            // ResultTask方法负责读取Map端写出的数据
            new ResultTask(stage.id, stage.latestInfo.attemptNumber,
              taskBinary, part, locs, id, properties, serializedTaskMetrics,
              Option(jobId), Option(sc.applicationId), sc.applicationAttemptId)
          }
      }
    } catch {
      case NonFatal(e) =>
        abortStage(stage, s"Task creation failed: $e\n${Utils.exceptionString(e)}", Some(e))
        runningStages -= stage
        return
    }

步骤2: ShuffleMapTask.scala 的 runTask方法：

  override def runTask(context: TaskContext): MapStatus = {
    // Deserialize the RDD using the broadcast variable.
    val threadMXBean = ManagementFactory.getThreadMXBean
    val deserializeStartTime = System.currentTimeMillis()
    val deserializeStartCpuTime = if (threadMXBean.isCurrentThreadCpuTimeSupported) {
      threadMXBean.getCurrentThreadCpuTime
    } else 0L
    val ser = SparkEnv.get.closureSerializer.newInstance()
    val (rdd, dep) = ser.deserialize[(RDD[_], ShuffleDependency[_, _, _])](
      ByteBuffer.wrap(taskBinary.value), Thread.currentThread.getContextClassLoader)
    _executorDeserializeTime = System.currentTimeMillis() - deserializeStartTime
    _executorDeserializeCpuTime = if (threadMXBean.isCurrentThreadCpuTimeSupported) {
      threadMXBean.getCurrentThreadCpuTime - deserializeStartCpuTime
    } else 0L

    var writer: ShuffleWriter[Any, Any] = null
    try {
      //  (不同版本可能会不一样，3.0.0版本应该是封装一些manager的获取什么的在ShuffleWriteProcessor.scala这里了)
      // 现在ShuffleManager只剩一个实现叫SortShuffleManger了，早期还有一个是Hash的
      val manager = SparkEnv.get.shuffleManager
      // 这getWriter会得到ShuffleWriter子类对象，每个子类最后写出方式不太一样，ShuffleWriter有三个子类，
      //分别是：UnsafeShuffleWriter、BypassMergeSortShuffleWriter、SortShuffleWriter
      // 返回哪个ShuffleWriter子类由dep.shuffleHandle这个参数决定，点击shuffleHandle，跳转到步骤3
      //我已经把结果说出来了，但是还是要看一下getWriter，点进这个getWriter，跳转到步骤5
      //(点击方式:光标放到getWriter，按Ctrl+Alt+b，然后选择SortShuffleManger)
      writer = manager.getWriter[Any, Any](dep.shuffleHandle, partitionId, context)
      // 上一行我们已经根据shuffleHandle获取到了对应的writer，
      // 我们就不一个个来看不同writer的这个write方法，这里只看SortShuffleWriter的
      // 光标放到write，按Ctrl+Alt+b，选SortShuffleWriter类的write方法，跳转到步骤6
      writer.write(rdd.iterator(partition, context).asInstanceOf[Iterator[_ <: Product2[Any, Any]]])
      writer.stop(success = true).get
    } catch {
      case e: Exception =>
        try {
          if (writer != null) {
            writer.stop(success = false)
          }
        } catch {
          case e: Exception =>
            log.debug("Could not stop writer", e)
        }
        throw e
    }
  }

步骤3: Dependency.scala：

  // 点击进去这个registerShuffle  (Ctrl+Alt+b，然后选择SortShuffleManger),跳转到步骤4
  val shuffleHandle: ShuffleHandle = _rdd.context.env.shuffleManager.registerShuffle(
    shuffleId, _rdd.partitions.length, this)

步骤4: SortShuffleManger.scala的registerShuffle方法：

  /**
   * Obtains a [[ShuffleHandle]] to pass to tasks.
   */
  override def registerShuffle[K, V, C](
      shuffleId: Int,
      numMaps: Int,
      dependency: ShuffleDependency[K, V, C]): ShuffleHandle = {
    // 可以点进这个shouldBypassMergeSort(能忽略归并排序的意思)方法看看条件，这里不点了，直接说结果
    // shouldBypassMergeSort为true的条件是（2者缺一不可）:
    // 1.map段不能有预聚合操作(比如groupByKey，哪些算子有预聚合可以去复习一下)
    // 2.且下游的分区数小于等于spark.shuffle.sort.bypassMergeThreshold(默认值是200)这个阈值
    if (SortShuffleWriter.shouldBypassMergeSort(conf, dependency)) {
      // If there are fewer than spark.shuffle.sort.bypassMergeThreshold partitions and we don't
      // need map-side aggregation, then write numPartitions files directly and just concatenate
      // them at the end. This avoids doing serialization and deserialization twice to merge
      // together the spilled files, which would happen with the normal code path. The downside is
      // having multiple files open at a time and thus more memory allocated to buffers.
      new BypassMergeSortShuffleHandle[K, V](
        shuffleId, numMaps, dependency.asInstanceOf[ShuffleDependency[K, V, V]])
    // 可以点进这个canUseSerializedShuffle(能使用序列化的shuffle的意思)方法看看条件，这里不点了，直接说结果
    // canUseSerializedShuffle方法为true的条件是（3者缺一不可）：
    // 1.序列化规则要支持序列化的重定位（本来序列化的对象是分开的，可以支持重定位关联在一起的意思），默认的java序列化方式是不支持的，kryo序列化支持
    // 2.map段不能有预聚合操作
    // 3.下游的分区数小于等于16777216
    } else if (SortShuffleManager.canUseSerializedShuffle(dependency)) {
      // Otherwise, try to buffer map outputs in a serialized form, since this is more efficient:
      new SerializedShuffleHandle[K, V](
        shuffleId, numMaps, dependency.asInstanceOf[ShuffleDependency[K, V, V]])
    } else {
      // Otherwise, buffer map outputs in a deserialized form:
      // 上面的两个判断进不去就会进到这里
      new BaseShuffleHandle(shuffleId, numMaps, dependency)
    }
  }

步骤5: SortShuffleManager.scala的getWriter方法：

  /** Get a writer for a given partition. Called on executors by map tasks. */
  override def getWriter[K, V](
      handle: ShuffleHandle,
      mapId: Int,
      context: TaskContext): ShuffleWriter[K, V] = {
    numMapsForShuffle.putIfAbsent(
      handle.shuffleId, handle.asInstanceOf[BaseShuffleHandle[_, _, _]].numMaps)
    val env = SparkEnv.get
    // handle 这个处理器对象决定了返回哪个ShuffleWriter子类对象，关于得到不同handle处理器对象的条件在步骤4已详细说明了，这里就不在赘述
    handle match {
      case unsafeShuffleHandle: SerializedShuffleHandle[K @unchecked, V @unchecked] =>
        new UnsafeShuffleWriter(
          env.blockManager,
          shuffleBlockResolver.asInstanceOf[IndexShuffleBlockResolver],
          context.taskMemoryManager(),
          unsafeShuffleHandle,
          mapId,
          context,
          env.conf)
      case bypassMergeSortHandle: BypassMergeSortShuffleHandle[K @unchecked, V @unchecked] =>
        new BypassMergeSortShuffleWriter(
          env.blockManager,
          shuffleBlockResolver.asInstanceOf[IndexShuffleBlockResolver],
          bypassMergeSortHandle,
          mapId,
          context,
          env.conf)
      case other: BaseShuffleHandle[K @unchecked, V @unchecked, _] =>
        new SortShuffleWriter(shuffleBlockResolver, other, mapId, context)
    }
  }

步骤6: SortShuffleWriter.scala的write方法：

  /** Write a bunch of records to this task's output */
  override def write(records: Iterator[Product2[K, V]]): Unit = {
  	// 根据map端是否有预聚合来获取排序器，
    sorter = if (dep.mapSideCombine) {
      // map端有预聚合，第二个参数聚合器是有传值的
      new ExternalSorter[K, V, C](
        context, dep.aggregator, Some(dep.partitioner), dep.keyOrdering, dep.serializer)
    } else {
      // In this case we pass neither an aggregator nor an ordering to the sorter, because we don't
      // care whether the keys get sorted in each partition; that will be done on the reduce side
      // if the operation being run is sortByKey.
      // map端没有预聚合，第二个参数聚合器为None
      new ExternalSorter[K, V, V](
        context, aggregator = None, Some(dep.partitioner), ordering = None, dep.serializer)
    }
    // 点击insertAll进去，跳转到步骤7
    // 这个方法主要是进行排序，排序完了之后再溢写到磁盘临时文件
    sorter.insertAll(records)

    // Don't bother including the time to open the merged output file in the shuffle write time,
    // because it just opens a single file, so is typically too fast to measure accurately
    // (see SPARK-3570).
    val output = shuffleBlockResolver.getDataFile(dep.shuffleId, mapId)
    val tmp = Utils.tempFileWith(output)
    try {
      val blockId = ShuffleBlockId(dep.shuffleId, mapId, IndexShuffleBlockResolver.NOOP_REDUCE_ID)
      // 因为前面溢写了很多临时文件，所以writePartitionedFile方法就是把溢写的临时文件和内存的数据合并成一个文件，会先进行一个归并排序，先根据分区,分区相同再根据Key排序。
      val partitionLengths = sorter.writePartitionedFile(blockId, tmp)
      // writeIndexFileAndCommit方法主要是生成正式的索引文件和数据文件
      shuffleBlockResolver.writeIndexFileAndCommit(dep.shuffleId, mapId, partitionLengths, tmp)
      mapStatus = MapStatus(blockManager.shuffleServerId, partitionLengths)
    } finally {
      if (tmp.exists() && !tmp.delete()) {
        logError(s"Error while deleting temp file ${tmp.getAbsolutePath}")
      }
    }
  }

步骤7: ExternalSorter.scala的insertAll方法：

  // PartitionedAppendOnlyMap这个数据结构是用来存放map端有预聚合的，Value值是可累加、可更新的。
  @volatile private var map = new PartitionedAppendOnlyMap[K, C]
  // PartitionedPairBuffer这个数据结构是用来存放map端没有预聚合的
  @volatile private var buffer = new PartitionedPairBuffer[K, C]
  
  def insertAll(records: Iterator[Product2[K, V]]): Unit = {
    // TODO: stop combining if we find that the reduction factor isn't high
    // 这里得到的就是一个boolean类型的值，意思就是是否有预聚合，这是由步骤6的获取排序器构造器传进去的第二个参数来决定的
    val shouldCombine = aggregator.isDefined

	// 如果有预聚合就进入
    if (shouldCombine) {
      // Combine values in-memory first using our AppendOnlyMap
      val mergeValue = aggregator.get.mergeValue
      val createCombiner = aggregator.get.createCombiner
      var kv: Product2[K, V] = null
      val update = (hadValue: Boolean, oldValue: C) => {
        if (hadValue) mergeValue(oldValue, kv._2) else createCombiner(kv._2)
      }
      while (records.hasNext) {
        addElementsRead()
        kv = records.next()
        // 更新
        map.changeValue((getPartition(kv._1), kv._1), update)
        // 溢写到临时文件，所以会有很多临时文件，参数是是否为PartitionedAppendOnlyMap数据结构的意思
        // 溢写前会先排序，先按分区排序，分区相同按key排序。
        maybeSpillCollection(usingMap = true)
      }
    } else {
      // Stick values into our buffer
      while (records.hasNext) {
        addElementsRead()
        val kv = records.next()
        buffer.insert(getPartition(kv._1), kv._1, kv._2.asInstanceOf[C])
        maybeSpillCollection(usingMap = false)
      }
    }
  }



  /**
   * Spill the current in-memory collection to disk if needed.
   *
   * @param usingMap whether we're using a map or buffer as our current in-memory collection
   */
  private def maybeSpillCollection(usingMap: Boolean): Unit = {
    var estimatedSize = 0L
    if (usingMap) {
      estimatedSize = map.estimateSize()
      if (maybeSpill(map, estimatedSize)) {
        map = new PartitionedAppendOnlyMap[K, C]
      }
    } else {
      estimatedSize = buffer.estimateSize()
      if (maybeSpill(buffer, estimatedSize)) {
        buffer = new PartitionedPairBuffer[K, C]
      }
    }

    if (estimatedSize > _peakMemoryUsedBytes) {
      _peakMemoryUsedBytes = estimatedSize
    }
  }

  /**
   * Spill our in-memory collection to a sorted file that we can merge later.
   * We add this file into `spilledFiles` to find it later.
   *
   * @param collection whichever collection we're using (map or buffer)
   */
  override protected[this] def spill(collection: WritablePartitionedPairCollection[K, C]): Unit = {
    val inMemoryIterator = collection.destructiveSortedWritablePartitionedIterator(comparator)
    // 写临时文件到磁盘
    val spillFile = spillMemoryIteratorToDisk(inMemoryIterator)
    spills += spillFile
  }