Spark 的shuffle流程框架以及源码详解（匠心巨作）（2）-CSDN博客

本文链接：https://blog.csdn.net/fengshaungme/article/details/84566999

在 Spark 的shuffle流程框架以及源码详解（匠心巨作）（1）这篇博客中，我们详细的介绍了Spark Shuffle 的发展过程，介绍了Spark Shuffle 过程中用到的数据结构，这些都为后面讲解Shuffle 的详细流程，以及源码详解做铺垫。本篇博客主要介绍BypassMergeSortShuffleWriter 的框架以及源码详解。本文代码是基于Spark 2.3.2版本。

1. Spark Shuffle 的执行流程

上图是Spark Shuffle 的执行流程图。Shuffle过程一般分为两部分，一部分是ShuffleWrite，另外一部分是ShuffleRead。同时根据不同的应用场景，ShuffleWrite有三种方式，BypassMergeSortShuffleWriter，UnsafeShuffleWriter，SortShuffleWriter。三种方式有很大的差异，具体使用哪一种方式，根据应用场景和设置的参数有关系。本篇博客只介BypassMergeSortShuffleWriter，另外两种方式在后面的博客中会介绍。

2. Spark ShuffleManager如何获取Writer

Shuffle 的程序入口处是在ShuffleMapTask中的runTask中，看一下runTask()中的代码：



override def runTask(context: TaskContext): MapStatus = {
    // Deserialize the RDD using the broadcast variable.
    val threadMXBean = ManagementFactory.getThreadMXBean
    val deserializeStartTime = System.currentTimeMillis()
    val deserializeStartCpuTime = if (threadMXBean.isCurrentThreadCpuTimeSupported) {
      threadMXBean.getCurrentThreadCpuTime
    } else 0L
    val ser = SparkEnv.get.closureSerializer.newInstance()
    val (rdd, dep) = ser.deserialize[(RDD[_], ShuffleDependency[_, _, _])](
      ByteBuffer.wrap(taskBinary.value), Thread.currentThread.getContextClassLoader)
    _executorDeserializeTime = System.currentTimeMillis() - deserializeStartTime
    _executorDeserializeCpuTime = if (threadMXBean.isCurrentThreadCpuTimeSupported) {
      threadMXBean.getCurrentThreadCpuTime - deserializeStartCpuTime
    } else 0L

    var writer: ShuffleWriter[Any, Any] = null
    try {
      val manager = SparkEnv.get.shuffleManager  //获取ShuffleManager,目前版本2.3.2只有SortShuffleManager
      writer = manager.getWriter[Any, Any](dep.shuffleHandle, partitionId, context)//根据ShuffleHandel 获取对应的Writer, 这里的ShuffleHandle是向ShuffleManager 注册的。
      writer.write(rdd.iterator(partition, context).asInstanceOf[Iterator[_ <: Product2[Any, Any]]]) // 调用实际的write的方法
      writer.stop(success = true).get
    } catch {
      case e: Exception =>
        try {
          if (writer != null) {
            writer.stop(success = false)
          }
        } catch {
          case e: Exception =>
            log.debug("Could not stop writer", e)
        }
        throw e
    }
  }

在runTask()的方法中，首先获取ShuffleManager，再根据向ShuffleManager注册的ShuffleHandle获取对应的Writer。下面来看一下ShuffleHandle是如何向ShuffleManager注册的：

override def registerShuffle[K, V, C](
      shuffleId: Int,
      numMaps: Int,
      dependency: ShuffleDependency[K, V, C]): ShuffleHandle = {
    if (SortShuffleWriter.shouldBypassMergeSort(conf, dependency)) { //Shuffle dependency 是shouldBypassMergeSort，就创建BypassMergeSortShuffleHandle
      // If there are fewer than spark.shuffle.sort.bypassMergeThreshold partitions and we don't
      // need map-side aggregation, then write numPartitions files directly and just concatenate
      // them at the end. This avoids doing serialization and deserialization twice to merge
      // together the spilled files, which would happen with the normal code path. The downside is
      // having multiple files open at a time and thus more memory allocated to buffers.
      new BypassMergeSortShuffleHandle[K, V](
        shuffleId, numMaps, dependency.asInstanceOf[ShuffleDependency[K, V, V]])
    } else if (SortShuffleManager.canUseSerializedShuffle(dependency)) { //满足canUseSerializedShuffle(dependency)条件，就创建SerializedShuffleHandle
      // Otherwise, try to buffer map outputs in a serialized form, since this is more efficient:
      new SerializedShuffleHandle[K, V](
        shuffleId, numMaps, dependency.asInstanceOf[ShuffleDependency[K, V, V]])
    } else {
      // Otherwise, buffer map outputs in a deserialized form:
      new BaseShuffleHandle(shuffleId, numMaps, dependency) //创建BaseShuffleHandle
    }
  }

上面的代码主要完成ShuffleHandle的注册，ShuffleManager根据注册的ShuffleHandle决定调用哪种Writer方法。获取Writer的方法：根据祖册的ShuffleHandle，创建对应的Writer。

override def getWriter[K, V](
      handle: ShuffleHandle,
      mapId: Int,
      context: TaskContext): ShuffleWriter[K, V] = {
    numMapsForShuffle.putIfAbsent(
      handle.shuffleId, handle.asInstanceOf[BaseShuffleHandle[_, _, _]].numMaps)
    val env = SparkEnv.get
    handle match {
      case unsafeShuffleHandle: SerializedShuffleHandle[K @unchecked, V @unchecked] =>
        new UnsafeShuffleWriter(
          env.blockManager,
          shuffleBlockResolver.asInstanceOf[IndexShuffleBlockResolver],
          context.taskMemoryManager(),
          unsafeShuffleHandle,
          mapId,
          context,
          env.conf)
      case bypassMergeSortHandle: BypassMergeSortShuffleHandle[K @unchecked, V @unchecked] =>
        new BypassMergeSortShuffleWriter(
          env.blockManager,
          shuffleBlockResolver.asInstanceOf[IndexShuffleBlockResolver],
          bypassMergeSortHandle,
          mapId,
          context,
          env.conf)
      case other: BaseShuffleHandle[K @unchecked, V @unchecked, _] =>
        new SortShuffleWriter(shuffleBlockResolver, other, mapId, context)
    }
  }

上面介绍了ShuffleManager是如何获取不同的Writer 的，获取Writer后就会开始把ShuffleMapTask 中输出的数据进行写磁盘操作。

3. BypassMergeSortShuffleWriter执行架构

4. BypassMergeSortShuffleWriter源码详解

在上面获取到BypassMergeSortShuffleWriter的方法后，整个代码的执行流程如下：

使用BypassMergeSortShuffleWriter的条件如下，这种Shuffle方式类似以HashShuffle，只是把最终ShuffleMapTask输出的数据全部合并到一个文件里。

Aggregator is specified； map端不会有聚合操作
no Ordering is specified； map端不能有排序的操作
the number of partitions is less than spark.shuffle.sort.bypassMergeThreshold；reduce分区的数量要少于这个设定的值

首先执行BypassMergeSortShuffleWriter.java里面的Write方法。

@Override
  public void write(Iterator<Product2<K, V>> records) throws IOException {
    assert (partitionWriters == null);
    if (!records.hasNext()) { //ShuffleMapTask如果没有输出数据，那就创建一个空的Index文件，并封装到MapStatus 返回
      partitionLengths = new long[numPartitions];
      shuffleBlockResolver.writeIndexFileAndCommit(shuffleId, mapId, partitionLengths, null);
      mapStatus = MapStatus$.MODULE$.apply(blockManager.shuffleServerId(), partitionLengths);
      return;
    }
    final SerializerInstance serInstance = serializer.newInstance();//创建一个序列化器的实例
    final long openStartTime = System.nanoTime();
    partitionWriters = new DiskBlockObjectWriter[numPartitions];//按照numPartition的数量，创建一个partitionWriter数组，这个numPartition其实就是ShuffleReduceTask的数量
    partitionWriterSegments = new FileSegment[numPartitions];//按照numPartition的数量，创建FileSegment数组，里面存放着数据，偏移量和数据的长度
    for (int i = 0; i < numPartitions; i++) { 
      final Tuple2<TempShuffleBlockId, File> tempShuffleBlockIdPlusFile =
        blockManager.diskBlockManager().createTempShuffleBlock();//创建临时文件,有多少个分区，就创建多少个临时文件
      final File file = tempShuffleBlockIdPlusFile._2(); 
      final BlockId blockId = tempShuffleBlockIdPlusFile._1();
      partitionWriters[i] =
        blockManager.getDiskWriter(blockId, file, serInstance, fileBufferSize, writeMetrics);//为每天一个临时文件创建一个DiskWriter，用来把数据写入对应的文件里 
    }
    // Creating the file to write to and creating a disk writer both involve interacting with
    // the disk, and can take a long time in aggregate when we open many files, so should be
    // included in the shuffle write time.
    writeMetrics.incWriteTime(System.nanoTime() - openStartTime);

    while (records.hasNext()) {
      final Product2<K, V> record = records.next();
      final K key = record._1();
      partitionWriters[partitioner.getPartition(key)].write(key, record._2());//当有数据输出时，首先对record按Key进行分区，根据分区的结果把数据写进对应的文件里
    }
    //遍历每一个partitionWriters,把所有的records 全部写入对应的文件里，并把每一个文件里的数据形成FileSegment 
    for (int i = 0; i < numPartitions; i++) { 
      final DiskBlockObjectWriter writer = partitionWriters[i];
      partitionWriterSegments[i] = writer.commitAndGet();
      writer.close();
    }
    //创建最终的输出文件
    File output = shuffleBlockResolver.getDataFile(shuffleId, mapId);
    //创建最终输出文件的临时文件
    File tmp = Utils.tempFileWith(output);
    try {
     //把所有的临时文件合并成一个最终的文件
      partitionLengths = writePartitionedFile(tmp);
      // 创建索引文件，记录每个分区的FileSegment的偏移量，以便后续Reduce拉去数据时，根据偏移量直接找到属于自己那部分的数据
      shuffleBlockResolver.writeIndexFileAndCommit(shuffleId, mapId, partitionLengths, tmp);
    } finally {
      if (tmp.exists() && !tmp.delete()) {
        logger.error("Error while deleting temp file {}", tmp.getAbsolutePath());
      }
    }
    //最后把输出的文件和索引文件封装在MapStatus 中
    mapStatus = MapStatus$.MODULE$.apply(blockManager.shuffleServerId(), partitionLengths);
  }

write方法是最顶层的调用，在这里面会首先根据Partition 的数量，创建相应个数的临时文件，当ShuffleMap Task输出数据时，根据分区器按照Key的分区，直接把数据写入到已经创建好的临时文件，当把数据都写入到对应的临时文件中后，就会在每个文件中创建一个FileSegment。最后创建一个最终的输出文件，把所有的临时文件全部合并在一起，并根据FileSegment 保存的信息，创建Index索引文件。最后把最终的输出文件Map output 和Index索引文件封装到MapStatus中，向MapOutputTrackerMaster注册。

看一下 writePartitionedFile

/**
   * Concatenate all of the per-partition files into a single combined file.
   *
   * @return array of lengths, in bytes, of each partition of the file (used by map output tracker).
   */
  private long[] writePartitionedFile(File outputFile) throws IOException {
    // Track location of the partition starts in the output file
    final long[] lengths = new long[numPartitions];
    if (partitionWriters == null) {
      // We were passed an empty iterator
      return lengths;
    }
    //创建输出流
    final FileOutputStream out = new FileOutputStream(outputFile, true); 
    final long writeStartTime = System.nanoTime();
    boolean threwException = true;
    try {
      //循环遍历上面创建的Segment数组，提取里面的文件
      for (int i = 0; i < numPartitions; i++) {
        final File file = partitionWriterSegments[i].file();
        if (file.exists()) {
          //把提取出的文件放入输入流中
          final FileInputStream in = new FileInputStream(file);
          boolean copyThrewException = true;
          try {
            lengths[i] = Utils.copyStream(in, out, false, transferToEnabled);
            copyThrewException = false;
          } finally {
            Closeables.close(in, copyThrewException);
          }
          if (!file.delete()) {
            logger.error("Unable to delete file for partition {}", i);
          }
        }
      }
      threwException = false;
    } finally {
      Closeables.close(out, threwException);
      writeMetrics.incWriteTime(System.nanoTime() - writeStartTime);
    }
     //重置partitionWriters
    partitionWriters = null;
    //返回合并后的文件的长度
    return lengths;
  }

writePartitionFile 的作用是把每一个临时的磁盘文件合并为一个大文件，并返回整个文件的长度。合并完文件后，还需要创建一个索引文件，以便后面的ReduceTask分区可以直接在一块文件中，根据offset直接抓取属于自己分区的文件。

看一下索引文件的创建：

**
   * Write an index file with the offsets of each block, plus a final offset at the end for the
   * end of the output file. This will be used by getBlockData to figure out where each block
   * begins and ends.
   *
   * It will commit the data and index file as an atomic operation, use the existing ones, or
   * replace them with new ones.
   *
   * Note: the `lengths` will be updated to match the existing index file if use the existing ones.
   */
  def writeIndexFileAndCommit(
      shuffleId: Int,
      mapId: Int,
      lengths: Array[Long],
      dataTmp: File): Unit = {
    //创建一个索引文件， 
    val indexFile = getIndexFile(shuffleId, mapId)
   // 创建索引文件的临时文件
    val indexTmp = Utils.tempFileWith(indexFile)
    try {
      val out = new DataOutputStream(new BufferedOutputStream(new FileOutputStream(indexTmp)))
      Utils.tryWithSafeFinally {
        // We take in lengths of each block, need to convert it to offsets.
        var offset = 0L
      // 前面生成的临时文件都记录了各自的offset和length，因此遍历filesegment,根据offset和length即可把每个block 的offset计算出来
        out.writeLong(offset)
        for (length <- lengths) {
          offset += length
          out.writeLong(offset)
        }
      } {
        out.close()
      }

      val dataFile = getDataFile(shuffleId, mapId)
      // There is only one IndexShuffleBlockResolver per executor, this synchronization make sure
      // the following check and rename are atomic.
      synchronized {
        val existingLengths = checkIndexAndDataFile(indexFile, dataFile, lengths.length)
        if (existingLengths != null) {
          // Another attempt for the same task has already written our map outputs successfully,
          // so just use the existing partition lengths and delete our temporary map outputs.
          System.arraycopy(existingLengths, 0, lengths, 0, lengths.length)
          if (dataTmp != null && dataTmp.exists()) {
            dataTmp.delete()
          }
          indexTmp.delete()
        } else {
          // This is the first successful attempt in writing the map outputs for this task,
          // so override any existing index and data files with the ones we wrote.
          if (indexFile.exists()) {
            indexFile.delete()
          }
          if (dataFile.exists()) {
            dataFile.delete()
          }
          if (!indexTmp.renameTo(indexFile)) {
            throw new IOException("fail to rename file " + indexTmp + " to " + indexFile)
          }
          if (dataTmp != null && dataTmp.exists() && !dataTmp.renameTo(dataFile)) {
            throw new IOException("fail to rename file " + dataTmp + " to " + dataFile)
          }
        }
      }
    } finally {
      if (indexTmp.exists() && !indexTmp.delete()) {
        logError(s"Failed to delete temporary index file at ${indexTmp.getAbsolutePath}")
      }
    }
  }

writeIndexFileAndCommit的作用是记录每个分区的文件在最后的整个文件的位置，reduce 分区可以根据这个信息，抓取属于自己的数据。最后把output文件和index文件封装为mapStatus，然后向mapOutputTrackerMaster进行注册，至此BypassMergeSortShuffleWriter完成。