spark源码分析之UnsafeShuffleWriter

最新推荐文章于 2024-07-10 17:25:40 发布

zhifeng687

最新推荐文章于 2024-07-10 17:25:40 发布

阅读量2k

点赞数 2

分类专栏： spark

本文链接：https://blog.csdn.net/qq_26222859/article/details/81454620

版权

spark 专栏收录该内容

30 篇文章 4 订阅

订阅专栏

概述

SortShuffleManager会判断在满足以下条件时调用UnsafeShuffleWriter，否则降级为使用SortShuffleWriter：

Serializer支持relocation。Serializer支持relocation是指，Serializer可以对已经序列化的对象进行排序，这种排序起到的效果和先对数据排序再序列化一致。支持relocation的Serializer是KryoSerializer，Spark默认使用JavaSerializer，通过参数spark.serializer设置；
不需要map side aggregate，即不能定义aggregator；
partition数量不能大于指定的阈值(2^24)；

UnsafeShuffleWriter 将record序列化后插入sorter，然后对已经序列化的record进行排序，并在排序完成后写入磁盘文件作为spill file，再将多个spill file合并成一个输出文件。在合并时会基于spill file的数量和IO compression codec选择最合适的合并策略。

源码分析

ShuffleMapTask获取ShuffleManager

在ShuffleMapTask调用runTask()方法执行任务的时候，会从SparkEnv中获取ShuffleManager。

ShuffleMapTask的runTask()方法如下：

override def runTask(context: TaskContext): MapStatus = {
    // Deserialize the RDD using the broadcast variable.
    val deserializeStartTime = System.currentTimeMillis()
    val ser = SparkEnv.get.closureSerializer.newInstance()
    val (rdd, dep) = ser.deserialize[(RDD[_], ShuffleDependency[_, _, _])](
      ByteBuffer.wrap(taskBinary.value), Thread.currentThread.getContextClassLoader)
    _executorDeserializeTime = System.currentTimeMillis() - deserializeStartTime
 
    metrics = Some(context.taskMetrics)
    var writer: ShuffleWriter[Any, Any] = null
    try {
       //获取shuffleManager
      val manager = SparkEnv.get.shuffleManager
      //shuffleManger获取writer
      writer = manager.getWriter[Any, Any](dep.shuffleHandle, partitionId, context)
      //writer调用write方法
      writer.write(rdd.iterator(partition, context).asInstanceOf[Iterator[_ <: Product2[Any, Any]]])
      return writer.stop(success = true).get
    } catch {
      case e: Exception =>
        try {
          if (writer != null) {
            writer.stop(success = false)
          }
        } catch {
          case e: Exception =>
            log.debug("Could not stop writer", e)
        }
        throw e
    }
  }

SortShufflemanager获取writer

SortShuffleManager会根据条件是否满足选择相应的ShuffleHandle，ShuffleHandle对应的shuffle writer如下：

BypassMergeSortShuffleHandle	BypassMergeSortShuffleWriter
SerializedShuffleHandle	UnsafeShuffleWriter
BaseShuffleHandle	SortShuffleWriter

registerShuffle方法

SortShufflemanager.scala

/**
   * Obtains a [[ShuffleHandle]] to pass to tasks.
   */
  override def registerShuffle[K, V, C](
      shuffleId: Int,
      numMaps: Int,
      dependency: ShuffleDependency[K, V, C]): ShuffleHandle = {
    if (SortShuffleWriter.shouldBypassMergeSort(conf, dependency)) {
      // If there are fewer than spark.shuffle.sort.bypassMergeThreshold partitions and we don't
      // need map-side aggregation, then write numPartitions files directly and just concatenate
      // them at the end. This avoids doing serialization and deserialization twice to merge
      // together the spilled files, which would happen with the normal code path. The downside is
      // having multiple files open at a time and thus more memory allocated to buffers.
      new BypassMergeSortShuffleHandle[K, V](
        shuffleId, numMaps, dependency.asInstanceOf[ShuffleDependency[K, V, V]])
    } else if (SortShuffleManager.canUseSerializedShuffle(dependency)) {
      // Otherwise, try to buffer map outputs in a serialized form, since this is more efficient:
      new SerializedShuffleHandle[K, V](
        shuffleId, numMaps, dependency.asInstanceOf[ShuffleDependency[K, V, V]])
    } else {
      // Otherwise, buffer map outputs in a deserialized form:
      new BaseShuffleHandle(shuffleId, numMaps, dependency)
    }
  }

canUseSerializedShuffle方法

该方法判断是否满足调用UnsafeShuffleWriter的条件：

Serializer支持relocation；
不需要map side aggregate，即不能定义aggregator；
partition数量不能大于指定的阈值(2^24)；

SortShufflemanager.scala

/**
   * Helper method for determining whether a shuffle should use an optimized serialized shuffle
   * path or whether it should fall back to the original path that operates on deserialized objects.
   */
  def canUseSerializedShuffle(dependency: ShuffleDependency[_, _, _]): Boolean = {
    val shufId = dependency.shuffleId
    val numPartitions = dependency.partitioner.numPartitions
    //判断serializer是否支持relocation
    if (!dependency.serializer.supportsRelocationOfSerializedObjects) {
      log.debug(s"Can't use serialized shuffle for shuffle $shufId because the serializer, " +
        s"${dependency.serializer.getClass.getName}, does not support object relocation")
      false
    //判断是否map端的聚合
    } else if (dependency.aggregator.isDefined) {
      log.debug(
        s"Can't use serialized shuffle for shuffle $shufId because an aggregator is defined")
      false
    //判断是否大于指定的阈值
    } else if (numPartitions > MAX_SHUFFLE_OUTPUT_PARTITIONS_FOR_SERIALIZED_MODE) {
      log.debug(s"Can't use serialized shuffle for shuffle $shufId because it has more than " +
        s"$MAX_SHUFFLE_OUTPUT_PARTITIONS_FOR_SERIALIZED_MODE partitions")
      false
    } else {
      log.debug(s"Can use serialized shuffle for shuffle $shufId")
      true
    }
  }
}

getWriter方法

如果满足条件，则handle是SerializedShuffleHandle ，创建UnsafeShuffleWriter来写数据。

SortShufflemanager.scala

/** Get a writer for a given partition. Called on executors by map tasks. */
  override def getWriter[K, V](
      handle: ShuffleHandle,
      mapId: Int,
      context: TaskContext): ShuffleWriter[K, V] = {
    numMapsForShuffle.putIfAbsent(
      handle.shuffleId, handle.asInstanceOf[BaseShuffleHandle[_, _, _]].numMaps)
    val env = SparkEnv.get
    handle match {
      case unsafeShuffleHandle: SerializedShuffleHandle[K @unchecked, V @unchecked] =>
        //创建UnsafeShuffleWriter
        new UnsafeShuffleWriter(
          env.blockManager,
          shuffleBlockResolver.asInstanceOf[IndexShuffleBlockResolver],
          context.taskMemoryManager(),
          unsafeShuffleHandle,
          mapId,
          context,
          env.conf)
      case bypassMergeSortHandle: BypassMergeSortShuffleHandle[K @unchecked, V @unchecked] =>
        new BypassMergeSortShuffleWriter(
          env.blockManager,
          shuffleBlockResolver.asInstanceOf[IndexShuffleBlockResolver],
          bypassMergeSortHandle,
          mapId,
          context,
          env.conf)
      case other: BaseShuffleHandle[K @unchecked, V @unchecked, _] =>
        new SortShuffleWriter(shuffleBlockResolver, other, mapId, context)
    }
  }

UnsafeShuffleWriter

write方法

1、将record进行分区并序列化后插入sorter。

2、将record进行排序，并在排序完成后写入磁盘文件作为spill file，再将多个spill file合并成一个输出文件。

@Override
  public void write(scala.collection.Iterator<Product2<K, V>> records) throws IOException {
    // Keep track of success so we know if we encountered an exception
    // We do this rather than a standard try/catch/re-throw to handle
    // generic throwables.
    boolean success = false;
    try {
      //将record进行分区并序列化后插入sorter
      while (records.hasNext()) {
        insertRecordIntoSorter(records.next());
      }
      //将record进行排序，并在排序完成后写入磁盘文件作为spill file，再将多个spill file合并成一个输出文件。
      closeAndWriteOutput();
      success = true;
    } finally {
      if (sorter != null) {
        try {
          sorter.cleanupResources();
        } catch (Exception e) {
          // Only throw this error if we won't be masking another
          // error.
          if (success) {
            throw e;
          } else {
            logger.error("In addition to a failure during writing, we failed during " +
                         "cleanup.", e);
          }
        }
      }
    }
  }

insertRecordToSorter方法

该方法将record进行分区并序列化后插入sorter。方法实现如下：

调用partitioner.getPartition方法对record的key进行分区，从而确定record被分配到哪个分区，并获取该分区的partitionId；
将record的key值和value值分别序列后保存到BtyeArrayOutputStream底层的buf字段；
将序列后的record插入到ShuffleExternalSorter;

在对record进行分区的过程中，假设使用的是HashPartitioner，则getPartition方法会将record的key的hashCode，和numPartition进行取模运算，从而确定record被分配的分区。

在record序列化的过程中，假设使用的是JavaSerializer，流的逐层（自底而上）调用关系为：

MyByteArrayOutputStream 》ObjectOutputStream 》JavaSerializationStream

其中，MyByteArrayOutputStream是ByteArrayOutputStream的子类，用以直接暴露buf[]字段；

ObjectOutputStream是java io实现的序列流，它的writeObject方法用以将对象写入流中；

JavaSerializationStream是ObjectOutputStream的代理类。

最终是将record序列化后保存到ByteArrayOutputStream的buf字段中。

下面代码中，serBuffer是MyByteArrayOutputStream的实例，调用getBuf方法可以获取底层ByteArrayOutputStream的buf字段。

serOutputStream是SerializationStream的实例，他会根据SparkConf初始化为JavaSerializationStream或者KryoSerializationStream。KryoSerializationStream的序列过程不在这里赘述。

ps：当然，JavaSerializer是不支持relocation的，所以事实上不可能会使用JavaSerializationStream。这里只是举个例子。

@VisibleForTesting
  void insertRecordIntoSorter(Product2<K, V> record) throws IOException {
    assert(sorter != null);
    final K key = record._1();
    //getPartition方法对key进行分区，从而确定record被分配到哪个分区，并获取该分区的partitionId
    final int partitionId = partitioner.getPartition(key);
    serBuffer.reset();
    //将record的key值序列后保存到serBuffer底层的buf字段
    serOutputStream.writeKey(key, OBJECT_CLASS_TAG);
    //将record的value值序列后保存到serBuffer底层的buf字段
    serOutputStream.writeValue(record._2(), OBJECT_CLASS_TAG);
    serOutputStream.flush();

    final int serializedRecordSize = serBuffer.size();
    assert (serializedRecordSize > 0);
    //将序列后的record插入到sorter
    sorter.insertRecord(
      serBuffer.getBuf(), Platform.BYTE_ARRAY_OFFSET, serializedRecordSize, partitionId);
  }

closeAndWriteOutput方法

将record进行排序，排序完成后写入磁盘文件作为spill file，再将多个spill file合并成一个输出文件。

该方法实现如下：

1、将内存的record排序，排序完成后写入磁盘文件作为spill file，最后返回这些spill file的元数据信息—— SpillInfo[]；

2、构造最终的输出文件实例,其中文件名为(reduceId为0)： "shuffle_" + shuffleId + "_" + mapId + "_" + reduceId；

3、在输出文件名后加上uuid用于标识文件正在写入,结束后重命名；

4、将多个spill文件合并成一个输出文件。基于spill文件的数量和IO压缩编解码器选择最合适的合并策略；

5、将每个partition的offset写入index文件方便reduce端fetch数据；

@VisibleForTesting
  void closeAndWriteOutput() throws IOException {
    assert(sorter != null);
    updatePeakMemoryUsed();
    serBuffer = null;
    serOutputStream = null;
    //将内存的record排序，排序完成后写入磁盘文件作为spill file，最后返回这些spill file的元数据信息—— SpillInfo[]；
    final SpillInfo[] spills = sorter.closeAndGetSpills();
    sorter = null;
    final long[] partitionLengths;
    /**构造最终的输出文件实例,其中文件名为(reduceId为0)： "shuffle_" + shuffleId + "_" + 
       mapId + "_" + reduceId；
    **/
    final File output = shuffleBlockResolver.getDataFile(shuffleId, mapId);
     //在输出文件名后加上uuid用于标识文件正在写入,结束后重命名；
    final File tmp = Utils.tempFileWith(output);
    try {
      try {
        //将多个spill文件合并成一个输出文件。基于spill文件的数量和IO压缩编解码器选择最合适的合并策略。
        partitionLengths = mergeSpills(spills, tmp);
      } finally {
        for (SpillInfo spill : spills) {
          if (spill.file.exists() && ! spill.file.delete()) {
            logger.error("Error while deleting spill file {}", spill.file.getPath());
          }
        }
      }
      //将每个partition的offset写入index文件方便reduce端fetch数据
      shuffleBlockResolver.writeIndexFileAndCommit(shuffleId, mapId, partitionLengths, tmp);
    } finally {
      if (tmp.exists() && !tmp.delete()) {
        logger.error("Error while deleting temp file {}", tmp.getAbsolutePath());
      }
    }
    mapStatus = MapStatus$.MODULE$.apply(blockManager.shuffleServerId(), partitionLengths);
  }

mergeSpills方法

将多个spill文件合并成一个输出文件。基于spill文件的数量和IO compression codec选择最合适的合并策略。

当有多个spill文件时，它的合并策略选择如下：

1、从SparkConf获取是否允许和是否支持fastMerge的信息，如果是，选择fast merge路径，否则选择slow merge路径。

2、当选择fast merge路径后，判断是否允许TransferTo及不需要加密，如果是，使用基于TransferTo的fast merge，否则，使用基于file Stream的fast merge。

/**
   * Merge zero or more spill files together, choosing the fastest merging strategy based on the
   * number of spills and the IO compression codec.
   *
   * @return the partition lengths in the merged file.
   */
  private long[] mergeSpills(SpillInfo[] spills, File outputFile) throws IOException {
	 //从sparkConf获取是否允许compression的flag
    final boolean compressionEnabled = sparkConf.getBoolean("spark.shuffle.compress", true);
	//根据sparkConf的codec信息生成对应的CompressionCodec
    final CompressionCodec compressionCodec = CompressionCodec$.MODULE$.createCodec(sparkConf);
	 //从sparkConf获取是否允许fastMerge的flag
    final boolean fastMergeEnabled =
      sparkConf.getBoolean("spark.shuffle.unsafe.fastMergeEnabled", true);
    final boolean fastMergeIsSupported = !compressionEnabled ||
      CompressionCodec$.MODULE$.supportsConcatenationOfSerializedStreams(compressionCodec);
    final boolean encryptionEnabled = blockManager.serializerManager().encryptionEnabled();
    try {
      if (spills.length == 0) {
        new FileOutputStream(outputFile).close(); // Create an empty file
        return new long[partitioner.numPartitions()];
      } else if (spills.length == 1) {
        // Here, we don't need to perform any metrics updates because the bytes written to this
        // output file would have already been counted as shuffle bytes written.
        Files.move(spills[0].file, outputFile);
        return spills[0].partitionLengths;
      } else {
        final long[] partitionLengths;
        // There are multiple spills to merge, so none of these spill files' lengths were counted
        // towards our shuffle write count or shuffle write time. If we use the slow merge path,
        // then the final output file's size won't necessarily be equal to the sum of the spill
        // files' sizes. To guard against this case, we look at the output file's actual size when
        // computing shuffle bytes written.
		// 这条条件（if/else）路径，为有多个spill文件要合并，所以没有在shuffle write count或者
		// shuffle write time时计算这些spill文件的长度。 如果我们使用慢合并路径，
        //  那么最终输出文件的大小不一定等于所有spill file的大小的总和。为了防止这种情况 ，我们在
        // 计算shuffle写入的字节时，观察输出文件的真实大小。
        //
        // We allow the individual merge methods to report their own IO times since different merge
        // strategies use different IO techniques.  We count IO during merge towards the shuffle
        // shuffle write time, which appears to be consistent with the "not bypassing merge-sort"
        // branch in ExternalSorter.
		// 我们允许各个合并方法报告它们自己的IO时间，既然不同的合并策略使用不用的IO技术。我们将合并期间
		//的IO时间统计到shuffle write time。
		//
        if (fastMergeEnabled && fastMergeIsSupported) {
          // Compression is disabled or we are using an IO compression codec that supports
          // decompression of concatenated compressed streams, so we can perform a fast spill merge
          // that doesn't need to interpret the spilled bytes.
		  //如果压缩被禁用，或者我们正在使用支持被拼接的压缩流的解压缩的压缩编解码器，我们可以执行
		  //快速的spill文件合并，不需要去解释溢出的字节。
          if (transferToEnabled && !encryptionEnabled) {
            logger.debug("Using transferTo-based fast merge");
            partitionLengths = mergeSpillsWithTransferTo(spills, outputFile);
          } else {
            logger.debug("Using fileStream-based fast merge");
            partitionLengths = mergeSpillsWithFileStream(spills, outputFile, null);
          }
        } else {
          logger.debug("Using slow merge");
          partitionLengths = mergeSpillsWithFileStream(spills, outputFile, compressionCodec);
        }
        // When closing an UnsafeShuffleExternalSorter that has already spilled once but also has
        // in-memory records, we write out the in-memory records to a file but do not count that
        // final write as bytes spilled (instead, it's accounted as shuffle write). The merge needs
        // to be counted as shuffle write, but this will lead to double-counting of the final
        // SpillInfo's bytes.
        writeMetrics.decBytesWritten(spills[spills.length - 1].file.length());
        writeMetrics.incBytesWritten(outputFile.length());
        return partitionLengths;
      }
    } catch (IOException e) {
      if (outputFile.exists() && !outputFile.delete()) {
        logger.error("Unable to delete output file {}", outputFile.getPath());
      }
      throw e;
    }
  }

mergeSpillsWithFileStream方法

使用java FileStream来合并spill file。

该合并方式明显慢于基于NIO（transferTo）的合并方式——UnsafeShuffleWriter#mergeSpillsWithTransferTo(SpillInfo[],
* File)。因此，它主要用于以下情形：

1、IO compression codec不支持压缩数据的拼接；

2、允许对数据加密；

3、用户明确禁用了TransferTo。版本号为2.6.32的linux内核在使用NIO方式会产生bug,需要将spark.file.transferTo参数设置为false。

4、当一个spill file中各个partition的大小都很小的时候，使用mergeSpillsWithFileStream方法是更快的，因为mergeSpillsWithTransferTo方法执行很小的磁盘IO是低效率的。在这种磁盘IO小且数量多的情况下，使用大缓冲区给输入输出文件有助于减少磁盘IO的数量，使文件合并更快。

该方法实现如下：

1、为每个spill file创建输入流。创建的输入流及流的装饰关系如下：

NioBufferedFileInputStream 》LimitedInputStream 》CryptoInputStream 》compressedInputStream

ps：compressedInputStream不是一个类，只是为了方便陈述用了该名词，它指的是ZstdInputStream、SnappyInputStream、LZFInputStream、LZ4BlockInputStream这些流的其中某一种。compressedOutputStream也是如此。

2、为最终的输出文件outputFile创建输出流。创建的输出流及流的装饰关系如下：

FileOutputStream 》BufferedOutputStream 》CountingOutputStream 》TimeTrackingOutputStream

》CloseAndFlushShieldOutputStream 》CryptoOutputStream 》compressedOutputStream

3、将输入流的全部字节复制到输出流；

/**
   * Merges spill files using Java FileStreams. This code path is typically slower than
   * the NIO-based merge, {@link UnsafeShuffleWriter#mergeSpillsWithTransferTo(SpillInfo[],
   * File)}, and it's mostly used in cases where the IO compression codec does not support
   * concatenation of compressed data, when encryption is enabled, or when users have
   * explicitly disabled use of {@code transferTo} in order to work around kernel bugs.
   * This code path might also be faster in cases where individual partition size in a spill
   * is small and UnsafeShuffleWriter#mergeSpillsWithTransferTo method performs many small
   * disk ios which is inefficient. In those case, Using large buffers for input and output
   * files helps reducing the number of disk ios, making the file merging faster.
   *
   * @param spills the spills to merge.
   * @param outputFile the file to write the merged data to.
   * @param compressionCodec the IO compression codec, or null if shuffle compression is disabled.
   * @return the partition lengths in the merged file.
   */
  private long[] mergeSpillsWithFileStream(
      SpillInfo[] spills,
      File outputFile,
      @Nullable CompressionCodec compressionCodec) throws IOException {
    assert (spills.length >= 2);
    final int numPartitions = partitioner.numPartitions();
    final long[] partitionLengths = new long[numPartitions];
    final InputStream[] spillInputStreams = new InputStream[spills.length];
    
    //为最终的输出文件outputFile创建输出流
    final OutputStream bos = new BufferedOutputStream(
            new FileOutputStream(outputFile),
            outputBufferSizeInBytes);
    // Use a counting output stream to avoid having to close the underlying file and ask
    // the file system for its size after each partition is written.
    final CountingOutputStream mergedFileOutputStream = new CountingOutputStream(bos);

    boolean threwException = true;
    try {
      //为每个spill file创建输入流
      for (int i = 0; i < spills.length; i++) {
        spillInputStreams[i] = new NioBufferedFileInputStream(
            spills[i].file,
            inputBufferSizeInBytes);
      }
      //外循环遍历partition
      for (int partition = 0; partition < numPartitions; partition++) {
        final long initialFileLength = mergedFileOutputStream.getByteCount();
        // Shield the underlying output stream from close() and flush() calls, so that we can close
        // the higher level streams to make sure all data is really flushed and internal state is
        // cleaned.
        OutputStream partitionOutput = new CloseAndFlushShieldOutputStream(
          new TimeTrackingOutputStream(writeMetrics, mergedFileOutputStream));
        partitionOutput = blockManager.serializerManager().wrapForEncryption(partitionOutput);
        if (compressionCodec != null) {
          partitionOutput = compressionCodec.compressedOutputStream(partitionOutput);
        }
        //内循环遍历spill file
        for (int i = 0; i < spills.length; i++) {
          final long partitionLengthInSpill = spills[i].partitionLengths[partition];
          if (partitionLengthInSpill > 0) {
            InputStream partitionInputStream = new LimitedInputStream(spillInputStreams[i],
              partitionLengthInSpill, false);
            try {
              partitionInputStream = blockManager.serializerManager().wrapForEncryption(
                partitionInputStream);
              if (compressionCodec != null) {
                partitionInputStream = compressionCodec.compressedInputStream(partitionInputStream);
              }
              //将输入流的全部字节复制到输出流
              ByteStreams.copy(partitionInputStream, partitionOutput);
            } finally {
              partitionInputStream.close();
            }
          }
        }
        partitionOutput.flush();
        partitionOutput.close();
        partitionLengths[partition] = (mergedFileOutputStream.getByteCount() - initialFileLength);
      }
      threwException = false;
    } finally {
      // To avoid masking exceptions that caused us to prematurely enter the finally block, only
      // throw exceptions during cleanup if threwException == false.
      for (InputStream stream : spillInputStreams) {
        Closeables.close(stream, threwException);
      }
      Closeables.close(mergedFileOutputStream, threwException);
    }
    return partitionLengths;
  }

mergeSpillsWithTransferTo方法

合并多个spill file，通过使用NIO的transferTo方法来拼接spill partition的字节。

只有当IO compression codec和seializer支持serialized stream的拼接时才是安全的。

该方法实现如下：

1、为每个spill file创建输入流，并获取输入流对应的FileChannel；

2、为最终的输出文件outputFile创建输出流，并获取输出流对应的FileChannel；

3、输入流对应的FileChannel调用transferTo方法，将字节转移到输出流对应的FileChannel；

/**
   * Merges spill files by using NIO's transferTo to concatenate spill partitions' bytes.
   * This is only safe when the IO compression codec and serializer support concatenation of
   * serialized streams.
   *
   * @return the partition lengths in the merged file.
   */
  private long[] mergeSpillsWithTransferTo(SpillInfo[] spills, File outputFile) throws IOException {
    assert (spills.length >= 2);
    final int numPartitions = partitioner.numPartitions();
    final long[] partitionLengths = new long[numPartitions];
    final FileChannel[] spillInputChannels = new FileChannel[spills.length];
    final long[] spillInputChannelPositions = new long[spills.length];
    FileChannel mergedFileOutputChannel = null;

    boolean threwException = true;
    try {
      //为每个spill file创建输入流，并获取输入流对应的通道
      for (int i = 0; i < spills.length; i++) {
        spillInputChannels[i] = new FileInputStream(spills[i].file).getChannel();
      }
      // This file needs to opened in append mode in order to work around a Linux kernel bug that
      // affects transferTo; see SPARK-3948 for more details.
      //为最终的输出文件outputFile创建输出流，并获取输出流对应的通道
      //输出文件需要以追加模式打开
      mergedFileOutputChannel = new FileOutputStream(outputFile, true).getChannel();

      long bytesWrittenToMergedFile = 0;
      //外循环遍历partition
      for (int partition = 0; partition < numPartitions; partition++) {
        //内循环遍历spill file
        for (int i = 0; i < spills.length; i++) {
          final long partitionLengthInSpill = spills[i].partitionLengths[partition];
          final FileChannel spillInputChannel = spillInputChannels[i];
          final long writeStartTime = System.nanoTime();
          //输入流对应的FileChannel调用transferTo方法，将字节转移到输出流对应的FileChannel
          Utils.copyFileStreamNIO(
            spillInputChannel,
            mergedFileOutputChannel,
            spillInputChannelPositions[i],
            partitionLengthInSpill);
          spillInputChannelPositions[i] += partitionLengthInSpill;
          writeMetrics.incWriteTime(System.nanoTime() - writeStartTime);
          bytesWrittenToMergedFile += partitionLengthInSpill;
          partitionLengths[partition] += partitionLengthInSpill;
        }
      }
      // Check the position after transferTo loop to see if it is in the right position and raise an
      // exception if it is incorrect. The position will not be increased to the expected length
      // after calling transferTo in kernel version 2.6.32. This issue is described at
      // https://bugs.openjdk.java.net/browse/JDK-7052359 and SPARK-3948.
      if (mergedFileOutputChannel.position() != bytesWrittenToMergedFile) {
        throw new IOException(
          "Current position " + mergedFileOutputChannel.position() + " does not equal expected " +
            "position " + bytesWrittenToMergedFile + " after transferTo. Please check your kernel" +
            " version to see if it is 2.6.32, as there is a kernel bug which will lead to " +
            "unexpected behavior when using transferTo. You can set spark.file.transferTo=false " +
            "to disable this NIO feature."
        );
      }
      threwException = false;
    } finally {
      // To avoid masking exceptions that caused us to prematurely enter the finally block, only
      // throw exceptions during cleanup if threwException == false.
      for (int i = 0; i < spills.length; i++) {
        assert(spillInputChannelPositions[i] == spills[i].file.length());
        Closeables.close(spillInputChannels[i], threwException);
      }
      Closeables.close(mergedFileOutputChannel, threwException);
    }
    return partitionLengths;
  }

zhifeng687

关注

2
点赞
踩
5

收藏

觉得还不错? 一键收藏
0
评论
spark源码分析之UnsafeShuffleWriter

概述 SortShuffleManager会判断在满足以下条件时调用UnsafeShuffleWriter，否则降级为使用SortShuffleWriter：Serializer支持relocation。Serializer支持relocation是指，Serializer可以对已经序列化的对象进行排序，这种排序起到的效果和先对数据排序再序列化一致。支持relocation的Seriali...
复制链接

扫一扫

专栏目录