spark源码分析之BypassMergeSortShuffleWriter

最新推荐文章于 2023-12-28 10:28:39 发布

zhifeng687

最新推荐文章于 2023-12-28 10:28:39 发布

阅读量2.4k

点赞数 2

分类专栏： spark

本文链接：https://blog.csdn.net/qq_26222859/article/details/81451126

版权

spark 专栏收录该内容

30 篇文章 4 订阅

订阅专栏

概述

spark1.6以后，取消了基于hash的shuffle，只剩下基于sort的shuffle。现在只存在以下三种shuffle writer：

BypassMergeSortShuffleWriter
UnsafeShuffleWriter
SortShuffleWriter

其中，BypassMergeSortShuffleWriter实现带Hash风格的基于Sort的Shuffle机制，和已经废弃的HashShuffleWriter类似。这个shuffle writer将传入的record写入单独的文件，每个reduce partition对应一个文件，然后拼接这些文件汇总成一个单独的输出文件。这个输出文件按照partitionId分成多个region，reducer会根据partitionId来fetch对应的region数据。

当存在大量的reduce partition时，这个shuffle writer是低效率的，因为它会同时对每个partition都打开一个serializer和文件流。

因此，SortShuffleManager只在满足以下条件时选择这个shuffle writer：

不能指定ordering，即在partition内不能排序；
不能指定Aggregator，即不能进行map端的聚合；
parttition的数量要小于spark.shuffle.sort.bypassMergeThreshold指定的阈值；

源码分析

ShuffleMapTask获取ShuffleManager

我们可以在配置文件指定spark.shuffle.manager，如果没有指定默认是sort，tungsten-sort也是基于SortShuffleManager的。

valshortShuffleMgrNames = Map(
  "sort"-> classOf[org.apache.spark.shuffle.sort.SortShuffleManager].getName,
  "tungsten-sort"->classOf[org.apache.spark.shuffle.sort.SortShuffleManager].getName)
val shuffleMgrName= conf.get("spark.shuffle.manager","sort")
val shuffleMgrClass= shortShuffleMgrNames.getOrElse(shuffleMgrName.toLowerCase,shuffleMgrName)
val shuffleManager= instantiateClass[ShuffleManager](shuffleMgrClass)

SortShuffleManager获取writer

SortShuffleManager会根据条件是否满足选择相应的ShuffleHandle，ShuffleHandle对应的shuffle writer如下：

BypassMergeSortShuffleHandle	BypassMergeSortShuffleWriter
SerializedShuffleHandle	UnsafeShuffleWriter
BaseShuffleHandle	SortShuffleWriter

SortShuffleManager.scala

/**
   * Obtains a [[ShuffleHandle]] to pass to tasks.
   */
  override def registerShuffle[K, V, C](
      shuffleId: Int,
      numMaps: Int,
      dependency: ShuffleDependency[K, V, C]): ShuffleHandle = {
    if (SortShuffleWriter.shouldBypassMergeSort(conf, dependency)) {
      // If there are fewer than spark.shuffle.sort.bypassMergeThreshold partitions and we don't
      // need map-side aggregation, then write numPartitions files directly and just concatenate
      // them at the end. This avoids doing serialization and deserialization twice to merge
      // together the spilled files, which would happen with the normal code path. The downside is
      // having multiple files open at a time and thus more memory allocated to buffers.
      new BypassMergeSortShuffleHandle[K, V](
        shuffleId, numMaps, dependency.asInstanceOf[ShuffleDependency[K, V, V]])
    } else if (SortShuffleManager.canUseSerializedShuffle(dependency)) {
      // Otherwise, try to buffer map outputs in a serialized form, since this is more efficient:
      new SerializedShuffleHandle[K, V](
        shuffleId, numMaps, dependency.asInstanceOf[ShuffleDependency[K, V, V]])
    } else {
      // Otherwise, buffer map outputs in a deserialized form:
      new BaseShuffleHandle(shuffleId, numMaps, dependency)
    }
  }

其中，shouldBypassMergeSort方法判断是否满足使用BypassMergeSortShuffleWriter的条件。

如果partition的数量小于spark.shuffle.sort.bypassMergeThreshold指定的阈值，并且我们不需要map端的聚合，那么直接写numPartition个文件，再将它们拼接成一个输出文件。这样避免了两次序列化和反序列化来使spill文件合并。缺点是一次打开多个文件，需要更多内存分配给缓冲区。

SortShuffleWriter.scala

private[spark] object SortShuffleWriter {
  def shouldBypassMergeSort(conf: SparkConf, dep: ShuffleDependency[_, _, _]): Boolean = {
    // We cannot bypass sorting if we need to do map-side aggregation.
    if (dep.mapSideCombine) {
      require(dep.aggregator.isDefined, "Map-side combine without Aggregator specified!")
      false
    } else {
      val bypassMergeThreshold: Int = conf.getInt("spark.shuffle.sort.bypassMergeThreshold", 200)
      dep.partitioner.numPartitions <= bypassMergeThreshold
    }
  }
}

BypassMergeSortShuffleWriter进行文件写入及合并

write方法

该方法实现如下：

1、为每个partition创建一个临时文件和一个DiskBlockobjectWriter，DiskBlockobjectWriter用于写该临时文件；

2、对每个record用对应的writer进行文件写入操作；
调用partitioner.getParttition()方法对record进行分区，从而获取每个分区（partition）对应的DiskBlockobjectWriter。假设使用的是HashPartitioner，则会将record的key的hashCode，和numPartition进行取模运算，从而确定record被分配到哪个分区。最后，调用该partition对应的DiskBlockobjectWriter，将record写入该partition对应的临时文件。

3、构造最终的输出文件实例,其中文件名为(reduceId为0)： "shuffle_" + shuffleId + "_" + mapId + "_" + reduceId；

4、在输出文件名后加上uuid用于标识文件正在写入,结束后重命名；

5、合并每个partition对应的文件到一个文件中；

6、将每个partition的offset写入index文件方便reduce端fetch数据；

BypassMergeSortShuffleWriter.java

@Override
  public void write(Iterator<Product2<K, V>> records) throws IOException {
    assert (partitionWriters == null);
    if (!records.hasNext()) {
      partitionLengths = new long[numPartitions];
      shuffleBlockResolver.writeIndexFileAndCommit(shuffleId, mapId, partitionLengths, null);
      mapStatus = MapStatus$.MODULE$.apply(blockManager.shuffleServerId(), partitionLengths);
      return;
    }
    final SerializerInstance serInstance = serializer.newInstance();
    final long openStartTime = System.nanoTime();
    partitionWriters = new DiskBlockObjectWriter[numPartitions];
    partitionWriterSegments = new FileSegment[numPartitions];
    for (int i = 0; i < numPartitions; i++) {
      final Tuple2<TempShuffleBlockId, File> tempShuffleBlockIdPlusFile =
        blockManager.diskBlockManager().createTempShuffleBlock();
      //为每个partition创建一个临时文件
      final File file = tempShuffleBlockIdPlusFile._2();
      final BlockId blockId = tempShuffleBlockIdPlusFile._1();
       /**
          为每个partition创建一个DiskBlockObjectWriter用于写临时文件
       **/
      partitionWriters[i] =
        blockManager.getDiskWriter(blockId, file, serInstance, fileBufferSize, writeMetrics);
    }
    // Creating the file to write to and creating a disk writer both involve interacting with
    // the disk, and can take a long time in aggregate when we open many files, so should be
    // included in the shuffle write time.
    writeMetrics.incWriteTime(System.nanoTime() - openStartTime);
    
     /**
      对每个record用对应的writer进行文件写入操作
      **/
    while (records.hasNext()) {
      final Product2<K, V> record = records.next();
      final K key = record._1();     
      //getPartition方法对key进行分区，从而确定record被分配到哪个分区，并获取该分区的partitionId
      //然后再获取该partitionId对应的DiskWriter写入record
      partitionWriters[partitioner.getPartition(key)].write(key, record._2());
    }
 
    //flush
    for (int i = 0; i < numPartitions; i++) {
      final DiskBlockObjectWriter writer = partitionWriters[i];
      partitionWriterSegments[i] = writer.commitAndGet();
      writer.close();
    }
   
    /**
        构造最终的输出文件实例,其中文件名为(reduceId为0)：
        "shuffle_" + shuffleId + "_" + mapId + "_" + reduceId
         文件所在的local文件夹是根据该文件名的hash值确定。
        1、如果运行在yarn上,yarn在启动的时候会根据配置项'LOCAL_DIRS'在本地创建
        文件夹
    **/
    File output = shuffleBlockResolver.getDataFile(shuffleId, mapId);
    //在输出文件名后加上uuid用于标识文件正在写入,结束后重命名
    File tmp = Utils.tempFileWith(output);
    try {
      //合并每个partition对应的文件到一个文件中
      partitionLengths = writePartitionedFile(tmp);
      //将每个partition的offset写入index文件方便reduce端fetch数据
      shuffleBlockResolver.writeIndexFileAndCommit(shuffleId, mapId, partitionLengths, tmp);
    } finally {
      if (tmp.exists() && !tmp.delete()) {
        logger.error("Error while deleting temp file {}", tmp.getAbsolutePath());
      }
    }
    mapStatus = MapStatus$.MODULE$.apply(blockManager.shuffleServerId(), partitionLengths);
  }

writePartitionedFile方法

该方法将所有partition对应的文件进行拼接，合并成一个输出文件。

参数outputFile就是最终的输出文件。

方法实现如下：

1、创建outputFile的文件输出流；

2、创建每个partition对应的文件的输入流；

3、将输入流的内容全部复制到文件输出流；

/**
   * Concatenate all of the per-partition files into a single combined file.
   *
   * @return array of lengths, in bytes, of each partition of the file (used by map output tracker).
   */
  private long[] writePartitionedFile(File outputFile) throws IOException {
    // Track location of the partition starts in the output file
    final long[] lengths = new long[numPartitions];
    if (partitionWriters == null) {
      // We were passed an empty iterator
      return lengths;
    }
    //创建outputFile的文件输出流
    final FileOutputStream out = new FileOutputStream(outputFile, true);
    final long writeStartTime = System.nanoTime();
    boolean threwException = true;
    try {
      for (int i = 0; i < numPartitions; i++) {
        final File file = partitionWriterSegments[i].file();
        if (file.exists()) {
           //创建每个partition对应的文件的输入流
          final FileInputStream in = new FileInputStream(file);
          boolean copyThrewException = true;
          try {
             //将输入流的内容全部复制到文件输出流
            lengths[i] = Utils.copyStream(in, out, false, transferToEnabled);
            copyThrewException = false;
          } finally {
            Closeables.close(in, copyThrewException);
          }
          if (!file.delete()) {
            logger.error("Unable to delete file for partition {}", i);
          }
        }
      }
      threwException = false;
    } finally {
      Closeables.close(out, threwException);
      writeMetrics.incWriteTime(System.nanoTime() - writeStartTime);
    }
    partitionWriters = null;
    return lengths;
  }

参考：

Spark Shuffle（ExternalSorter）

Spark源码分析之Sort-Based Shuffle读写流程

zhifeng687

关注

2
点赞
踩
5

收藏

觉得还不错? 一键收藏
3
评论
spark源码分析之BypassMergeSortShuffleWriter

概述spark1.6以后，取消了基于hash的shuffle，只剩下基于sort的shuffle。现在只存在以下三种shuffle writer：BypassMergeSortShuffleWriter UnsafeShuffleWriter SortShuffleWriter其中，BypassMergeSortShuffleWriter实现带Hash风格的基于Sort的Shuffl...
复制链接

扫一扫