spark源码分析之BypassMergeSortShuffleWriter

概述

spark1.6以后,取消了基于hash的shuffle,只剩下基于sort的shuffle。现在只存在以下三种shuffle writer:

  1. BypassMergeSortShuffleWriter
  2. UnsafeShuffleWriter
  3. SortShuffleWriter

其中,BypassMergeSortShuffleWriter实现带Hash风格的基于Sort的Shuffle机制,和已经废弃的HashShuffleWriter类似。这个shuffle writer将传入的record写入单独的文件,每个reduce partition对应一个文件,然后拼接这些文件汇总成一个单独的输出文件。这个输出文件按照partitionId分成多个region,reducer会根据partitionId来fetch对应的region数据。

当存在大量的reduce partition时,这个shuffle writer是低效率的,因为它会同时对每个partition都打开一个serializer和文件流。

因此,SortShuffleManager只在满足以下条件时选择这个shuffle writer:

  1. 不能指定ordering,即在partition内不能排序;
  2. 不能指定Aggregator,即不能进行map端的聚合;
  3. parttition的数量要小于spark.shuffle.sort.bypassMergeThreshold指定的阈值;

源码分析

ShuffleMapTask获取ShuffleManager

我们可以在配置文件指定spark.shuffle.manager,如果没有指定默认是sort,tungsten-sort也是基于SortShuffleManager的。

valshortShuffleMgrNames = Map(
  "sort"-> classOf[org.apache.spark.shuffle.sort.SortShuffleManager].getName,
  "tungsten-sort"->classOf[org.apache.spark.shuffle.sort.SortShuffleManager].getName)
val shuffleMgrName= conf.get("spark.shuffle.manager","sort")
val shuffleMgrClass= shortShuffleMgrNames.getOrElse(shuffleMgrName.toLowerCase,shuffleMgrName)
val shuffleManager= instantiateClass[ShuffleManager](shuffleMgrClass)

 

SortShuffleManager获取writer

SortShuffleManager会根据条件是否满足选择相应的ShuffleHandle,ShuffleHandle对应的shuffle writer如下:

BypassMergeSortShuffleHandleBypassMergeSortShuffleWriter
SerializedShuffleHandleUnsafeShuffleWriter
BaseShuffleHandleSortShuffleWriter

 

 

 

 

SortShuffleManager.scala

/**
   * Obtains a [[ShuffleHandle]] to pass to tasks.
   */
  override def registerShuffle[K, V, C](
      shuffleId: Int,
      numMaps: Int,
      dependency: ShuffleDependency[K, V, C]): ShuffleHandle = {
    if (SortShuffleWriter.shouldBypassMergeSort(conf, dependency)) {
      // If there are fewer than spark.shuffle.sort.bypassMergeThreshold partitions and we don't
      // need map-side aggregation, then write numPartitions files directly and just concatenate
      // them at the end. This avoids doing serialization and deserialization twice to merge
      // together the spilled files, which would happen with the normal code path. The downside is
      // having multiple files open at a time and thus more memory allocated to buffers.
      new BypassMergeSortShuffleHandle[K, V](
        shuffleId, numMaps, dependency.asInstanceOf[ShuffleDependency[K, V, V]])
    } else if (SortShuffleManager.canUseSerializedShuffle(dependency)) {
      // Otherwise, try to buffer map outputs in a serialized form, since this is more efficient:
      new SerializedShuffleHandle[K, V](
        shuffleId, numMaps, dependency.asInstanceOf[ShuffleDependency[K, V, V]])
    } else {
      // Otherwise, buffer map outputs in a deserialized form:
      new BaseShuffleHandle(shuffleId, numMaps, dependency)
    }
  }

其中,shouldBypassMergeSort方法判断是否满足使用BypassMergeSortShuffleWriter的条件。

如果partition的数量小于spark.shuffle.sort.bypassMergeThreshold指定的阈值,并且我们不需要map端的聚合,那么直接写numPartition个文件,再将它们拼接成一个输出文件。这样避免了两次序列化和反序列化来使spill文件合并。缺点是一次打开多个文件,需要更多内存分配给缓冲区。

SortShuffleWriter.scala

private[spark] object SortShuffleWriter {
  def shouldBypassMergeSort(conf: SparkConf, dep: ShuffleDependency[_, _, _]): Boolean = {
    // We cannot bypass sorting if we need to do map-side aggregation.
    if (dep.mapSideCombine) {
      require(dep.aggregator.isDefined, "Map-side combine without Aggregator specified!")
      false
    } else {
      val bypassMergeThreshold: Int = conf.getInt("spark.shuffle.sort.bypassMergeThreshold", 200)
      dep.partitioner.numPartitions <= bypassMergeThreshold
    }
  }
}

 

BypassMergeSortShuffleWriter进行文件写入及合并

write方法

该方法实现如下:

1、为每个partition创建一个临时文件和一个DiskBlockobjectWriter,DiskBlockobjectWriter用于写该临时文件;

2、对每个record用对应的writer进行文件写入操作;
      调用partitioner.getParttition()方法对record进行分区,从而获取每个分区(partition)对应的DiskBlockobjectWriter。假设使用的是HashPartitioner,则会将record的key的hashCode,和numPartition进行取模运算,从而确定record被分配到哪个分区。最后,调用该partition对应的DiskBlockobjectWriter,将record写入该partition对应的临时文件。

3、 构造最终的输出文件实例,其中文件名为(reduceId为0): "shuffle_" + shuffleId + "_" + mapId + "_" + reduceId;

4、在输出文件名后加上uuid用于标识文件正在写入,结束后重命名;

5、合并每个partition对应的文件到一个文件中;

6、将每个partition的offset写入index文件方便reduce端fetch数据;

BypassMergeSortShuffleWriter.java

@Override
  public void write(Iterator<Product2<K, V>> records) throws IOException {
    assert (partitionWriters == null);
    if (!records.hasNext()) {
      partitionLengths = new long[numPartitions];
      shuffleBlockResolver.writeIndexFileAndCommit(shuffleId, mapId, partitionLengths, null);
      mapStatus = MapStatus$.MODULE$.apply(blockManager.shuffleServerId(), partitionLengths);
      return;
    }
    final SerializerInstance serInstance = serializer.newInstance();
    final long openStartTime = System.nanoTime();
    partitionWriters = new DiskBlockObjectWriter[numPartitions];
    partitionWriterSegments = new FileSegment[numPartitions];
    for (int i = 0; i < numPartitions; i++) {
      final Tuple2<TempShuffleBlockId, File> tempShuffleBlockIdPlusFile =
        blockManager.diskBlockManager().createTempShuffleBlock();
      //为每个partition创建一个临时文件
      final File file = tempShuffleBlockIdPlusFile._2();
      final BlockId blockId = tempShuffleBlockIdPlusFile._1();
       /**
          为每个partition创建一个DiskBlockObjectWriter用于写临时文件
       **/
      partitionWriters[i] =
        blockManager.getDiskWriter(blockId, file, serInstance, fileBufferSize, writeMetrics);
    }
    // Creating the file to write to and creating a disk writer both involve interacting with
    // the disk, and can take a long time in aggregate when we open many files, so should be
    // included in the shuffle write time.
    writeMetrics.incWriteTime(System.nanoTime() - openStartTime);
    
     /**
      对每个record用对应的writer进行文件写入操作
      **/
    while (records.hasNext()) {
      final Product2<K, V> record = records.next();
      final K key = record._1();     
      //getPartition方法对key进行分区,从而确定record被分配到哪个分区,并获取该分区的partitionId
      //然后再获取该partitionId对应的DiskWriter写入record
      partitionWriters[partitioner.getPartition(key)].write(key, record._2());
    }
 
    //flush
    for (int i = 0; i < numPartitions; i++) {
      final DiskBlockObjectWriter writer = partitionWriters[i];
      partitionWriterSegments[i] = writer.commitAndGet();
      writer.close();
    }
   
    /**
        构造最终的输出文件实例,其中文件名为(reduceId为0):
        "shuffle_" + shuffleId + "_" + mapId + "_" + reduceId
         文件所在的local文件夹是根据该文件名的hash值确定。
        1、如果运行在yarn上,yarn在启动的时候会根据配置项'LOCAL_DIRS'在本地创建
        文件夹
    **/
    File output = shuffleBlockResolver.getDataFile(shuffleId, mapId);
    //在输出文件名后加上uuid用于标识文件正在写入,结束后重命名
    File tmp = Utils.tempFileWith(output);
    try {
      //合并每个partition对应的文件到一个文件中
      partitionLengths = writePartitionedFile(tmp);
      //将每个partition的offset写入index文件方便reduce端fetch数据
      shuffleBlockResolver.writeIndexFileAndCommit(shuffleId, mapId, partitionLengths, tmp);
    } finally {
      if (tmp.exists() && !tmp.delete()) {
        logger.error("Error while deleting temp file {}", tmp.getAbsolutePath());
      }
    }
    mapStatus = MapStatus$.MODULE$.apply(blockManager.shuffleServerId(), partitionLengths);
  }

writePartitionedFile方法

该方法将所有partition对应的文件进行拼接,合并成一个输出文件。

参数outputFile就是最终的输出文件。

方法实现如下:

1、创建outputFile的文件输出流;

2、创建每个partition对应的文件的输入流;

3、将输入流的内容全部复制到文件输出流;

/**
   * Concatenate all of the per-partition files into a single combined file.
   *
   * @return array of lengths, in bytes, of each partition of the file (used by map output tracker).
   */
  private long[] writePartitionedFile(File outputFile) throws IOException {
    // Track location of the partition starts in the output file
    final long[] lengths = new long[numPartitions];
    if (partitionWriters == null) {
      // We were passed an empty iterator
      return lengths;
    }
    //创建outputFile的文件输出流
    final FileOutputStream out = new FileOutputStream(outputFile, true);
    final long writeStartTime = System.nanoTime();
    boolean threwException = true;
    try {
      for (int i = 0; i < numPartitions; i++) {
        final File file = partitionWriterSegments[i].file();
        if (file.exists()) {
           //创建每个partition对应的文件的输入流
          final FileInputStream in = new FileInputStream(file);
          boolean copyThrewException = true;
          try {
             //将输入流的内容全部复制到文件输出流
            lengths[i] = Utils.copyStream(in, out, false, transferToEnabled);
            copyThrewException = false;
          } finally {
            Closeables.close(in, copyThrewException);
          }
          if (!file.delete()) {
            logger.error("Unable to delete file for partition {}", i);
          }
        }
      }
      threwException = false;
    } finally {
      Closeables.close(out, threwException);
      writeMetrics.incWriteTime(System.nanoTime() - writeStartTime);
    }
    partitionWriters = null;
    return lengths;
  }

参考:

Spark Shuffle(ExternalSorter)

Spark源码分析之Sort-Based Shuffle读写流程

  • 2
    点赞
  • 5
    收藏
    觉得还不错? 一键收藏
  • 3
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 3
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值