spark-shuffle的写入器源码分析

最新推荐文章于 2023-08-17 14:08:55 发布

ZL小屁孩

最新推荐文章于 2023-08-17 14:08:55 发布

阅读量310

点赞数

分类专栏： spark

本文链接：https://blog.csdn.net/ZH519080/article/details/82782918

版权

spark 专栏收录该内容

23 篇文章 0 订阅

订阅专栏

Spark-shuffle最根本的优化和迫切解决的问题：减少Mapper端ShuffleWriter产生的文件数，减少mapper端小文件的好处：

mapper端的内存占用变少了
可以处理小规模数据，不会容易达到性能瓶颈
Reducer端的数据获取次数变少，增加效率，减少网络消耗

在SparkEnv中可以看到shuffle的配置属性以及当前在spark的ShuffleManager可插拔框架中已经提供的ShuffleManager的具体实现，提供的有hash、sort和tungsten-sort三种shuffle模式，SparkEnv关于Shuffle的部分源码：

// 通过短格式的命名来指定所使用的ShuffleManager
val shortShuffleMgrNames = Map(
  "hash" -> "org.apache.spark.shuffle.hash.HashShuffleManager",
  "sort" -> "org.apache.spark.shuffle.sort.SortShuffleManager",
  "tungsten-sort" -> "org.apache.spark.shuffle.sort.SortShuffleManager")
//指定ShuffleManager的配置属性，默认使用sort模式，由SortShuffleManager实现
val shuffleMgrName = conf.get("spark.shuffle.manager", "sort")
val shuffleMgrClass = shortShuffleMgrNames.getOrElse(shuffleMgrName.toLowerCase, shuffleMgrName)
val shuffleManager = instantiateClass[ShuffleManager](shuffleMgrClass)

其中，tungsten-sort的优势是把spark从只能处理中小规模数据平台变成可以处理无限大规模数据的平台，体现在spark的运算能力。

sort与tungsten-sort都是实现“org.apache.spark.shuffle.sort.SortShuffleManager”，两种实现的机制的区别在于：使用不同的Shuffle数据写入器。

其中，基于sort（sorted based shuffle）的类有：BypassMergeSortShuffleWriter、SortShuffleWriter；基于tungsten-sort的类有：UnsafeShuffleWriter（序列化排序）。

在Driver和每个Executor的SparkEnv实例化过程中，都会创建一个ShuffleManager，用于管理块数据，提供集群块数据的读写（getWriter、getReader），包括数据的本地读写和读取远程节点的块数据。

Shuffle系统的框架以ShuffleManager为入口进行解析。在ShuffleManager中指定了整个Shuffle使用的各个组件，包括如何注册到ShuffleManager，以获取一个用于读写的处理句柄ShuffleHandle，通过ShuffleHandle获取特定的数据读写接口：ShuffleReader和ShuffleWriter，以及如何获取块数据信息的解析接口ShuffleBlockResolver。下面是几个重要组件的源码分析。

ShuffleManager是spark shuffle提供的一个可插拔式接口，提供具体实现子类或自定义具体实现子类时都要重写特质的ShuffleManager的抽象方法。

Executor执行stage是最终调用Task.run方法实现的。Task是abstract class，其抽象方法是由子类ShuffleMapTask或ResultTask实现。ShuffleMapTask会基于ShuffleDependency中指定的分区器，将一个RDD的元素拆分到多个buckets中，此时通过ShuffleManager的getWriter接口获取数据与buckets的映射关系。而ResultTask对应的是一个将输出返回给应用程序Driver端的Task，在该Task执行过程中，最终会调用RDD的compute对内部数据进行计算，而带有ShuffleDependency的RDD中，在compute计算时，会通过ShuffleManager的getReader接口获取上一个Stage的Shuffle输出结构作为本次Task的输入数据。

ShuffleMapTask的runTask方法的主要源码：

override def runTask(context: TaskContext): MapStatus = {
//首先从SparkEnv获取ShuffleManager，然后从ShuffleDependency中获取注册到ShuffleManager时得到的shuffleHandle，根据shuffleHandle和当前Task对应的分区RDD，获取ShuffleWriter，最后根据获取的ShuffleWriter，调用其write接口，写入当前分区的数据
  var writer: ShuffleWriter[Any, Any] = null
  try {
    val manager = SparkEnv.get.shuffleManager
    writer = manager.getWriter[Any, Any](dep.shuffleHandle, partitionId, context)
//调用RDD进行计算，通过HashShufleWriter的write方法把RDD的计算结果持久化
    writer.write(rdd.iterator(partition, context).asInstanceOf[Iterator[_ <: Product2[Any, Any]]])
    writer.stop(success = true).get
  } catch { 。。。。。。
  }
}

在ShuffleMapTask类的runTask方法中，getWriter方法是特质ShuffleManager的抽象方法，SortShuffleManager和HashShuffleManager实现特质ShuffleManager，以SortShuffleManager的getWriter方法为例，实现getWriter方法的调用之前，必须通过SortShuffleManager的registerShuffle方法向ShuffleManager注册并获得能进行shuffle的句柄handle，SortShuffleManager的registerShuffle方法的源码：

override def registerShuffle[K, V, C](shuffleId: Int,numMaps: Int,dependency: ShuffleDependency[K, V, C]): ShuffleHandle = {
  if (SortShuffleWriter.shouldBypassMergeSort(SparkEnv.get.conf, dependency)) {
    // 若partition的数量小于spark.shuffle.sort.bypassMergeThreshold指定的阈值，并且不需要map端的聚合，直接写numPartitions个文件，并拼接成一个输出文件，可避免两次序列化和反序列化致使spill文件合并，但是需要更多的内存分配到缓冲区内
    new BypassMergeSortShuffleHandle[K, V](
      shuffleId, numMaps, dependency.asInstanceOf[ShuffleDependency[K, V, V]])
  } else if (SortShuffleManager.canUseSerializedShuffle(dependency)) {
    // 当map端输出时需要序列化
    new SerializedShuffleHandle[K, V](
      shuffleId, numMaps, dependency.asInstanceOf[ShuffleDependency[K, V, V]])
  } else {
    // Otherwise, buffer map outputs in a deserialized form:
    new BaseShuffleHandle(shuffleId, numMaps, dependency)
  }
}

SortShuffleManager的registerShuffle方法中的BypassMergeSortShuffleHandle、SerializedShuffleHandle、BaseShuffleHandle分别对应SortShuffleManager的getWriter方法中的BypassMergeSortShuffleWriter、UnsafeShuffleWriter、SortShuffleWriter类：

SortShuffleManager.registerShuffle方法	SortShuffleManager.getWriter方法
BypassMergeSortShuffleHandle	BypassMergeSortShuffleWriter
SerializedShuffleHandle	UnsafeShuffleWriter
BaseShuffleHandle	SortShuffleWriter

因此：先获取shuffle的writer句柄（handle）才能获取各个不同的writer类

SortShuffleManager的getWriter方法的源码:

override def getWriter[K, V](handle: ShuffleHandle,mapId: Int,context: TaskContext): ShuffleWriter[K, V] = {
  numMapsForShuffle.putIfAbsent(
    handle.shuffleId, handle.asInstanceOf[BaseShuffleHandle[_, _, _]].numMaps)
  val env = SparkEnv.get
  handle match {
    case unsafeShuffleHandle: SerializedShuffleHandle[K @unchecked, V @unchecked] =>
      new UnsafeShuffleWriter(env.blockManager,shuffleBlockResolver.asInstanceOf[IndexShuffleBlockResolver],context.taskMemoryManager(),unsafeShuffleHandle,mapId,context,env.conf)
    case bypassMergeSortHandle: BypassMergeSortShuffleHandle[K @unchecked, V @unchecked] =>new BypassMergeSortShuffleWriter(env.blockManager,shuffleBlockResolver.asInstanceOf[IndexShuffleBlockResolver],bypassMergeSortHandle,mapId,context,env.conf)
    case other: BaseShuffleHandle[K @unchecked, V @unchecked, _] =>
      new SortShuffleWriter(shuffleBlockResolver, other, mapId, context)
  }
}

BypassMergeSortShuffleWriter写数据源码解析

该类实现了带Hash风格的基于Sort的Shuffle机制，为每个Reduce端的任务构建一个输出文件，将输入的每条记录分别写入各自对应的文件中，并在最后将这些基于各个分区的文件合并成一个输出文件。

使用BypassMergeSortShuffleWriter写入器的条件：

不能指定ordering，若当指定ordering时，会对分区内部的数据进行排序。因此BypassMergeSortShuffleWriter避免了排序的开销。
不能指定aggregator
分区个数小于spark.shuffle.sort.bypassMergeThreshold配置属性指定的个数。

BypassMergeSortShuffleWriter的write方法的源码：

@Override
public void write(Iterator<Product2<K, V>> records) throws IOException {
  assert (partitionWriters == null);
  if (!records.hasNext()) {//初始化索引文件内从，当获取分区内的真实数据时需要重写
    partitionLengths = new long[numPartitions];
    shuffleBlockResolver.writeIndexFileAndCommit(shuffleId, mapId, partitionLengths, null);
    mapStatus = MapStatus$.MODULE$.apply(blockManager.shuffleServerId(), partitionLengths);
    return;
  }
  final SerializerInstance serInstance = serializer.newInstance();
  //为每个分区各配置一个磁盘写入器DiskBlockObjectWriter
  partitionWriters = new DiskBlockObjectWriter[numPartitions];
//在该写入模式下，会同时打开numPartitions个DiskBlockObjectWriter，因此对应的分区不应设置过大，避免带来过大的内存开销，缓存大小为32K（fileBufflerSize定义大小）。
  for (int i = 0; i < numPartitions; i++) {
    final Tuple2<TempShuffleBlockId, File> tempShuffleBlockIdPlusFile =  //createTempShuffleBlock方法描
      blockManager.diskBlockManager().createTempShuffleBlock(); //述各个分区生成的中间临时文件的格
    final File file = tempShuffleBlockIdPlusFile._2();  //式与对应的BlockId
    final BlockId blockId = tempShuffleBlockIdPlusFile._1();
    partitionWriters[i] =
      blockManager.getDiskWriter(blockId, file, serInstance, fileBufferSize, writeMetrics).open();
  }
  // 创建文件写入和创建磁盘写入都与磁盘交互，若多个文件打开会花费较长时间，因此磁盘写入时间应包含在shuffle写入时间内
  writeMetrics.incShuffleWriteTime(System.nanoTime() - openStartTime);
  while (records.hasNext()) {
    final Product2<K, V> record = records.next();
    final K key = record._1();
    partitionWriters[partitioner.getPartition(key)].write(key, record._2());
  }
  for (DiskBlockObjectWriter writer : partitionWriters) {
    writer.commitAndClose();
  }//该文件包含所有Reduce端输出的数据
  File output = shuffleBlockResolver.getDataFile(shuffleId, mapId);
  File tmp = Utils.tempFileWith(output);
  partitionLengths = writePartitionedFile(tmp);
  shuffleBlockResolver.writeIndexFileAndCommit(shuffleId, mapId, partitionLengths, tmp);
  mapStatus = MapStatus$.MODULE$.apply(blockManager.shuffleServerId(), partitionLengths);
}

因此：从BypassMergeSortShuffleWriter的write可知，map端的任务最终会生成两个文件，即数据文件和索引文件。使用DiskBlockObjectWriter写记录时，是以32k记录批次写入的，不会占用太大的内存，但由于对应不能指定聚合器（Aggregator），写数据时也会直接写入记录，因此对应后续的网络I/O的开销消耗很大。

SortShuffleWriter写数据的源码分析

BypassMergeSortShuffleWriter的写数据是在Reduce端的分区个数较少的情况下提供的一种优化方式，当数据集规模非常大时，使用BypassMergeSortShuffleWriter写数据方式不合适，需要使用SortShuffleWriter实现写数据。

与其他ShuffleWriter的具体子类一样，SortShuffleWriter写数据的具体实现方法是write方法，其源码：

override def write(records: Iterator[Product2[K, V]]): Unit = {
  sorter = if (dep.mapSideCombine) {
//当需要在Map端聚合时，指定聚合器Aggregator，将Key值的ordering传入外部排序器ExternalSorter中
    require(dep.aggregator.isDefined, "Map-side combine without Aggregator specified!")
    new ExternalSorter[K, V, C](
      context, dep.aggregator, Some(dep.partitioner), dep.keyOrdering, dep.serializer)
  } else {
    // 没有指定Map端使用聚合时，传入ExternalSorter的聚合器（Aggregator），与Key值的Ordering都设为None，即不需要传入，对应在Reduce端读取数据时才根据聚合器分区数据进行聚合，并根据是否设置Ordering而选择是否对分区数据进行排序
    new ExternalSorter[K, V, V](
      context, aggregator = None, Some(dep.partitioner), ordering = None, dep.serializer)
  }
  sorter.insertAll(records)  //将写入的数据全部放入外部排序器中
  // 与BypassMergeSortShuffleWriter一样，获取输出文件名和BlockId
  val output = shuffleBlockResolver.getDataFile(dep.shuffleId, mapId)
  val tmp = Utils.tempFileWith(output)
  val blockId = ShuffleBlockId(dep.shuffleId, mapId, IndexShuffleBlockResolver.NOOP_REDUCE_ID)
  val partitionLengths = sorter.writePartitionedFile(blockId, tmp)
  shuffleBlockResolver.writeIndexFileAndCommit(dep.shuffleId, mapId, partitionLengths, tmp)
  mapStatus = MapStatus(blockManager.shuffleServerId, partitionLengths)
}

外部排序器（ExternalSorter）继承了Spillable，因此内存使用达到一定阈值会spill到磁盘，可以减少内存带来的开销。ExternalSorter的insertAll方法内部处理完（包括聚合和非聚合）每条记录时，都会检查是否需要spill，内部各种细节比较多，具体实现在Spillable类的mayeSpill方法中，其流程：ExternalSorter.insertAll -> ExeternalSorter.maybeSpillCollection -> Spillable.maybeSpill，Spillable类的maybeSpill方法的源码：

protected def maybeSpill(collection: C, currentMemory: Long): Boolean = {
  var shouldSpill = false
//检查当前记录数是否为32的倍数，即对小批量的记录进行spill；检查当前内存是否超过内存阈值
  if (elementsRead % 32 == 0 && currentMemory >= myMemoryThreshold) {
    // 从shuffle内存池中获取当前内存的2倍
    val amountToRequest = 2 * currentMemory - myMemoryThreshold
    val granted =   //实际上先申请内存，然后再次判断，最后决定是否spill
      taskMemoryManager.acquireExecutionMemory(amountToRequest, MemoryMode.ON_HEAP, null)
    myMemoryThreshold += granted
    // 若内存太小而不能进一步增长或超过memoryThreshold阈值，则会溢出
    shouldSpill = currentMemory >= myMemoryThreshold
  }
  shouldSpill = shouldSpill || _elementsRead > numElementsForceSpillThreshold
  // Actually spill
  if (shouldSpill) {
    _spillCount += 1
    logSpillage(currentMemory)
    spill(collection)   //ExternalSorter的spill方法重写Spillable的spill方法
    _elementsRead = 0
    _memoryBytesSpilled += currentMemory
    releaseMemory()
  }
  shouldSpill
}

Tungsten-sort：UnsafeShuffleWriter分析

基于TungstenSort的shuffle实现机制对应是序列化排序模式（SerializedShuffleHandle）。使用序列化UnsafeShuffleWriter类的条件：

Shuffle过程中不带聚合或输出结果不排序
Shuffle的序列化器支持序列化值重定位
Shuffle过程中输出分区个数小于16777216个

以SortShuffleManager的getWriter方法可知，数据写入器类UnsafeShuffleWriter中使用变量shuffleBlockResolver来对逻辑数据块与物理数据块的映射进行解析，而该变量使用的是与Hash的Shuffle实现机制不同的解析类IndexShuffleBlockResolver。UnsafeShuffleWriter的write方法的源码：

@Override
public void write(scala.collection.Iterator<Product2<K, V>> records) throws IOException {。。。。。。
    while (records.hasNext()) { //对输入的记录集records，循环将每条记录插入到外部排序器中
      insertRecordIntoSorter(records.next());
    } //生成两种数据：每个M安排端对应生成一个数据文件和对应的索引文件
    closeAndWriteOutput(); 
    success = true;
  } finally {
    if (sorter != null) {
      try {
        sorter.cleanupResources();  //释放外部排序器的资源
。。。。。。}

UnsafeShuffleWriter的insertRecordIntoSorter方法的源码：

void insertRecordIntoSorter(Product2<K, V> record) throws IOException {。。。。。。
  final K key = record._1();
  final int partitionId = partitioner.getPartition(key);
  serBuffer.reset();//复位存放记录的缓冲区，使用ByteArrayOutputStream存放记录，容量为1MB
//进一步使用序列化器从serBuffer缓冲区构建序列化输出流，将记录写入到缓冲区
  serOutputStream.writeKey(key, OBJECT_CLASS_TAG);
  serOutputStream.writeValue(record._2(), OBJECT_CLASS_TAG);
  serOutputStream.flush();
  final int serializedRecordSize = serBuffer.size();
  assert (serializedRecordSize > 0);
//将记录插入外部排序器中，serBuffer是字节数组，内部数据存放偏移量为Platform.BYTE_ARRAY_OFFSET
  sorter.insertRecord(
    serBuffer.getBuf(), Platform.BYTE_ARRAY_OFFSET, serializedRecordSize, partitionId);
}

继续查看UnsafeShuffleWriter的write方法中的closeAndWriteOutput方法的源码：

void closeAndWriteOutput() throws IOException {assert(sorter != null);
  updatePeakMemoryUsed();
  serBuffer = null;  //设置为null，用于GC垃圾回收
  serOutputStream = null;
  final SpillInfo[] spills = sorter.closeAndGetSpills(); //关闭外部排序器，并获取全部spill信息
  sorter = null;
  final long[] partitionLengths;   //getDataFile通过块解析器获取输出文件名
  final File output = shuffleBlockResolver.getDataFile(shuffleId, mapId);
  final File tmp = Utils.tempFileWith(output);  //在writeIndexFileAndCommit重复调用块解析器获取文件名
  try {
    partitionLengths = mergeSpills(spills, tmp);
  } finally {。。。。。。
  } //将合并的spill获取的分区及数量信息写入索引文件，并将临时文件重命名为真正的数据文件名
  shuffleBlockResolver.writeIndexFileAndCommit(shuffleId, mapId, partitionLengths, tmp);
  mapStatus = MapStatus$.MODULE$.apply(blockManager.shuffleServerId(), partitionLengths);
}

UnsafeShuffleWriter的closeAndWriteOutput方法主要有三步：

出发外部排序器，获取spill信息；
合并中间的spill文件，生成数据文件并返回各个分区对应的数据信息
根据各个分区数据信息生成数据文件对应的索引文件

UnsafeShuffleWriter的closeAndWriteOutput方法调用了mergeSpills方法，则UnsafeShuffleWriter的mergeSpills方法的源码：

private long[] mergeSpills(SpillInfo[] spills, File outputFile) throws IOException {
  final boolean compressionEnabled = sparkConf.getBoolean("spark.shuffle.compress", true); //压缩配置信息
  final CompressionCodec compressionCodec = CompressionCodec$.MODULE$.createCodec(sparkConf);
  final boolean fastMergeEnabled =  //是否启动unsafe的快速合并
    sparkConf.getBoolean("spark.shuffle.unsafe.fastMergeEnabled", true);
  final boolean fastMergeIsSupported = !compressionEnabled || //没有压缩或压缩支持序列化时，快速合并
    CompressionCodec$.MODULE$.supportsConcatenationOfSerializedStreams(compressionCodec);
  try {
    if (spills.length == 0) {
      new FileOutputStream(outputFile).close(); // 没有spill中间文件时，创建一个空文件
      return new long[partitioner.numPartitions()];
    } else if (spills.length == 1) {
      // 最后一个spills文件已更新metrics信息，不需要重复更新，直接重命名spills的中间临时文件为目标输出的数据文件
      Files.move(spills[0].file, outputFile);
      return spills[0].partitionLengths;
    } else {
      final long[] partitionLengths;
      // 档多个spill中间临时文件时，根据不同的条件采用不同的文件合并策略
      if (fastMergeEnabled && fastMergeIsSupported) {
        if (transferToEnabled) {
          // 通过NIO方式合并spills的数据，仅在I/O压缩码和序列器支持序列化流的合并才安全
          partitionLengths = mergeSpillsWithTransferTo(spills, outputFile);
        } else {
          // 使用java FileStreams文件流方式合并
          partitionLengths = mergeSpillsWithFileStream(spills, outputFile, null);
        }
      } else {
        partitionLengths = mergeSpillsWithFileStream(spills, outputFile, compressionCodec);
      }
      // 更新shuffle写数据的度量信息
      writeMetrics.decShuffleBytesWritten(spills[spills.length - 1].file.length());
      writeMetrics.incShuffleBytesWritten(outputFile.length());
      return partitionLengths;
    }
  } 。。。。。。}

private long[] mergeSpills(SpillInfo[] spills, File outputFile) throws IOException {
  final boolean compressionEnabled = sparkConf.getBoolean("spark.shuffle.compress", true); //压缩配置信息
  final CompressionCodec compressionCodec = CompressionCodec$.MODULE$.createCodec(sparkConf);
  final boolean fastMergeEnabled =  //是否启动unsafe的快速合并
    sparkConf.getBoolean("spark.shuffle.unsafe.fastMergeEnabled", true);
  final boolean fastMergeIsSupported = !compressionEnabled || //没有压缩或压缩支持序列化时，快速合并
    CompressionCodec$.MODULE$.supportsConcatenationOfSerializedStreams(compressionCodec);
  try {
    if (spills.length == 0) {
      new FileOutputStream(outputFile).close(); // 没有spill中间文件时，创建一个空文件
      return new long[partitioner.numPartitions()];
    } else if (spills.length == 1) {
      // 最后一个spills文件已更新metrics信息，不需要重复更新，直接重命名spills的中间临时文件为目标输出的数据文件
      Files.move(spills[0].file, outputFile);
      return spills[0].partitionLengths;
    } else {
      final long[] partitionLengths;
      // 档多个spill中间临时文件时，根据不同的条件采用不同的文件合并策略
      if (fastMergeEnabled && fastMergeIsSupported) {
        if (transferToEnabled) {
          // 通过NIO方式合并spills的数据，仅在I/O压缩码和序列器支持序列化流的合并才安全
          partitionLengths = mergeSpillsWithTransferTo(spills, outputFile);
        } else {
          // 使用java FileStreams文件流方式合并
          partitionLengths = mergeSpillsWithFileStream(spills, outputFile, null);
        }
      } else {
        partitionLengths = mergeSpillsWithFileStream(spills, outputFile, compressionCodec);
      }
      // 更新shuffle写数据的度量信息
      writeMetrics.decShuffleBytesWritten(spills[spills.length - 1].file.length());
      writeMetrics.incShuffleBytesWritten(outputFile.length());
      return partitionLengths;
    }
  } 。。。。。。}