spark源码-shuffle原理分析-1-ShuffleWriter

zdaiqing

已于 2022-08-22 12:04:33 修改

阅读量328

点赞数

分类专栏： Spark 大数据源码文章标签：大数据 spark scala 分布式

于 2022-08-21 23:40:00 首次发布

本文链接：https://blog.csdn.net/m0_37817767/article/details/126457477

版权

源码同时被 3 个专栏收录

29 篇文章 2 订阅

订阅专栏

大数据

26 篇文章 0 订阅

订阅专栏

Spark

25 篇文章 2 订阅

订阅专栏

ShuffleWriter

1.概述
2.ShuffleHandle注册
3.ShuffleWriter实例化
- 3.1.实例化时间点
- - 3.1.1.Executor#run-执行器运行任务
  - 3.1.2.实例化ShuffleWriter对象
4.BypassMergeSortShuffleWriter
5.SortShuffleWriter
6.UnsafeShuffleWriter
7.创建数据文件对应index文件
8.总结
9.参考资料

1.概述

本次分析基于spark版本2.11进行；

spark中的shuffle是一个整体的大框架，本次主要对ShuffleWriter在shuffle中产生作用的原理进行梳理；

2.ShuffleHandle注册

2.1.注册时间点

shuffleHandle是宽依赖ShuffleDependency的属性之一；
当实例化宽依赖对象的时候，就会向shuffleManager注册handle，并返回handle用以初始化shuffleHandle属性；
向shuffleManager注册handle时，会实例化一个ShuffleHandle对象；

class ShuffleDependency[K: ClassTag, V: ClassTag, C: ClassTag](
    @transient private val _rdd: RDD[_ <: Product2[K, V]],
    val partitioner: Partitioner,
    val serializer: Serializer = SparkEnv.get.serializer,
    val keyOrdering: Option[Ordering[K]] = None,
    val aggregator: Option[Aggregator[K, V, C]] = None,
    val mapSideCombine: Boolean = false)
  extends Dependency[Product2[K, V]] {

  override def rdd: RDD[Product2[K, V]] = _rdd.asInstanceOf[RDD[Product2[K, V]]]

  val shuffleId: Int = _rdd.context.newShuffleId()

  //向shuffleManager注册handle，并返回handle初始化shuffleHandle属性
  val shuffleHandle: ShuffleHandle = _rdd.context.env.shuffleManager.registerShuffle(
    shuffleId, _rdd.partitions.length, this)
}

2.2.向shuffleManager注册shuffle

说明：

在spark中，SortShuffleManager是ShuffleManager的唯一实现类，注册shuffle最终在SortShuffleManager中完成；
发生时间：
- 在实例化ShuffleDependency对象时，初始化宽依赖的shuffleHandle属性，此时宽依赖向shuffleManager注册shuffle；

总结：

BypassMergeSortShuffleHandle：
- 适用：不是map端聚合且分区数不高于200
- 效果：直接写入numPartitions文件，并在最后将它们连接起来
- 优势：避免了进行两次序列化和反序列化以合并溢出的文件
- 缺点：一次打开多个文件，从而为缓冲区分配更多内存；
BaseShuffleHandle：
- 适用：前面2中不适用的；
- 效果：以反序列化的形式缓冲映射输出
- 特点：支持map端聚合

SerializedShuffleHandle：

适用
- 序列化器支持对象迁移：持序列化重定向；
- 非map端聚合
- 分区数不大于16777216
效果：以序列化的形式缓冲映射输出

private[spark] class SortShuffleManager(conf: SparkConf) extends ShuffleManager with Logging {
  override def registerShuffle[K, V, C](
      shuffleId: Int,
      numMaps: Int,
      dependency: ShuffleDependency[K, V, C]): ShuffleHandle = {
    //不是map端聚合且分区数不高于200
    if (SortShuffleWriter.shouldBypassMergeSort(conf, dependency)) {
      // 直接写入numPartitions文件，并在最后将它们连接起来
      //这避免了进行两次序列化和反序列化以合并溢出的文件，这在正常的代码路径中会发生。缺点是一次打开多个文件，从而为缓冲区分配更多内存。
      new BypassMergeSortShuffleHandle[K, V](
        shuffleId, numMaps, dependency.asInstanceOf[ShuffleDependency[K, V, V]])
    } 
    
    //序列化器支持对象迁移、非map端聚合、分区数不大于16777216
    else if (SortShuffleManager.canUseSerializedShuffle(dependency)) {
      // 以序列化的形式缓冲映射输出
      new SerializedShuffleHandle[K, V](
        shuffleId, numMaps, dependency.asInstanceOf[ShuffleDependency[K, V, V]])
    } else {
      // 以反序列化的形式缓冲映射输出
      new BaseShuffleHandle(shuffleId, numMaps, dependency)
    }
  }
}

2.2.1.BypassMergeSortShuffleHandle判断

要求：

不是map端聚合且分区数不高于200
分区数阈值由spark.shuffle.sort.bypassMergeThreshold参数指定；默认值200；

private[spark] object SortShuffleWriter {
  //不是map端聚合且分区数不高于200
  def shouldBypassMergeSort(conf: SparkConf, dep: ShuffleDependency[_, _, _]): Boolean = {
    // map端聚合需要排序
    if (dep.mapSideCombine) {
      false
    } else {
      //spark.shuffle.sort.bypassMergeThreshold : 默认值200
      val bypassMergeThreshold: Int = conf.getInt("spark.shuffle.sort.bypassMergeThreshold", 200)
      //分区数小于spark.shuffle.sort.bypassMergeThreshold或200
      dep.partitioner.numPartitions <= bypassMergeThreshold
    }
  }
}

2.2.2.SerializedShuffleHandle判断

要求：

支持序列化重定向；
非map端聚合；
分区数不大于16777216；
以上3个条件同时满足；

private[spark] object SortShuffleManager extends Logging {

  //16777216
  val MAX_SHUFFLE_OUTPUT_PARTITIONS_FOR_SERIALIZED_MODE =
    PackedRecordPointer.MAXIMUM_PARTITION_ID + 1

  //序列化器支持对象迁移、非map端聚合、分区数不大于16777216
  def canUseSerializedShuffle(dependency: ShuffleDependency[_, _, _]): Boolean = {
    val shufId = dependency.shuffleId
    val numPartitions = dependency.partitioner.numPartitions
    
    //序列化器不支持对象迁移
    if (!dependency.serializer.supportsRelocationOfSerializedObjects) {
      log.debug(s"Can't use serialized shuffle for shuffle $shufId because the serializer, " +
        s"${dependency.serializer.getClass.getName}, does not support object relocation")
      false
    } 
      
    //map端聚合
    else if (dependency.mapSideCombine) {
      log.debug(s"Can't use serialized shuffle for shuffle $shufId because we need to do " +
        s"map-side aggregation")
      false
    } 
    //分区数大于16777216
    else if (numPartitions > MAX_SHUFFLE_OUTPUT_PARTITIONS_FOR_SERIALIZED_MODE) {
      log.debug(s"Can't use serialized shuffle for shuffle $shufId because it has more than " +
        s"$MAX_SHUFFLE_OUTPUT_PARTITIONS_FOR_SERIALIZED_MODE partitions")
      false
    } else {
      log.debug(s"Can use serialized shuffle for shuffle $shufId")
      true
    }
  }
}

3.ShuffleWriter实例化

3.1.实例化时间点

实例化和使用时机：

当executor执行shuffle任务时，底层调用ShuffleMapTask.runTask()函数进行实现；
在ShuffleMapTask.runTask()函数中，会实例化一个ShuffleWriter对象，然后通过ShuffleWriter.write()函数将数据落地到磁盘；

3.1.1.Executor#run-执行器运行任务

说明：

在Executor中，执行shuffle任务时，底层调用ShuffleMapTask.runTask()函数进行实现；

private[spark] class Executor(
    executorId: String,
    executorHostname: String,
    env: SparkEnv,
    userClassPath: Seq[URL] = Nil,
    isLocal: Boolean = false,
    uncaughtExceptionHandler: UncaughtExceptionHandler = new SparkUncaughtExceptionHandler)
  extends Logging {
      
    //针对shuffle任务，实例化ShuffleMapTask对象
    @volatile var task: Task[Any] = _
      
    override def run(): Unit = {
      
      //---------其他代码---------

      try {
       
        //---------其他代码---------
        val value = Utils.tryWithSafeFinally {
          
          //针对shuffle任务，执行ShuffleMapTask#runTask函数
          val res = task.run(
            taskAttemptId = taskId,
            attemptNumber = taskDescription.attemptNumber,
            metricsSystem = env.metricsSystem)
          threwException = false
          res
        }

        //---------其他代码---------
      } catch {
        //---------其他代码---------
      } finally {
        runningTasks.remove(taskId)
      }
    }
  }

在ShuffleMapTask.runTask()函数中，会实例化一个ShuffleWriter对象，然后通过ShuffleWriter.write将数据落地到磁盘；
实例化一个ShuffleWriter对象时候会将stage依赖中维护的shuffleHandle传过去;

private[spark] class ShuffleMapTask(
    stageId: Int,
    stageAttemptId: Int,
    taskBinary: Broadcast[Array[Byte]],
    partition: Partition,
    @transient private var locs: Seq[TaskLocation],
    localProperties: Properties,
    serializedTaskMetrics: Array[Byte],
    jobId: Option[Int] = None,
    appId: Option[String] = None,
    appAttemptId: Option[String] = None,
    isBarrier: Boolean = false)
  extends Task[MapStatus](stageId, stageAttemptId, partition.index, localProperties,
    serializedTaskMetrics, jobId, appId, appAttemptId, isBarrier)
  with Logging {
      
  override def runTask(context: TaskContext): MapStatus = {
    //---------其他代码---------
      
    val ser = SparkEnv.get.closureSerializer.newInstance()
    val (rdd, dep) = ser.deserialize[(RDD[_], ShuffleDependency[_, _, _])](
      ByteBuffer.wrap(taskBinary.value), Thread.currentThread.getContextClassLoader)
    
    //---------其他代码---------

    var writer: ShuffleWriter[Any, Any] = null
    try {
      val manager = SparkEnv.get.shuffleManager
      
      //从shuffleManager中获取写入器：将依赖中维护的shuffleHandle传过去
      writer = manager.getWriter[Any, Any](dep.shuffleHandle, partitionId, context)
        
      //通过写入器将数据落地磁盘
      writer.write(rdd.iterator(partition, context).asInstanceOf[Iterator[_ <: Product2[Any, Any]]])
        
      //关闭写入器
      writer.stop(success = true).get
    } catch {
      //---------其他代码---------
    }
  }
}

3.1.2.实例化ShuffleWriter对象

说明：

UnsafeShuffleWriter、BypassMergeSortShuffleWriter、SortShuffleWriter都是ShuffleWriter的子类；
SerializedShuffleHandle、BypassMergeSortShuffleHandle、BaseShuffleHandle是ShuffleHandle的子类；
根据ShuffleHandle实例化对象的具体子类类型，实例化不同的ShuffleWriter子类对象；

private[spark] class SortShuffleManager(conf: SparkConf) extends ShuffleManager with Logging {
  override def getWriter[K, V](
      handle: ShuffleHandle,
      mapId: Int,
      context: TaskContext): ShuffleWriter[K, V] = {
    numMapsForShuffle.putIfAbsent(
      handle.shuffleId, handle.asInstanceOf[BaseShuffleHandle[_, _, _]].numMaps)
    val env = SparkEnv.get
    
    //rdd依赖中维护的ShuffleHandle类型，实例化对应的writer
    handle match {
      case unsafeShuffleHandle: SerializedShuffleHandle[K @unchecked, V @unchecked] =>
        new UnsafeShuffleWriter(
          env.blockManager,
          shuffleBlockResolver.asInstanceOf[IndexShuffleBlockResolver],
          context.taskMemoryManager(),
          unsafeShuffleHandle,
          mapId,
          context,
          env.conf)
      case bypassMergeSortHandle: BypassMergeSortShuffleHandle[K @unchecked, V @unchecked] =>
        new BypassMergeSortShuffleWriter(
          env.blockManager,
          shuffleBlockResolver.asInstanceOf[IndexShuffleBlockResolver],
          bypassMergeSortHandle,
          mapId,
          context,
          env.conf)
      case other: BaseShuffleHandle[K @unchecked, V @unchecked, _] =>
        new SortShuffleWriter(shuffleBlockResolver, other, mapId, context)
    }
  }
}

4.BypassMergeSortShuffleWriter

4.1.实例化

对主要属性进行分析说明

定义数据写入磁盘时文件缓存的大小，默认32kb；
- 可以过spark.shuffle.file.buffer参数指定
定义合并临时文件时，是否通过NIO方式赋值文件数据：默认true；
以FileSegment数组的形式缓存临时文件句柄；
以long数组的形式缓存每个分区的数据量；

final class BypassMergeSortShuffleWriter<K, V> extends ShuffleWriter<K, V> {

  private static final Logger logger = LoggerFactory.getLogger(BypassMergeSortShuffleWriter.class);

  //数据写入磁盘时的文件缓存，默认32kb，通过spark.shuffle.file.buffer参数指定
  private final int fileBufferSize;
  //合并临时文件时，是否通过NIO方式赋值文件数据：默认true；通过spark.file.transferTo参数指定
  private final boolean transferToEnabled;
  //分区数
  private final int numPartitions;
  private final BlockManager blockManager;
  //分区器
  private final Partitioner partitioner;
  private final ShuffleWriteMetrics writeMetrics;
  //本次shuffle的唯一标识
  private final int shuffleId;
  private final int mapId;
  private final Serializer serializer;
  //创建和维护shuffle数据的逻辑块和物理文件位置的对应关系
  private final IndexShuffleBlockResolver shuffleBlockResolver;

  //每个分区的写出器
  private DiskBlockObjectWriter[] partitionWriters;
  
  //每个分区的临时文件
  private FileSegment[] partitionWriterSegments;
  
  //文件输出状态信息
  @Nullable private MapStatus mapStatus;
  
  //每个分区的数据量
  private long[] partitionLengths;

  private boolean stopping = false;
}

4.2.write

原理：

通过blockManager为每个分区构建一个临时文件；根据临时文件构建分区数据磁盘写出器；
遍历数据记录，数据添加到分区对应临时文件的输出流中；
文件flush，将临时文件输出流中数据真正落地到磁盘文件中；
所有分区的临时文件合并为一个大的数据文件，并且生成对应的index文件；
记录数据输出状态

说明：

调用一次write函数，生成的临时文件根据分区数决定，一个分区一个临时文件；最终合并出来的大文件只有一个，对应生成一个index文件；
index文件记录每个分区数据的长度、偏移量；

final class BypassMergeSortShuffleWriter<K, V> extends ShuffleWriter<K, V> {
    
    public void write(Iterator<Product2<K, V>> records) throws IOException {
        assert this.partitionWriters == null;

        if (!records.hasNext()) {//空记录，生成一个空index文件
            this.partitionLengths = new long[this.numPartitions];
            this.shuffleBlockResolver.writeIndexFileAndCommit(this.shuffleId, this.mapId, this.partitionLengths, (File)null);
            this.mapStatus = .MODULE$.apply(this.blockManager.shuffleServerId(), this.partitionLengths);
        } else {
            //初始化序列化器
            SerializerInstance serInstance = this.serializer.newInstance();
            long openStartTime = System.nanoTime();
            //初始化分区写出器数组
            this.partitionWriters = new DiskBlockObjectWriter[this.numPartitions];
            //初始化分区对应临时文件数组
            this.partitionWriterSegments = new FileSegment[this.numPartitions];

            //通过blockManager获取block的临时文件，并以此构建分区的磁盘写出器
            //一个分区对应磁盘写出器和一个临时文件
            int i;
            for(i = 0; i < this.numPartitions; ++i) {
                //通过blockManager构建block的临时文件信息：blockId,临时文件
                Tuple2<TempShuffleBlockId, File> tempShuffleBlockIdPlusFile = this.blockManager.diskBlockManager().createTempShuffleBlock();
                File file = (File)tempShuffleBlockIdPlusFile._2();
                BlockId blockId = (BlockId)tempShuffleBlockIdPlusFile._1();
                //构建当前分区的磁盘写出器：绑定临时文件、文件输出流；
                this.partitionWriters[i] = this.blockManager.getDiskWriter(blockId, file, serInstance, this.fileBufferSize, this.writeMetrics);
            }

            this.writeMetrics.incWriteTime(System.nanoTime() - openStartTime);

            //将数据逐条添加到各分区对应临时文件的输出流中
            while(records.hasNext()) {
                Product2<K, V> record = (Product2)records.next();
                K key = record._1();
                //将数据添加到临时文件输出流中
                this.partitionWriters[this.partitioner.getPartition(key)].write(key, record._2());
            }

            //将各分区从文件输出流flush到临时文件中
            for(i = 0; i < this.numPartitions; ++i) {
                //获取当前分区的磁盘写出器
                DiskBlockObjectWriter writer = this.partitionWriters[i];
                //文件flush，返回记录的偏移量、长度、临时文件
                this.partitionWriterSegments[i] = writer.commitAndGet();
                writer.close();
            }

            //获取输出数据文件
            File output = this.shuffleBlockResolver.getDataFile(this.shuffleId, this.mapId);
            //创建输出数据文件的临时文件
            File tmp = Utils.tempFileWith(output);

            try {
                //临时文件合并产生数据文件，返回临时文件长度数组
                this.partitionLengths = this.writePartitionedFile(tmp);
                //生成数据文件对应的index文件
                this.shuffleBlockResolver.writeIndexFileAndCommit(this.shuffleId, this.mapId, this.partitionLengths, tmp);
            } finally {
                if (tmp.exists() && !tmp.delete()) {
                    logger.error("Error while deleting temp file {}", tmp.getAbsolutePath());
                }

            }

            //记录数据输出状态
            this.mapStatus = .MODULE$.apply(this.blockManager.shuffleServerId(), this.partitionLengths);
        }
    }
}

4.2.1.writePartitionedFile-合并临时文件

说明：

按照分区顺序，逐个将分区临时文件数据复制到数据文件；
默认根据NIO方式进行数据复制；
- 可以通过spark.file.transferTo参数指定；

final class BypassMergeSortShuffleWriter<K, V> extends ShuffleWriter<K, V> {
  private long[] writePartitionedFile(File outputFile) throws IOException {
    //初始化分区数据量数组
    final long[] lengths = new long[numPartitions];
    
    if (partitionWriters == null) {
      //没有分区写出器，代表分区写出器构造没有执行，不存在数据写出，不存在临时文件
      //返回空数组
      return lengths;
    }

    //构建数据文件输出流
    final FileOutputStream out = new FileOutputStream(outputFile, true);
    final long writeStartTime = System.nanoTime();
    boolean threwException = true;
    try {
      //遍历分区
      for (int i = 0; i < numPartitions; i++) {
        //取出分区临时文件
        final File file = partitionWriterSegments[i].file();
        if (file.exists()) {
          //构建临时文件输入流
          final FileInputStream in = new FileInputStream(file);
          boolean copyThrewException = true;
          try {
            //将分区临时文件数据复制到数据文件中
            //transferToEnabled默认问true：以NIO的方式复制
            lengths[i] = Utils.copyStream(in, out, false, transferToEnabled);
            copyThrewException = false;
          } finally {
            Closeables.close(in, copyThrewException);
          }
          if (!file.delete()) {
            logger.error("Unable to delete file for partition {}", i);
          }
        }
      }
      threwException = false;
    } finally {
      Closeables.close(out, threwException);
      writeMetrics.incWriteTime(System.nanoTime() - writeStartTime);
    }
    partitionWriters = null;
    return lengths;
  }
}

5.SortShuffleWriter

5.1.实例化

SortShuffleWriter实例化相对简单，没有比较多的属性需要初始化；

private[spark] class SortShuffleWriter[K, V, C](
    shuffleBlockResolver: IndexShuffleBlockResolver,
    handle: BaseShuffleHandle[K, V, C],
    mapId: Int,
    context: TaskContext)
  extends ShuffleWriter[K, V] with Logging {

  //依赖关系
  private val dep = handle.dependency

  private val blockManager = SparkEnv.get.blockManager

  //排序器
  private var sorter: ExternalSorter[K, V, _] = null

  private var stopping = false

  private var mapStatus: MapStatus = null

  private val writeMetrics = context.taskMetrics().shuffleWriteMetrics
    
}

5.2.write

原理：

构建排序器
将所有数据添加到排序器中；
- 以PartitionedAppendOnlyMap或PartitionedPairBuffer形式将数据缓存在内存中；
- 当内存中缓存数据达到溢写条件时，将缓存中的数据整个溢写到磁盘中，由一个临时文件保存；
将排序器中数据输出到一个临时文件；
构建一个临时文件的index文件；
记录数据输出状态；

特别说明：

临时文件中的数据是根据分区编号先后进行写入的；
如果存在map端聚合，是现将数据进行聚合后，再写入临时文件的；
调用一次SortShuffleWriter.write()函数，会生成一个磁盘输出文件和一个对应的index文件；

private[spark] class SortShuffleWriter[K, V, C](
    shuffleBlockResolver: IndexShuffleBlockResolver,
    handle: BaseShuffleHandle[K, V, C],
    mapId: Int,
    context: TaskContext)
  extends ShuffleWriter[K, V] with Logging {
  
  override def write(records: Iterator[Product2[K, V]]): Unit = {
    
    //构建排序器
    sorter = if (dep.mapSideCombine) {
      //map端聚合，需要定义排序器的聚合器、排序方式
      new ExternalSorter[K, V, C](
        context, dep.aggregator, Some(dep.partitioner), dep.keyOrdering, dep.serializer)
    } else {
      // 在这种情况下，我们既没有向排序器传递聚合器，也没有向排序器传递排序器，因为我们不关心键是否在每个分区中排序;如果正在运行的操作是sortByKey，则将在reduce端执行
      new ExternalSorter[K, V, V](
        context, aggregator = None, Some(dep.partitioner), ordering = None, dep.serializer)
    }
      
    //数据添加到排序器：满足溢写条件情况下，缓存数据将会溢写到磁盘；否则，在内存中缓存
    sorter.insertAll(records)

    //获取输出数据文件
    val output = shuffleBlockResolver.getDataFile(dep.shuffleId, mapId)
    //创建输出文件的临时文件
    val tmp = Utils.tempFileWith(output)
    try {
      //组装blockId
      val blockId = ShuffleBlockId(dep.shuffleId, mapId, IndexShuffleBlockResolver.NOOP_REDUCE_ID)
      //临时文件合并数据到数据文件：根据参数进行排序聚合
      val partitionLengths = sorter.writePartitionedFile(blockId, tmp)
      //创建数据文件的index文件
      shuffleBlockResolver.writeIndexFileAndCommit(dep.shuffleId, mapId, partitionLengths, tmp)
      //记录数据输出状态
      mapStatus = MapStatus(blockManager.shuffleServerId, partitionLengths)
    } finally {
      if (tmp.exists() && !tmp.delete()) {
        logError(s"Error while deleting temp file ${tmp.getAbsolutePath}")
      }
    }
  }
}

5.2.1.ExternalSorter#insertAll-数据添加到ExternalSorter

数据缓存：

PartitionedAppendOnlyMap
- 针对需要map端聚合的情况，使用此对象缓存数据；
- 以（partitionId, key）为key，对value进行聚合；
- 以（（partitionId, key），聚合后的value）为一条数据
PartitionedPairBuffer
- 针对非map端聚合的情况，使用此对象缓存数据；
- 以（key，value）为一条数据；

步骤：

遍历记录中的数据，根据是否map端聚合，逐条将数据缓存到集合中；
- map端聚合，将数据缓存到PartitionedAppendOnlyMap，缓存的时候，根据（partitionId, key）为key对value进行合并；
- 非map端聚合，直接以（key，value）形式将数据缓存到PartitionedPairBuffer中；
每缓存一条记录，即判断一次是否需要将缓存中的数据溢写到磁盘

特别说明：

如果没有发生缓存数据溢写到磁盘，数据将会以集合的方式缓存在内存中；
记录数据在ExternalSorter中的保存形式：
- 全部都以集合的形式缓存在内存中
- 全部都以临时文件的形式落地在磁盘中，ExternalSorter中以spills属性维护对所有临时文件的引用；
- 部分记录以集合的形式缓存在内存中，部分记录以临时文件的形式落地在磁盘；

private[spark] class ExternalSorter[K, V, C](
    context: TaskContext,
    aggregator: Option[Aggregator[K, V, C]] = None,
    partitioner: Option[Partitioner] = None,
    ordering: Option[Ordering[K]] = None,
    serializer: Serializer = SparkEnv.get.serializer)
  extends Spillable[WritablePartitionedPairCollection[K, C]](context.taskMemoryManager())
  with Logging {
      
  @volatile private var map = new PartitionedAppendOnlyMap[K, C]
  @volatile private var buffer = new PartitionedPairBuffer[K, C]
      
  def insertAll(records: Iterator[Product2[K, V]]): Unit = {
    // 聚合器部位None，则需要map端聚合
    val shouldCombine = aggregator.isDefined

    //map端预聚合，数据以PartitionedAppendOnlyMap形式缓存
    if (shouldCombine) {
      // Combine values in-memory first using our AppendOnlyMap
      val mergeValue = aggregator.get.mergeValue
      val createCombiner = aggregator.get.createCombiner
      var kv: Product2[K, V] = null
      //数据聚合：已有数据，就聚合；没有数据就新建一个数据；
      val update = (hadValue: Boolean, oldValue: C) => {
        if (hadValue) mergeValue(oldValue, kv._2) else createCombiner(kv._2)
      }
        
      //遍历记录数据
      while (records.hasNext) {
        //自上次溢写后，本次从记录数据中读取数据的数量+1
        addElementsRead()
        //从记录中读取一条数据
        kv = records.next()
        //数据根据（分区，key）进行聚合
        map.changeValue((getPartition(kv._1), kv._1), update)
        
        //判断缓存数据是否需要溢写到磁盘
        maybeSpillCollection(usingMap = true)
      }
    } else {
      // 非map端聚合，数据以PartitionedPairBuffer形式缓存
      while (records.hasNext) {
        addElementsRead()
        val kv = records.next()
        buffer.insert(getPartition(kv._1), kv._1, kv._2.asInstanceOf[C])
        maybeSpillCollection(usingMap = false)
      }
    }
  }      
}

5.2.1.1.ExternalSorter#maybeSpillCollection-判断是否需要将缓存数据溢写到磁盘

数据溢写条件：

缓存到集合总的数据量时32的倍数 && 缓存的数据字节数不小于集合阈值 && 缓存的数据字节数 > 扩充后的集合阈值；
缓存到集合的的数据量大于Integer.MAX_VALUE（0x7fffffff）
以上2个条件满足其一即发生溢写

溢写：

底层的溢写逻辑由ExternalSorter#spill实现；
溢写一次会生成一个溢写临时文件；
添加到ExternalSorter的数据，其所有的溢写文件都维护在ExternalSorter#spills属性中；

集合的阈值：

初始阈值
- 初始阈值由spark.shuffle.spill.initialMemoryThreshold参数决定；
- 默认5M；
阈值扩充
- 每次阈值扩充量：2 * 当前集合字节数 - 当前集合阈值；
- 扩充时机：当前集合字节数 >= 当前集合阈值时；

特别说明：

缓存中的数据溢写到磁盘中后，用于缓存记录数据的集合将会重新构建；

private[spark] class ExternalSorter[K, V, C](
    context: TaskContext,
    aggregator: Option[Aggregator[K, V, C]] = None,
    partitioner: Option[Partitioner] = None,
    ordering: Option[Ordering[K]] = None,
    serializer: Serializer = SparkEnv.get.serializer)
  extends Spillable[WritablePartitionedPairCollection[K, C]](context.taskMemoryManager())
  with Logging {
      
  @volatile private var map = new PartitionedAppendOnlyMap[K, C]
  @volatile private var buffer = new PartitionedPairBuffer[K, C]
      
  //到目前为止观察到的内存中数据结构的峰值大小，以字节为单位
  private var _peakMemoryUsedBytes: Long = 0L
  def peakMemoryUsedBytes: Long = _peakMemoryUsedBytes
      
  private def maybeSpillCollection(usingMap: Boolean): Unit = {
    var estimatedSize = 0L
    if (usingMap) {//map端聚合
      //估计集合的当前大小(以字节为单位）
      estimatedSize = map.estimateSize()
      //内存数据溢写到磁盘
      if (maybeSpill(map, estimatedSize)) {
        //发生数据溢写后，重新构建缓存集合
        map = new PartitionedAppendOnlyMap[K, C]
      }
    } else {
      //估计集合的当前大小(以字节为单位）
      estimatedSize = buffer.estimateSize()
      //内存数据溢写到磁盘
      if (maybeSpill(buffer, estimatedSize)) {
        //发生数据溢写后，重新构建缓存集合
        buffer = new PartitionedPairBuffer[K, C]
      }
    }

    //更新内存中数据结构的峰值大小
    if (estimatedSize > _peakMemoryUsedBytes) {
      _peakMemoryUsedBytes = estimatedSize
    }
  }     
}


private[spark] abstract class Spillable[C](taskMemoryManager: TaskMemoryManager)
  extends MemoryConsumer(taskMemoryManager) with Logging {
      
  // 自上次溢出后从输入中读取的元素数
  protected def elementsRead: Int = _elementsRead
  private[this] var _elementsRead = 0
      
  // 集合大小的初始阈值:默认5M
  private[this] val initialMemoryThreshold: Long =
    SparkEnv.get.conf.getLong("spark.shuffle.spill.initialMemoryThreshold", 5 * 1024 * 1024)
      
  //集合的字节大小阈值
  //为了避免大量的小溢出，初始化该值为数量级> 0
  @volatile private[this] var myMemoryThreshold = initialMemoryThreshold
      
  //溢写的总字节数
  @volatile private[this] var _memoryBytesSpilled = 0L
      
  //发生溢写的次数
  private[this] var _spillCount = 0
      
  //如果需要，将当前内存中的收集信息溢出到磁盘。试图在溢出之前获取更多内存
  protected def maybeSpill(collection: C, currentMemory: Long): Boolean = {
    var shouldSpill = false
    //读取数据量是32的倍数，且集合数据内存占用量不小于集合设置的阈值
    if (elementsRead % 32 == 0 && currentMemory >= myMemoryThreshold) {
      //集合内存阈值扩充：2倍当前使用量 - 原有阈值
      val amountToRequest = 2 * currentMemory - myMemoryThreshold
      val granted = acquireMemory(amountToRequest)
      myMemoryThreshold += granted
        
      // 如果申请到的不够，map或buffer预估占用内存量还是大于阈值，确定溢写
      shouldSpill = currentMemory >= myMemoryThreshold
    }
      
    //如果上面判定不需要溢写，但读取的记录总数比Integer.MAX_VALUE大，也还是得溢写
    //numElementsForceSpillThreshold:Integer.MAX_VALUE   0x7fffffff
    shouldSpill = shouldSpill || _elementsRead > numElementsForceSpillThreshold
    
      //溢写实施
    if (shouldSpill) {
      //溢写次数+1
      _spillCount += 1
      logSpillage(currentMemory)
        
      //溢写
      spill(collection)
        
      //清空数据读取量
      _elementsRead = 0
        
      //累加溢写总字节数
      _memoryBytesSpilled += currentMemory
        
      //释放内存
      releaseMemory()
    }
    
    //返回溢写判断结果
    shouldSpill
  }
}

5.2.1.2.ExternalSorter#spill-缓存数据溢写到磁盘

步骤：

对集合中的数据根据排序比较器进行排序，获取排序后数据迭代器
排序后的数据溢写到磁盘，返回溢写文件；
- 通过diskBlockManager创建一个临时文件
- 通过blockManager创建临时文件写入器；
- 遍历排序后的数据，将数据逐条添加到临时文件写入器，记录添加的数据量；
- 每隔10000条，通过flush将写入器数据批量落地到临时文件；
- 遍历结束后，将剩下的不足10000条的数据批量落地到临时文件；
- 返回临时文件；
溢写文件添加到临时溢写文件文件集合；

特别说明：

集合中数据落地磁盘文件是一批一批的落地的，批处理数据量由spark.shuffle.spill.batchSize参数设置，默认10000；

private[spark] class ExternalSorter[K, V, C](
    context: TaskContext,
    aggregator: Option[Aggregator[K, V, C]] = None,
    partitioner: Option[Partitioner] = None,
    ordering: Option[Ordering[K]] = None,
    serializer: Serializer = SparkEnv.get.serializer)
  extends Spillable[WritablePartitionedPairCollection[K, C]](context.taskMemoryManager())
  with Logging {
      
  private val spills = new ArrayBuffer[SpilledFile]
      
  private val serInstance = serializer.newInstance()

  //文件缓存大小，默认32K
  private val fileBufferSize = conf.getSizeAsKb("spark.shuffle.file.buffer", "32k").toInt * 1024
  
  //批处理记录数量，默认10000
  private val serializerBatchSize = conf.getLong("spark.shuffle.spill.batchSize", 10000)
     
  //记录每个分区的数据量
  val elementsPerPartition = new Array[Long](numPartitions)  
  
      
  //集合数据溢写到磁盘
  override protected[this] def spill(collection: WritablePartitionedPairCollection[K, C]): Unit = {
     
    //对集合中的数据根据排序比较器进行排序，获取排序后数据迭代器
    //有排序比较器，对分区内数据根据key升序排序
    //没有排序比较器，根据分区进行升序排序
    val inMemoryIterator = collection.destructiveSortedWritablePartitionedIterator(comparator)
    
    //数据溢写到磁盘
    val spillFile = spillMemoryIteratorToDisk(inMemoryIterator)
      
    //添加到临时文件集合
    spills += spillFile
  }  
      
  //排序比较器：分区内对key进行升序排列    
  private val keyComparator: Comparator[K] = ordering.getOrElse(new Comparator[K] {
    override def compare(a: K, b: K): Int = {
      val h1 = if (a == null) 0 else a.hashCode()
      val h2 = if (b == null) 0 else b.hashCode()
      if (h1 < h2) -1 else if (h1 == h2) 0 else 1
    }
  })
  
  //获取排序比较器
  private def comparator: Option[Comparator[K]] = {
    if (ordering.isDefined || aggregator.isDefined) {
      Some(keyComparator)
    } else {
      None
    }
  }
  
  //数据溢写到磁盘
  private[this] def spillMemoryIteratorToDisk(inMemoryIterator: WritablePartitionedIterator)
      : SpilledFile = {
    // 因为溢写文件在shuffle过程中会被读取，因此它们的压缩不由spill相关参数控制
    // 创建一个临时块
    val (blockId, file) = diskBlockManager.createTempShuffleBlock()

    //记录每次溢写的数据量
    var objectsWritten: Long = 0
    val spillMetrics: ShuffleWriteMetrics = new ShuffleWriteMetrics
    
    //获取临时文件块磁盘写入器
    val writer: DiskBlockObjectWriter =
      blockManager.getDiskWriter(blockId, file, serInstance, fileBufferSize, spillMetrics)

    // List of batch sizes (bytes) in the order they are written to disk
    val batchSizes = new ArrayBuffer[Long]

    // How many elements we have in each partition
    val elementsPerPartition = new Array[Long](numPartitions)

    //文件flush，返回记录的偏移量、长度、临时文件
    def flush(): Unit = {
      //将磁盘写出器序列化流中数据flush到临时文件中
      val segment = writer.commitAndGet()
      batchSizes += segment.length
      _diskBytesSpilled += segment.length
      objectsWritten = 0
    }

    var success = false
    try {
      //记录数据遍历
      while (inMemoryIterator.hasNext) {
        val partitionId = inMemoryIterator.nextPartition()
        require(partitionId >= 0 && partitionId < numPartitions,
          s"partition Id: ${partitionId} should be in the range [0, ${numPartitions})")
        
        //数据逐条添加到磁盘写出器序列化流中
        inMemoryIterator.writeNext(writer)
          
        //累加记录所附分区的数据量
        elementsPerPartition(partitionId) += 1
        
        //累加当次溢写的数据量
        objectsWritten += 1

        //每10000条数据落地到磁盘文件一次
        if (objectsWritten == serializerBatchSize) {
          flush()
        }
      }
      
      //遍历完毕，将剩余不足10000的数据的落地到磁盘
      if (objectsWritten > 0) {
        flush()
      } else {
        writer.revertPartialWritesAndClose()
      }
      success = true
    } finally {
      if (success) {
        writer.close()
      } else {
        // 
        writer.revertPartialWritesAndClose()
        if (file.exists()) {
          if (!file.delete()) {
            logWarning(s"Error deleting ${file}")
          }
        }
      }
    }

    //返回溢写文件
    SpilledFile(file, blockId, batchSizes.toArray, elementsPerPartition)
  }
}

5.2.2.ExternalSorter#writePartitionedFile-产生完整数据文件

功能：

将溢写的临时文件和缓存中的数据合并，产生一个完成的数据文件；

详情：

未发生过溢写
- 将缓存中的数据根据分区id和key进行排序；
- 将排序后的数据按照分区依次批量写入临时文件；（一个分区写一次）
发生过溢写
- 将磁盘中溢写文件的数据与内存中缓存的数据根据分区进行合流
- 将合流后的数据按照分区依次批量写入临时文件；（一个分区写一次）

private[spark] class ExternalSorter[K, V, C](
    context: TaskContext,
    aggregator: Option[Aggregator[K, V, C]] = None,
    partitioner: Option[Partitioner] = None,
    ordering: Option[Ordering[K]] = None,
    serializer: Serializer = SparkEnv.get.serializer)
  extends Spillable[WritablePartitionedPairCollection[K, C]](context.taskMemoryManager())
  with Logging {
      
  def writePartitionedFile(
      blockId: BlockId,
      outputFile: File): Array[Long] = {

    // Track location of each range in the output file
    val lengths = new Array[Long](numPartitions)
    
    //通过blockManager获取输出文件写入器
    val writer = blockManager.getDiskWriter(blockId, outputFile, serInstance, fileBufferSize,
      context.taskMetrics().shuffleWriteMetrics)

    if (spills.isEmpty) {//没有发生溢写
      // 确定数据在内存中的缓存形式
      val collection = if (aggregator.isDefined) map else buffer
      //数据排序：根据分区id和key排序
      val it = collection.destructiveSortedWritablePartitionedIterator(comparator)
      //排序后的数据遍历:将数据根据分区依次写入输出文件中
      while (it.hasNext) {
        val partitionId = it.nextPartition()
        //同一个分区的数据依次添加到写出器序列化流中：同一个分区的数据在一起
        while (it.hasNext && it.nextPartition() == partitionId) {
          it.writeNext(writer)
        }
        
        //序列化流中数据flush到文件，返回记录的偏移量、长度、临时文件
        val segment = writer.commitAndGet()
        
        //缓存分区与分区数据量
        lengths(partitionId) = segment.length
      }
    } else {//有发生数据溢写
      // 溢写文件和缓存数据合并，合并后再根据分区依次写入输出文件中
      for ((id, elements) <- this.partitionedIterator) {
        if (elements.hasNext) {
          //将分区中的数据依次添加到写出器序列化流中
          for (elem <- elements) {
            writer.write(elem._1, elem._2)
          }
            
          //序列化流中数据flush到文件，返回记录的偏移量、长度、临时文件
          val segment = writer.commitAndGet()
            
          //缓存分区与分区数据量
          lengths(id) = segment.length
        }
      }
    }

    writer.close()
    context.taskMetrics().incMemoryBytesSpilled(memoryBytesSpilled)
    context.taskMetrics().incDiskBytesSpilled(diskBytesSpilled)
    context.taskMetrics().incPeakExecutionMemory(peakMemoryUsedBytes)

    lengths
  }
}

5.2.2.1.ExternalSorter#partitionedIterator-磁盘数据和内存数据合并

详情：

读取磁盘临时文件和内存缓存中的数据；
遍历分区，将磁盘临时文件和内存缓存中同一个分区的数据合并；
对合并后的数据进行聚合排序操作
- 定义了聚合器：跨分区执行部分聚合：根据聚合器定义聚合value，最后按照key排序；
- 没有定义聚合器，但是定义了排序器：对数据根据排序器进行排序，而不是合并它们；
- 没有定义聚合器和排序器：返回合并后的结果；
返回分区与分区数据的映射集合

private[spark] class ExternalSorter[K, V, C](
    context: TaskContext,
    aggregator: Option[Aggregator[K, V, C]] = None,
    partitioner: Option[Partitioner] = None,
    ordering: Option[Ordering[K]] = None,
    serializer: Serializer = SparkEnv.get.serializer)
  extends Spillable[WritablePartitionedPairCollection[K, C]](context.taskMemoryManager())
  with Logging {
      
  def partitionedIterator: Iterator[(Int, Iterator[Product2[K, C]])] = {
    val usingMap = aggregator.isDefined
    val collection: WritablePartitionedPairCollection[K, C] = if (usingMap) map else buffer
    if (spills.isEmpty) {//未发生溢写
      //未定义排序规则
      if (!ordering.isDefined) {
        // 只按分区ID排序，而不是key
        groupByPartition(destructiveIterator(collection.partitionedDestructiveSortedIterator(None)))
      } else {//定义排序规则
        // 根据分区ID和key进行排序
        groupByPartition(destructiveIterator(
          collection.partitionedDestructiveSortedIterator(Some(keyComparator))))
      }
    } else {//发生溢写
      //合并溢出的和内存中的数据
      merge(spills, destructiveIterator(
        collection.partitionedDestructiveSortedIterator(comparator)))
    }
  }
   
  //数据合并
  private def merge(spills: Seq[SpilledFile], inMemory: Iterator[((Int, K), C)])
      : Iterator[(Int, Iterator[Product2[K, C]])] = {
     
    //从磁盘读取溢写的临时文件
    val readers = spills.map(new SpillReader(_))
    //获取内存中缓存的数据
    val inMemBuffered = inMemory.buffered
          
    //遍历分区
    (0 until numPartitions).iterator.map { p =>
      //获取内存中当前分区数据  
      val inMemIterator = new IteratorForPartition(p, inMemBuffered)
      //将磁盘文件数据和内存缓存数据合流
      val iterators = readers.map(_.readNextPartition()) ++ Seq(inMemIterator)
      if (aggregator.isDefined) {//定义了聚合器
        // 跨分区执行部分聚合：根据聚合器定义聚合value，最后按照key排序
        (p, mergeWithAggregation(
          iterators, aggregator.get.mergeCombiners, keyComparator, ordering.isDefined))
      } else if (ordering.isDefined) {//没有聚合器，但是定义了排序器
        // 对元素进行排序，而不是合并它们
        (p, mergeSort(iterators, ordering.get))
      } else {//没有定义聚合器和排序器
        //返回合并后的结果
        (p, iterators.iterator.flatten)
      }
    }
  }
}

6.UnsafeShuffleWriter

6.1.实例化

说明：

通过调用构造函数new一个UnsafeShuffleWriter对象完成实例化；
使用UnsafeShuffleWriter要求stage分区数不能大于16777216
初始化排序器，指定缓存初始化大小4096
初始化系列化缓存，指定大小1M

public class UnsafeShuffleWriter<K, V> extends ShuffleWriter<K, V> {
    private static final Logger logger = LoggerFactory.getLogger(UnsafeShuffleWriter.class);
    private static final ClassTag<Object> OBJECT_CLASS_TAG;
    @VisibleForTesting
    static final int DEFAULT_INITIAL_SORT_BUFFER_SIZE = 4096;
    static final int DEFAULT_INITIAL_SER_BUFFER_SIZE = 1048576;
    private final BlockManager blockManager;
    private final IndexShuffleBlockResolver shuffleBlockResolver;
    private final TaskMemoryManager memoryManager;
    private final SerializerInstance serializer;
    private final Partitioner partitioner;
    private final ShuffleWriteMetrics writeMetrics;
    private final int shuffleId;
    private final int mapId;
    private final TaskContext taskContext;
    private final SparkConf sparkConf;
    private final boolean transferToEnabled;
    private final int initialSortBufferSize;
    private final int inputBufferSizeInBytes;
    private final int outputBufferSizeInBytes;
    @Nullable
    private MapStatus mapStatus;
    @Nullable
    private ShuffleExternalSorter sorter;
    private long peakMemoryUsedBytes = 0L;
    private UnsafeShuffleWriter.MyByteArrayOutputStream serBuffer;
    private SerializationStream serOutputStream;
    private boolean stopping = false;

    public UnsafeShuffleWriter(BlockManager blockManager, IndexShuffleBlockResolver shuffleBlockResolver, TaskMemoryManager memoryManager, SerializedShuffleHandle<K, V> handle, int mapId, TaskContext taskContext, SparkConf sparkConf) throws IOException {
        
        int numPartitions = handle.dependency().partitioner().numPartitions();
        //分区数不能大于16777216
        if (numPartitions > SortShuffleManager.MAX_SHUFFLE_OUTPUT_PARTITIONS_FOR_SERIALIZED_MODE()) {
            throw new IllegalArgumentException("UnsafeShuffleWriter can only be used for shuffles with at most " + SortShuffleManager.MAX_SHUFFLE_OUTPUT_PARTITIONS_FOR_SERIALIZED_MODE() + " reduce partitions");
        } else {
            this.blockManager = blockManager;
            this.shuffleBlockResolver = shuffleBlockResolver;
            this.memoryManager = memoryManager;
            this.mapId = mapId;
            ShuffleDependency<K, V, V> dep = handle.dependency();
            this.shuffleId = dep.shuffleId();
            this.serializer = dep.serializer().newInstance();
            this.partitioner = dep.partitioner();
            this.writeMetrics = taskContext.taskMetrics().shuffleWriteMetrics();
            this.taskContext = taskContext;
            this.sparkConf = sparkConf;
            this.transferToEnabled = sparkConf.getBoolean("spark.file.transferTo", true);
            
            //排序器初始化缓存大小：4096
            this.initialSortBufferSize = sparkConf.getInt("spark.shuffle.sort.initialBufferSize", 4096);
            //输入缓存大小：默认32k
            this.inputBufferSizeInBytes = (int)(Long)sparkConf.get(.MODULE$.SHUFFLE_FILE_BUFFER_SIZE()) * 1024;
            //输出缓存大小：默认32k
            this.outputBufferSizeInBytes = (int)(Long)sparkConf.get(.MODULE$.SHUFFLE_UNSAFE_FILE_OUTPUT_BUFFER_SIZE()) * 1024;
            this.open();
        }
    }
    
    private void open() {
        assert this.sorter == null;

        //初始化排序器，指定缓存初始化大小4096
        this.sorter = new ShuffleExternalSorter(this.memoryManager, this.blockManager, this.taskContext, this.initialSortBufferSize, this.partitioner.numPartitions(), this.sparkConf, this.writeMetrics);
        
        //初始化系列化缓存，指定大小1M
        this.serBuffer = new UnsafeShuffleWriter.MyByteArrayOutputStream(1048576);
        
        //初始化序列化流
        this.serOutputStream = this.serializer.serializeStream(this.serBuffer);
    }
}

6.2.write

说明：

UnsafeShuffleWriter中提供2种重载的write函数，底层都是通过write(scala.collection.Iterator<Product2<K, V>> records)函数实现；
首先将迭代器中数据逐条添加到排序器中；
- 排序器中数据达到溢写条件，迭代器中数据将会溢写到一个临时文件中；
其次将排序器中数据落地到一个输出文件中；
- 会产生一个输出文件 + 一个输出文件对应的index文件；
最后释放排序器中资源；

public class UnsafeShuffleWriter<K, V> extends ShuffleWriter<K, V> {
    
  public void write(Iterator<Product2<K, V>> records) throws IOException {
    //将java迭代器转为scala迭代器
    write(JavaConverters.asScalaIteratorConverter(records).asScala());
  }
  
  public void write(scala.collection.Iterator<Product2<K, V>> records) throws IOException {
    
    boolean success = false;
    try {
       
      //遍历记录，逐条将记录添加到排序器中
      while (records.hasNext()) {
        insertRecordIntoSorter(records.next());
      }
      //合并临时文件为一个整体文件，然后上次临时文件
      closeAndWriteOutput();
      success = true;
    } finally {
      if (sorter != null) {
        try {
          //释放资源
          sorter.cleanupResources();
        } catch (Exception e) {
          // Only throw this error if we won't be masking another
          // error.
          if (success) {
            throw e;
          } else {
            logger.error("In addition to a failure during writing, we failed during " +
                         "cleanup.", e);
          }
        }
      }
    }
  }
}

6.2.1.insertRecordIntoSorter-添加数据到排序器

说明：

首先，将数据添加到序列化流中；
- 序列化流中通过MyByteArrayOutputStream对象对数据进行缓存；默认缓存1M；
- 序列化流中有计数器统计缓存中数据量；
其次，将序列化流中数据添加到排序器；

public class UnsafeShuffleWriter<K, V> extends ShuffleWriter<K, V> {
    
  private ShuffleExternalSorter sorter;
    
  //序列化缓存：默认大小1M
  private MyByteArrayOutputStream serBuffer;
  //序列化流
  private SerializationStream serOutputStream;
    
  void insertRecordIntoSorter(Product2<K, V> record) throws IOException {
    assert(sorter != null);
    final K key = record._1();
    final int partitionId = partitioner.getPartition(key);
      
    //重置序列化缓存计数器：重置为0
    serBuffer.reset();
      
    //将key、value写入序列化流中
    serOutputStream.writeKey(key, OBJECT_CLASS_TAG);
    serOutputStream.writeValue(record._2(), OBJECT_CLASS_TAG);
    serOutputStream.flush();

    //确保序列化缓存中有数据
    final int serializedRecordSize = serBuffer.size();
    assert (serializedRecordSize > 0);

    //将序列化数据添加到排序器中
    sorter.insertRecord(
      serBuffer.getBuf(), Platform.BYTE_ARRAY_OFFSET, serializedRecordSize, partitionId);
  }
}

6.2.1.1.ShuffleExternalSorter#insertRecord

说明：

1.对达到溢写条件的缓存数据，就将缓存中数据溢写到磁盘；
2.对达到扩充容量要求的内存排序器缓存进行容量扩充；
3.申请新的page存储数据；
4.获取数据存储地址；
5.将数据长度添加到page；
6.将数据复制到page中；
7.将数据存储地址添加到内存排序器中；

总结：

数据存储在外部排序器的链表中，通过page（内存块MemoryBlock）作为链表元素存储数据；
数据地址和分区id存储在内存排序器的缓存中；
数据溢写条件：数据量超过Interger.MAX_VALUE；
一次溢写产生一个临时文件；

final class ShuffleExternalSorter extends MemoryConsumer {
    
  	public void insertRecord(Object recordBase, long recordOffset, int length, int partitionId) throws IOException {
        
        assert this.inMemSorter != null;

        //内存排序器中数据量 >= Integer.MAX_VALUE
        if (this.inMemSorter.numRecords() >= this.numElementsForSpillThreshold) {
            logger.info("Spilling data because number of spilledRecords crossed the threshold " + this.numElementsForSpillThreshold);
            
            //数据溢写到磁盘
            this.spill();
        }

        //内存排序器中存储数据的缓存扩容
        this.growPointerArrayIfNecessary();
        
        //数据对其，确定数据长度
        int uaoSize = UnsafeAlignedOffset.getUaoSize();
        int required = length + uaoSize;
        
        //申请新的page存储数据
        this.acquireNewPageIfNecessary(required);

        assert this.currentPage != null;

        //获取当前page中存储数据的对象
        Object base = this.currentPage.getBaseObject();
        
        //获取数据存储地址
        long recordAddress = this.taskMemoryManager.encodePageNumberAndOffset(this.currentPage, this.pageCursor);
        
        //将数据长度添加到page
        UnsafeAlignedOffset.putSize(base, this.pageCursor, length);
        
        // 调整下page指针的内存地址
        this.pageCursor += (long)uaoSize;
        
        // 再将数据复制到page中
        Platform.copyMemory(recordBase, recordOffset, base, this.pageCursor, (long)length);
        
        // 调整下page光标位置
        this.pageCursor += (long)length;
        
        // 将数据在page中的存储地址和分区id记录到内存排序器中
        this.inMemSorter.insertRecord(recordAddress, partitionId);
    }
}

6.2.1.1.1.ShuffleExternalSorter#spill-缓存数据溢写磁盘

说明：

判断溢写条件是否达成；
将数据溢写到临时文件，一次溢写产生一个临时文件；
- 数据排序时通过对内存排序器缓存的数据地址根据分区id升序排列完成的；
- 临时文件中的数据一个分区时在一起的；
释放内存资源；
重置内存排序器；

final class ShuffleExternalSorter extends MemoryConsumer {
    
    //从父类MemoryConsumer继承过来的
	public void spill() throws IOException {
        this.spill(9223372036854775807L, this);
    }
    
    public long spill(long size, MemoryConsumer trigger) throws IOException {
        //要求内存排序器缓存数据量 > 0
        if (trigger == this && this.inMemSorter != null && this.inMemSorter.numRecords() != 0) {
            logger.info("Thread {} spilling sort data of {} to disk ({} {} so far)", new Object[]{Thread.currentThread().getId(), Utils.bytesToString(this.getMemoryUsage()), this.spills.size(), this.spills.size() > 1 ? " times" : " time"});
            
            //排序并写入临时文件
            this.writeSortedFile(false);
            
            //释放内存资源
            long spillSize = this.freeMemory();
            
            //重置内存排序器
            this.inMemSorter.reset();
            
            //上报溢出切片文件大小
            this.taskContext.taskMetrics().incMemoryBytesSpilled(spillSize);
            
            return spillSize;
        } else {
            return 0L;
        }
    }
}

6.2.1.1.1.1.writeSortedFile-数据溢写到临时文件

说明：

对内存排序器缓存的数据地址，根据地址中的分区id，使用RadixSort方法进行升序排列；
- 分区id存储在每天数据地址的5~7这3个字节中；
- 3个字节，一个字节8位，可以存2的24（3*8=24）次方个分区id；
从blockManager中获取溢写临时文件信息；
从blockManager中获取磁盘写出器；
遍历数据地址，将每个数据地址对应的数据写入临时文件；
- 从数据地址中获取数据分区id；
- 将上个分区的数据flush到临时文件，并记录分区数据量到溢写信息对象中；
- 从外部排序器的数据链表中，根据数据地址，找到存储数据的page；
- 遍历page中的数据，将数据添加到写出器的序列化流中；
  - 每次从page中读取缓存大小的数据写入写出器缓存；
  - 将写出器缓存数据添加到写出器序列化流中；
  - 读取器缓存大小由spark.shuffle.spill.diskWriteBufferSize参数控制；默认1M；
- 数据遍历结束，通知写入器已经向序列化流中写入了一整条条数据
数据遍历结束，将数据从序列化流中flush到临时文件中；
关闭写出器；
将分区数据量缓存到溢写信息对象中；并将溢写信息对象添加到溢写信息地下链表中；
上报溢写指标；

总结：

排序通过对内存排序器缓存器中数据地址的排序实现的；
根据排序后的数据地址，从外部排序器数据的链表缓存器中确定存储数据的page；
根据排序后的数据地址依次将数据写出到临时文件；
内存排序器数据地址排序时根据分区id升序排列的，所以临时文件中一个分区的数据是在一起的；
对每条数据写出到临时文件，现将数据读取到写出去缓存（默认1M），然后将缓存添加到序列化流；最后一个分区的数据一起flush到临时文件；

final class ShuffleExternalSorter extends MemoryConsumer {
    
    private void writeSortedFile(boolean isLastFile) {
        ShuffleWriteMetrics writeMetricsToUse;
        
        if (isLastFile) {//非切分文件
            //上报指标
            writeMetricsToUse = this.writeMetrics;
        } else {//切分文件
            //构建指标统计器
            writeMetricsToUse = new ShuffleWriteMetrics();
        }

        //数据排序
        ShuffleSorterIterator sortedRecords = this.inMemSorter.getSortedIterator();
        
        //构建磁盘写出缓存：通过spark.shuffle.spill.diskWriteBufferSize控制大小
        byte[] writeBuffer = new byte[this.diskWriteBufferSize];
        
        //从blockManager获取临时溢写文件信息
        Tuple2<TempShuffleBlockId, File> spilledFileInfo = this.blockManager.diskBlockManager().createTempShuffleBlock();
        File file = (File)spilledFileInfo._2();
        TempShuffleBlockId blockId = (TempShuffleBlockId)spilledFileInfo._1();
        
        //构建溢写信息对象
        SpillInfo spillInfo = new SpillInfo(this.numPartitions, file, blockId);
        SerializerInstance ser = DummySerializerInstance.INSTANCE;
        
        //从blockManager中获取磁盘写出器
        DiskBlockObjectWriter writer = this.blockManager.getDiskWriter(blockId, file, ser, this.fileBufferSizeBytes, writeMetricsToUse);
        
        //初始化当前分区id
        int currentPartition = -1;
        
        int uaoSize = UnsafeAlignedOffset.getUaoSize();

        //遍历数据存储地址
        while(sortedRecords.hasNext()) {
            
            //将地址数据加载到packedRecordPointer中
            sortedRecords.loadNext();
            //从地址数据中获取对应分区id
            int partition = sortedRecords.packedRecordPointer.getPartitionId();

            assert partition >= currentPartition;

            //分区号不同：代表上一个分区数据已经全部添加到写出去序列化流中
            if (partition != currentPartition) {
                //currentPartition != -1：不是处理第一个分区的数据
                if (currentPartition != -1) {
                    //将上一个分区的数据从序列化流中flush到临时文件中
                    FileSegment fileSegment = writer.commitAndGet();
                    //将分区数据量缓存到溢写信息对象中
                    spillInfo.partitionLengths[currentPartition] = fileSegment.length();
                }

                //更新当前分区id
                currentPartition = partition;
            }

            //根据内存排序器缓存的数据地址找到外部排序器通过链表缓存的page
            long recordPointer = sortedRecords.packedRecordPointer.getRecordPointer();
            Object recordPage = this.taskMemoryManager.getPage(recordPointer);
            
            //通过数据在page中的偏移量，计算数据长度
            long recordOffsetInPage = this.taskMemoryManager.getOffsetInPage(recordPointer);
            int dataRemaining = UnsafeAlignedOffset.getSize(recordPage, recordOffsetInPage);

            //遍历数据
            int toTransfer;
            for(long recordReadPosition = recordOffsetInPage + (long)uaoSize; dataRemaining > 0; dataRemaining -= toTransfer) {
                //计算每次处理的数据量：在磁盘写出缓存大小和数据大小见取小
                toTransfer = Math.min(this.diskWriteBufferSize, dataRemaining);
                
                //根据确定的处理数据量，从page中将数据赋值到写出缓存器中
                Platform.copyMemory(recordPage, recordReadPosition, writeBuffer, (long)Platform.BYTE_ARRAY_OFFSET, (long)toTransfer);
                
                //将写出缓冲器中数据添加到写出器的序列化流中
                writer.write(writeBuffer, 0, toTransfer);
                
                //数据读取偏移量更新
                recordReadPosition += (long)toTransfer;
            }

            //通知写入器已经向序列化流中写入了一条数据
            writer.recordWritten();
        }

        //数据遍历结束，将数据从序列化流中flush到临时文件中
        FileSegment committedSegment = writer.commitAndGet();
        
        //关闭写出器
        writer.close();
        if (currentPartition != -1) {
            //将分区数据量缓存到溢写信息对象中
            spillInfo.partitionLengths[currentPartition] = committedSegment.length();
            //将本次溢写信息对象添加到溢写信息对象链表中
            this.spills.add(spillInfo);
        }

        if (!isLastFile) {
            //上报指标
            this.writeMetrics.incRecordsWritten(writeMetricsToUse.recordsWritten());
            this.taskContext.taskMetrics().incDiskBytesSpilled(writeMetricsToUse.bytesWritten());
        }

    }
}

6.2.1.1.1.2.溢写数据排序

说明：

对内存排序器中缓存的数据地址进行排序；
根据分区ID对内存排序器缓存数据排序；

final class ShuffleInMemorySorter {
    
    public ShuffleInMemorySorter.ShuffleSorterIterator getSortedIterator() {
        int offset = 0;
        
        //默认基数排序
        if (this.useRadixSort) {
            //分区id存储在longArray的5~7这3个字节中；最多可以存2的24次方个分区id；
            offset = RadixSort.sort(this.array, (long)this.pos, 5, 7, false, false);
        } else {
            //TimSort排序
            MemoryBlock unused = new MemoryBlock(this.array.getBaseObject(), this.array.getBaseOffset() + (long)this.pos * 8L, (this.array.size() - (long)this.pos) * 8L);
            LongArray buffer = new LongArray(unused);
            Sorter<PackedRecordPointer, LongArray> sorter = new Sorter(new ShuffleSortDataFormat(buffer));
            sorter.sort(this.array, 0, this.pos, SORT_COMPARATOR);
        }

        return new ShuffleInMemorySorter.ShuffleSorterIterator(this.pos, this.array, offset);
    }
}

6.2.1.1.2.growPointerArrayIfNecessary-内存排序器缓存扩容

说明：

首先，判断内存排序器缓存是否达到扩容条件；
然后，针对达到条件的情况，进行扩容；
- 计算当前缓存容量；
- 按照当前缓存容量的2倍进行扩容，构建一个新的数组缓存器；
- 再次判断缓存资源是否够用；
  - 够用，释放新构建的数组缓存器资源；
  - 不够用，将新的数组缓存器替换为内存排序器的缓存；

final class ShuffleExternalSorter extends MemoryConsumer {
    
    private ShuffleInMemorySorter inMemSorter;
    
    private void growPointerArrayIfNecessary() throws IOException {
        assert this.inMemSorter != null;

        //判断是否达到扩容条件
        if (!this.inMemSorter.hasSpaceForAnotherRecord()) {
            //计算当前缓存容量
            long used = this.inMemSorter.getMemoryUsage();

            //扩容
            LongArray array;
            try {
                array = this.allocateArray(used / 8L * 2L);
            } catch (TooLargePageException var5) {
                this.spill();
                return;
            } catch (SparkOutOfMemoryError var6) {
                if (!this.inMemSorter.hasSpaceForAnotherRecord()) {
                    logger.error("Unable to grow the pointer array");
                    throw var6;
                }

                return;
            }

            //再次判断缓存是否够用：可能其他task释放了资源，从而缓存够用
            if (this.inMemSorter.hasSpaceForAnotherRecord()) {
                //这种情况下，释放刚申请的page资源
                this.freeArray(array);
            } else {
                //资源还是不够用，使用刚申请的page资源
                this.inMemSorter.expandPointerArray(array);
            }
        }

    }
}

6.2.1.1.2.1.扩容条件判断

说明：

当缓存最新数据的索引 < 缓存可用容量时，缓存不需要扩容；
- 反之：当当缓存最新数据的索引 >= 缓存可用容量时，需要扩容；
针对缓存可用容量
- 默认可用容量为总容量的一半；
- 如果使用非useRadixSort方案，可用容量为总容量的2/3；
针对缓存
- 使用LongArray作为缓存器；

final class ShuffleInMemorySorter {
    
    //缓存数据的数组
    private LongArray array;
    
    //记录数组中最新数据的索引
    private int pos = 0;
    
    //定义数组可使用容量
    private int usableCapacity = 0;
    
    ShuffleInMemorySorter(MemoryConsumer consumer, int initialSize, boolean useRadixSort) {
        //---------其他代码---------
        this.usableCapacity = this.getUsableCapacity();
    }
    
    //最新数据索引 < 可用容量：缓存不需要扩容
    public boolean hasSpaceForAnotherRecord() {
        return this.pos < this.usableCapacity;
    }
    
    //默认可用容量为总容量的一半；
    //如果使用非useRadixSort方案，可用容量为总容量的2/3
    private int getUsableCapacity() {
        return (int)((double)this.array.size() / (this.useRadixSort ? 2.0D : 1.5D));
    }

}

6.2.1.1.2.2.内存使用量计算

说明：

数组page大小即为内存使用量，单位byte；

final class ShuffleInMemorySorter {
    
    //返回数组page大小
    public long getMemoryUsage() {
        return this.array.size() * 8L;
    }
}

public final class LongArray {
    //返回数组page大小
    public long size() {
        return this.length;
    }
}

6.2.1.1.2.3.allocateArray-扩容

说明：

根据新的容量申请新的page（MemoryBlock）
根据page构建新的数组

public abstract class MemoryConsumer {
    
    public LongArray allocateArray(long size) {
        long required = size * 8L;
        
        //根据新的容量申请新的page（MemoryBlock）
        MemoryBlock page = this.taskMemoryManager.allocatePage(required, this);
        if (page == null || page.size() < required) {
            this.throwOom(page, required);
        }

        //更新内存使用量
        this.used += required;
        
        //根据page构建新的数组
        return new LongArray(page);
    }
}

6.2.1.1.2.4.expandPointerArray-数据转移

说明：

首先，确保新缓存容量比旧缓存容量大；
其次，将数据从旧数组复制到新数组；
然后，释放旧数组资源；
接着，新数组作为内存排序器缓存；
最后，更新内存排序器可用容量；

final class ShuffleInMemorySorter {
    public void expandPointerArray(LongArray newArray) {
        
        //确保新缓存容量比原来缓存容量大
        assert newArray.size() > this.array.size();

        //数据从旧数组复制到新数组
        Platform.copyMemory(this.array.getBaseObject(), this.array.getBaseOffset(), newArray.getBaseObject(), newArray.getBaseOffset(), (long)this.pos * 8L);
       
        //释放旧数组资源
        this.consumer.freeArray(this.array);
        
        //新数组作为内存排序器缓存
        this.array = newArray;
        
        //更新内存排序器可用容量
        this.usableCapacity = this.getUsableCapacity();
    }
}

6.2.1.1.3.acquireNewPageIfNecessary-申请新的page存储数据

构建新page条件：

当前page为null；
当前page不够用；
以上二者存在一种就需要构建新page；

步骤：

根据数据长度构建新page作为当前page；
更新page光标位置；
新构建page加入page列表；

final class ShuffleExternalSorter extends MemoryConsumer {
    
    private void acquireNewPageIfNecessary(int required) {
        
        //当前page为null，或者当前page不够用
        if (this.currentPage == null || this.pageCursor + (long)required > this.currentPage.getBaseOffset() + this.currentPage.size()) {
            
            //根据数据长度构建新page作为当前page
            this.currentPage = this.allocatePage((long)required);
            
            //更新page光标位置
            this.pageCursor = this.currentPage.getBaseOffset();
            
            //新构建page加入page列表
            this.allocatedPages.add(this.currentPage);
        }

    }
}

6.2.2.closeAndWriteOutput-外部排序器缓存数据落地到磁盘

说明：

更新内存使用峰值；
将缓存数据落地到磁盘临时文件；
合并临时文件（可能多个：一次溢写一个）为一个输出文件；
- 合并后的输出文件，数据根据分区id升序排列，一个分区的数据在一块；
创建输出文件对应index文件；
- 存储每个分区的数据偏移量；
- 数据偏移量顺序和输出的数据文件一致，一一对应；
记录shuffle状态信息；

public class UnsafeShuffleWriter<K, V> extends ShuffleWriter<K, V> {
    
  void closeAndWriteOutput() throws IOException {
    assert(sorter != null);
      
    // 更新内存使用峰值
    updatePeakMemoryUsed();
      
    serBuffer = null;
    serOutputStream = null;
      
    //将缓存中的数据落地到磁盘临时文件中
    final SpillInfo[] spills = sorter.closeAndGetSpills();
      
    sorter = null;
    final long[] partitionLengths;
      
    //获取输出数据文件
    final File output = shuffleBlockResolver.getDataFile(shuffleId, mapId);
    //构建输出数据文件的临时文件
    final File tmp = Utils.tempFileWith(output);
    try {
      try {
        //合并临时文件
        partitionLengths = mergeSpills(spills, tmp);
      } finally {
        for (SpillInfo spill : spills) {
          if (spill.file.exists() && ! spill.file.delete()) {
            logger.error("Error while deleting spill file {}", spill.file.getPath());
          }
        }
      }
        
      //创建输出文件对应的index文件
      shuffleBlockResolver.writeIndexFileAndCommit(shuffleId, mapId, partitionLengths, tmp);
    } finally {
      if (tmp.exists() && !tmp.delete()) {
        logger.error("Error while deleting temp file {}", tmp.getAbsolutePath());
      }
    }
      
    //记录shuffle状态信息
    mapStatus = MapStatus$.MODULE$.apply(blockManager.shuffleServerId(), partitionLengths);
  }
}

6.2.2.1.closeAndGetSpills-强制flush缓存到磁盘中

说明：

根据内存排序器缓存的数据地址，将地址对应的数据落地到临时文件中；
释放内存资源和内存排序器；
返回所有的溢写信息对象数组；
- 溢写信息对象中存储了每个分区对应的数据量；

final class ShuffleExternalSorter extends MemoryConsumer {
    public SpillInfo[] closeAndGetSpills() throws IOException {
        if (this.inMemSorter != null) {
            //将缓存中的数据溢写到磁盘
            this.writeSortedFile(true);
            
            //释放内存资源
            this.freeMemory();
            
            //释放内存排序器
            this.inMemSorter.free();
            this.inMemSorter = null;
        }

        //返回溢写信息对象数组
        return (SpillInfo[])this.spills.toArray(new SpillInfo[this.spills.size()]);
    }
}

6.2.2.2.mergeSpills-合并临时文件

说明：

没有临时文件，创建一个空输出文件；
由一个临时文件，将临时文件迁移并重命名为输出文件；
由多个临时文件，执行文件合并；
- 快合并
  - 要求开启快速合并且支持快速合并；
    - 通过spark.file.transferTo判断判断是否基于传输快速合并，默认是；
    - 否则基于文件流快速合并；
- 慢合并
  - 如果不是快速合并，则采取慢合并；
返回每个分区的数据量数组；

public class UnsafeShuffleWriter<K, V> extends ShuffleWriter<K, V> {
    
    private long[] mergeSpills(SpillInfo[] spills, File outputFile) throws IOException {
        //是否压缩：默认是，通过spark.shuffle.compress设置
        boolean compressionEnabled = this.sparkConf.getBoolean("spark.shuffle.compress", true);
        //压缩方式编码：默认LZ4CompressionCodec，通过spark.io.compression.codec设置
        CompressionCodec compressionCodec = org.apache.spark.io.CompressionCodec..MODULE$.createCodec(this.sparkConf);
        
        //是否启用fast merge：默认是
        boolean fastMergeEnabled = this.sparkConf.getBoolean("spark.shuffle.unsafe.fastMergeEnabled", true);
        //是否支持 fast merge:
        //1、不启用压缩算法
        //2、或者SnappyCompressionCodec、LZFCompressionCodec、LZ4CompressionCodec、ZStdCompressionCodec这4种压缩算法之一
        boolean fastMergeIsSupported = !compressionEnabled || org.apache.spark.io.CompressionCodec..MODULE$.supportsConcatenationOfSerializedStreams(compressionCodec);
        
        //是否启用加密：默认否，通过 spark.io.encryption.enabled 参数来设置
        boolean encryptionEnabled = this.blockManager.serializerManager().encryptionEnabled();

        try {
            if (spills.length == 0) {//没有溢写分解
                 // 创建一个空文件
                (new FileOutputStream(outputFile)).close();
                //返回全是0的分区数据量数组
                return new long[this.partitioner.numPartitions()];
            } else if (spills.length == 1) {//一个溢写文件
                //临时文件迁移并重名为输出文件
                Files.move(spills[0].file, outputFile);
                //返回分区数据量数组
                return spills[0].partitionLengths;
            } else {//多个溢写文件
                long[] partitionLengths;
                //fast merge
                if (fastMergeEnabled && fastMergeIsSupported) {
                    //基于传输&&不加密方式快速合并：默认方式；
                    if (this.transferToEnabled && !encryptionEnabled) {
                        //基于传输的合并
                        logger.debug("Using transferTo-based fast merge");
                        partitionLengths = this.mergeSpillsWithTransferTo(spills, outputFile);
                    } else {
                        //基于文件流的合并
                        logger.debug("Using fileStream-based fast merge");
                        partitionLengths = this.mergeSpillsWithFileStream(spills, outputFile, (CompressionCodec)null);
                    }
                } else {
                    //慢合并
                    logger.debug("Using slow merge");
                    partitionLengths = this.mergeSpillsWithFileStream(spills, outputFile, compressionCodec);
                }

                //写出指标统计
                this.writeMetrics.decBytesWritten(spills[spills.length - 1].file.length());
                this.writeMetrics.incBytesWritten(outputFile.length());
                
                //返回分区数据量数组
                return partitionLengths;
            }
        } catch (IOException var9) {
            if (outputFile.exists() && !outputFile.delete()) {
                logger.error("Unable to delete output file {}", outputFile.getPath());
            }

            throw var9;
        }
    }
}

6.3.排序器

6.3.1.ShuffleExternalSorter-外部排序器

说明：

ShuffleExternalSorter是MemoryConsumer的子类；
在构造ShuffleExternalSorter实例化对象时，会构造一个MemoryConsumer实例化对象；
指定文件缓存大小：默认32k；
指定溢写阈值：Integer.MAX；
磁盘写出缓冲区：默认1M；
构建一个内存排序器，并维护在当前排序器中；
- 指定缓存初始化大小4096；
- 默认根据useRadixSort排序；

总结：

在外部排序器中，通过链表存储数据；
- 链表中的元素为page，实际上是内存块MemoryBlock；
- 所有的数据都存在一个个page中；
通过currentPage指向最新的page，当前page；
通过pageCursor（光标）指向page中数据的偏移量；

final class ShuffleExternalSorter extends MemoryConsumer {
    private static final Logger logger = LoggerFactory.getLogger(ShuffleExternalSorter.class);
    @VisibleForTesting
    static final int DISK_WRITE_BUFFER_SIZE = 1048576;
    private final int numPartitions;
    private final TaskMemoryManager taskMemoryManager;
    private final BlockManager blockManager;
    private final TaskContext taskContext;
    private final ShuffleWriteMetrics writeMetrics;
    //溢写阈值
    private final int numElementsForSpillThreshold;
    //文件缓冲区
    private final int fileBufferSizeBytes;
    //磁盘写出缓冲区
    private final int diskWriteBufferSize;
    //存储数据的page最多可以使用2^13个页表
    private final LinkedList<MemoryBlock> allocatedPages = new LinkedList();
    //溢出文件的元数据信息的列表
    private final LinkedList<SpillInfo> spills = new LinkedList();
    private long peakMemoryUsedBytes;
    @Nullable
    private ShuffleInMemorySorter inMemSorter;
    //当前使用的page
    @Nullable
    private MemoryBlock currentPage = null;
    //Page的光标
    private long pageCursor = -1L;

    //构造函数
    ShuffleExternalSorter(TaskMemoryManager memoryManager, BlockManager blockManager, TaskContext taskContext, int initialSize, int numPartitions, SparkConf conf, ShuffleWriteMetrics writeMetrics) {
        
        //构造一个MemoryConsumer实例化对象
        super(memoryManager, (long)((int)Math.min(134217728L, memoryManager.pageSizeBytes())), memoryManager.getTungstenMemoryMode());
        
        this.taskMemoryManager = memoryManager;
        this.blockManager = blockManager;
        this.taskContext = taskContext;
        
        //确定reduce的分区数目
        this.numPartitions = numPartitions;
        //指定文件缓存大小：默认32k
        this.fileBufferSizeBytes = (int)(Long)conf.get(.MODULE$.SHUFFLE_FILE_BUFFER_SIZE()) * 1024;
        
        //指定溢写阈值：Integer.MAX
        this.numElementsForSpillThreshold = (Integer)conf.get(.MODULE$.SHUFFLE_SPILL_NUM_ELEMENTS_FORCE_SPILL_THRESHOLD());
        this.writeMetrics = writeMetrics;
        
        //构建一个内存排序器，并维护在当前排序器中
        this.inMemSorter = new ShuffleInMemorySorter(this, initialSize, conf.getBoolean("spark.shuffle.sort.useRadixSort", true));
        
        //内存使用情况
        this.peakMemoryUsedBytes = this.getMemoryUsage();
        //磁盘写出缓冲区：默认1M
        this.diskWriteBufferSize = (int)(Long)conf.get(.MODULE$.SHUFFLE_DISK_WRITE_BUFFER_SIZE());
    }

6.3.1.1.ShuffleInMemorySorter-内存排序器

说明：

使用LongArray作为内存缓存形式，对数据进行缓存；
array初始化容量为4096；
可用容量根据排序方案确定：
- 针对useRadixSort排序，可用容量时总容量的一半；
- 针对非useRadixSort排序，可用容量时总容量的2/3；
默认比较器：根据分区ID升序排列；

总结：

内存排序器通过LongArray对象缓存数据存储地址；
- LongArray对象底层通过内存块MemoryBlock存储数据；

final class ShuffleInMemorySorter {
    
    //初始化排序规则：默认根据分区Id升序排列
    private static final ShuffleInMemorySorter.SortComparator SORT_COMPARATOR = new ShuffleInMemorySorter.SortComparator();
    private final MemoryConsumer consumer;
    private LongArray array;
    private final boolean useRadixSort;
    private int pos = 0;
    private int usableCapacity = 0;
    private final int initialSize;

    ShuffleInMemorySorter(MemoryConsumer consumer, int initialSize, boolean useRadixSort) {
        this.consumer = consumer;

        assert initialSize > 0;

        //缓存初始化大小4096
        this.initialSize = initialSize;
        //默认根据useRadixSort排序
        this.useRadixSort = useRadixSort;
        //构建一个4096大小的LongArray作为排序器数据缓存
        this.array = consumer.allocateArray((long)initialSize);
        //可用容量
        this.usableCapacity = this.getUsableCapacity();
    }
    
    //内部类：排序比较器
    private static final class SortComparator implements Comparator<PackedRecordPointer> {
        private SortComparator() {
        }

        //默认根据分区id进行升序排列
        public int compare(PackedRecordPointer left, PackedRecordPointer right) {
            int leftId = left.getPartitionId();
            int rightId = right.getPartitionId();
            return leftId < rightId ? -1 : (leftId > rightId ? 1 : 0);
        }
    }
}

6.3.1.1.1.可用容量计算

说明：

针对useRadixSort排序，可用容量时总容量的一半；
针对非useRadixSort排序，可用容量时总容量的2/3；

final class ShuffleInMemorySorter {
    private static final ShuffleInMemorySorter.SortComparator SORT_COMPARATOR = new ShuffleInMemorySorter.SortComparator();
    private final MemoryConsumer consumer;
    private LongArray array;
    private final boolean useRadixSort;
    private int pos = 0;
    private int usableCapacity = 0;
    private final int initialSize;

    //针对根据useRadixSort排序，可用容量时总容量的一半；否则，可用容量是总容量的2/3；
    private int getUsableCapacity() {
        return (int)((double)this.array.size() / (this.useRadixSort ? 2.0D : 1.5D));
    }
}

6.4. 总结

适用场景：

序列化器支持对象迁移：支持序列化重定向；
非map端聚合
分区数不大于16777216

写出流程：

首先将迭代器中数据逐条添加到排序器中；
- 排序器中数据达到溢写条件，迭代器中数据将会溢写到一个临时文件中；
其次将排序器中数据落地到一个输出文件中；
- 会产生一个输出文件 + 一个输出文件对应的index文件；
最后释放排序器中资源；

排序器：

外部排序器
- 缓存数据
- 通过以page（内存块memeryBlock）为元素的链表实现数据缓存；
内存排序器
- 缓存数据地址
  - 数据地址的5~7折3个字节存储分区id；
  - 总共可以存储2的24次方（16777216）个分区编号；
- 通过LongArry（内部以memeryBlock存储数据）实现数据缓存；
排序的实现
- 通过对内存排序器缓存的数据地址根据分区id以RadixSort方式升序排序实现数据的排序；
溢写
- 内存缓存器中的数据地址数据达到Integer.MAX_VALUE，即产生一次数据溢写；
- 一次数据溢写，参数一个临时文件；

产出：

一次write，产生一个数据输出文件 + 一个index文件
输出文件
- 文件中数据根据分区id升序排列；
- 一个分区的数据在一块；
index文件
- 文件中存储数据文件中每个分区的数据偏移量；
- 文件中数据和数据文件中分区一一对应；

7.创建数据文件对应index文件

说明：

index文件中记录每个分区数据偏移量；
index文件中记录记录的每个数据偏移量与数据文件中每个分区的对应；2个文件中分区顺序一致，都是升序排列；

private[spark] class IndexShuffleBlockResolver(
    conf: SparkConf,
    _blockManager: BlockManager = null)
  extends ShuffleBlockResolver
  with Logging {
     
  def writeIndexFileAndCommit(
      shuffleId: Int,
      mapId: Int,
      lengths: Array[Long],
      dataTmp: File): Unit = {
      
    //创建index文件
    val indexFile = getIndexFile(shuffleId, mapId)
    val indexTmp = Utils.tempFileWith(indexFile)
    try {
        
      //获取数据文件
      val dataFile = getDataFile(shuffleId, mapId)
        
      //每个执行器只有一个IndexShuffleBlockResolver，这个同步确保下面的检查和重命名是原子的.
      synchronized {
          
        //检查索引文件和数据文件
        val existingLengths = checkIndexAndDataFile(indexFile, dataFile, lengths.length)
        if (existingLengths != null) {
          System.arraycopy(existingLengths, 0, lengths, 0, lengths.length)
            
          //如果相关的index已经存在, 就可以直接退出了, 这是因为这个mapTask可能已经运行过了. 
          // 当然也可能因为其它原因失败, 但总之这次写是不成功的, 直接删除tmp文件完事
          if (dataTmp != null && dataTmp.exists()) {
            dataTmp.delete()
          }
        } else {
          // 创建面向index临时文件的数据输出流
          val out = new DataOutputStream(new BufferedOutputStream(new FileOutputStream(indexTmp)))
          Utils.tryWithSafeFinally {
            // We take in lengths of each block, need to convert it to offsets.
            var offset = 0L
            out.writeLong(offset)
            //遍历分区数据量数组
            for (length <- lengths) {
              //更新每个分区数据偏移量
              offset += length
              //将分区数据偏移量依次写入index文件
              out.writeLong(offset)
            }
          } {
            out.close()
          }

          //索引文件删除
          if (indexFile.exists()) {
            indexFile.delete()
          }
            
          //数据文件删除
          if (dataFile.exists()) {
            dataFile.delete()
          }
            
          //将索引临时文件改名为索引文件
          if (!indexTmp.renameTo(indexFile)) {
            throw new IOException("fail to rename file " + indexTmp + " to " + indexFile)
          }
          if (dataTmp != null && dataTmp.exists() && !dataTmp.renameTo(dataFile)) {
            throw new IOException("fail to rename file " + dataTmp + " to " + dataFile)
          }
        }
      }
    } finally {
      if (indexTmp.exists() && !indexTmp.delete()) {
        logError(s"Failed to delete temporary index file at ${indexTmp.getAbsolutePath}")
      }
    }
  }
}

8.总结

整体逻辑：

构建RDD依赖时，如果是宽依赖，会初始化宽依赖的shuffleHandle属性
- 此时会向shuffleManager注册handle时，根据不同情况实例化不同的ShuffleHandle对象
在Executor执行任务时，针对shuffle任务，会将任务执行的结果数据通过ShuffleWriter落地到磁盘
- 从ShuffleManager中，根据RDD依赖中的shuffleHandle属性值（ShuffleHandle对象）的不同类型，实例化获取不同的ShuffleWriter类型对象
- 落地到磁盘时调用ShuffleWriter.write()函数实现的
- 执行一次shuffle任务，落地到磁盘会生成一个数据文件 + 一个index文件

ShuffleWriter类型的分析：

BypassMergeSortShuffleWriter
- handle：BypassMergeSortShuffleHandle
- 适用：不是map端聚合且分区数不高于200
- 效果：直接写入numPartitions文件，并在最后将它们连接起来
- 优势：避免了进行两次序列化和反序列化以合并溢出的文件
- 缺点：一次打开多个文件，从而为缓冲区分配更多内存；
SortShuffleWriter
- handle：BaseShuffleHandle
- 适用：前面2中不适用的；
- 效果：以反序列化的形式缓冲映射输出
- 特点：支持map端聚合、支持排序
UnsafeShuffleWriter
- handle：SerializedShuffleHandle
- 适用
  - 序列化器支持对象迁移：持序列化重定向；
  - 非map端聚合
  - 分区数不大于16777216
- 效果：以序列化的形式缓冲映射输出、支持排序