spark源码-shuffle原理分析-1-ShuffleWriter

26 篇文章 0 订阅
25 篇文章 2 订阅

ShuffleWriter

1.概述

本次分析基于spark版本2.11进行;

spark中的shuffle是一个整体的大框架,本次主要对ShuffleWriter在shuffle中产生作用的原理进行梳理;

2.ShuffleHandle注册

2.1.注册时间点

  • shuffleHandle是宽依赖ShuffleDependency的属性之一;
  • 当实例化宽依赖对象的时候,就会向shuffleManager注册handle,并返回handle用以初始化shuffleHandle属性;
  • 向shuffleManager注册handle时,会实例化一个ShuffleHandle对象;
class ShuffleDependency[K: ClassTag, V: ClassTag, C: ClassTag](
    @transient private val _rdd: RDD[_ <: Product2[K, V]],
    val partitioner: Partitioner,
    val serializer: Serializer = SparkEnv.get.serializer,
    val keyOrdering: Option[Ordering[K]] = None,
    val aggregator: Option[Aggregator[K, V, C]] = None,
    val mapSideCombine: Boolean = false)
  extends Dependency[Product2[K, V]] {

  override def rdd: RDD[Product2[K, V]] = _rdd.asInstanceOf[RDD[Product2[K, V]]]

  val shuffleId: Int = _rdd.context.newShuffleId()

  //向shuffleManager注册handle,并返回handle初始化shuffleHandle属性
  val shuffleHandle: ShuffleHandle = _rdd.context.env.shuffleManager.registerShuffle(
    shuffleId, _rdd.partitions.length, this)
}

2.2.向shuffleManager注册shuffle

说明:

  • 在spark中,SortShuffleManager是ShuffleManager的唯一实现类,注册shuffle最终在SortShuffleManager中完成;
  • 发生时间:
    • 在实例化ShuffleDependency对象时,初始化宽依赖的shuffleHandle属性,此时宽依赖向shuffleManager注册shuffle;

总结:

  • BypassMergeSortShuffleHandle:
    • 适用:不是map端聚合且分区数不高于200
    • 效果:直接写入numPartitions文件,并在最后将它们连接起来
    • 优势:避免了进行两次序列化和反序列化以合并溢出的文件
    • 缺点:一次打开多个文件,从而为缓冲区分配更多内存;
  • BaseShuffleHandle:
    • 适用:前面2中不适用的;
    • 效果:以反序列化的形式缓冲映射输出
    • 特点:支持map端聚合

SerializedShuffleHandle:

  • 适用
    • 序列化器支持对象迁移:持序列化重定向;
    • 非map端聚合
    • 分区数不大于16777216
  • 效果:以序列化的形式缓冲映射输出
private[spark] class SortShuffleManager(conf: SparkConf) extends ShuffleManager with Logging {
  override def registerShuffle[K, V, C](
      shuffleId: Int,
      numMaps: Int,
      dependency: ShuffleDependency[K, V, C]): ShuffleHandle = {
    //不是map端聚合且分区数不高于200
    if (SortShuffleWriter.shouldBypassMergeSort(conf, dependency)) {
      // 直接写入numPartitions文件,并在最后将它们连接起来
      //这避免了进行两次序列化和反序列化以合并溢出的文件,这在正常的代码路径中会发生。缺点是一次打开多个文件,从而为缓冲区分配更多内存。
      new BypassMergeSortShuffleHandle[K, V](
        shuffleId, numMaps, dependency.asInstanceOf[ShuffleDependency[K, V, V]])
    } 
    
    //序列化器支持对象迁移、非map端聚合、分区数不大于16777216
    else if (SortShuffleManager.canUseSerializedShuffle(dependency)) {
      // 以序列化的形式缓冲映射输出
      new SerializedShuffleHandle[K, V](
        shuffleId, numMaps, dependency.asInstanceOf[ShuffleDependency[K, V, V]])
    } else {
      // 以反序列化的形式缓冲映射输出
      new BaseShuffleHandle(shuffleId, numMaps, dependency)
    }
  }
}

2.2.1.BypassMergeSortShuffleHandle判断

要求:

  • 不是map端聚合且分区数不高于200
  • 分区数阈值由spark.shuffle.sort.bypassMergeThreshold参数指定;默认值200;
private[spark] object SortShuffleWriter {
  //不是map端聚合且分区数不高于200
  def shouldBypassMergeSort(conf: SparkConf, dep: ShuffleDependency[_, _, _]): Boolean = {
    // map端聚合需要排序
    if (dep.mapSideCombine) {
      false
    } else {
      //spark.shuffle.sort.bypassMergeThreshold : 默认值200
      val bypassMergeThreshold: Int = conf.getInt("spark.shuffle.sort.bypassMergeThreshold", 200)
      //分区数小于spark.shuffle.sort.bypassMergeThreshold或200
      dep.partitioner.numPartitions <= bypassMergeThreshold
    }
  }
}

2.2.2.SerializedShuffleHandle判断

要求:

  • 支持序列化重定向;
  • 非map端聚合;
  • 分区数不大于16777216;
  • 以上3个条件同时满足;
private[spark] object SortShuffleManager extends Logging {

  //16777216
  val MAX_SHUFFLE_OUTPUT_PARTITIONS_FOR_SERIALIZED_MODE =
    PackedRecordPointer.MAXIMUM_PARTITION_ID + 1

  //序列化器支持对象迁移、非map端聚合、分区数不大于16777216
  def canUseSerializedShuffle(dependency: ShuffleDependency[_, _, _]): Boolean = {
    val shufId = dependency.shuffleId
    val numPartitions = dependency.partitioner.numPartitions
    
    //序列化器不支持对象迁移
    if (!dependency.serializer.supportsRelocationOfSerializedObjects) {
      log.debug(s"Can't use serialized shuffle for shuffle $shufId because the serializer, " +
        s"${dependency.serializer.getClass.getName}, does not support object relocation")
      false
    } 
      
    //map端聚合
    else if (dependency.mapSideCombine) {
      log.debug(s"Can't use serialized shuffle for shuffle $shufId because we need to do " +
        s"map-side aggregation")
      false
    } 
    //分区数大于16777216
    else if (numPartitions > MAX_SHUFFLE_OUTPUT_PARTITIONS_FOR_SERIALIZED_MODE) {
      log.debug(s"Can't use serialized shuffle for shuffle $shufId because it has more than " +
        s"$MAX_SHUFFLE_OUTPUT_PARTITIONS_FOR_SERIALIZED_MODE partitions")
      false
    } else {
      log.debug(s"Can use serialized shuffle for shuffle $shufId")
      true
    }
  }
}

3.ShuffleWriter实例化

3.1.实例化时间点

实例化和使用时机:

  • 当executor执行shuffle任务时,底层调用ShuffleMapTask.runTask()函数进行实现;
  • ShuffleMapTask.runTask()函数中,会实例化一个ShuffleWriter对象,然后通过ShuffleWriter.write()函数将数据落地到磁盘;

3.1.1.Executor#run-执行器运行任务

说明:

  • 在Executor中,执行shuffle任务时,底层调用ShuffleMapTask.runTask()函数进行实现;
private[spark] class Executor(
    executorId: String,
    executorHostname: String,
    env: SparkEnv,
    userClassPath: Seq[URL] = Nil,
    isLocal: Boolean = false,
    uncaughtExceptionHandler: UncaughtExceptionHandler = new SparkUncaughtExceptionHandler)
  extends Logging {
      
    //针对shuffle任务,实例化ShuffleMapTask对象
    @volatile var task: Task[Any] = _
      
    override def run(): Unit = {
      
      //---------其他代码---------

      try {
       
        //---------其他代码---------
        val value = Utils.tryWithSafeFinally {
          
          //针对shuffle任务,执行ShuffleMapTask#runTask函数
          val res = task.run(
            taskAttemptId = taskId,
            attemptNumber = taskDescription.attemptNumber,
            metricsSystem = env.metricsSystem)
          threwException = false
          res
        }

        //---------其他代码---------
      } catch {
        //---------其他代码---------
      } finally {
        runningTasks.remove(taskId)
      }
    }
  }
  • ShuffleMapTask.runTask()函数中,会实例化一个ShuffleWriter对象,然后通过ShuffleWriter.write将数据落地到磁盘;
  • 实例化一个ShuffleWriter对象时候会将stage依赖中维护的shuffleHandle传过去;
private[spark] class ShuffleMapTask(
    stageId: Int,
    stageAttemptId: Int,
    taskBinary: Broadcast[Array[Byte]],
    partition: Partition,
    @transient private var locs: Seq[TaskLocation],
    localProperties: Properties,
    serializedTaskMetrics: Array[Byte],
    jobId: Option[Int] = None,
    appId: Option[String] = None,
    appAttemptId: Option[String] = None,
    isBarrier: Boolean = false)
  extends Task[MapStatus](stageId, stageAttemptId, partition.index, localProperties,
    serializedTaskMetrics, jobId, appId, appAttemptId, isBarrier)
  with Logging {
      
  override def runTask(context: TaskContext): MapStatus = {
    //---------其他代码---------
      
    val ser = SparkEnv.get.closureSerializer.newInstance()
    val (rdd, dep) = ser.deserialize[(RDD[_], ShuffleDependency[_, _, _])](
      ByteBuffer.wrap(taskBinary.value), Thread.currentThread.getContextClassLoader)
    
    //---------其他代码---------

    var writer: ShuffleWriter[Any, Any] = null
    try {
      val manager = SparkEnv.get.shuffleManager
      
      //从shuffleManager中获取写入器:将依赖中维护的shuffleHandle传过去
      writer = manager.getWriter[Any, Any](dep.shuffleHandle, partitionId, context)
        
      //通过写入器将数据落地磁盘
      writer.write(rdd.iterator(partition, context).asInstanceOf[Iterator[_ <: Product2[Any, Any]]])
        
      //关闭写入器
      writer.stop(success = true).get
    } catch {
      //---------其他代码---------
    }
  }
}

3.1.2.实例化ShuffleWriter对象

说明:

  • UnsafeShuffleWriter、BypassMergeSortShuffleWriter、SortShuffleWriter都是ShuffleWriter的子类;
  • SerializedShuffleHandle、BypassMergeSortShuffleHandle、BaseShuffleHandle是ShuffleHandle的子类;
  • 根据ShuffleHandle实例化对象的具体子类类型,实例化不同的ShuffleWriter子类对象;
private[spark] class SortShuffleManager(conf: SparkConf) extends ShuffleManager with Logging {
  override def getWriter[K, V](
      handle: ShuffleHandle,
      mapId: Int,
      context: TaskContext): ShuffleWriter[K, V] = {
    numMapsForShuffle.putIfAbsent(
      handle.shuffleId, handle.asInstanceOf[BaseShuffleHandle[_, _, _]].numMaps)
    val env = SparkEnv.get
    
    //rdd依赖中维护的ShuffleHandle类型,实例化对应的writer
    handle match {
      case unsafeShuffleHandle: SerializedShuffleHandle[K @unchecked, V @unchecked] =>
        new UnsafeShuffleWriter(
          env.blockManager,
          shuffleBlockResolver.asInstanceOf[IndexShuffleBlockResolver],
          context.taskMemoryManager(),
          unsafeShuffleHandle,
          mapId,
          context,
          env.conf)
      case bypassMergeSortHandle: BypassMergeSortShuffleHandle[K @unchecked, V @unchecked] =>
        new BypassMergeSortShuffleWriter(
          env.blockManager,
          shuffleBlockResolver.asInstanceOf[IndexShuffleBlockResolver],
          bypassMergeSortHandle,
          mapId,
          context,
          env.conf)
      case other: BaseShuffleHandle[K @unchecked, V @unchecked, _] =>
        new SortShuffleWriter(shuffleBlockResolver, other, mapId, context)
    }
  }
}

4.BypassMergeSortShuffleWriter

4.1.实例化

对主要属性进行分析说明

  • 定义数据写入磁盘时文件缓存的大小,默认32kb;
    • 可以过spark.shuffle.file.buffer参数指定
  • 定义合并临时文件时,是否通过NIO方式赋值文件数据:默认true;
  • 以FileSegment数组的形式缓存临时文件句柄;
  • 以long数组的形式缓存每个分区的数据量;
final class BypassMergeSortShuffleWriter<K, V> extends ShuffleWriter<K, V> {

  private static final Logger logger = LoggerFactory.getLogger(BypassMergeSortShuffleWriter.class);

  //数据写入磁盘时的文件缓存,默认32kb,通过spark.shuffle.file.buffer参数指定
  private final int fileBufferSize;
  //合并临时文件时,是否通过NIO方式赋值文件数据:默认true;通过spark.file.transferTo参数指定
  private final boolean transferToEnabled;
  //分区数
  private final int numPartitions;
  private final BlockManager blockManager;
  //分区器
  private final Partitioner partitioner;
  private final ShuffleWriteMetrics writeMetrics;
  //本次shuffle的唯一标识
  private final int shuffleId;
  private final int mapId;
  private final Serializer serializer;
  //创建和维护shuffle数据的逻辑块和物理文件位置的对应关系
  private final IndexShuffleBlockResolver shuffleBlockResolver;

  //每个分区的写出器
  private DiskBlockObjectWriter[] partitionWriters;
  
  //每个分区的临时文件
  private FileSegment[] partitionWriterSegments;
  
  //文件输出状态信息
  @Nullable private MapStatus mapStatus;
  
  //每个分区的数据量
  private long[] partitionLengths;

  private boolean stopping = false;
}

4.2.write

原理:

  • 通过blockManager为每个分区构建一个临时文件;根据临时文件构建分区数据磁盘写出器;
  • 遍历数据记录,数据添加到分区对应临时文件的输出流中;
  • 文件flush,将临时文件输出流中数据真正落地到磁盘文件中;
  • 所有分区的临时文件合并为一个大的数据文件,并且生成对应的index文件;
  • 记录数据输出状态

说明:

  • 调用一次write函数,生成的临时文件根据分区数决定,一个分区一个临时文件;最终合并出来的大文件只有一个,对应生成一个index文件;
  • index文件记录每个分区数据的长度、偏移量;
final class BypassMergeSortShuffleWriter<K, V> extends ShuffleWriter<K, V> {
    
    public void write(Iterator<Product2<K, V>> records) throws IOException {
        assert this.partitionWriters == null;

        if (!records.hasNext()) {//空记录,生成一个空index文件
            this.partitionLengths = new long[this.numPartitions];
            this.shuffleBlockResolver.writeIndexFileAndCommit(this.shuffleId, this.mapId, this.partitionLengths, (File)null);
            this.mapStatus = .MODULE$.apply(this.blockManager.shuffleServerId(), this.partitionLengths);
        } else {
            //初始化序列化器
            SerializerInstance serInstance = this.serializer.newInstance();
            long openStartTime = System.nanoTime();
            //初始化分区写出器数组
            this.partitionWriters = new DiskBlockObjectWriter[this.numPartitions];
            //初始化分区对应临时文件数组
            this.partitionWriterSegments = new FileSegment[this.numPartitions];

            //通过blockManager获取block的临时文件,并以此构建分区的磁盘写出器
            //一个分区对应磁盘写出器和一个临时文件
            int i;
            for(i = 0; i < this.numPartitions; ++i) {
                //通过blockManager构建block的临时文件信息:blockId,临时文件
                Tuple2<TempShuffleBlockId, File> tempShuffleBlockIdPlusFile = this.blockManager.diskBlockManager().createTempShuffleBlock();
                File file = (File)tempShuffleBlockIdPlusFile._2();
                BlockId blockId = (BlockId)tempShuffleBlockIdPlusFile._1();
                //构建当前分区的磁盘写出器:绑定临时文件、文件输出流;
                this.partitionWriters[i] = this.blockManager.getDiskWriter(blockId, file, serInstance, this.fileBufferSize, this.writeMetrics);
            }

            this.writeMetrics.incWriteTime(System.nanoTime() - openStartTime);

            //将数据逐条添加到各分区对应临时文件的输出流中
            while(records.hasNext()) {
                Product2<K, V> record = (Product2)records.next();
                K key = record._1();
                //将数据添加到临时文件输出流中
                this.partitionWriters[this.partitioner.getPartition(key)].write(key, record._2());
            }

            //将各分区从文件输出流flush到临时文件中
            for(i = 0; i < this.numPartitions; ++i) {
                //获取当前分区的磁盘写出器
                DiskBlockObjectWriter writer = this.partitionWriters[i];
                //文件flush,返回记录的偏移量、长度、临时文件
                this.partitionWriterSegments[i] = writer.commitAndGet();
                writer.close();
            }

            //获取输出数据文件
            File output = this.shuffleBlockResolver.getDataFile(this.shuffleId, this.mapId);
            //创建输出数据文件的临时文件
            File tmp = Utils.tempFileWith(output);

            try {
                //临时文件合并产生数据文件,返回临时文件长度数组
                this.partitionLengths = this.writePartitionedFile(tmp);
                //生成数据文件对应的index文件
                this.shuffleBlockResolver.writeIndexFileAndCommit(this.shuffleId, this.mapId, this.partitionLengths, tmp);
            } finally {
                if (tmp.exists() && !tmp.delete()) {
                    logger.error("Error while deleting temp file {}", tmp.getAbsolutePath());
                }

            }

            //记录数据输出状态
            this.mapStatus = .MODULE$.apply(this.blockManager.shuffleServerId(), this.partitionLengths);
        }
    }
}

4.2.1.writePartitionedFile-合并临时文件

说明:

  • 按照分区顺序,逐个将分区临时文件数据复制到数据文件;
  • 默认根据NIO方式进行数据复制;
    • 可以通过spark.file.transferTo参数指定;
final class BypassMergeSortShuffleWriter<K, V> extends ShuffleWriter<K, V> {
  private long[] writePartitionedFile(File outputFile) throws IOException {
    //初始化分区数据量数组
    final long[] lengths = new long[numPartitions];
    
    if (partitionWriters == null) {
      //没有分区写出器,代表分区写出器构造没有执行,不存在数据写出,不存在临时文件
      //返回空数组
      return lengths;
    }

    //构建数据文件输出流
    final FileOutputStream out = new FileOutputStream(outputFile, true);
    final long writeStartTime = System.nanoTime();
    boolean threwException = true;
    try {
      //遍历分区
      for (int i = 0; i < numPartitions; i++) {
        //取出分区临时文件
        final File file = partitionWriterSegments[i].file();
        if (file.exists()) {
          //构建临时文件输入流
          final FileInputStream in = new FileInputStream(file);
          boolean copyThrewException = true;
          try {
            //将分区临时文件数据复制到数据文件中
            //transferToEnabled默认问true:以NIO的方式复制
            lengths[i] = Utils.copyStream(in, out, false, transferToEnabled);
            copyThrewException = false;
          } finally {
            Closeables.close(in, copyThrewException);
          }
          if (!file.delete()) {
            logger.error("Unable to delete file for partition {}", i);
          }
        }
      }
      threwException = false;
    } finally {
      Closeables.close(out, threwException);
      writeMetrics.incWriteTime(System.nanoTime() - writeStartTime);
    }
    partitionWriters = null;
    return lengths;
  }
}

5.SortShuffleWriter

5.1.实例化

SortShuffleWriter实例化相对简单,没有比较多的属性需要初始化;

private[spark] class SortShuffleWriter[K, V, C](
    shuffleBlockResolver: IndexShuffleBlockResolver,
    handle: BaseShuffleHandle[K, V, C],
    mapId: Int,
    context: TaskContext)
  extends ShuffleWriter[K, V] with Logging {

  //依赖关系
  private val dep = handle.dependency

  private val blockManager = SparkEnv.get.blockManager

  //排序器
  private var sorter: ExternalSorter[K, V, _] = null

  private var stopping = false

  private var mapStatus: MapStatus = null

  private val writeMetrics = context.taskMetrics().shuffleWriteMetrics
    
}

5.2.write

原理:

  • 构建排序器
  • 将所有数据添加到排序器中;
    • 以PartitionedAppendOnlyMap或PartitionedPairBuffer形式将数据缓存在内存中;
    • 当内存中缓存数据达到溢写条件时,将缓存中的数据整个溢写到磁盘中,由一个临时文件保存;
  • 将排序器中数据输出到一个临时文件;
  • 构建一个临时文件的index文件;
  • 记录数据输出状态;

特别说明:

  • 临时文件中的数据是根据分区编号先后进行写入的;
  • 如果存在map端聚合,是现将数据进行聚合后,再写入临时文件的;
  • 调用一次SortShuffleWriter.write()函数,会生成一个磁盘输出文件和一个对应的index文件;
private[spark] class SortShuffleWriter[K, V, C](
    shuffleBlockResolver: IndexShuffleBlockResolver,
    handle: BaseShuffleHandle[K, V, C],
    mapId: Int,
    context: TaskContext)
  extends ShuffleWriter[K, V] with Logging {
  
  override def write(records: Iterator[Product2[K, V]]): Unit = {
    
    //构建排序器
    sorter = if (dep.mapSideCombine) {
      //map端聚合,需要定义排序器的聚合器、排序方式
      new ExternalSorter[K, V, C](
        context, dep.aggregator, Some(dep.partitioner), dep.keyOrdering, dep.serializer)
    } else {
      // 在这种情况下,我们既没有向排序器传递聚合器,也没有向排序器传递排序器,因为我们不关心键是否在每个分区中排序;如果正在运行的操作是sortByKey,则将在reduce端执行
      new ExternalSorter[K, V, V](
        context, aggregator = None, Some(dep.partitioner), ordering = None, dep.serializer)
    }
      
    //数据添加到排序器:满足溢写条件情况下,缓存数据将会溢写到磁盘;否则,在内存中缓存
    sorter.insertAll(records)

    //获取输出数据文件
    val output = shuffleBlockResolver.getDataFile(dep.shuffleId, mapId)
    //创建输出文件的临时文件
    val tmp = Utils.tempFileWith(output)
    try {
      //组装blockId
      val blockId = ShuffleBlockId(dep.shuffleId, mapId, IndexShuffleBlockResolver.NOOP_REDUCE_ID)
      //临时文件合并数据到数据文件:根据参数进行排序聚合
      val partitionLengths = sorter.writePartitionedFile(blockId, tmp)
      //创建数据文件的index文件
      shuffleBlockResolver.writeIndexFileAndCommit(dep.shuffleId, mapId, partitionLengths, tmp)
      //记录数据输出状态
      mapStatus = MapStatus(blockManager.shuffleServerId, partitionLengths)
    } finally {
      if (tmp.exists() && !tmp.delete()) {
        logError(s"Error while deleting temp file ${tmp.getAbsolutePath}")
      }
    }
  }
}

5.2.1.ExternalSorter#insertAll-数据添加到ExternalSorter

数据缓存:

  • PartitionedAppendOnlyMap
    • 针对需要map端聚合的情况,使用此对象缓存数据;
    • 以(partitionId, key)为key,对value进行聚合;
    • 以((partitionId, key),聚合后的value)为一条数据
  • PartitionedPairBuffer
    • 针对非map端聚合的情况,使用此对象缓存数据;
    • 以(key,value)为一条数据;

步骤:

  • 遍历记录中的数据,根据是否map端聚合,逐条将数据缓存到集合中;
    • map端聚合,将数据缓存到PartitionedAppendOnlyMap,缓存的时候,根据(partitionId, key)为key对value进行合并;
    • 非map端聚合,直接以(key,value)形式将数据缓存到PartitionedPairBuffer中;
  • 每缓存一条记录,即判断一次是否需要将缓存中的数据溢写到磁盘

特别说明:

  • 如果没有发生缓存数据溢写到磁盘,数据将会以集合的方式缓存在内存中;
  • 记录数据在ExternalSorter中的保存形式:
    • 全部都以集合的形式缓存在内存中
    • 全部都以临时文件的形式落地在磁盘中,ExternalSorter中以spills属性维护对所有临时文件的引用;
    • 部分记录以集合的形式缓存在内存中,部分记录以临时文件的形式落地在磁盘;
private[spark] class ExternalSorter[K, V, C](
    context: TaskContext,
    aggregator: Option[Aggregator[K, V, C]] = None,
    partitioner: Option[Partitioner] = None,
    ordering: Option[Ordering[K]] = None,
    serializer: Serializer = SparkEnv.get.serializer)
  extends Spillable[WritablePartitionedPairCollection[K, C]](context.taskMemoryManager())
  with Logging {
      
  @volatile private var map = new PartitionedAppendOnlyMap[K, C]
  @volatile private var buffer = new PartitionedPairBuffer[K, C]
      
  def insertAll(records: Iterator[Product2[K, V]]): Unit = {
    // 聚合器部位None,则需要map端聚合
    val shouldCombine = aggregator.isDefined

    //map端预聚合,数据以PartitionedAppendOnlyMap形式缓存
    if (shouldCombine) {
      // Combine values in-memory first using our AppendOnlyMap
      val mergeValue = aggregator.get.mergeValue
      val createCombiner = aggregator.get.createCombiner
      var kv: Product2[K, V] = null
      //数据聚合:已有数据,就聚合;没有数据就新建一个数据;
      val update = (hadValue: Boolean, oldValue: C) => {
        if (hadValue) mergeValue(oldValue, kv._2) else createCombiner(kv._2)
      }
        
      //遍历记录数据
      while (records.hasNext) {
        //自上次溢写后,本次从记录数据中读取数据的数量+1
        addElementsRead()
        //从记录中读取一条数据
        kv = records.next()
        //数据根据(分区,key)进行聚合
        map.changeValue((getPartition(kv._1), kv._1), update)
        
        //判断缓存数据是否需要溢写到磁盘
        maybeSpillCollection(usingMap = true)
      }
    } else {
      // 非map端聚合,数据以PartitionedPairBuffer形式缓存
      while (records.hasNext) {
        addElementsRead()
        val kv = records.next()
        buffer.insert(getPartition(kv._1), kv._1, kv._2.asInstanceOf[C])
        maybeSpillCollection(usingMap = false)
      }
    }
  }      
}
5.2.1.1.ExternalSorter#maybeSpillCollection-判断是否需要将缓存数据溢写到磁盘

数据溢写条件:

  • 缓存到集合总的数据量时32的倍数 && 缓存的数据字节数不小于集合阈值 && 缓存的数据字节数 > 扩充后的集合阈值;
  • 缓存到集合的的数据量大于Integer.MAX_VALUE(0x7fffffff)
  • 以上2个条件满足其一即发生溢写

溢写:

  • 底层的溢写逻辑由ExternalSorter#spill实现;
  • 溢写一次会生成一个溢写临时文件;
  • 添加到ExternalSorter的数据,其所有的溢写文件都维护在ExternalSorter#spills属性中;

集合的阈值:

  • 初始阈值
    • 初始阈值由spark.shuffle.spill.initialMemoryThreshold参数决定;
    • 默认5M;
  • 阈值扩充
    • 每次阈值扩充量:2 * 当前集合字节数 - 当前集合阈值;
    • 扩充时机:当前集合字节数 >= 当前集合阈值 时;

特别说明:

  • 缓存中的数据溢写到磁盘中后,用于缓存记录数据的集合将会重新构建;
private[spark] class ExternalSorter[K, V, C](
    context: TaskContext,
    aggregator: Option[Aggregator[K, V, C]] = None,
    partitioner: Option[Partitioner] = None,
    ordering: Option[Ordering[K]] = None,
    serializer: Serializer = SparkEnv.get.serializer)
  extends Spillable[WritablePartitionedPairCollection[K, C]](context.taskMemoryManager())
  with Logging {
      
  @volatile private var map = new PartitionedAppendOnlyMap[K, C]
  @volatile private var buffer = new PartitionedPairBuffer[K, C]
      
  //到目前为止观察到的内存中数据结构的峰值大小,以字节为单位
  private var _peakMemoryUsedBytes: Long = 0L
  def peakMemoryUsedBytes: Long = _peakMemoryUsedBytes
      
  private def maybeSpillCollection(usingMap: Boolean): Unit = {
    var estimatedSize = 0L
    if (usingMap) {//map端聚合
      //估计集合的当前大小(以字节为单位)
      estimatedSize = map.estimateSize()
      //内存数据溢写到磁盘
      if (maybeSpill(map, estimatedSize)) {
        //发生数据溢写后,重新构建缓存集合
        map = new PartitionedAppendOnlyMap[K, C]
      }
    } else {
      //估计集合的当前大小(以字节为单位)
      estimatedSize = buffer.estimateSize()
      //内存数据溢写到磁盘
      if (maybeSpill(buffer, estimatedSize)) {
        //发生数据溢写后,重新构建缓存集合
        buffer = new PartitionedPairBuffer[K, C]
      }
    }

    //更新内存中数据结构的峰值大小
    if (estimatedSize > _peakMemoryUsedBytes) {
      _peakMemoryUsedBytes = estimatedSize
    }
  }     
}


private[spark] abstract class Spillable[C](taskMemoryManager: TaskMemoryManager)
  extends MemoryConsumer(taskMemoryManager) with Logging {
      
  // 自上次溢出后从输入中读取的元素数
  protected def elementsRead: Int = _elementsRead
  private[this] var _elementsRead = 0
      
  // 集合大小的初始阈值:默认5M
  private[this] val initialMemoryThreshold: Long =
    SparkEnv.get.conf.getLong("spark.shuffle.spill.initialMemoryThreshold", 5 * 1024 * 1024)
      
  //集合的字节大小阈值
  //为了避免大量的小溢出,初始化该值为数量级> 0
  @volatile private[this] var myMemoryThreshold = initialMemoryThreshold
      
  //溢写的总字节数
  @volatile private[this] var _memoryBytesSpilled = 0L
      
  //发生溢写的次数
  private[this] var _spillCount = 0
      
  //如果需要,将当前内存中的收集信息溢出到磁盘。试图在溢出之前获取更多内存
  protected def maybeSpill(collection: C, currentMemory: Long): Boolean = {
    var shouldSpill = false
    //读取数据量是32的倍数,且集合数据内存占用量不小于集合设置的阈值
    if (elementsRead % 32 == 0 && currentMemory >= myMemoryThreshold) {
      //集合内存阈值扩充:2倍当前使用量 - 原有阈值
      val amountToRequest = 2 * currentMemory - myMemoryThreshold
      val granted = acquireMemory(amountToRequest)
      myMemoryThreshold += granted
        
      // 如果申请到的不够,map或buffer预估占用内存量还是大于阈值,确定溢写
      shouldSpill = currentMemory >= myMemoryThreshold
    }
      
    //如果上面判定不需要溢写,但读取的记录总数比Integer.MAX_VALUE大,也还是得溢写
    //numElementsForceSpillThreshold:Integer.MAX_VALUE   0x7fffffff
    shouldSpill = shouldSpill || _elementsRead > numElementsForceSpillThreshold
    
      //溢写实施
    if (shouldSpill) {
      //溢写次数+1
      _spillCount += 1
      logSpillage(currentMemory)
        
      //溢写
      spill(collection)
        
      //清空数据读取量
      _elementsRead = 0
        
      //累加溢写总字节数
      _memoryBytesSpilled += currentMemory
        
      //释放内存
      releaseMemory()
    }
    
    //返回溢写判断结果
    shouldSpill
  }
}
5.2.1.2.ExternalSorter#spill-缓存数据溢写到磁盘

步骤:

  • 对集合中的数据根据排序比较器进行排序,获取排序后数据迭代器
  • 排序后的数据溢写到磁盘,返回溢写文件;
    • 通过diskBlockManager创建一个临时文件
    • 通过blockManager创建临时文件写入器;
    • 遍历排序后的数据,将数据逐条添加到临时文件写入器,记录添加的数据量;
    • 每隔10000条,通过flush将写入器数据批量落地到临时文件;
    • 遍历结束后,将剩下的不足10000条的数据批量落地到临时文件;
    • 返回临时文件;
  • 溢写文件添加到临时溢写文件文件集合;

特别说明:

  • 集合中数据落地磁盘文件是一批一批的落地的,批处理数据量由spark.shuffle.spill.batchSize参数设置,默认10000;
private[spark] class ExternalSorter[K, V, C](
    context: TaskContext,
    aggregator: Option[Aggregator[K, V, C]] = None,
    partitioner: Option[Partitioner] = None,
    ordering: Option[Ordering[K]] = None,
    serializer: Serializer = SparkEnv.get.serializer)
  extends Spillable[WritablePartitionedPairCollection[K, C]](context.taskMemoryManager())
  with Logging {
      
  private val spills = new ArrayBuffer[SpilledFile]
      
  private val serInstance = serializer.newInstance()

  //文件缓存大小,默认32K
  private val fileBufferSize = conf.getSizeAsKb("spark.shuffle.file.buffer", "32k").toInt * 1024
  
  //批处理记录数量,默认10000
  private val serializerBatchSize = conf.getLong("spark.shuffle.spill.batchSize", 10000)
     
  //记录每个分区的数据量
  val elementsPerPartition = new Array[Long](numPartitions)  
  
      
  //集合数据溢写到磁盘
  override protected[this] def spill(collection: WritablePartitionedPairCollection[K, C]): Unit = {
     
    //对集合中的数据根据排序比较器进行排序,获取排序后数据迭代器
    //有排序比较器,对分区内数据根据key升序排序
    //没有排序比较器,根据分区进行升序排序
    val inMemoryIterator = collection.destructiveSortedWritablePartitionedIterator(comparator)
    
    //数据溢写到磁盘
    val spillFile = spillMemoryIteratorToDisk(inMemoryIterator)
      
    //添加到临时文件集合
    spills += spillFile
  }  
      
  //排序比较器:分区内对key进行升序排列    
  private val keyComparator: Comparator[K] = ordering.getOrElse(new Comparator[K] {
    override def compare(a: K, b: K): Int = {
      val h1 = if (a == null) 0 else a.hashCode()
      val h2 = if (b == null) 0 else b.hashCode()
      if (h1 < h2) -1 else if (h1 == h2) 0 else 1
    }
  })
  
  //获取排序比较器
  private def comparator: Option[Comparator[K]] = {
    if (ordering.isDefined || aggregator.isDefined) {
      Some(keyComparator)
    } else {
      None
    }
  }
  
  //数据溢写到磁盘
  private[this] def spillMemoryIteratorToDisk(inMemoryIterator: WritablePartitionedIterator)
      : SpilledFile = {
    // 因为溢写文件在shuffle过程中会被读取,因此它们的压缩不由spill相关参数控制
    // 创建一个临时块
    val (blockId, file) = diskBlockManager.createTempShuffleBlock()

    //记录每次溢写的数据量
    var objectsWritten: Long = 0
    val spillMetrics: ShuffleWriteMetrics = new ShuffleWriteMetrics
    
    //获取临时文件块磁盘写入器
    val writer: DiskBlockObjectWriter =
      blockManager.getDiskWriter(blockId, file, serInstance, fileBufferSize, spillMetrics)

    // List of batch sizes (bytes) in the order they are written to disk
    val batchSizes = new ArrayBuffer[Long]

    // How many elements we have in each partition
    val elementsPerPartition = new Array[Long](numPartitions)

    //文件flush,返回记录的偏移量、长度、临时文件
    def flush(): Unit = {
      //将磁盘写出器序列化流中数据flush到临时文件中
      val segment = writer.commitAndGet()
      batchSizes += segment.length
      _diskBytesSpilled += segment.length
      objectsWritten = 0
    }

    var success = false
    try {
      //记录数据遍历
      while (inMemoryIterator.hasNext) {
        val partitionId = inMemoryIterator.nextPartition()
        require(partitionId >= 0 && partitionId < numPartitions,
          s"partition Id: ${partitionId} should be in the range [0, ${numPartitions})")
        
        //数据逐条添加到磁盘写出器序列化流中
        inMemoryIterator.writeNext(writer)
          
        //累加记录所附分区的数据量
        elementsPerPartition(partitionId) += 1
        
        //累加当次溢写的数据量
        objectsWritten += 1

        //每10000条数据落地到磁盘文件一次
        if (objectsWritten == serializerBatchSize) {
          flush()
        }
      }
      
      //遍历完毕,将剩余不足10000的数据的落地到磁盘
      if (objectsWritten > 0) {
        flush()
      } else {
        writer.revertPartialWritesAndClose()
      }
      success = true
    } finally {
      if (success) {
        writer.close()
      } else {
        // 
        writer.revertPartialWritesAndClose()
        if (file.exists()) {
          if (!file.delete()) {
            logWarning(s"Error deleting ${file}")
          }
        }
      }
    }

    //返回溢写文件
    SpilledFile(file, blockId, batchSizes.toArray, elementsPerPartition)
  }
}

5.2.2.ExternalSorter#writePartitionedFile-产生完整数据文件

功能:

  • 将溢写的临时文件和缓存中的数据合并,产生一个完成的数据文件;

详情:

  • 未发生过溢写
    • 将缓存中的数据根据分区id和key进行排序;
    • 将排序后的数据按照分区依次批量写入临时文件;(一个分区写一次)
  • 发生过溢写
    • 将磁盘中溢写文件的数据与内存中缓存的数据根据分区进行合流
    • 将合流后的数据按照分区依次批量写入临时文件;(一个分区写一次)
private[spark] class ExternalSorter[K, V, C](
    context: TaskContext,
    aggregator: Option[Aggregator[K, V, C]] = None,
    partitioner: Option[Partitioner] = None,
    ordering: Option[Ordering[K]] = None,
    serializer: Serializer = SparkEnv.get.serializer)
  extends Spillable[WritablePartitionedPairCollection[K, C]](context.taskMemoryManager())
  with Logging {
      
  def writePartitionedFile(
      blockId: BlockId,
      outputFile: File): Array[Long] = {

    // Track location of each range in the output file
    val lengths = new Array[Long](numPartitions)
    
    //通过blockManager获取输出文件写入器
    val writer = blockManager.getDiskWriter(blockId, outputFile, serInstance, fileBufferSize,
      context.taskMetrics().shuffleWriteMetrics)

    if (spills.isEmpty) {//没有发生溢写
      // 确定数据在内存中的缓存形式
      val collection = if (aggregator.isDefined) map else buffer
      //数据排序:根据分区id和key排序
      val it = collection.destructiveSortedWritablePartitionedIterator(comparator)
      //排序后的数据遍历:将数据根据分区依次写入输出文件中
      while (it.hasNext) {
        val partitionId = it.nextPartition()
        //同一个分区的数据依次添加到写出器序列化流中:同一个分区的数据在一起
        while (it.hasNext && it.nextPartition() == partitionId) {
          it.writeNext(writer)
        }
        
        //序列化流中数据flush到文件,返回记录的偏移量、长度、临时文件
        val segment = writer.commitAndGet()
        
        //缓存分区与分区数据量
        lengths(partitionId) = segment.length
      }
    } else {//有发生数据溢写
      // 溢写文件和缓存数据合并,合并后再根据分区依次写入输出文件中
      for ((id, elements) <- this.partitionedIterator) {
        if (elements.hasNext) {
          //将分区中的数据依次添加到写出器序列化流中
          for (elem <- elements) {
            writer.write(elem._1, elem._2)
          }
            
          //序列化流中数据flush到文件,返回记录的偏移量、长度、临时文件
          val segment = writer.commitAndGet()
            
          //缓存分区与分区数据量
          lengths(id) = segment.length
        }
      }
    }

    writer.close()
    context.taskMetrics().incMemoryBytesSpilled(memoryBytesSpilled)
    context.taskMetrics().incDiskBytesSpilled(diskBytesSpilled)
    context.taskMetrics().incPeakExecutionMemory(peakMemoryUsedBytes)

    lengths
  }
}
5.2.2.1.ExternalSorter#partitionedIterator-磁盘数据和内存数据合并

详情:

  • 读取磁盘临时文件和内存缓存中的数据;
  • 遍历分区,将磁盘临时文件和内存缓存中同一个分区的数据合并;
  • 对合并后的数据进行聚合排序操作
    • 定义了聚合器:跨分区执行部分聚合:根据聚合器定义聚合value,最后按照key排序;
    • 没有定义聚合器,但是定义了排序器:对数据根据排序器进行排序,而不是合并它们;
    • 没有定义聚合器和排序器:返回合并后的结果;
  • 返回分区与分区数据的映射集合
private[spark] class ExternalSorter[K, V, C](
    context: TaskContext,
    aggregator: Option[Aggregator[K, V, C]] = None,
    partitioner: Option[Partitioner] = None,
    ordering: Option[Ordering[K]] = None,
    serializer: Serializer = SparkEnv.get.serializer)
  extends Spillable[WritablePartitionedPairCollection[K, C]](context.taskMemoryManager())
  with Logging {
      
  def partitionedIterator: Iterator[(Int, Iterator[Product2[K, C]])] = {
    val usingMap = aggregator.isDefined
    val collection: WritablePartitionedPairCollection[K, C] = if (usingMap) map else buffer
    if (spills.isEmpty) {//未发生溢写
      //未定义排序规则
      if (!ordering.isDefined) {
        // 只按分区ID排序,而不是key
        groupByPartition(destructiveIterator(collection.partitionedDestructiveSortedIterator(None)))
      } else {//定义排序规则
        // 根据分区ID和key进行排序
        groupByPartition(destructiveIterator(
          collection.partitionedDestructiveSortedIterator(Some(keyComparator))))
      }
    } else {//发生溢写
      //合并溢出的和内存中的数据
      merge(spills, destructiveIterator(
        collection.partitionedDestructiveSortedIterator(comparator)))
    }
  }
   
  //数据合并
  private def merge(spills: Seq[SpilledFile], inMemory: Iterator[((Int, K), C)])
      : Iterator[(Int, Iterator[Product2[K, C]])] = {
     
    //从磁盘读取溢写的临时文件
    val readers = spills.map(new SpillReader(_))
    //获取内存中缓存的数据
    val inMemBuffered = inMemory.buffered
          
    //遍历分区
    (0 until numPartitions).iterator.map { p =>
      //获取内存中当前分区数据  
      val inMemIterator = new IteratorForPartition(p, inMemBuffered)
      //将磁盘文件数据和内存缓存数据合流
      val iterators = readers.map(_.readNextPartition()) ++ Seq(inMemIterator)
      if (aggregator.isDefined) {//定义了聚合器
        // 跨分区执行部分聚合:根据聚合器定义聚合value,最后按照key排序
        (p, mergeWithAggregation(
          iterators, aggregator.get.mergeCombiners, keyComparator, ordering.isDefined))
      } else if (ordering.isDefined) {//没有聚合器,但是定义了排序器
        // 对元素进行排序,而不是合并它们
        (p, mergeSort(iterators, ordering.get))
      } else {//没有定义聚合器和排序器
        //返回合并后的结果
        (p, iterators.iterator.flatten)
      }
    }
  }
}

6.UnsafeShuffleWriter

6.1.实例化

说明:

  • 通过调用构造函数new一个UnsafeShuffleWriter对象完成实例化;
  • 使用UnsafeShuffleWriter要求stage分区数不能大于16777216
  • 初始化排序器,指定缓存初始化大小4096
  • 初始化系列化缓存,指定大小1M
public class UnsafeShuffleWriter<K, V> extends ShuffleWriter<K, V> {
    private static final Logger logger = LoggerFactory.getLogger(UnsafeShuffleWriter.class);
    private static final ClassTag<Object> OBJECT_CLASS_TAG;
    @VisibleForTesting
    static final int DEFAULT_INITIAL_SORT_BUFFER_SIZE = 4096;
    static final int DEFAULT_INITIAL_SER_BUFFER_SIZE = 1048576;
    private final BlockManager blockManager;
    private final IndexShuffleBlockResolver shuffleBlockResolver;
    private final TaskMemoryManager memoryManager;
    private final SerializerInstance serializer;
    private final Partitioner partitioner;
    private final ShuffleWriteMetrics writeMetrics;
    private final int shuffleId;
    private final int mapId;
    private final TaskContext taskContext;
    private final SparkConf sparkConf;
    private final boolean transferToEnabled;
    private final int initialSortBufferSize;
    private final int inputBufferSizeInBytes;
    private final int outputBufferSizeInBytes;
    @Nullable
    private MapStatus mapStatus;
    @Nullable
    private ShuffleExternalSorter sorter;
    private long peakMemoryUsedBytes = 0L;
    private UnsafeShuffleWriter.MyByteArrayOutputStream serBuffer;
    private SerializationStream serOutputStream;
    private boolean stopping = false;

    public UnsafeShuffleWriter(BlockManager blockManager, IndexShuffleBlockResolver shuffleBlockResolver, TaskMemoryManager memoryManager, SerializedShuffleHandle<K, V> handle, int mapId, TaskContext taskContext, SparkConf sparkConf) throws IOException {
        
        int numPartitions = handle.dependency().partitioner().numPartitions();
        //分区数不能大于16777216
        if (numPartitions > SortShuffleManager.MAX_SHUFFLE_OUTPUT_PARTITIONS_FOR_SERIALIZED_MODE()) {
            throw new IllegalArgumentException("UnsafeShuffleWriter can only be used for shuffles with at most " + SortShuffleManager.MAX_SHUFFLE_OUTPUT_PARTITIONS_FOR_SERIALIZED_MODE() + " reduce partitions");
        } else {
            this.blockManager = blockManager;
            this.shuffleBlockResolver = shuffleBlockResolver;
            this.memoryManager = memoryManager;
            this.mapId = mapId;
            ShuffleDependency<K, V, V> dep = handle.dependency();
            this.shuffleId = dep.shuffleId();
            this.serializer = dep.serializer().newInstance();
            this.partitioner = dep.partitioner();
            this.writeMetrics = taskContext.taskMetrics().shuffleWriteMetrics();
            this.taskContext = taskContext;
            this.sparkConf = sparkConf;
            this.transferToEnabled = sparkConf.getBoolean("spark.file.transferTo", true);
            
            //排序器初始化缓存大小:4096
            this.initialSortBufferSize = sparkConf.getInt("spark.shuffle.sort.initialBufferSize", 4096);
            //输入缓存大小:默认32k
            this.inputBufferSizeInBytes = (int)(Long)sparkConf.get(.MODULE$.SHUFFLE_FILE_BUFFER_SIZE()) * 1024;
            //输出缓存大小:默认32k
            this.outputBufferSizeInBytes = (int)(Long)sparkConf.get(.MODULE$.SHUFFLE_UNSAFE_FILE_OUTPUT_BUFFER_SIZE()) * 1024;
            this.open();
        }
    }
    
    private void open() {
        assert this.sorter == null;

        //初始化排序器,指定缓存初始化大小4096
        this.sorter = new ShuffleExternalSorter(this.memoryManager, this.blockManager, this.taskContext, this.initialSortBufferSize, this.partitioner.numPartitions(), this.sparkConf, this.writeMetrics);
        
        //初始化系列化缓存,指定大小1M
        this.serBuffer = new UnsafeShuffleWriter.MyByteArrayOutputStream(1048576);
        
        //初始化序列化流
        this.serOutputStream = this.serializer.serializeStream(this.serBuffer);
    }
}

6.2.write

说明:

  • UnsafeShuffleWriter中提供2种重载的write函数,底层都是通过write(scala.collection.Iterator<Product2<K, V>> records)函数实现;
  • 首先将迭代器中数据逐条添加到排序器中;
    • 排序器中数据达到溢写条件,迭代器中数据将会溢写到一个临时文件中;
  • 其次将排序器中数据落地到一个输出文件中;
    • 会产生一个输出文件 + 一个输出文件对应的index文件;
  • 最后释放排序器中资源;
public class UnsafeShuffleWriter<K, V> extends ShuffleWriter<K, V> {
    
  public void write(Iterator<Product2<K, V>> records) throws IOException {
    //将java迭代器转为scala迭代器
    write(JavaConverters.asScalaIteratorConverter(records).asScala());
  }
  
  public void write(scala.collection.Iterator<Product2<K, V>> records) throws IOException {
    
    boolean success = false;
    try {
       
      //遍历记录,逐条将记录添加到排序器中
      while (records.hasNext()) {
        insertRecordIntoSorter(records.next());
      }
      //合并临时文件为一个整体文件,然后上次临时文件
      closeAndWriteOutput();
      success = true;
    } finally {
      if (sorter != null) {
        try {
          //释放资源
          sorter.cleanupResources();
        } catch (Exception e) {
          // Only throw this error if we won't be masking another
          // error.
          if (success) {
            throw e;
          } else {
            logger.error("In addition to a failure during writing, we failed during " +
                         "cleanup.", e);
          }
        }
      }
    }
  }
}

6.2.1.insertRecordIntoSorter-添加数据到排序器

说明:

  • 首先,将数据添加到序列化流中;
    • 序列化流中通过MyByteArrayOutputStream对象对数据进行缓存;默认缓存1M;
    • 序列化流中有计数器统计缓存中数据量;
  • 其次,将序列化流中数据添加到排序器;
public class UnsafeShuffleWriter<K, V> extends ShuffleWriter<K, V> {
    
  private ShuffleExternalSorter sorter;
    
  //序列化缓存:默认大小1M
  private MyByteArrayOutputStream serBuffer;
  //序列化流
  private SerializationStream serOutputStream;
    
  void insertRecordIntoSorter(Product2<K, V> record) throws IOException {
    assert(sorter != null);
    final K key = record._1();
    final int partitionId = partitioner.getPartition(key);
      
    //重置序列化缓存计数器:重置为0
    serBuffer.reset();
      
    //将key、value写入序列化流中
    serOutputStream.writeKey(key, OBJECT_CLASS_TAG);
    serOutputStream.writeValue(record._2(), OBJECT_CLASS_TAG);
    serOutputStream.flush();

    //确保序列化缓存中有数据
    final int serializedRecordSize = serBuffer.size();
    assert (serializedRecordSize > 0);

    //将序列化数据添加到排序器中
    sorter.insertRecord(
      serBuffer.getBuf(), Platform.BYTE_ARRAY_OFFSET, serializedRecordSize, partitionId);
  }
}
6.2.1.1.ShuffleExternalSorter#insertRecord

说明:

  • 1.对达到溢写条件的缓存数据,就将缓存中数据溢写到磁盘;
  • 2.对达到扩充容量要求的内存排序器缓存进行容量扩充;
  • 3.申请新的page存储数据;
  • 4.获取数据存储地址;
  • 5.将数据长度添加到page;
  • 6.将数据复制到page中;
  • 7.将数据存储地址添加到内存排序器中;

总结:

  • 数据存储在外部排序器的链表中,通过page(内存块MemoryBlock)作为链表元素存储数据;
  • 数据地址和分区id存储在内存排序器的缓存中;
  • 数据溢写条件:数据量超过Interger.MAX_VALUE;
  • 一次溢写产生一个临时文件;
final class ShuffleExternalSorter extends MemoryConsumer {
    
  	public void insertRecord(Object recordBase, long recordOffset, int length, int partitionId) throws IOException {
        
        assert this.inMemSorter != null;

        //内存排序器中数据量 >= Integer.MAX_VALUE
        if (this.inMemSorter.numRecords() >= this.numElementsForSpillThreshold) {
            logger.info("Spilling data because number of spilledRecords crossed the threshold " + this.numElementsForSpillThreshold);
            
            //数据溢写到磁盘
            this.spill();
        }

        //内存排序器中存储数据的缓存扩容
        this.growPointerArrayIfNecessary();
        
        //数据对其,确定数据长度
        int uaoSize = UnsafeAlignedOffset.getUaoSize();
        int required = length + uaoSize;
        
        //申请新的page存储数据
        this.acquireNewPageIfNecessary(required);

        assert this.currentPage != null;

        //获取当前page中存储数据的对象
        Object base = this.currentPage.getBaseObject();
        
        //获取数据存储地址
        long recordAddress = this.taskMemoryManager.encodePageNumberAndOffset(this.currentPage, this.pageCursor);
        
        //将数据长度添加到page
        UnsafeAlignedOffset.putSize(base, this.pageCursor, length);
        
        // 调整下page指针的内存地址
        this.pageCursor += (long)uaoSize;
        
        // 再将数据复制到page中
        Platform.copyMemory(recordBase, recordOffset, base, this.pageCursor, (long)length);
        
        // 调整下page光标位置
        this.pageCursor += (long)length;
        
        // 将数据在page中的存储地址和分区id记录到内存排序器中
        this.inMemSorter.insertRecord(recordAddress, partitionId);
    }
}
6.2.1.1.1.ShuffleExternalSorter#spill-缓存数据溢写磁盘

说明:

  • 判断溢写条件是否达成;
  • 将数据溢写到临时文件,一次溢写产生一个临时文件;
    • 数据排序时通过对内存排序器缓存的数据地址根据分区id升序排列完成的;
    • 临时文件中的数据一个分区时在一起的;
  • 释放内存资源;
  • 重置内存排序器;
final class ShuffleExternalSorter extends MemoryConsumer {
    
    //从父类MemoryConsumer继承过来的
	public void spill() throws IOException {
        this.spill(9223372036854775807L, this);
    }
    
    public long spill(long size, MemoryConsumer trigger) throws IOException {
        //要求内存排序器缓存数据量 > 0
        if (trigger == this && this.inMemSorter != null && this.inMemSorter.numRecords() != 0) {
            logger.info("Thread {} spilling sort data of {} to disk ({} {} so far)", new Object[]{Thread.currentThread().getId(), Utils.bytesToString(this.getMemoryUsage()), this.spills.size(), this.spills.size() > 1 ? " times" : " time"});
            
            //排序并写入临时文件
            this.writeSortedFile(false);
            
            //释放内存资源
            long spillSize = this.freeMemory();
            
            //重置内存排序器
            this.inMemSorter.reset();
            
            //上报溢出切片文件大小
            this.taskContext.taskMetrics().incMemoryBytesSpilled(spillSize);
            
            return spillSize;
        } else {
            return 0L;
        }
    }
}
6.2.1.1.1.1.writeSortedFile-数据溢写到临时文件

说明:

  • 对内存排序器缓存的数据地址,根据地址中的分区id,使用RadixSort方法进行升序排列;
    • 分区id存储在每天数据地址的5~7这3个字节中;
    • 3个字节,一个字节8位,可以存2的24(3*8=24)次方个分区id;
  • 从blockManager中获取溢写临时文件信息;
  • 从blockManager中获取磁盘写出器;
  • 遍历数据地址,将每个数据地址对应的数据写入临时文件;
    • 从数据地址中获取数据分区id;
    • 将上个分区的数据flush到临时文件,并记录分区数据量到溢写信息对象中;
    • 从外部排序器的数据链表中,根据数据地址,找到存储数据的page;
    • 遍历page中的数据,将数据添加到写出器的序列化流中;
      • 每次从page中读取缓存大小的数据写入写出器缓存;
      • 将写出器缓存数据添加到写出器序列化流中;
      • 读取器缓存大小由spark.shuffle.spill.diskWriteBufferSize参数控制;默认1M;
    • 数据遍历结束,通知写入器已经向序列化流中写入了一整条条数据
  • 数据遍历结束,将数据从序列化流中flush到临时文件中;
  • 关闭写出器;
  • 将分区数据量缓存到溢写信息对象中;并将溢写信息对象添加到溢写信息地下链表中;
  • 上报溢写指标;

总结:

  • 排序通过对内存排序器缓存器中数据地址的排序实现的;

  • 根据排序后的数据地址,从外部排序器数据的链表缓存器中确定存储数据的page;

  • 根据排序后的数据地址依次将数据写出到临时文件;

  • 内存排序器数据地址排序时根据分区id升序排列的,所以临时文件中一个分区的数据是在一起的;

  • 对每条数据写出到临时文件,现将数据读取到写出去缓存(默认1M),然后将缓存添加到序列化流;最后一个分区的数据一起flush到临时文件;

final class ShuffleExternalSorter extends MemoryConsumer {
    
    private void writeSortedFile(boolean isLastFile) {
        ShuffleWriteMetrics writeMetricsToUse;
        
        if (isLastFile) {//非切分文件
            //上报指标
            writeMetricsToUse = this.writeMetrics;
        } else {//切分文件
            //构建指标统计器
            writeMetricsToUse = new ShuffleWriteMetrics();
        }

        //数据排序
        ShuffleSorterIterator sortedRecords = this.inMemSorter.getSortedIterator();
        
        //构建磁盘写出缓存:通过spark.shuffle.spill.diskWriteBufferSize控制大小
        byte[] writeBuffer = new byte[this.diskWriteBufferSize];
        
        //从blockManager获取临时溢写文件信息
        Tuple2<TempShuffleBlockId, File> spilledFileInfo = this.blockManager.diskBlockManager().createTempShuffleBlock();
        File file = (File)spilledFileInfo._2();
        TempShuffleBlockId blockId = (TempShuffleBlockId)spilledFileInfo._1();
        
        //构建溢写信息对象
        SpillInfo spillInfo = new SpillInfo(this.numPartitions, file, blockId);
        SerializerInstance ser = DummySerializerInstance.INSTANCE;
        
        //从blockManager中获取磁盘写出器
        DiskBlockObjectWriter writer = this.blockManager.getDiskWriter(blockId, file, ser, this.fileBufferSizeBytes, writeMetricsToUse);
        
        //初始化当前分区id
        int currentPartition = -1;
        
        int uaoSize = UnsafeAlignedOffset.getUaoSize();

        //遍历数据存储地址
        while(sortedRecords.hasNext()) {
            
            //将地址数据加载到packedRecordPointer中
            sortedRecords.loadNext();
            //从地址数据中获取对应分区id
            int partition = sortedRecords.packedRecordPointer.getPartitionId();

            assert partition >= currentPartition;

            //分区号不同:代表上一个分区数据已经全部添加到写出去序列化流中
            if (partition != currentPartition) {
                //currentPartition != -1:不是处理第一个分区的数据
                if (currentPartition != -1) {
                    //将上一个分区的数据从序列化流中flush到临时文件中
                    FileSegment fileSegment = writer.commitAndGet();
                    //将分区数据量缓存到溢写信息对象中
                    spillInfo.partitionLengths[currentPartition] = fileSegment.length();
                }

                //更新当前分区id
                currentPartition = partition;
            }

            //根据内存排序器缓存的数据地址找到外部排序器通过链表缓存的page
            long recordPointer = sortedRecords.packedRecordPointer.getRecordPointer();
            Object recordPage = this.taskMemoryManager.getPage(recordPointer);
            
            //通过数据在page中的偏移量,计算数据长度
            long recordOffsetInPage = this.taskMemoryManager.getOffsetInPage(recordPointer);
            int dataRemaining = UnsafeAlignedOffset.getSize(recordPage, recordOffsetInPage);

            //遍历数据
            int toTransfer;
            for(long recordReadPosition = recordOffsetInPage + (long)uaoSize; dataRemaining > 0; dataRemaining -= toTransfer) {
                //计算每次处理的数据量:在磁盘写出缓存大小和数据大小见取小
                toTransfer = Math.min(this.diskWriteBufferSize, dataRemaining);
                
                //根据确定的处理数据量,从page中将数据赋值到写出缓存器中
                Platform.copyMemory(recordPage, recordReadPosition, writeBuffer, (long)Platform.BYTE_ARRAY_OFFSET, (long)toTransfer);
                
                //将写出缓冲器中数据添加到写出器的序列化流中
                writer.write(writeBuffer, 0, toTransfer);
                
                //数据读取偏移量更新
                recordReadPosition += (long)toTransfer;
            }

            //通知写入器已经向序列化流中写入了一条数据
            writer.recordWritten();
        }

        //数据遍历结束,将数据从序列化流中flush到临时文件中
        FileSegment committedSegment = writer.commitAndGet();
        
        //关闭写出器
        writer.close();
        if (currentPartition != -1) {
            //将分区数据量缓存到溢写信息对象中
            spillInfo.partitionLengths[currentPartition] = committedSegment.length();
            //将本次溢写信息对象添加到溢写信息对象链表中
            this.spills.add(spillInfo);
        }

        if (!isLastFile) {
            //上报指标
            this.writeMetrics.incRecordsWritten(writeMetricsToUse.recordsWritten());
            this.taskContext.taskMetrics().incDiskBytesSpilled(writeMetricsToUse.bytesWritten());
        }

    }
}
6.2.1.1.1.2.溢写数据排序

说明:

  • 对内存排序器中缓存的数据地址进行排序;
  • 根据分区ID对内存排序器缓存数据排序;
final class ShuffleInMemorySorter {
    
    public ShuffleInMemorySorter.ShuffleSorterIterator getSortedIterator() {
        int offset = 0;
        
        //默认基数排序
        if (this.useRadixSort) {
            //分区id存储在longArray的5~7这3个字节中;最多可以存2的24次方个分区id;
            offset = RadixSort.sort(this.array, (long)this.pos, 5, 7, false, false);
        } else {
            //TimSort排序
            MemoryBlock unused = new MemoryBlock(this.array.getBaseObject(), this.array.getBaseOffset() + (long)this.pos * 8L, (this.array.size() - (long)this.pos) * 8L);
            LongArray buffer = new LongArray(unused);
            Sorter<PackedRecordPointer, LongArray> sorter = new Sorter(new ShuffleSortDataFormat(buffer));
            sorter.sort(this.array, 0, this.pos, SORT_COMPARATOR);
        }

        return new ShuffleInMemorySorter.ShuffleSorterIterator(this.pos, this.array, offset);
    }
}
6.2.1.1.2.growPointerArrayIfNecessary-内存排序器缓存扩容

说明:

  • 首先,判断内存排序器缓存是否达到扩容条件;
  • 然后,针对达到条件的情况,进行扩容;
    • 计算当前缓存容量;
    • 按照当前缓存容量的2倍进行扩容,构建一个新的数组缓存器;
    • 再次判断缓存资源是否够用;
      • 够用,释放新构建的数组缓存器资源;
      • 不够用,将新的数组缓存器替换为内存排序器的缓存;
final class ShuffleExternalSorter extends MemoryConsumer {
    
    private ShuffleInMemorySorter inMemSorter;
    
    private void growPointerArrayIfNecessary() throws IOException {
        assert this.inMemSorter != null;

        //判断是否达到扩容条件
        if (!this.inMemSorter.hasSpaceForAnotherRecord()) {
            //计算当前缓存容量
            long used = this.inMemSorter.getMemoryUsage();

            //扩容
            LongArray array;
            try {
                array = this.allocateArray(used / 8L * 2L);
            } catch (TooLargePageException var5) {
                this.spill();
                return;
            } catch (SparkOutOfMemoryError var6) {
                if (!this.inMemSorter.hasSpaceForAnotherRecord()) {
                    logger.error("Unable to grow the pointer array");
                    throw var6;
                }

                return;
            }

            //再次判断缓存是否够用:可能其他task释放了资源,从而缓存够用
            if (this.inMemSorter.hasSpaceForAnotherRecord()) {
                //这种情况下,释放刚申请的page资源
                this.freeArray(array);
            } else {
                //资源还是不够用,使用刚申请的page资源
                this.inMemSorter.expandPointerArray(array);
            }
        }

    }
}
6.2.1.1.2.1.扩容条件判断

说明:

  • 当缓存最新数据的索引 < 缓存可用容量时,缓存不需要扩容;
    • 反之:当当缓存最新数据的索引 >= 缓存可用容量时,需要扩容;
  • 针对缓存可用容量
    • 默认可用容量为总容量的一半;
    • 如果使用非useRadixSort方案,可用容量为总容量的2/3;
  • 针对缓存
    • 使用LongArray作为缓存器;
final class ShuffleInMemorySorter {
    
    //缓存数据的数组
    private LongArray array;
    
    //记录数组中最新数据的索引
    private int pos = 0;
    
    //定义数组可使用容量
    private int usableCapacity = 0;
    
    ShuffleInMemorySorter(MemoryConsumer consumer, int initialSize, boolean useRadixSort) {
        //---------其他代码---------
        this.usableCapacity = this.getUsableCapacity();
    }
    
    //最新数据索引 < 可用容量:缓存不需要扩容
    public boolean hasSpaceForAnotherRecord() {
        return this.pos < this.usableCapacity;
    }
    
    //默认可用容量为总容量的一半;
    //如果使用非useRadixSort方案,可用容量为总容量的2/3
    private int getUsableCapacity() {
        return (int)((double)this.array.size() / (this.useRadixSort ? 2.0D : 1.5D));
    }

}
6.2.1.1.2.2.内存使用量计算

说明:

  • 数组page大小即为内存使用量,单位byte;
final class ShuffleInMemorySorter {
    
    //返回数组page大小
    public long getMemoryUsage() {
        return this.array.size() * 8L;
    }
}

public final class LongArray {
    //返回数组page大小
    public long size() {
        return this.length;
    }
}
6.2.1.1.2.3.allocateArray-扩容

说明:

  • 根据新的容量申请新的page(MemoryBlock)
  • 根据page构建新的数组
public abstract class MemoryConsumer {
    
    public LongArray allocateArray(long size) {
        long required = size * 8L;
        
        //根据新的容量申请新的page(MemoryBlock)
        MemoryBlock page = this.taskMemoryManager.allocatePage(required, this);
        if (page == null || page.size() < required) {
            this.throwOom(page, required);
        }

        //更新内存使用量
        this.used += required;
        
        //根据page构建新的数组
        return new LongArray(page);
    }
}
6.2.1.1.2.4.expandPointerArray-数据转移

说明:

  • 首先,确保新缓存容量比旧缓存容量大;
  • 其次,将数据从旧数组复制到新数组;
  • 然后,释放旧数组资源;
  • 接着,新数组作为内存排序器缓存;
  • 最后,更新内存排序器可用容量;
final class ShuffleInMemorySorter {
    public void expandPointerArray(LongArray newArray) {
        
        //确保新缓存容量比原来缓存容量大
        assert newArray.size() > this.array.size();

        //数据从旧数组复制到新数组
        Platform.copyMemory(this.array.getBaseObject(), this.array.getBaseOffset(), newArray.getBaseObject(), newArray.getBaseOffset(), (long)this.pos * 8L);
       
        //释放旧数组资源
        this.consumer.freeArray(this.array);
        
        //新数组作为内存排序器缓存
        this.array = newArray;
        
        //更新内存排序器可用容量
        this.usableCapacity = this.getUsableCapacity();
    }
}
6.2.1.1.3.acquireNewPageIfNecessary-申请新的page存储数据

构建新page条件:

  • 当前page为null;
  • 当前page不够用;
  • 以上二者存在一种就需要构建新page;

步骤:

  • 根据数据长度构建新page作为当前page;
  • 更新page光标位置;
  • 新构建page加入page列表;
final class ShuffleExternalSorter extends MemoryConsumer {
    
    private void acquireNewPageIfNecessary(int required) {
        
        //当前page为null,或者当前page不够用
        if (this.currentPage == null || this.pageCursor + (long)required > this.currentPage.getBaseOffset() + this.currentPage.size()) {
            
            //根据数据长度构建新page作为当前page
            this.currentPage = this.allocatePage((long)required);
            
            //更新page光标位置
            this.pageCursor = this.currentPage.getBaseOffset();
            
            //新构建page加入page列表
            this.allocatedPages.add(this.currentPage);
        }

    }
}

6.2.2.closeAndWriteOutput-外部排序器缓存数据落地到磁盘

说明:

  • 更新内存使用峰值;
  • 将缓存数据落地到磁盘临时文件;
  • 合并临时文件(可能多个:一次溢写一个)为一个输出文件;
    • 合并后的输出文件,数据根据分区id升序排列,一个分区的数据在一块;
  • 创建输出文件对应index文件;
    • 存储每个分区的数据偏移量;
    • 数据偏移量顺序和输出的数据文件一致,一一对应;
  • 记录shuffle状态信息;
public class UnsafeShuffleWriter<K, V> extends ShuffleWriter<K, V> {
    
  void closeAndWriteOutput() throws IOException {
    assert(sorter != null);
      
    // 更新内存使用峰值
    updatePeakMemoryUsed();
      
    serBuffer = null;
    serOutputStream = null;
      
    //将缓存中的数据落地到磁盘临时文件中
    final SpillInfo[] spills = sorter.closeAndGetSpills();
      
    sorter = null;
    final long[] partitionLengths;
      
    //获取输出数据文件
    final File output = shuffleBlockResolver.getDataFile(shuffleId, mapId);
    //构建输出数据文件的临时文件
    final File tmp = Utils.tempFileWith(output);
    try {
      try {
        //合并临时文件
        partitionLengths = mergeSpills(spills, tmp);
      } finally {
        for (SpillInfo spill : spills) {
          if (spill.file.exists() && ! spill.file.delete()) {
            logger.error("Error while deleting spill file {}", spill.file.getPath());
          }
        }
      }
        
      //创建输出文件对应的index文件
      shuffleBlockResolver.writeIndexFileAndCommit(shuffleId, mapId, partitionLengths, tmp);
    } finally {
      if (tmp.exists() && !tmp.delete()) {
        logger.error("Error while deleting temp file {}", tmp.getAbsolutePath());
      }
    }
      
    //记录shuffle状态信息
    mapStatus = MapStatus$.MODULE$.apply(blockManager.shuffleServerId(), partitionLengths);
  }
}
6.2.2.1.closeAndGetSpills-强制flush缓存到磁盘中

说明:

  • 根据内存排序器缓存的数据地址,将地址对应的数据落地到临时文件中;
  • 释放内存资源和内存排序器;
  • 返回所有的溢写信息对象数组;
    • 溢写信息对象中存储了每个分区对应的数据量;
final class ShuffleExternalSorter extends MemoryConsumer {
    public SpillInfo[] closeAndGetSpills() throws IOException {
        if (this.inMemSorter != null) {
            //将缓存中的数据溢写到磁盘
            this.writeSortedFile(true);
            
            //释放内存资源
            this.freeMemory();
            
            //释放内存排序器
            this.inMemSorter.free();
            this.inMemSorter = null;
        }

        //返回溢写信息对象数组
        return (SpillInfo[])this.spills.toArray(new SpillInfo[this.spills.size()]);
    }
}
6.2.2.2.mergeSpills-合并临时文件

说明:

  • 没有临时文件,创建一个空输出文件;
  • 由一个临时文件,将临时文件迁移并重命名为输出文件;
  • 由多个临时文件,执行文件合并;
    • 快合并
      • 要求开启快速合并且支持快速合并;
        • 通过spark.file.transferTo判断判断是否基于传输快速合并,默认是;
        • 否则基于文件流快速合并;
    • 慢合并
      • 如果不是快速合并,则采取慢合并;
  • 返回每个分区的数据量数组;
public class UnsafeShuffleWriter<K, V> extends ShuffleWriter<K, V> {
    
    private long[] mergeSpills(SpillInfo[] spills, File outputFile) throws IOException {
        //是否压缩:默认是,通过spark.shuffle.compress设置
        boolean compressionEnabled = this.sparkConf.getBoolean("spark.shuffle.compress", true);
        //压缩方式编码:默认LZ4CompressionCodec,通过spark.io.compression.codec设置
        CompressionCodec compressionCodec = org.apache.spark.io.CompressionCodec..MODULE$.createCodec(this.sparkConf);
        
        //是否启用fast merge:默认是
        boolean fastMergeEnabled = this.sparkConf.getBoolean("spark.shuffle.unsafe.fastMergeEnabled", true);
        //是否支持 fast merge:
        //1、不启用压缩算法
        //2、或者SnappyCompressionCodec、LZFCompressionCodec、LZ4CompressionCodec、ZStdCompressionCodec这4种压缩算法之一
        boolean fastMergeIsSupported = !compressionEnabled || org.apache.spark.io.CompressionCodec..MODULE$.supportsConcatenationOfSerializedStreams(compressionCodec);
        
        //是否启用加密:默认否,通过 spark.io.encryption.enabled 参数来设置
        boolean encryptionEnabled = this.blockManager.serializerManager().encryptionEnabled();

        try {
            if (spills.length == 0) {//没有溢写分解
                 // 创建一个空文件
                (new FileOutputStream(outputFile)).close();
                //返回全是0的分区数据量数组
                return new long[this.partitioner.numPartitions()];
            } else if (spills.length == 1) {//一个溢写文件
                //临时文件迁移并重名为输出文件
                Files.move(spills[0].file, outputFile);
                //返回分区数据量数组
                return spills[0].partitionLengths;
            } else {//多个溢写文件
                long[] partitionLengths;
                //fast merge
                if (fastMergeEnabled && fastMergeIsSupported) {
                    //基于传输&&不加密方式快速合并:默认方式;
                    if (this.transferToEnabled && !encryptionEnabled) {
                        //基于传输的合并
                        logger.debug("Using transferTo-based fast merge");
                        partitionLengths = this.mergeSpillsWithTransferTo(spills, outputFile);
                    } else {
                        //基于文件流的合并
                        logger.debug("Using fileStream-based fast merge");
                        partitionLengths = this.mergeSpillsWithFileStream(spills, outputFile, (CompressionCodec)null);
                    }
                } else {
                    //慢合并
                    logger.debug("Using slow merge");
                    partitionLengths = this.mergeSpillsWithFileStream(spills, outputFile, compressionCodec);
                }

                //写出指标统计
                this.writeMetrics.decBytesWritten(spills[spills.length - 1].file.length());
                this.writeMetrics.incBytesWritten(outputFile.length());
                
                //返回分区数据量数组
                return partitionLengths;
            }
        } catch (IOException var9) {
            if (outputFile.exists() && !outputFile.delete()) {
                logger.error("Unable to delete output file {}", outputFile.getPath());
            }

            throw var9;
        }
    }
}

6.3.排序器

6.3.1.ShuffleExternalSorter-外部排序器

说明:

  • ShuffleExternalSorterMemoryConsumer的子类;
  • 在构造ShuffleExternalSorter实例化对象时,会构造一个MemoryConsumer实例化对象;
  • 指定文件缓存大小:默认32k;
  • 指定溢写阈值:Integer.MAX;
  • 磁盘写出缓冲区:默认1M;
  • 构建一个内存排序器,并维护在当前排序器中;
    • 指定缓存初始化大小4096;
    • 默认根据useRadixSort排序;

总结:

  • 在外部排序器中,通过链表存储数据;
    • 链表中的元素为page,实际上是内存块MemoryBlock;
    • 所有的数据都存在一个个page中;
  • 通过currentPage指向最新的page,当前page;
  • 通过pageCursor(光标)指向page中数据的偏移量;
final class ShuffleExternalSorter extends MemoryConsumer {
    private static final Logger logger = LoggerFactory.getLogger(ShuffleExternalSorter.class);
    @VisibleForTesting
    static final int DISK_WRITE_BUFFER_SIZE = 1048576;
    private final int numPartitions;
    private final TaskMemoryManager taskMemoryManager;
    private final BlockManager blockManager;
    private final TaskContext taskContext;
    private final ShuffleWriteMetrics writeMetrics;
    //溢写阈值
    private final int numElementsForSpillThreshold;
    //文件缓冲区
    private final int fileBufferSizeBytes;
    //磁盘写出缓冲区
    private final int diskWriteBufferSize;
    //存储数据的page最多可以使用2^13个页表
    private final LinkedList<MemoryBlock> allocatedPages = new LinkedList();
    //溢出文件的元数据信息的列表
    private final LinkedList<SpillInfo> spills = new LinkedList();
    private long peakMemoryUsedBytes;
    @Nullable
    private ShuffleInMemorySorter inMemSorter;
    //当前使用的page
    @Nullable
    private MemoryBlock currentPage = null;
    //Page的光标
    private long pageCursor = -1L;

    //构造函数
    ShuffleExternalSorter(TaskMemoryManager memoryManager, BlockManager blockManager, TaskContext taskContext, int initialSize, int numPartitions, SparkConf conf, ShuffleWriteMetrics writeMetrics) {
        
        //构造一个MemoryConsumer实例化对象
        super(memoryManager, (long)((int)Math.min(134217728L, memoryManager.pageSizeBytes())), memoryManager.getTungstenMemoryMode());
        
        this.taskMemoryManager = memoryManager;
        this.blockManager = blockManager;
        this.taskContext = taskContext;
        
        //确定reduce的分区数目
        this.numPartitions = numPartitions;
        //指定文件缓存大小:默认32k
        this.fileBufferSizeBytes = (int)(Long)conf.get(.MODULE$.SHUFFLE_FILE_BUFFER_SIZE()) * 1024;
        
        //指定溢写阈值:Integer.MAX
        this.numElementsForSpillThreshold = (Integer)conf.get(.MODULE$.SHUFFLE_SPILL_NUM_ELEMENTS_FORCE_SPILL_THRESHOLD());
        this.writeMetrics = writeMetrics;
        
        //构建一个内存排序器,并维护在当前排序器中
        this.inMemSorter = new ShuffleInMemorySorter(this, initialSize, conf.getBoolean("spark.shuffle.sort.useRadixSort", true));
        
        //内存使用情况
        this.peakMemoryUsedBytes = this.getMemoryUsage();
        //磁盘写出缓冲区:默认1M
        this.diskWriteBufferSize = (int)(Long)conf.get(.MODULE$.SHUFFLE_DISK_WRITE_BUFFER_SIZE());
    }
6.3.1.1.ShuffleInMemorySorter-内存排序器

说明:

  • 使用LongArray作为内存缓存形式,对数据进行缓存;
  • array初始化容量为4096;
  • 可用容量根据排序方案确定:
    • 针对useRadixSort排序,可用容量时总容量的一半;
    • 针对非useRadixSort排序,可用容量时总容量的2/3;
  • 默认比较器:根据分区ID升序排列;

总结:

  • 内存排序器通过LongArray对象缓存数据存储地址;
    • LongArray对象底层通过内存块MemoryBlock存储数据;
final class ShuffleInMemorySorter {
    
    //初始化排序规则:默认根据分区Id升序排列
    private static final ShuffleInMemorySorter.SortComparator SORT_COMPARATOR = new ShuffleInMemorySorter.SortComparator();
    private final MemoryConsumer consumer;
    private LongArray array;
    private final boolean useRadixSort;
    private int pos = 0;
    private int usableCapacity = 0;
    private final int initialSize;

    ShuffleInMemorySorter(MemoryConsumer consumer, int initialSize, boolean useRadixSort) {
        this.consumer = consumer;

        assert initialSize > 0;

        //缓存初始化大小4096
        this.initialSize = initialSize;
        //默认根据useRadixSort排序
        this.useRadixSort = useRadixSort;
        //构建一个4096大小的LongArray作为排序器数据缓存
        this.array = consumer.allocateArray((long)initialSize);
        //可用容量
        this.usableCapacity = this.getUsableCapacity();
    }
    
    //内部类:排序比较器
    private static final class SortComparator implements Comparator<PackedRecordPointer> {
        private SortComparator() {
        }

        //默认根据分区id进行升序排列
        public int compare(PackedRecordPointer left, PackedRecordPointer right) {
            int leftId = left.getPartitionId();
            int rightId = right.getPartitionId();
            return leftId < rightId ? -1 : (leftId > rightId ? 1 : 0);
        }
    }
}
6.3.1.1.1.可用容量计算

说明:

  • 针对useRadixSort排序,可用容量时总容量的一半;
  • 针对非useRadixSort排序,可用容量时总容量的2/3;
final class ShuffleInMemorySorter {
    private static final ShuffleInMemorySorter.SortComparator SORT_COMPARATOR = new ShuffleInMemorySorter.SortComparator();
    private final MemoryConsumer consumer;
    private LongArray array;
    private final boolean useRadixSort;
    private int pos = 0;
    private int usableCapacity = 0;
    private final int initialSize;

    //针对根据useRadixSort排序,可用容量时总容量的一半;否则,可用容量是总容量的2/3;
    private int getUsableCapacity() {
        return (int)((double)this.array.size() / (this.useRadixSort ? 2.0D : 1.5D));
    }
}

6.4. 总结

适用场景:

  • 序列化器支持对象迁移:支持序列化重定向;
  • 非map端聚合
  • 分区数不大于16777216

写出流程:

  • 首先将迭代器中数据逐条添加到排序器中;
    • 排序器中数据达到溢写条件,迭代器中数据将会溢写到一个临时文件中;
  • 其次将排序器中数据落地到一个输出文件中;
    • 会产生一个输出文件 + 一个输出文件对应的index文件;
  • 最后释放排序器中资源;

排序器:

  • 外部排序器
    • 缓存数据
    • 通过以page(内存块memeryBlock)为元素的链表实现数据缓存;
  • 内存排序器
    • 缓存数据地址
      • 数据地址的5~7折3个字节存储分区id;
      • 总共可以存储2的24次方(16777216)个分区编号;
    • 通过LongArry(内部以memeryBlock存储数据)实现数据缓存;
  • 排序的实现
    • 通过对内存排序器缓存的数据地址根据分区id以RadixSort方式升序排序实现数据的排序;
  • 溢写
    • 内存缓存器中的数据地址数据达到Integer.MAX_VALUE,即产生一次数据溢写;
    • 一次数据溢写,参数一个临时文件;

产出:

  • 一次write,产生一个数据输出文件 + 一个index文件
  • 输出文件
    • 文件中数据根据分区id升序排列;
    • 一个分区的数据在一块;
  • index文件
    • 文件中存储数据文件中每个分区的数据偏移量;
    • 文件中数据和数据文件中分区一一对应;

7.创建数据文件对应index文件

说明:

  • index文件中记录每个分区数据偏移量;
  • index文件中记录记录的每个数据偏移量与数据文件中每个分区的对应;2个文件中分区顺序一致,都是升序排列;
private[spark] class IndexShuffleBlockResolver(
    conf: SparkConf,
    _blockManager: BlockManager = null)
  extends ShuffleBlockResolver
  with Logging {
     
  def writeIndexFileAndCommit(
      shuffleId: Int,
      mapId: Int,
      lengths: Array[Long],
      dataTmp: File): Unit = {
      
    //创建index文件
    val indexFile = getIndexFile(shuffleId, mapId)
    val indexTmp = Utils.tempFileWith(indexFile)
    try {
        
      //获取数据文件
      val dataFile = getDataFile(shuffleId, mapId)
        
      //每个执行器只有一个IndexShuffleBlockResolver,这个同步确保下面的检查和重命名是原子的.
      synchronized {
          
        //检查索引文件和数据文件
        val existingLengths = checkIndexAndDataFile(indexFile, dataFile, lengths.length)
        if (existingLengths != null) {
          System.arraycopy(existingLengths, 0, lengths, 0, lengths.length)
            
          //如果相关的index已经存在, 就可以直接退出了, 这是因为这个mapTask可能已经运行过了. 
          // 当然也可能因为其它原因失败, 但总之这次写是不成功的, 直接删除tmp文件完事
          if (dataTmp != null && dataTmp.exists()) {
            dataTmp.delete()
          }
        } else {
          // 创建面向index临时文件的数据输出流
          val out = new DataOutputStream(new BufferedOutputStream(new FileOutputStream(indexTmp)))
          Utils.tryWithSafeFinally {
            // We take in lengths of each block, need to convert it to offsets.
            var offset = 0L
            out.writeLong(offset)
            //遍历分区数据量数组
            for (length <- lengths) {
              //更新每个分区数据偏移量
              offset += length
              //将分区数据偏移量依次写入index文件
              out.writeLong(offset)
            }
          } {
            out.close()
          }

          //索引文件删除
          if (indexFile.exists()) {
            indexFile.delete()
          }
            
          //数据文件删除
          if (dataFile.exists()) {
            dataFile.delete()
          }
            
          //将索引临时文件改名为索引文件
          if (!indexTmp.renameTo(indexFile)) {
            throw new IOException("fail to rename file " + indexTmp + " to " + indexFile)
          }
          if (dataTmp != null && dataTmp.exists() && !dataTmp.renameTo(dataFile)) {
            throw new IOException("fail to rename file " + dataTmp + " to " + dataFile)
          }
        }
      }
    } finally {
      if (indexTmp.exists() && !indexTmp.delete()) {
        logError(s"Failed to delete temporary index file at ${indexTmp.getAbsolutePath}")
      }
    }
  }
}

8.总结

整体逻辑:

  • 构建RDD依赖时,如果是宽依赖,会初始化宽依赖的shuffleHandle属性

    • 此时会向shuffleManager注册handle时,根据不同情况实例化不同的ShuffleHandle对象
  • 在Executor执行任务时,针对shuffle任务,会将任务执行的结果数据通过ShuffleWriter落地到磁盘

    • 从ShuffleManager中,根据RDD依赖中的shuffleHandle属性值(ShuffleHandle对象)的不同类型,实例化获取不同的ShuffleWriter类型对象
    • 落地到磁盘时调用ShuffleWriter.write()函数实现的
    • 执行一次shuffle任务,落地到磁盘会生成一个数据文件 + 一个index文件

ShuffleWriter类型的分析:

  • BypassMergeSortShuffleWriter
    • handle:BypassMergeSortShuffleHandle
    • 适用:不是map端聚合且分区数不高于200
    • 效果:直接写入numPartitions文件,并在最后将它们连接起来
    • 优势:避免了进行两次序列化和反序列化以合并溢出的文件
    • 缺点:一次打开多个文件,从而为缓冲区分配更多内存;
  • SortShuffleWriter
    • handle:BaseShuffleHandle
    • 适用:前面2中不适用的;
    • 效果:以反序列化的形式缓冲映射输出
    • 特点:支持map端聚合、支持排序
  • UnsafeShuffleWriter
    • handle:SerializedShuffleHandle
    • 适用
      • 序列化器支持对象迁移:持序列化重定向;
      • 非map端聚合
      • 分区数不大于16777216
    • 效果:以序列化的形式缓冲映射输出、支持排序

9.参考资料

源码解析Spark各个ShuffleWriter的实现机制(四)——UnsafeShuffleWriter

Spark Shuffle参数调优的原理与建议

Spark基本sort shuffle write流程解析

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值