Spark-Core【Shuffle原理】-【源码】

weixin_42733117

已于 2024-05-31 11:02:16 修改

阅读量274

点赞数 5

分类专栏： Spark 文章标签： spark ajax 大数据

于 2024-05-24 15:58:43 首次发布

本文链接：https://blog.csdn.net/weixin_42733117/article/details/139175187

版权

Spark 专栏收录该内容

3 篇文章 0 订阅

订阅专栏

一、Shuffle过程图解：

在这里插入图片描述
文字理解：在RDD聚合过程中，数据是需要落盘的，不可能一直缓存在内存中等待上一个RDD计算完毕，因此，就有了上图的过程和数据文件落盘的理解，以及优化Shuffle的方向(减少落盘的数据量)

二、同样引入另外一种情况：上一个阶段的Task是只有一个CPU和分区，而reduce阶段是有三个分区=三个Task，那我们落盘的文件应该是怎样的呢

在这里插入图片描述

文字理解：如果只落一个文件，那么三个Task去分，不知道应该从哪里开始读起，如果落三个文件，任务一多，就会形成小文件的问题，所以最好的办法是，生成一个数据文件，一个索引文件可以通过源码解析得知，Spark的原理；
并且下一个阶段的Task都能拿到同一个Key的数据

三、因此就引出有点类似MR的shuffle的概念，就是上一个Task/RDD的写入磁盘

源码印证：
阅读DAGSchedule.scala

val tasks: Seq[Task[_]] = try {
      val serializedTaskMetrics = closureSerializer.serialize(stage.latestInfo.taskMetrics).array()
      stage match {
        case stage: ShuffleMapStage =>//匹配模式匹配，如果是shuffle的阶段
          stage.pendingPartitions.clear()
          partitionsToCompute.map { id =>
            val locs = taskIdToLocations(id)
            val part = partitions(id)
            stage.pendingPartitions += id
            new ShuffleMapTask(stage.id, stage.latestInfo.attemptNumber,
            //那么就会new一个ShuffleMapTask
              taskBinary, part, locs, properties, serializedTaskMetrics, Option(jobId),
              Option(sc.applicationId), sc.applicationAttemptId, stage.rdd.isBarrier())
          }

继续阅读：ShuffleMapTask，extends Task(def run) ，Task中有一个抽象的方法runTask()，必须要在子类ShuffleMapTask 中重写
所以可以在ShuffleMapTask中看到runTask()方法的重写，里面有写文件的方法

override def runTask(context: TaskContext): MapStatus = {
    // Deserialize the RDD using the broadcast variable.
    val threadMXBean = ManagementFactory.getThreadMXBean
    val deserializeStartTimeNs = System.nanoTime()
    val deserializeStartCpuTime = if (threadMXBean.isCurrentThreadCpuTimeSupported) 
    ........................
    val rdd = rddAndDep._1
    val dep = rddAndDep._2
    // While we use the old shuffle fetch protocol, we use partitionId as mapId in the
    // ShuffleBlockId construction.
    val mapId = if (SparkEnv.get.conf.get(config.SHUFFLE_USE_OLD_FETCH_PROTOCOL)) {
      partitionId
    } else context.taskAttemptId()
    dep.shuffleWriterProcessor.write(rdd, dep, mapId, context, partition)//写的步骤
  }

进入到Write方法，发现是一个abstract class，ctrl+H，发现有一个SortShuffleWriter，里面有

    sorter.writePartitionedMapOutput(dep.shuffleId, mapId, mapOutputWriter)
    val partitionLengths = mapOutputWriter.commitAllPartitions()

两个方法writePartitionedMapOutput，commitAllPartitions ，进入commitAllPartitions，查找发现有一个本地磁盘的操作
在这里插入图片描述

然后就可以看到方法里有一个writeIndexFileAndCommit 方法，写入索引文件和提交任务

  public long[] commitAllPartitions() throws IOException {
    if (outputFileChannel != null && outputFileChannel.position() != bytesWrittenToMergedFile) 
    cleanUp();
    File resolvedTmp = outputTempFile != null && outputTempFile.isFile() ? outputTempFile : null;
    blockResolver.writeIndexFileAndCommit(shuffleId, mapId, partitionLengths, resolvedTmp);  //
    return partitionLengths;
  }

再进入writeIndexFileAndCommit方法,就能看到indexfile和datafile

def writeIndexFileAndCommit(
      shuffleId: Int,
      mapId: Long,
      lengths: Array[Long],
      dataTmp: File): Unit = {
.....................................
          if (indexFile.exists()) {
            indexFile.delete()
          }
          if (dataFile.exists()) {
            dataFile.delete()
          }
  }

同样下一个RDD/task，读的操作是怎么做的，源码

在Task中有

val tasks: Seq[Task[_]] = try {
        case stage: ResultStage => //模式匹配，如果是结果stage，就new ResultTask
          partitionsToCompute.map { id =>
            val p: Int = stage.partitions(id)
            val part = partitions(p)
            val locs = taskIdToLocations(id)
            new ResultTask(stage.id, stage.latestInfo.attemptNumber,
              taskBinary, part, locs, id, properties, serializedTaskMetrics,
              Option(jobId), Option(sc.applicationId), sc.applicationAttemptId,
              stage.rdd.isBarrier())
          }
      }
    }

发现ResultTask 里面一样有Runtask ，但是在里面没有发现Reader，仅有一个rdd.iterator，再执行getOrCompute 方法，如果存储级别不为null，然后再进入RDD.scala

private[spark] def computeOrReadCheckpoint(split: Partition, context: TaskContext): Iterator[T] =
  {
    if (isCheckpointedAndMaterialized) {
      firstParent[T].iterator(split, context)
    } else {
      compute(split, context)
    }
  }

因为不是CheckPoint，所以走compute方法，发现是个抽象类，RDD抽象类里面有很多，因为我们是属于ShuffleRDD嘛，所以在ShuffleRDD.scala里面，发现抽象的compute方法，所以其实每种RDD都有不同的compute方法？

override def compute(split: Partition, context: TaskContext): Iterator[(K, C)] = {
    val dep = dependencies.head.asInstanceOf[ShuffleDependency[K, V, C]]
    val metrics = context.taskMetrics().createTempShuffleReadMetrics()
    SparkEnv.get.shuffleManager.getReader(
      dep.shuffleHandle, split.index, split.index + 1, context, metrics)
      .read()  //根据Handle处理器，获取数据索引，和数据的读取
      .asInstanceOf[Iterator[(K, C)]]
  }

实战：项目实战，Spark1.0版本未经过优化的HashShuffle

五、shuffle写的过程中，各种Writer适用的范围及条件【源码解读】

源码：

ShuffleWriterProcessor.scala
获取shuffle管理器和writer

 def write(
      rdd: RDD[_],
      dep: ShuffleDependency[_, _, _],
      mapId: Long,
      context: TaskContext,
      partition: Partition): MapStatus = {
    var writer: ShuffleWriter[Any, Any] = null
    try {
      val manager = SparkEnv.get.shuffleManager
      writer = manager.getWriter[Any, Any](
        dep.shuffleHandle,
        ...................

点击 manager.getWriter
getWriter会匹配三种shuffle处理器:
SerializedShuffleHandle
BypassMergeSortShuffleHandle
BaseShuffleHandle

override def getWriter[K, V]({
    handle match {
      case unsafeShuffleHandle: SerializedShuffleHandle[K @unchecked, V @unchecked] =>
        new UnsafeShuffleWriter(
          env.blockManager,
          context.taskMemoryManager(),
          unsafeShuffleHandle,
          mapId,
          context,
          env.conf,
          metrics,
          shuffleExecutorComponents)
      case bypassMergeSortHandle: BypassMergeSortShuffleHandle[K @unchecked, V @unchecked] =>
        new BypassMergeSortShuffleWriter(
          env.blockManager,
          bypassMergeSortHandle,
          mapId,
          env.conf,
          metrics,
          shuffleExecutorComponents)
      case other: BaseShuffleHandle[K @unchecked, V @unchecked, _] =>
        new SortShuffleWriter(
          shuffleBlockResolver, other, mapId, context, shuffleExecutorComponents)
    }
  }

5.1 SerializedShuffleHandle

回退到：ShuffleWriterProcessor.scala ，shuffleHandle是getWriter的第一个参数
点击dep.shuffleHandle

 val manager = SparkEnv.get.shuffleManager
      writer = manager.getWriter[Any, Any](
        dep.shuffleHandle,
        mapId,
        context,
        createMetricsReporter(context))
      writer.write(
        rdd.iterator(partition, context).asInstanceOf[Iterator[_ <: Product2[Any, Any]]])
      writer.stop(success = true).get
        ...................

注册handle，点击registerShuffle()

 val shuffleHandle: ShuffleHandle = _rdd.context.env.shuffleManager.registerShuffle(
    shuffleId, this)

registerShuffle方法

else if (SortShuffleManager.canUseSerializedShuffle(dependency)) {
      // Otherwise, try to buffer map outputs in a serialized form, since this is more efficient:
      new SerializedShuffleHandle[K, V](
        shuffleId, dependency.asInstanceOf[ShuffleDependency[K, V, V]])

canUseSerializedShuffle是否可以序列化

def canUseSerializedShuffle(dependency: ShuffleDependency[_, _, _]): Boolean = {
    val shufId = dependency.shuffleId
    val numPartitions = dependency.partitioner.numPartitions
    if (!dependency.serializer.supportsRelocationOfSerializedObjects) { //是否支持序列化重定向对象
      log.debug(s"Can't use serialized shuffle for shuffle $shufId because the serializer, " +
        s"${dependency.serializer.getClass.getName}, does not support object relocation")
      false
    } else if (dependency.mapSideCombine) {           //是否支持预聚合
      log.debug(s"Can't use serialized shuffle for shuffle $shufId because we need to do " +
        s"map-side aggregation")
      false
    } else if (numPartitions > MAX_SHUFFLE_OUTPUT_PARTITIONS_FOR_SERIALIZED_MODE) { //分区数是否小于16777215+1
      log.debug(s"Can't use serialized shuffle for shuffle $shufId because it has more than " +
        s"$MAX_SHUFFLE_OUTPUT_PARTITIONS_FOR_SERIALIZED_MODE partitions")
      false
    } else {
      log.debug(s"Can use serialized shuffle for shuffle $shufId")
      true
    }
  }

总结：1、需要满足序列化重定向对象(Kyro序列化)
2、不能使用预聚合(groupbyKey,reduceByKey等)
3、分区数要小于16777216

5.2 shouldBypassMergeSort 忽略合并排序writer

SortShuffleManager.scala

 if (SortShuffleWriter.shouldBypassMergeSort(conf, dependency)) {
      // If there are fewer than spark.shuffle.sort.bypassMergeThreshold partitions and we don't
      // need map-side aggregation, then write numPartitions files directly and just concatenate
      // them at the end. This avoids doing serialization and deserialization twice to merge
      // together the spilled files, which would happen with the normal code path. The downside is
      // having multiple files open at a time and thus more memory allocated to buffers.
      new BypassMergeSortShuffleHandle[K, V](
        shuffleId, dependency.asInstanceOf[ShuffleDependency[K, V, V]])
    }

点击shouldBypassMergeSort

  def shouldBypassMergeSort(conf: SparkConf, dep: ShuffleDependency[_, _, _]): Boolean = {
    // We cannot bypass sorting if we need to do map-side aggregation.
    if (dep.mapSideCombine) {
      false
    } else {
      val bypassMergeThreshold: Int = conf.get(config.SHUFFLE_SORT_BYPASS_MERGE_THRESHOLD)
      dep.partitioner.numPartitions <= bypassMergeThreshold
    }
  }