Spark-Core【Shuffle原理】-【源码】

一、Shuffle过程图解:

在这里插入图片描述
文字理解:在RDD聚合过程中,数据是需要落盘的,不可能一直缓存在内存中等待上一个RDD计算完毕,因此,就有了上图的过程和数据文件落盘的理解,以及优化Shuffle的方向(减少落盘的数据量)

二、同样引入另外一种情况:上一个阶段的Task是只有一个CPU和分区,而reduce阶段是有三个分区=三个Task,那我们落盘的文件应该是怎样的呢

在这里插入图片描述

文字理解:如果只落一个文件,那么三个Task去分,不知道应该从哪里开始读起,如果落三个文件,任务一多,就会形成小文件的问题,所以最好的办法是,生成一个数据文件,一个索引文件可以通过源码解析得知,Spark的原理;
并且下一个阶段的Task都能拿到同一个Key的数据

三、因此就引出有点类似MR的shuffle的概念,就是上一个Task/RDD的写入磁盘

源码印证:
阅读DAGSchedule.scala

val tasks: Seq[Task[_]] = try {
      val serializedTaskMetrics = closureSerializer.serialize(stage.latestInfo.taskMetrics).array()
      stage match {
        case stage: ShuffleMapStage =>//匹配模式匹配,如果是shuffle的阶段
          stage.pendingPartitions.clear()
          partitionsToCompute.map { id =>
            val locs = taskIdToLocations(id)
            val part = partitions(id)
            stage.pendingPartitions += id
            new ShuffleMapTask(stage.id, stage.latestInfo.attemptNumber,
            //那么就会new一个ShuffleMapTask
              taskBinary, part, locs, properties, serializedTaskMetrics, Option(jobId),
              Option(sc.applicationId), sc.applicationAttemptId, stage.rdd.isBarrier())
          }

继续阅读:ShuffleMapTask,extends Task(def run) ,Task中有一个抽象的方法runTask(),必须要在子类ShuffleMapTask 中重写
所以可以在ShuffleMapTask中看到runTask()方法的重写,里面有写文件的方法

override def runTask(context: TaskContext): MapStatus = {
    // Deserialize the RDD using the broadcast variable.
    val threadMXBean = ManagementFactory.getThreadMXBean
    val deserializeStartTimeNs = System.nanoTime()
    val deserializeStartCpuTime = if (threadMXBean.isCurrentThreadCpuTimeSupported) 
    ........................
    val rdd = rddAndDep._1
    val dep = rddAndDep._2
    // While we use the old shuffle fetch protocol, we use partitionId as mapId in the
    // ShuffleBlockId construction.
    val mapId = if (SparkEnv.get.conf.get(config.SHUFFLE_USE_OLD_FETCH_PROTOCOL)) {
      partitionId
    } else context.taskAttemptId()
    dep.shuffleWriterProcessor.write(rdd, dep, mapId, context, partition)//写的步骤
  }

进入到Write方法,发现是一个abstract class,ctrl+H,发现有一个SortShuffleWriter,里面有

    sorter.writePartitionedMapOutput(dep.shuffleId, mapId, mapOutputWriter)
    val partitionLengths = mapOutputWriter.commitAllPartitions()

两个方法writePartitionedMapOutput,commitAllPartitions ,进入commitAllPartitions,查找发现有一个本地磁盘的操作
在这里插入图片描述

然后就可以看到方法里有一个writeIndexFileAndCommit 方法,写入索引文件和提交任务

  public long[] commitAllPartitions() throws IOException {
    if (outputFileChannel != null && outputFileChannel.position() != bytesWrittenToMergedFile) 
    cleanUp();
    File resolvedTmp = outputTempFile != null && outputTempFile.isFile() ? outputTempFile : null;
    blockResolver.writeIndexFileAndCommit(shuffleId, mapId, partitionLengths, resolvedTmp);  //
    return partitionLengths;
  }

再进入writeIndexFileAndCommit方法,就能看到indexfile和datafile

def writeIndexFileAndCommit(
      shuffleId: Int,
      mapId: Long,
      lengths: Array[Long],
      dataTmp: File): Unit = {
.....................................
          if (indexFile.exists()) {
            indexFile.delete()
          }
          if (dataFile.exists()) {
            dataFile.delete()
          }
  }

同样下一个RDD/task,读的操作是怎么做的,源码

在Task中有

val tasks: Seq[Task[_]] = try {
        case stage: ResultStage => //模式匹配,如果是结果stage,就new ResultTask
          partitionsToCompute.map { id =>
            val p: Int = stage.partitions(id)
            val part = partitions(p)
            val locs = taskIdToLocations(id)
            new ResultTask(stage.id, stage.latestInfo.attemptNumber,
              taskBinary, part, locs, id, properties, serializedTaskMetrics,
              Option(jobId), Option(sc.applicationId), sc.applicationAttemptId,
              stage.rdd.isBarrier())
          }
      }
    }

发现ResultTask 里面一样有Runtask ,但是在里面没有发现Reader,仅有一个rdd.iterator,再执行getOrCompute 方法,如果存储级别不为null,然后再进入RDD.scala

private[spark] def computeOrReadCheckpoint(split: Partition, context: TaskContext): Iterator[T] =
  {
    if (isCheckpointedAndMaterialized) {
      firstParent[T].iterator(split, context)
    } else {
      compute(split, context)
    }
  }

因为不是CheckPoint,所以走compute方法,发现是个抽象类,RDD抽象类里面有很多,因为我们是属于ShuffleRDD嘛,所以在ShuffleRDD.scala里面,发现抽象的compute方法,所以其实每种RDD都有不同的compute方法?

override def compute(split: Partition, context: TaskContext): Iterator[(K, C)] = {
    val dep = dependencies.head.asInstanceOf[ShuffleDependency[K, V, C]]
    val metrics = context.taskMetrics().createTempShuffleReadMetrics()
    SparkEnv.get.shuffleManager.getReader(
      dep.shuffleHandle, split.index, split.index + 1, context, metrics)
      .read()  //根据Handle处理器,获取数据索引,和数据的读取
      .asInstanceOf[Iterator[(K, C)]]
  }

实战:项目实战,Spark1.0版本未经过优化的HashShuffle

五、shuffle写的过程中,各种Writer适用的范围及条件【源码解读】

源码:

ShuffleWriterProcessor.scala
获取shuffle管理器和writer

 def write(
      rdd: RDD[_],
      dep: ShuffleDependency[_, _, _],
      mapId: Long,
      context: TaskContext,
      partition: Partition): MapStatus = {
    var writer: ShuffleWriter[Any, Any] = null
    try {
      val manager = SparkEnv.get.shuffleManager
      writer = manager.getWriter[Any, Any](
        dep.shuffleHandle,
        ...................

点击 manager.getWriter
getWriter会匹配三种shuffle处理器:
SerializedShuffleHandle
BypassMergeSortShuffleHandle
BaseShuffleHandle

override def getWriter[K, V]({
    handle match {
      case unsafeShuffleHandle: SerializedShuffleHandle[K @unchecked, V @unchecked] =>
        new UnsafeShuffleWriter(
          env.blockManager,
          context.taskMemoryManager(),
          unsafeShuffleHandle,
          mapId,
          context,
          env.conf,
          metrics,
          shuffleExecutorComponents)
      case bypassMergeSortHandle: BypassMergeSortShuffleHandle[K @unchecked, V @unchecked] =>
        new BypassMergeSortShuffleWriter(
          env.blockManager,
          bypassMergeSortHandle,
          mapId,
          env.conf,
          metrics,
          shuffleExecutorComponents)
      case other: BaseShuffleHandle[K @unchecked, V @unchecked, _] =>
        new SortShuffleWriter(
          shuffleBlockResolver, other, mapId, context, shuffleExecutorComponents)
    }
  }

5.1 SerializedShuffleHandle

回退到:ShuffleWriterProcessor.scala ,shuffleHandle是getWriter的第一个参数
点击dep.shuffleHandle

 val manager = SparkEnv.get.shuffleManager
      writer = manager.getWriter[Any, Any](
        dep.shuffleHandle,
        mapId,
        context,
        createMetricsReporter(context))
      writer.write(
        rdd.iterator(partition, context).asInstanceOf[Iterator[_ <: Product2[Any, Any]]])
      writer.stop(success = true).get
        ...................

注册handle,点击registerShuffle()

 val shuffleHandle: ShuffleHandle = _rdd.context.env.shuffleManager.registerShuffle(
    shuffleId, this)

registerShuffle方法

else if (SortShuffleManager.canUseSerializedShuffle(dependency)) {
      // Otherwise, try to buffer map outputs in a serialized form, since this is more efficient:
      new SerializedShuffleHandle[K, V](
        shuffleId, dependency.asInstanceOf[ShuffleDependency[K, V, V]])

canUseSerializedShuffle是否可以序列化

def canUseSerializedShuffle(dependency: ShuffleDependency[_, _, _]): Boolean = {
    val shufId = dependency.shuffleId
    val numPartitions = dependency.partitioner.numPartitions
    if (!dependency.serializer.supportsRelocationOfSerializedObjects) { //是否支持序列化重定向对象
      log.debug(s"Can't use serialized shuffle for shuffle $shufId because the serializer, " +
        s"${dependency.serializer.getClass.getName}, does not support object relocation")
      false
    } else if (dependency.mapSideCombine) {           //是否支持预聚合
      log.debug(s"Can't use serialized shuffle for shuffle $shufId because we need to do " +
        s"map-side aggregation")
      false
    } else if (numPartitions > MAX_SHUFFLE_OUTPUT_PARTITIONS_FOR_SERIALIZED_MODE) { //分区数是否小于16777215+1
      log.debug(s"Can't use serialized shuffle for shuffle $shufId because it has more than " +
        s"$MAX_SHUFFLE_OUTPUT_PARTITIONS_FOR_SERIALIZED_MODE partitions")
      false
    } else {
      log.debug(s"Can use serialized shuffle for shuffle $shufId")
      true
    }
  }

总结:1、需要满足序列化重定向对象(Kyro序列化)
2、不能使用预聚合(groupbyKey,reduceByKey等)
3、分区数要小于16777216

5.2 shouldBypassMergeSort 忽略合并排序writer

SortShuffleManager.scala

 if (SortShuffleWriter.shouldBypassMergeSort(conf, dependency)) {
      // If there are fewer than spark.shuffle.sort.bypassMergeThreshold partitions and we don't
      // need map-side aggregation, then write numPartitions files directly and just concatenate
      // them at the end. This avoids doing serialization and deserialization twice to merge
      // together the spilled files, which would happen with the normal code path. The downside is
      // having multiple files open at a time and thus more memory allocated to buffers.
      new BypassMergeSortShuffleHandle[K, V](
        shuffleId, dependency.asInstanceOf[ShuffleDependency[K, V, V]])
    }

点击shouldBypassMergeSort

  def shouldBypassMergeSort(conf: SparkConf, dep: ShuffleDependency[_, _, _]): Boolean = {
    // We cannot bypass sorting if we need to do map-side aggregation.
    if (dep.mapSideCombine) {
      false
    } else {
      val bypassMergeThreshold: Int = conf.get(config.SHUFFLE_SORT_BYPASS_MERGE_THRESHOLD)
      dep.partitioner.numPartitions <= bypassMergeThreshold
    }
  }

总结:1、不能使用预聚合算子
2、分区数小于等于200(可配)

5.3 SortShuffleWriter 剩余其他情况

评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值