Spark Shuffle流程

本文基于 CDH Spark 2.4.0.cloudera1 版本说明

ShuffleManager

ShuffleManager 参数配置和 shuffleManager 实例化

SparkEnv 中进行ShuffleManager 初始化,通过参数spark.shuffle.manager 可以自定义配置,默认使用 SortShuffleManager
// TODO :sort 和tungsten-sort 的区别

// driver 初始化 ShuffleManager
sc = new SparkContext("local[2, 4]", "test", conf)
SparkContext(config)
SparkContext::createSparkEnv()
SparkEnv::createDriverEnv()
SparkEnv::create()

// executor 初始化 ShuffleManager
CoarseGrainedExecutorBackend::run()
SparkEnv.createExecutorEnv(driverConf, executorId, hostname, cores, cfg.ioEncryptionKey, isLocal = false)
SparkEnv::create()
    // SparkEnv::create()
    // Let the user specify short names for shuffle managers
    val shortShuffleMgrNames = Map(
      "sort" -> classOf[org.apache.spark.shuffle.sort.SortShuffleManager].getName,
      "tungsten-sort" -> classOf[org.apache.spark.shuffle.sort.SortShuffleManager].getName)
    val shuffleMgrName = conf.get("spark.shuffle.manager", "sort")
    val shuffleMgrClass =
      shortShuffleMgrNames.getOrElse(shuffleMgrName.toLowerCase(Locale.ROOT), shuffleMgrName)
    val shuffleManager = instantiateClass[ShuffleManager](shuffleMgrClass)

根据 ShuffleManager实例化 shuffleHandle

  • 在生成RDD DAG计算依赖图时,会根据数据是否有Shuffle生成 ShuffledRDD
  • ShuffledRDD 在获取dependencies时生成 ShuffleDependency
  • 在 ShuffleDependency 内部根据 shuffleManager实例化 shuffleHandle
  // ShuffledRDD
  override def getDependencies: Seq[Dependency[_]] = {
    val serializer = userSpecifiedSerializer.getOrElse {
      val serializerManager = SparkEnv.get.serializerManager
      if (mapSideCombine) {
        serializerManager.getSerializer(implicitly[ClassTag[K]], implicitly[ClassTag[C]])
      } else {
        serializerManager.getSerializer(implicitly[ClassTag[K]], implicitly[ClassTag[V]])
      }
    }
    List(new ShuffleDependency(prev, part, serializer, keyOrdering, aggregator, mapSideCombine))
  }

// ShuffleDependency 
  val shuffleHandle: ShuffleHandle = _rdd.context.env.shuffleManager.registerShuffle(
    shuffleId, _rdd.partitions.length, this)
  // SortShuffleManager
  override def registerShuffle[K, V, C](
      shuffleId: Int,
      numMaps: Int,
      dependency: ShuffleDependency[K, V, C]): ShuffleHandle = {
    if (SortShuffleWriter.shouldBypassMergeSort(conf, dependency)) {
      // If there are fewer than spark.shuffle.sort.bypassMergeThreshold partitions and we don't
      // need map-side aggregation, then write numPartitions files directly and just concatenate
      // them at the end. This avoids doing serialization and deserialization twice to merge
      // together the spilled files, which would happen with the normal code path. The downside is
      // having multiple files open at a time and thus more memory allocated to buffers.
      new BypassMergeSortShuffleHandle[K, V](
        shuffleId, numMaps, dependency.asInstanceOf[ShuffleDependency[K, V, V]])
    } else if (SortShuffleManager.canUseSerializedShuffle(dependency)) {
      // Otherwise, try to buffer map outputs in a serialized form, since this is more efficient:
      new SerializedShuffleHandle[K, V](
        shuffleId, numMaps, dependency.asInstanceOf[ShuffleDependency[K, V, V]])
    } else {
      // Otherwise, buffer map outputs in a deserialized form:
      new BaseShuffleHandle(shuffleId, numMaps, dependency)
    }
  }

ShuffleHandle 选择条件

  • BypassMergeSortShuffleHandle
    • dependency 不需要 map端聚合,且 dependency的partitions 个数小于 bypassMergeThreshold(default=200)
    • 优点:直接些partitions file,然后在文件最后合并文件,优点:避免了合并spilled文件时的多次序列化和反序列化
    • 缺点:同时会打开多个文件,占用更多buffers内存
  • SerializedShuffleHandle
    • serializer 支持重新定位被他序列化对象,默认 Serializer 不可以,UnsafeRowSerializer 可以
    • 同上面一样,也不能有map端聚合,partitions 个数不能大于 16777216
  • BaseShuffleHandle
    • fallback handle

ShuffleHandle类图

ShuffleHandle 本身没有什么实现,没什么好说的,比如 BypassMergeSortShuffleHandle 和 SerializedShuffleHandle 除了名字不同,其他都一样

Map task获取 ShuffleWriter

Executor :: ()
Task :: run()
ShuffleMapTask :: runTask()
SortShuffleManager :: getWriter()

  /** Get a writer for a given partition. Called on executors by map tasks. */
  override def getWriter[K, V](
      handle: ShuffleHandle,
      mapId: Int,
      context: TaskContext): ShuffleWriter[K, V] = {
    numMapsForShuffle.putIfAbsent(
      handle.shuffleId, handle.asInstanceOf[BaseShuffleHandle[_, _, _]].numMaps)
    val env = SparkEnv.get
    handle match {
      case unsafeShuffleHandle: SerializedShuffleHandle[K @unchecked, V @unchecked] =>
        new UnsafeShuffleWriter(
          env.blockManager,
          shuffleBlockResolver.asInstanceOf[IndexShuffleBlockResolver],
          context.taskMemoryManager(),
          unsafeShuffleHandle,
          mapId,
          context,
          env.conf)
      case bypassMergeSortHandle: BypassMergeSortShuffleHandle[K @unchecked, V @unchecked] =>
        new BypassMergeSortShuffleWriter(
          env.blockManager,
          shuffleBlockResolver.asInstanceOf[IndexShuffleBlockResolver],
          bypassMergeSortHandle,
          mapId,
          context,
          env.conf)
      case other: BaseShuffleHandle[K @unchecked, V @unchecked, _] =>
        new SortShuffleWriter(shuffleBlockResolver, other, mapId, context)
    }
  }

好吧,上面的shuffleHandle 实际上是用于选择不同的 ShuffleWriter的(神奇的是 BypassMergeSortShuffleWriter 和 UnsafeShuffleWriter 还是Java写的。。。 )

ShuffleHandleShuffleWriter
BypassMergeSortShuffleHandleBypassMergeSortShuffleWriter
SerializedShuffleHandleUnsafeShuffleWriter
BaseShuffleHandleSortShuffleWriter

BypassMergeSortShuffleWriter

  • blockManager.diskBlockManager().createTempShuffleBlock() 创建临时文件,生成blockId
  • blockId -> file --> partitionWriters
  • 对每一条记录通过writer写入KV partitionWriters[partitioner.getPartition(key)].write(key, record._2());
  • 对每个writer commit(),close() partitionWriterSegments[i] = writer.commitAndGet();
  • writePartitionedFile(tmp)
  • 生成 MapStatus

特点:

  • 这个writer竟然没有排序,数据直接写文件
  • map端不能聚合,排序都没有,根据没办法实现聚合
  • Reduce 个数少的时候才能使用,如果reduce个数多,io压力会比较大?最终合并文件时还好,只需要同时打开两个文件,但是在迭代器写数据的时候需要同时打开 numPartitions 个文件

详细代码:

  @Override
  public void write(Iterator<Product2<K, V>> records) throws IOException {
    assert (partitionWriters == null);
    if (!records.hasNext()) {
      partitionLengths = new long[numPartitions];
      shuffleBlockResolver.writeIndexFileAndCommit(shuffleId, mapId, partitionLengths, null);
      mapStatus = MapStatus$.MODULE$.apply(blockManager.shuffleServerId(), partitionLengths);
      return;
    }
    final SerializerInstance serInstance = serializer.newInstance();
    final long openStartTime = System.nanoTime();
    partitionWriters = new DiskBlockObjectWriter[numPartitions];
    partitionWriterSegments = new FileSegment[numPartitions];
    // 创建 numPartitions 个writer同时写文件
    for (int i = 0; i < numPartitions; i++) {
      final Tuple2<TempShuffleBlockId, File> tempShuffleBlockIdPlusFile =
        blockManager.diskBlockManager().createTempShuffleBlock();
      // DiskBlockManager 根据 blockId 创建文件夹和对应的文件
      final File file = tempShuffleBlockIdPlusFile._2();
      // blockId : "temp_shuffle_" + uuid
      final BlockId blockId = tempShuffleBlockIdPlusFile._1();

      // 根据file 生成 DiskBlockObjectWriter,负责将jvm中Objects 写入文件,不支持并发写,每次只能写入一次
      partitionWriters[i] =
        blockManager.getDiskWriter(blockId, file, serInstance, fileBufferSize, writeMetrics);
    }
    // Creating the file to write to and creating a disk writer both involve interacting with
    // the disk, and can take a long time in aggregate when we open many files, so should be
    // included in the shuffle write time.
    writeMetrics.incWriteTime(System.nanoTime() - openStartTime);

    while (records.hasNext()) {
      final Product2<K, V> record = records.next();
      final K key = record._1();
      partitionWriters[partitioner.getPartition(key)].write(key, record._2());
    }

    //flush
    for (int i = 0; i < numPartitions; i++) {
      final DiskBlockObjectWriter writer = partitionWriters[i];
      // flush和close输出流,将文件上次 committedPosition 和 当前 committedPosition 的文件段作为一个fileSegment
      partitionWriterSegments[i] = writer.commitAndGet();
      writer.close();
    }

    /**
        构造最终的输出文件实例,其中文件名为(reduceId为0):
        "shuffle_" + shuffleId + "_" + mapId + "_" + reduceId
        "shuffle_" + shuffleId + "_" + mapId + "_" + reduceId + ".data"
         文件所在的local文件夹是根据该文件名的hash值确定。
        1、如果运行在yarn上,yarn在启动的时候会根据配置项'LOCAL_DIRS'在本地创建
        文件夹
    **/
    File output = shuffleBlockResolver.getDataFile(shuffleId, mapId);
    //在输出文件名后加上uuid用于标识文件正在写入,结束后重命名
    File tmp = Utils.tempFileWith(output);
    try {
      //合并每个fileSegment 对应的文件到一个文件中
      partitionLengths = writePartitionedFile(tmp);
      // 索引文件: "shuffle_" + shuffleId + "_" + mapId + "_" + reduceId + ".index" (写临时文件时,文件名末尾有添加uuid)
      // 记录所有 fileSegment 的 offset,并将 dataFile 和 indexFile rename为最终文件
      shuffleBlockResolver.writeIndexFileAndCommit(shuffleId, mapId, partitionLengths, tmp);
    } finally {
      if (tmp.exists() && !tmp.delete()) {
        logger.error("Error while deleting temp file {}", tmp.getAbsolutePath());
      }
    }
    mapStatus = MapStatus$.MODULE$.apply(blockManager.shuffleServerId(), partitionLengths);
  }
  • DiskBlockObjectWriter 的 write() 方法

  private def initialize(): Unit = {
    // java fos
    fos = new FileOutputStream(file, true)
    // 获取文件channel,这个并没有直接使用
    channel = fos.getChannel()
    // 包装 fos,每次写入数据的时候记录写入数据耗时
    ts = new TimeTrackingOutputStream(writeMetrics, fos)
    class ManualCloseBufferedOutputStream
      extends BufferedOutputStream(ts, bufferSize) with ManualCloseOutputStream
    // 包装ts,增加 buffer 写入数据功能
    mcs = new ManualCloseBufferedOutputStream
  }

  def open(): DiskBlockObjectWriter = {
    // ... stream 未关闭,且已经初始化
    // 根据配置,对数据进行加密
    // 判断当前blockId 是否需要对 msc 中数据使用 Codec对数据进行压缩
    bs = serializerManager.wrapStream(blockId, mcs)
    // serializeStream
    objOut = serializerInstance.serializeStream(bs)
    streamOpen = true
    this
  }

  def write(key: Any, value: Any) {
    if (!streamOpen) {
      open()
    }

    // 上面看到最后一个包装类是序列化器,writeKey 和 writeValue 都是调用 writeObject 方法写入数据
    objOut.writeKey(key)
    objOut.writeValue(value)
    // 处理metrics
    recordWritten()
  }

SortShuffleWriter

// 神奇了,上面的 shouldBypassMergeSort() 方法为什么要放在 SortShuffleWriter 这个类里呢?有毛关系?

这个Writer主要还是调用 ExternalSorter 来实现排序 Shuffle的

  /** Write a bunch of records to this task's output */
  override def write(records: Iterator[Product2[K, V]]): Unit = {
    sorter = if (dep.mapSideCombine) {
      new ExternalSorter[K, V, C](
        context, dep.aggregator, Some(dep.partitioner), dep.keyOrdering, dep.serializer)
    } else {
      // In this case we pass neither an aggregator nor an ordering to the sorter, because we don't
      // care whether the keys get sorted in each partition; that will be done on the reduce side
      // if the operation being run is sortByKey.
      new ExternalSorter[K, V, V](
        context, aggregator = None, Some(dep.partitioner), ordering = None, dep.serializer)
    }
    sorter.insertAll(records)

    // Don't bother including the time to open the merged output file in the shuffle write time,
    // because it just opens a single file, so is typically too fast to measure accurately
    // (see SPARK-3570).
    val output = shuffleBlockResolver.getDataFile(dep.shuffleId, mapId)
    val tmp = Utils.tempFileWith(output)
    try {
      val blockId = ShuffleBlockId(dep.shuffleId, mapId, IndexShuffleBlockResolver.NOOP_REDUCE_ID)
      val partitionLengths = sorter.writePartitionedFile(blockId, tmp)
      shuffleBlockResolver.writeIndexFileAndCommit(dep.shuffleId, mapId, partitionLengths, tmp)
      mapStatus = MapStatus(blockManager.shuffleServerId, partitionLengths)
    } finally {
      if (tmp.exists() && !tmp.delete()) {
        logError(s"Error while deleting temp file ${tmp.getAbsolutePath}")
      }
    }
  }

ExternalSorter

  def insertAll(records: Iterator[Product2[K, V]]): Unit = {
    // TODO: stop combining if we find that the reduction factor isn't high
    val shouldCombine = aggregator.isDefined

    if (shouldCombine) {
      // Combine values in-memory first using our AppendOnlyMap
      val mergeValue = aggregator.get.mergeValue
      val createCombiner = aggregator.get.createCombiner
      var kv: Product2[K, V] = null
      val update = (hadValue: Boolean, oldValue: C) => {
        if (hadValue) mergeValue(oldValue, kv._2) else createCombiner(kv._2)
      }
      while (records.hasNext) {
        addElementsRead()
        kv = records.next()
        // 如果需要map端聚合,将数据不断写入一个map
        map.changeValue((getPartition(kv._1), kv._1), update)
        maybeSpillCollection(usingMap = true)
      }
    } else {
      // Stick values into our buffer
      while (records.hasNext) {
        addElementsRead()
        val kv = records.next()
        // 不需要map端聚合,不断插入数据到内存中map中
        buffer.insert(getPartition(kv._1), kv._1, kv._2.asInstanceOf[C])
        maybeSpillCollection(usingMap = false)
      }
    }
  }

// 通过 maybeSpillCollection 判断array 或者 map是否需要进行Spill

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值