Spark Shuffle流程

最新推荐文章于 2022-07-02 11:19:55 发布

wankunde

最新推荐文章于 2022-07-02 11:19:55 发布

阅读量197

点赞数

分类专栏： spark

本文链接：https://blog.csdn.net/wankunde/article/details/100715711

版权

spark 专栏收录该内容

69 篇文章 7 订阅

订阅专栏

文章目录

ShuffleManager
BypassMergeSortShuffleWriter
SortShuffleWriter
- ExternalSorter

本文基于 CDH Spark 2.4.0.cloudera1 版本说明

参考

ShuffleManager

ShuffleManager 参数配置和 shuffleManager 实例化

在 SparkEnv 中进行ShuffleManager 初始化，通过参数spark.shuffle.manager 可以自定义配置，默认使用 SortShuffleManager
// TODO :sort 和tungsten-sort 的区别

// driver 初始化 ShuffleManager
sc = new SparkContext("local[2, 4]", "test", conf)
SparkContext(config)
SparkContext::createSparkEnv()
SparkEnv::createDriverEnv()
SparkEnv::create()

// executor 初始化 ShuffleManager
CoarseGrainedExecutorBackend::run()
SparkEnv.createExecutorEnv(driverConf, executorId, hostname, cores, cfg.ioEncryptionKey, isLocal = false)
SparkEnv::create()

    // SparkEnv::create()
    // Let the user specify short names for shuffle managers
    val shortShuffleMgrNames = Map(
      "sort" -> classOf[org.apache.spark.shuffle.sort.SortShuffleManager].getName,
      "tungsten-sort" -> classOf[org.apache.spark.shuffle.sort.SortShuffleManager].getName)
    val shuffleMgrName = conf.get("spark.shuffle.manager", "sort")
    val shuffleMgrClass =
      shortShuffleMgrNames.getOrElse(shuffleMgrName.toLowerCase(Locale.ROOT), shuffleMgrName)
    val shuffleManager = instantiateClass[ShuffleManager](shuffleMgrClass)

根据 ShuffleManager实例化 shuffleHandle

在生成RDD DAG计算依赖图时，会根据数据是否有Shuffle生成 ShuffledRDD
ShuffledRDD 在获取dependencies时生成 ShuffleDependency
在 ShuffleDependency 内部根据 shuffleManager实例化 shuffleHandle

  // ShuffledRDD
  override def getDependencies: Seq[Dependency[_]] = {
    val serializer = userSpecifiedSerializer.getOrElse {
      val serializerManager = SparkEnv.get.serializerManager
      if (mapSideCombine) {
        serializerManager.getSerializer(implicitly[ClassTag[K]], implicitly[ClassTag[C]])
      } else {
        serializerManager.getSerializer(implicitly[ClassTag[K]], implicitly[ClassTag[V]])
      }
    }
    List(new ShuffleDependency(prev, part, serializer, keyOrdering, aggregator, mapSideCombine))
  }

// ShuffleDependency 
  val shuffleHandle: ShuffleHandle = _rdd.context.env.shuffleManager.registerShuffle(
    shuffleId, _rdd.partitions.length, this)

  // SortShuffleManager
  override def registerShuffle[K, V, C](
      shuffleId: Int,
      numMaps: Int,
      dependency: ShuffleDependency[K, V, C]): ShuffleHandle = {
    if (SortShuffleWriter.shouldBypassMergeSort(conf, dependency)) {
      // If there are fewer than spark.shuffle.sort.bypassMergeThreshold partitions and we don't
      // need map-side aggregation, then write numPartitions files directly and just concatenate
      // them at the end. This avoids doing serialization and deserialization twice to merge
      // together the spilled files, which would happen with the normal code path. The downside is
      // having multiple files open at a time and thus more memory allocated to buffers.
      new BypassMergeSortShuffleHandle[K, V](
        shuffleId, numMaps, dependency.asInstanceOf[ShuffleDependency[K, V, V]])
    } else if (SortShuffleManager.canUseSerializedShuffle(dependency)) {
      // Otherwise, try to buffer map outputs in a serialized form, since this is more efficient:
      new SerializedShuffleHandle[K, V](
        shuffleId, numMaps, dependency.asInstanceOf[ShuffleDependency[K, V, V]])
    } else {
      // Otherwise, buffer map outputs in a deserialized form:
      new BaseShuffleHandle(shuffleId, numMaps, dependency)
    }
  }

ShuffleHandle 选择条件

BypassMergeSortShuffleHandle
- dependency 不需要 map端聚合，且 dependency的partitions 个数小于 bypassMergeThreshold(default=200)
- 优点：直接些partitions file,然后在文件最后合并文件，优点：避免了合并spilled文件时的多次序列化和反序列化
- 缺点：同时会打开多个文件，占用更多buffers内存
SerializedShuffleHandle
- serializer 支持重新定位被他序列化对象,默认 Serializer 不可以，UnsafeRowSerializer 可以
- 同上面一样，也不能有map端聚合，partitions 个数不能大于 16777216
BaseShuffleHandle
- fallback handle

ShuffleHandle类图

ShuffleHandle 本身没有什么实现，没什么好说的，比如 BypassMergeSortShuffleHandle 和 SerializedShuffleHandle 除了名字不同，其他都一样

Map task获取 ShuffleWriter

Executor :: ()
Task :: run()
ShuffleMapTask :: runTask()
SortShuffleManager :: getWriter()

  /** Get a writer for a given partition. Called on executors by map tasks. */
  override def getWriter[K, V](
      handle: ShuffleHandle,
      mapId: Int,
      context: TaskContext): ShuffleWriter[K, V] = {
    numMapsForShuffle.putIfAbsent(
      handle.shuffleId, handle.asInstanceOf[BaseShuffleHandle[_, _, _]].numMaps)
    val env = SparkEnv.get
    handle match {
      case unsafeShuffleHandle: SerializedShuffleHandle[K @unchecked, V @unchecked] =>
        new UnsafeShuffleWriter(
          env.blockManager,
          shuffleBlockResolver.asInstanceOf[IndexShuffleBlockResolver],
          context.taskMemoryManager(),
          unsafeShuffleHandle,
          mapId,
          context,
          env.conf)
      case bypassMergeSortHandle: BypassMergeSortShuffleHandle[K @unchecked, V @unchecked] =>
        new BypassMergeSortShuffleWriter(
          env.blockManager,
          shuffleBlockResolver.asInstanceOf[IndexShuffleBlockResolver],
          bypassMergeSortHandle,
          mapId,
          context,
          env.conf)
      case other: BaseShuffleHandle[K @unchecked, V @unchecked, _] =>
        new SortShuffleWriter(shuffleBlockResolver, other, mapId, context)
    }
  }

好吧，上面的shuffleHandle 实际上是用于选择不同的 ShuffleWriter的（神奇的是 BypassMergeSortShuffleWriter 和 UnsafeShuffleWriter 还是Java写的。。。）

ShuffleHandle	ShuffleWriter
BypassMergeSortShuffleHandle	BypassMergeSortShuffleWriter
SerializedShuffleHandle	UnsafeShuffleWriter
BaseShuffleHandle	SortShuffleWriter

BypassMergeSortShuffleWriter

blockManager.diskBlockManager().createTempShuffleBlock() 创建临时文件，生成blockId
blockId -> file --> partitionWriters
对每一条记录通过writer写入KV partitionWriters[partitioner.getPartition(key)].write(key, record._2());
对每个writer commit()，close() partitionWriterSegments[i] = writer.commitAndGet();
writePartitionedFile(tmp)
生成 MapStatus

特点：

这个writer竟然没有排序，数据直接写文件
map端不能聚合，排序都没有，根据没办法实现聚合
Reduce 个数少的时候才能使用，如果reduce个数多，io压力会比较大？最终合并文件时还好，只需要同时打开两个文件，但是在迭代器写数据的时候需要同时打开 numPartitions 个文件

详细代码：

  @Override
  public void write(Iterator<Product2<K, V>> records) throws IOException {
    assert (partitionWriters == null);
    if (!records.hasNext()) {
      partitionLengths = new long[numPartitions];
      shuffleBlockResolver.writeIndexFileAndCommit(shuffleId, mapId, partitionLengths, null);
      mapStatus = MapStatus$.MODULE$.apply(blockManager.shuffleServerId(), partitionLengths);
      return;
    }
    final SerializerInstance serInstance = serializer.newInstance();
    final long openStartTime = System.nanoTime();
    partitionWriters = new DiskBlockObjectWriter[numPartitions];
    partitionWriterSegments = new FileSegment[numPartitions];
    // 创建 numPartitions 个writer同时写文件
    for (int i = 0; i < numPartitions; i++) {
      final Tuple2<TempShuffleBlockId, File> tempShuffleBlockIdPlusFile =
        blockManager.diskBlockManager().createTempShuffleBlock();
      // DiskBlockManager 根据 blockId 创建文件夹和对应的文件
      final File file = tempShuffleBlockIdPlusFile._2();
      // blockId : "temp_shuffle_" + uuid
      final BlockId blockId = tempShuffleBlockIdPlusFile._1();

      // 根据file 生成 DiskBlockObjectWriter，负责将jvm中Objects 写入文件，不支持并发写，每次只能写入一次
      partitionWriters[i] =
        blockManager.getDiskWriter(blockId, file, serInstance, fileBufferSize, writeMetrics);
    }
    // Creating the file to write to and creating a disk writer both involve interacting with
    // the disk, and can take a long time in aggregate when we open many files, so should be
    // included in the shuffle write time.
    writeMetrics.incWriteTime(System.nanoTime() - openStartTime);

    while (records.hasNext()) {
      final Product2<K, V> record = records.next();
      final K key = record._1();
      partitionWriters[partitioner.getPartition(key)].write(key, record._2());
    }

    //flush
    for (int i = 0; i < numPartitions; i++) {
      final DiskBlockObjectWriter writer = partitionWriters[i];
      // flush和close输出流，将文件上次 committedPosition 和 当前 committedPosition 的文件段作为一个fileSegment
      partitionWriterSegments[i] = writer.commitAndGet();
      writer.close();
    }

    /**
        构造最终的输出文件实例,其中文件名为(reduceId为0)：
        "shuffle_" + shuffleId + "_" + mapId + "_" + reduceId
        "shuffle_" + shuffleId + "_" + mapId + "_" + reduceId + ".data"
         文件所在的local文件夹是根据该文件名的hash值确定。
        1、如果运行在yarn上,yarn在启动的时候会根据配置项'LOCAL_DIRS'在本地创建
        文件夹
    **/
    File output = shuffleBlockResolver.getDataFile(shuffleId, mapId);
    //在输出文件名后加上uuid用于标识文件正在写入,结束后重命名
    File tmp = Utils.tempFileWith(output);
    try {
      //合并每个fileSegment 对应的文件到一个文件中
      partitionLengths = writePartitionedFile(tmp);
      // 索引文件: "shuffle_" + shuffleId + "_" + mapId + "_" + reduceId + ".index" (写临时文件时，文件名末尾有添加uuid)
      // 记录所有 fileSegment 的 offset，并将 dataFile 和 indexFile rename为最终文件
      shuffleBlockResolver.writeIndexFileAndCommit(shuffleId, mapId, partitionLengths, tmp);
    } finally {
      if (tmp.exists() && !tmp.delete()) {
        logger.error("Error while deleting temp file {}", tmp.getAbsolutePath());
      }
    }
    mapStatus = MapStatus$.MODULE$.apply(blockManager.shuffleServerId(), partitionLengths);
  }

DiskBlockObjectWriter 的 write() 方法


  private def initialize(): Unit = {
    // java fos
    fos = new FileOutputStream(file, true)
    // 获取文件channel，这个并没有直接使用
    channel = fos.getChannel()
    // 包装 fos，每次写入数据的时候记录写入数据耗时
    ts = new TimeTrackingOutputStream(writeMetrics, fos)
    class ManualCloseBufferedOutputStream
      extends BufferedOutputStream(ts, bufferSize) with ManualCloseOutputStream
    // 包装ts,增加 buffer 写入数据功能
    mcs = new ManualCloseBufferedOutputStream
  }

  def open(): DiskBlockObjectWriter = {
    // ... stream 未关闭，且已经初始化
    // 根据配置，对数据进行加密
    // 判断当前blockId 是否需要对 msc 中数据使用 Codec对数据进行压缩
    bs = serializerManager.wrapStream(blockId, mcs)
    // serializeStream
    objOut = serializerInstance.serializeStream(bs)
    streamOpen = true
    this
  }

  def write(key: Any, value: Any) {
    if (!streamOpen) {
      open()
    }

    // 上面看到最后一个包装类是序列化器，writeKey 和 writeValue 都是调用 writeObject 方法写入数据
    objOut.writeKey(key)
    objOut.writeValue(value)
    // 处理metrics
    recordWritten()
  }

SortShuffleWriter

// 神奇了，上面的 shouldBypassMergeSort() 方法为什么要放在 SortShuffleWriter 这个类里呢？有毛关系？

这个Writer主要还是调用 ExternalSorter 来实现排序 Shuffle的

  /** Write a bunch of records to this task's output */
  override def write(records: Iterator[Product2[K, V]]): Unit = {
    sorter = if (dep.mapSideCombine) {
      new ExternalSorter[K, V, C](
        context, dep.aggregator, Some(dep.partitioner), dep.keyOrdering, dep.serializer)
    } else {
      // In this case we pass neither an aggregator nor an ordering to the sorter, because we don't
      // care whether the keys get sorted in each partition; that will be done on the reduce side
      // if the operation being run is sortByKey.
      new ExternalSorter[K, V, V](
        context, aggregator = None, Some(dep.partitioner), ordering = None, dep.serializer)
    }
    sorter.insertAll(records)

    // Don't bother including the time to open the merged output file in the shuffle write time,
    // because it just opens a single file, so is typically too fast to measure accurately
    // (see SPARK-3570).
    val output = shuffleBlockResolver.getDataFile(dep.shuffleId, mapId)
    val tmp = Utils.tempFileWith(output)
    try {
      val blockId = ShuffleBlockId(dep.shuffleId, mapId, IndexShuffleBlockResolver.NOOP_REDUCE_ID)
      val partitionLengths = sorter.writePartitionedFile(blockId, tmp)
      shuffleBlockResolver.writeIndexFileAndCommit(dep.shuffleId, mapId, partitionLengths, tmp)
      mapStatus = MapStatus(blockManager.shuffleServerId, partitionLengths)
    } finally {
      if (tmp.exists() && !tmp.delete()) {
        logError(s"Error while deleting temp file ${tmp.getAbsolutePath}")
      }
    }
  }

ExternalSorter

  def insertAll(records: Iterator[Product2[K, V]]): Unit = {
    // TODO: stop combining if we find that the reduction factor isn't high
    val shouldCombine = aggregator.isDefined

    if (shouldCombine) {
      // Combine values in-memory first using our AppendOnlyMap
      val mergeValue = aggregator.get.mergeValue
      val createCombiner = aggregator.get.createCombiner
      var kv: Product2[K, V] = null
      val update = (hadValue: Boolean, oldValue: C) => {
        if (hadValue) mergeValue(oldValue, kv._2) else createCombiner(kv._2)
      }
      while (records.hasNext) {
        addElementsRead()
        kv = records.next()
        // 如果需要map端聚合，将数据不断写入一个map
        map.changeValue((getPartition(kv._1), kv._1), update)
        maybeSpillCollection(usingMap = true)
      }
    } else {
      // Stick values into our buffer
      while (records.hasNext) {
        addElementsRead()
        val kv = records.next()
        // 不需要map端聚合，不断插入数据到内存中map中
        buffer.insert(getPartition(kv._1), kv._1, kv._2.asInstanceOf[C])
        maybeSpillCollection(usingMap = false)
      }
    }
  }

// 通过 maybeSpillCollection 判断array 或者 map是否需要进行Spill

wankunde

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Spark Shuffle流程

文章目录ShuffleManagerShuffleManager 参数配置和 shuffleManager 实例化根据 ShuffleManager实例化 shuffleHandleShuffleHandle 选择条件Map task获取 ShuffleWriterBypassMergeSortShuffleWriterSortShuffleWriterExternalSorter本文基于 CD...
复制链接

扫一扫

专栏目录