spark1.2.0源码分析之ShuffledRDD抓取数据

最新推荐文章于 2021-04-08 18:09:15 发布

Yobadman

最新推荐文章于 2021-04-08 18:09:15 发布

阅读量2.3k

点赞数

分类专栏： spark源码文章标签： spark 大数据源码

本文链接：https://blog.csdn.net/Yobadman/article/details/42610007

版权

当经过shuffle写数据到本地磁盘后，需要从磁盘中将数据读取出来，这个是 ShuffledRDD 做的事情：

  override def compute(split: Partition, context: TaskContext): Iterator[(K, C)] = {
    val dep = dependencies.head.asInstanceOf[ShuffleDependency[K, V, C]]
    SparkEnv.get.shuffleManager.getReader(dep.shuffleHandle, split.index, split.index + 1, context)
      .read()     // 每次reducer都读一个partition
      .asInstanceOf[Iterator[(K, C)]]
  }

可以看出，是通过 ShuffleManager.getReader方法来获得一个读取器，目前spark只有一种类型的读取器：HashShuffleReader，看一下具体源码：

  override def getReader[K, C](
      handle: ShuffleHandle,
      startPartition: Int,
      endPartition: Int,
      context: TaskContext): ShuffleReader[K, C] = {
    // We currently use the same block store shuffle fetcher as the hash-based shuffle.
    new HashShuffleReader(
      handle.asInstanceOf[BaseShuffleHandle[K, _, C]], startPartition, endPartition, context)
  }

继续查看其 read() 方法：

 override def read(): Iterator[Product2[K, C]] = {
    val ser = Serializer.getSerializer(dep.serializer)
    <span style="color:#FF0000;">val iter = BlockStoreShuffleFetcher.fetch(handle.shuffleId, startPartition, context, ser)</span>  //真正的从file中抓取reducer所需的内容

    val aggregatedIter: Iterator[Product2[K, C]] = if (dep.aggregator.isDefined) {
      if (dep.mapSideCombine) {  //mapper端已经按key进行聚合了，此时，合并combiners
        new InterruptibleIterator(context, dep.aggregator.get.combineCombinersByKey(iter, context))
      } else {   //mapper端没有进行聚合，此时，合并values
        new InterruptibleIterator(context, dep.aggregator.get.combineValuesByKey(iter, context))
      }
    } else if (dep.aggregator.isEmpty && dep.mapSideCombine) {
      throw new IllegalStateException("Aggregator is empty for map-side combine")
    } else {  //没有聚合器，将其转换为键值对，因为之后的rdd需要这样的格式
      // Convert the Product2s to pairs since this is what downstream RDDs currently expect
      iter.asInstanceOf[Iterator[Product2[K, C]]].map(pair => (pair._1, pair._2))
    }

    // Sort the output if there is a sort ordering defined.
    dep.keyOrdering match {
      case Some(keyOrd: Ordering[K]) =>   //需要排序的情况
        // Create an ExternalSorter to sort the data. Note that if spark.shuffle.spill is disabled,
        // the ExternalSorter won't spill to disk.
        v

最低0.47元/天解锁文章