Spark源码学习笔记（随笔）-groupByKey()是宽依赖吗

最新推荐文章于 2022-05-04 15:34:09 发布

lzy2014

最新推荐文章于 2022-05-04 15:34:09 发布

阅读量1.8k

点赞数 2

分类专栏： spark 文章标签： spark 源码

本文链接：https://blog.csdn.net/lzy2014/article/details/72898580

版权

spark 专栏收录该内容

9 篇文章 3 订阅

订阅专栏

最近从源码角度温习之前学的Spark的基础，在RDD的Dependency这一节中，关于一些Transition操作是Narrow Dependency还是Shuffle Dependency。

对于map/filter等操作我们能很清晰的知道它是窄依赖，对于一些复杂的或者不是那么明确的转换操作就不太能区分是什么依赖，如groupByKey()。较多博客直接说这个转换操作是宽依赖，真的是宽依赖吗？

我们看看源码：

def groupByKey(): RDD[(K, Iterable[V])] = self.withScope {
    groupByKey(defaultPartitioner(self))
  }

取默认分区方式（该RDD如有分区方式则使用该分区方式）作为参数调用另一个带参数的groupByKey：

def groupByKey(partitioner: Partitioner): RDD[(K, Iterable[V])] = self.withScope {
    // groupByKey shouldn't use map side combine because map side combine does not
    // reduce the amount of data shuffled and requires all map side data be inserted
    // into a hash table, leading to more objects in the old gen.
    val createCombiner = (v: V) => CompactBuffer(v)
    val mergeValue = (buf: CompactBuffer[V], v: V) => buf += v
    val mergeCombiners = (c1: CompactBuffer[V], c2: CompactBuffer[V]) => c1 ++= c2
    val bufs = combineByKeyWithClassTag[CompactBuffer[V]](
      createCombiner, mergeValue, mergeCombiners, partitioner, mapSideCombine = false)
    bufs.asInstanceOf[RDD[(K, Iterable[V])]]
  }

最后调用了函数combineByKeyWithClassTag，看看这个函数：

def combineByKeyWithClassTag[C](
      createCombiner: V => C,
      mergeValue: (C, V) => C,
      mergeCombiners: (C, C) => C,
      partitioner: Partitioner,
      mapSideCombine: Boolean = true,
      serializer: Serializer = null)(implicit ct: ClassTag[C]): RDD[(K, C)] = self.withScope {
    require(mergeCombiners != null, "mergeCombiners must be defined") // required as of Spark 0.9.0
    if (keyClass.isArray) {
      if (mapSideCombine) {
        throw new SparkException("Cannot use map-side combining with array keys.")
      }
      if (partitioner.isInstanceOf[HashPartitioner]) {
        throw new SparkException("HashPartitioner cannot partition array keys.")
      }
    }
    val aggregator = new Aggregator[K, V, C](
      self.context.clean(createCombiner),
      self.context.clean(mergeValue),
      self.context.clean(mergeCombiners))
    if (self.partitioner == Some(partitioner)) {
      self.mapPartitions(iter => {
        val context = TaskContext.get()
        new InterruptibleIterator(context, aggregator.combineValuesByKey(iter, context))
      }, preservesPartitioning = true)
    } else {
      new ShuffledRDD[K, V, C](self, partitioner)
        .setSerializer(serializer)
        .setAggregator(aggregator)
        .setMapSideCombine(mapSideCombine)
    }
  }