最近从源码角度温习之前学的Spark的基础,在RDD的Dependency这一节中,关于一些Transition操作是Narrow Dependency还是Shuffle Dependency。
对于map/filter等操作我们能很清晰的知道它是窄依赖,对于一些复杂的或者不是那么明确的转换操作就不太能区分是什么依赖,如groupByKey()。较多博客直接说这个转换操作是宽依赖,真的是宽依赖吗?
我们看看源码:
def groupByKey(): RDD[(K, Iterable[V])] = self.withScope {
groupByKey(defaultPartitioner(self))
}
取默认分区方式(该RDD如有分区方式则使用该分区方式)作为参数调用另一个带参数的groupByKey:
def groupByKey(partitioner: Partitioner): RDD[(K, Iterable[V])] = self.withScope {
// groupByKey shouldn't use map side combine because map side combine does not
// reduce the amount of data shuffled and requires all map side data be inserted
// into a hash table, leading to more objects in the old gen.
val createCombiner = (v: V) => CompactBuffer(v)
val mergeValue = (buf: CompactBuffer[V], v: V) => buf += v
val mergeCombiners = (c1: CompactBuffer[V], c2: CompactBuffer[V]) => c1 ++= c2
val bufs = combineByKeyWithClassTag[CompactBuffer[V]](
createCombiner, mergeValue, mergeCombiners, partitioner, mapSideCombine = false)
bufs.asInstanceOf[RDD[(K, Iterable[V])]]
}
最后调用了函数combineByKeyWithClassTag,看看这个函数:
def combineByKeyWithClassTag[C](
createCombiner: V => C,
mergeValue: (C, V) => C,
mergeCombiners: (C, C) => C,
partitioner: Partitioner,
mapSideCombine: Boolean = true,
serializer: Serializer = null)(implicit ct: ClassTag[C]): RDD[(K, C)] = self.withScope {
require(mergeCombiners != null, "mergeCombiners must be defined") // required as of Spark 0.9.0
if (keyClass.isArray) {
if (mapSideCombine) {
throw new SparkException("Cannot use map-side combining with array keys.")
}
if (partitioner.isInstanceOf[HashPartitioner]) {
throw new SparkException("HashPartitioner cannot partition array keys.")
}
}
val aggregator = new Aggregator[K, V, C](
self.context.clean(createCombiner),
self.context.clean(mergeValue),
self.context.clean(mergeCombiners))
if (self.partitioner == Some(partitioner)) {
self.mapPartitions(iter => {
val context = TaskContext.get()
new InterruptibleIterator(context, aggregator.combineValuesByKey(iter, context))
}, preservesPartitioning = true)
} else {
new ShuffledRDD[K, V, C](self, partitioner)
.setSerializer(serializer)
.setAggregator(aggregator)
.setMapSideCombine(mapSideCombine)
}
}
上述函数中可以看到,如果该RDD的分区方式与参数中的分区方式相同,则调用mapPartitions函数,该函数生成MapPartitionsRDD,为窄依赖。分区方式不同,才生成ShuffledRDD,为宽依赖。
groupByKey还有另外两个函数:groupByKey(numPartitions: Int)和groupByKey(partitioner: Partitioner)。这两个函数会有新的分区方式。
因此,groupByKey()应该不一定是宽依赖吧!