前言
本文主要从源码角度分析下groupbykey、reducebykey、foldbykey、aggregatebykey四个算子的基本原理。他们都是PairRDDFunctions的成员方法,最终调用combineByKeyWithClassTag,生成ShuffledRDD。所以先得理解combineByKeyWithClassTag
combineByKeyWithClassTag
combineByKeyWithClassTag的作用是根据key聚合,然后根据传入的逻辑对聚合的value进行计算。具体细节咱们来分析源码
def combineByKeyWithClassTag[C](
createCombiner: V => C,
mergeValue: (C, V) => C,
mergeCombiners: (C, C) => C,
partitioner: Partitioner,
mapSideCombine: Boolean = true,
serializer: Serializer = null)(implicit ct: ClassTag[C]): RDD[(K, C)] = self.withScope {
// 省去非核心代码
// ...
val aggregator = new Aggregator[K, V, C](
self.context.clean(createCombiner),
self.context.clean(mergeValue),
self.context.