groupByKey和reduceByKey是常用的聚合函数,作用的数据集为PairRDD
scala
reduceByKey函数原型
def reduceByKey(partitioner: Partitioner, func: (V, V) => V): RDD[(K, V)] = self.withScope {
combineByKeyWithClassTag[V]((v: V) => v, func, func, partitioner)
}
我们可以发现,输入参数是两个,第一个参数是分区函数,第二个是作用在value上的一个函数,如果第一个参数为空,则使用默认的hashPartitioner;
reduceByKey样例
val conf = new SparkConf().setAppName("jiangtao_demo").setMaster("local")
val sc = new SparkContext(conf)
val data = sc.makeRDD(List("pandas","numpy","pip","pip","pip"))
//mapToPair
val dataPair = data.map((_,1))
//reduceByKey
val result1 = dataPair.reduceByKey(_+_)
//或者
// val result2 = dataPair.reduceByKey((x,y)=>(x+y))
groupByKey函数原型
def groupByKey(partitioner: Partitioner): RDD[(K, Iterable[V])] = self.withScope {
// groupByKey shouldn’t use map side combine because map side combine does not
// reduce the amount of data shuffled and requires all map side data be inserted
// into a hash table, leading to more objects in the old gen.
val createCombiner = (v: V) => CompactBuffer(v)
val mergeValue = (buf: CompactBuffer[V], v: V) => buf += v
val mergeCombiners = (c1: CompactBuffer[V], c2: CompactBuffer[V]) => c1 ++= c2
val bufs = combineByKeyWithClassTag[CompactBuffer[V]](
createCombiner, mergeValue, mergeCombiners, partitioner, mapSideCombine = false)
bufs.asInstanceOf[RDD[(K, Iterable[V])]]
}
我们发现groupByKey输入参数为分区方式,参数为空,代表采用默认的hashPartitioner,返回值为RDD[(K, Iterable[V])
其实reduceByKey与groupByKey底层调用的都是combineByKeyWithClassTag,我们都知道,reduceByKey是分区内先进行一个合并,在进行shuffle混洗;而groupByKey是直