reduceByKey、groupByKey、aggregateByKey区别及用法
前言
在面试过程中被问到reduceByKey、groupByKey、aggregateByKey区别,然鹅博主因没有相关业务支撑,在自学过程也没对其进行深入了解,所有在被问到的时候就非常尴尬,然后就去查看官网,资料,如何使用,查看其源码,好了,废话不多说,看下面。
参考官网介绍
-
reduceByKey(func, [numPartitions])
当在 (K, V) 对的数据集上调用时,返回 (K, V) 对的数据集,其中每个键的值使用给定的 reduce 函数func聚合,该函数必须是 (V,V) => V. 与 中一样groupByKey,reduce 任务的数量可通过可选的第二个参数进行配置。 -
groupByKey ([ numPartitions ])
当在 (K, V) 对的数据集上调用时,返回 (K, Iterable) 对的数据集。
注意:如果您分组是为了对每个键执行聚合(例如求和或平均值),使用reduceByKey或aggregateByKey将产生更好的性能。
注意:默认情况下,输出中的并行级别取决于父 RDD 的分区数。您可以传递一个可选numPartitions参数来设置不同数量的任务。 -
aggregateByKey ( zeroValue )( seqOp ,combOp , [ numPartitions ])
当在 (K, V) 对的数据集上调用时,返回 (K, U) 对的数据集,其中每个键的值使用给定的组合函数和中性“零”值聚合。允许与输入值类型不同的聚合值类型,同时避免不必要的分配。与 in 一样groupByKey,reduce 任务的数量可通过可选的第二个参数进行配置。
使用方法
数据准备
//创建数组
val words: Array[String] = Array("one", "two", "two", "three", "three", "three", "111")
//使用parallelize加载数据
val wordPairsRdd: RDD[String] = scrdd.parallelize(words)
//使用map将数据转换成K,V形式的
val line: RDD[(String, Int)] = wordPairsRdd.map(word => (word,1))
具体算子使用
reduceByKey
val word: RDD[(String, Int)] = line.reduceByKey(_+_)
val word1: RDD[(String, Int)] = line.reduceByKey((_+_),2)
word.foreach(println)
word1.foreach(println)
下图计算结果为groupByKey不加分区
下图计算结果为reduceByKey使用两个分区
groupByKey
//当然也可以不给分区数
val wordcount: RDD[(String, Iterable[Int])] = line.groupByKey(2)
val count: RDD[(String, Int)] = wordcount.map(t => (t._1,t._2.sum))
wordcount 的计算结果
count 计算结果
aggregateByKey
//不分区
val agree1: RDD[(String, Int)] = line.aggregateByKey(0)((x, y)=> x+y, (x, y)=> x+y)
agree1.foreach(println)
//分区
val agree1: RDD[(String, Int)] = line.aggregateByKey(0,2)((x, y)=> x+y, (x, y)=> x+y)
aggregateByKey算子计算结果
reduceByKey、groupByKey、aggregateByKey区别
reduceByKey
附上源码
/**
* Merge the values for each key using an associative and commutative reduce function. This will
* also perform the merging locally on each mapper before sending results to a reducer, similarly
* to a "combiner" in MapReduce. Output will be hash-partitioned with the existing partitioner/
* parallelism level.
*/
def reduceByKey(func: (V, V) => V): RDD[(K, V)] = self.withScope {
reduceByKey(defaultPartitioner(self), func)
}
//reduceByKey调用reduceByKey
def reduceByKey(partitioner: Partitioner, func: (V, V) => V): RDD[(K, V)] = self.withScope {
combineByKeyWithClassTag[V]((v: V) => v, func, func, partitioner)
}
在mapper端会做本地的聚合,然后把聚合后的结果发给reducer.
groupByKey
附上源码
/**
* Group the values for each key in the RDD into a single sequence. Hash-partitions the
* resulting RDD with into `numPartitions` partitions. The ordering of elements within
* each group is not guaranteed, and may even differ each time the resulting RDD is evaluated.
* * @note This operation may be very expensive. If you are grouping in order to perform an
* aggregation (such as a sum or average) over each key, using `PairRDDFunctions.aggregateByKey`
* or `PairRDDFunctions.reduceByKey` will provide much better performance.
* * @note As currently implemented, groupByKey must be able to hold all the key-value pairs for any
* key in memory. If a key has too many values, it can result in an `OutOfMemoryError`.
*/
def groupByKey(numPartitions: Int): RDD[(K, Iterable[V])] = self.withScope {
groupByKey(new HashPartitioner(numPartitions))
}
//groupByKey调用了groupByKey
def groupByKey(partitioner: Partitioner): RDD[(K, Iterable[V])] = self.withScope {
// groupByKey shouldn't use map side combine because map side combine does not
// reduce the amount of data shuffled and requires all map side data be inserted
// into a hash table, leading to more objects in the old gen.
val createCombiner = (v: V) => CompactBuffer(v)
val mergeValue = (buf: CompactBuffer[V], v: V) => buf += v
val mergeCombiners = (c1: CompactBuffer[V], c2: CompactBuffer[V]) => c1 ++= c2
val bufs = combineByKeyWithClassTag[CompactBuffer[V]](
createCombiner, mergeValue, mergeCombiners, partitioner, mapSideCombine = false)
bufs.asInstanceOf[RDD[(K, Iterable[V])]]
}
从注释中我们可以看出
- 将 RDD 中每个键的值分组为一个序列。 哈希分区 生成的 RDD 分为
numPartitions
个分区。 内部元素的排序不能保证每个组,甚至可能在每次评估结果 RDD 时有所不同。 - 此操作可能非常昂贵。 如果您分组是为了执行 使用
PairRDDFunctions.aggregateByKey
对每个键进行聚合(例如求和或平均值)或PairRDDFunctions.reduceByKey
将提供更好的性能。 - 目前已实现,groupByKey 必须能够保存任何键在内存中。 如果一个键有太多的值,它可能会导致
OutOfMemoryError
。
这里我们如果使用groupByKey 处理大量数据的话需放置出现内存OOM
aggregateByKey
附上源码
从官网上看aggregateByKey ( zeroValue )( seqOp ,combOp , [ numPartitions ])
zeroValue 为初始值,seqOp做的就是在每个分区内部的聚合操作,而combOp就是汇总每个分区的结果的一个全局的操作。可以试想一下用这个函数来实现经典的wordcount和reducebykey的实现方式的区别
/**
* Aggregate the values of each key, using given combine functions and a neutral "zero value".
* This function can return a different result type, U, than the type of the values in this RDD,
* V. Thus, we need one operation for merging a V into a U and one operation for merging two U's,
* as in scala.TraversableOnce. The former operation is used for merging values within a
* partition, and the latter is used for merging values between partitions. To avoid memory
* allocation, both of these functions are allowed to modify and return their first argument
* instead of creating a new U.
*/
def aggregateByKey[U: ClassTag](zeroValue: U)(seqOp: (U, V) => U,
combOp: (U, U) => U): RDD[(K, U)] = self.withScope {
aggregateByKey(zeroValue, defaultPartitioner(self))(seqOp, combOp)
}
//aggregateByKey调用了aggregateByKey方法
def aggregateByKey[U: ClassTag](zeroValue: U, partitioner: Partitioner)(seqOp: (U, V) => U,
combOp: (U, U) => U): RDD[(K, U)] = self.withScope {
// Serialize the zero value to a byte array so that we can get a new clone of it on each key
val zeroBuffer = SparkEnv.get.serializer.newInstance().serialize(zeroValue)
val zeroArray = new Array[Byte](zeroBuffer.limit)
zeroBuffer.get(zeroArray)
lazy val cachedSerializer = SparkEnv.get.serializer.newInstance()
val createZero = () => cachedSerializer.deserialize[U](ByteBuffer.wrap(zeroArray))
// We will clean the combiner closure later in `combineByKey`
val cleanedSeqOp = self.context.clean(seqOp)
combineByKeyWithClassTag[U]((v: V) => cleanedSeqOp(createZero(), v),
cleanedSeqOp, combOp, partitioner)
}
可以看到三者其实调用的都是:
def combineByKeyWithClassTag附上源码
/**
* :: Experimental ::
* Generic function to combine the elements for each key using a custom set of aggregation
* functions. Turns an RDD[(K, V)] into a result of type RDD[(K, C)], for a "combined type" C
*
* Users provide three functions:
*
* - `createCombiner`, which turns a V into a C (e.g., creates a one-element list)
* - `mergeValue`, to merge a V into a C (e.g., adds it to the end of a list)
* - `mergeCombiners`, to combine two C's into a single one.
*
* In addition, users can control the partitioning of the output RDD, and whether to perform
* map-side aggregation (if a mapper can produce multiple items with the same key).
*
* @note V and C can be different -- for example, one might group an RDD of type
* (Int, Int) into an RDD of type (Int, Seq[Int]).
*/
@Experimental
def combineByKeyWithClassTag[C](
createCombiner: V => C,
mergeValue: (C, V) => C,
mergeCombiners: (C, C) => C,
partitioner: Partitioner,
mapSideCombine: Boolean = true,
serializer: Serializer = null)(implicit ct: ClassTag[C]): RDD[(K, C)] = self.withScope {
require(mergeCombiners != null, "mergeCombiners must be defined") // required as of Spark 0.9.0
if (keyClass.isArray) {
if (mapSideCombine) {
throw new SparkException("Cannot use map-side combining with array keys.")
}
if (partitioner.isInstanceOf[HashPartitioner]) {
throw new SparkException("HashPartitioner cannot partition array keys.")
}
}
val aggregator = new Aggregator[K, V, C](
self.context.clean(createCombiner),
self.context.clean(mergeValue),
self.context.clean(mergeCombiners))
if (self.partitioner == Some(partitioner)) {
self.mapPartitions(iter => {
val context = TaskContext.get()
new InterruptibleIterator(context, aggregator.combineValuesByKey(iter, context))
}, preservesPartitioning = true)
} else {
new ShuffledRDD[K, V, C](self, partitioner)
.setSerializer(serializer)
.setAggregator(aggregator)
.setMapSideCombine(mapSideCombine)
}
}
感觉aggregatebykey貌似要复杂不少和reducebykey比起来。其实在实际的使用的时候也确实是这个样子的。
reducebykey 的源码中我们可以看到调用
combineByKeyWithClassTag[V]((v: V) => v, func, func, partitioner)
createCombiner: V => C,它的combiner 没有做聚合的处理。
mergeValue: (C, V) => C,
mergeCombiners: (C, C) => C,这个两个函数一样的,都是用的是func
aggregatebykey 的源码中我们可以看到这个的两个函数是比较灵活的,你可以自己去定义。seqOp做的就是在每个分区内部的聚合操作,而combOp就是汇总每个分区的结果的一个全局的操作。可以试想一下用这个函数来实现经典的wordcount和reducebykey的实现方式的区别。
def aggregateByKey[U: ClassTag](zeroValue: U)(seqOp: (U, V) => U,
combOp: (U, U) => U): RDD[(K, U)] = self.withScope {
aggregateByKey(zeroValue, defaultPartitioner(self))(seqOp, combOp)
}
以上便是博主的了解,没有总结具体的区别,大家可自行查看学习深入。