Spark中针对键值对类型的RDD做各种操作比较常用的两个方法就是ReduceByKey与GroupByKey方法,下面从源码里面看看ReduceByKey与GroupByKey方法的使用以及内部逻辑。
官方源码解释:三种形式的reduceByKey
总体来说下面三种形式的方法备注大意为:
根据用户传入的函数来对(K,V)中每个K对应的所有values做merge操作(具体的操作类型根据用户定义的函数),在将结果发送给reducer节点前该merge操作首先会在本地Mapper端进行。但是具体到每个方法,根据传入的参数其含义又有所延伸,下面会具体解释:
/**
* Merge the values for each key using an associative and commutative reduce function. This will
* also perform the merging locally on each mapper before sending results to a reducer, similarly
* to a "combiner" in MapReduce.
* 传入分区器,根据分区器重新分区
*/
def reduceByKey(partitioner: Partitioner, func: (V, V) => V): RDD[(K, V)] = self.withScope {
combineByKeyWithClassTag[V]((v: V) => v, func, func, partitioner)
}
/**
* Merge the values for each key using an associative and commutative reduce function. This will
* also perform the merging locally on each mapper before sending results to a reducer, similarly
* to a "combiner" in MapReduce. Output will be hash-partitioned with numPartitions partitions.
* 重新设置分区数
*/
def reduceByKey(func: (V, V) => V, numPartitions: Int): RDD[(K, V)] = self.withScope {
reduceByKey(new HashPartitioner(numPartitions), func)
}
/**
* Merge the values for each key using an associative and commutative reduce function. This will
* also perform the merging locally on each mapper before sending results to a reducer, similarly
* to a "combiner" in MapReduce. Output will be hash-partitioned with the existing partitioner/
* parallelism level.
* 使用默认分区器
*/
def reduceByKey(func: (V, V) => V): RDD[(K, V)] = self.withScope {
reduceByKey(defaultPartitioner(self), func)
}
接着往下面来看,reduceByKey方法主要执行逻辑在combineByKeyWithClassTag[V]((v: V) => v, func, func, partitioner)这个方法中,贴出源码:
def combineByKeyWithClassTag[C](
createCombiner: V => C, //把V装进C中
mergeValue: (C, V) => C, //把V整合进入C中
mergeCombiners: (C, C) => C, //整合两个C成为一个
partitioner: Partitioner,
mapSideCombine: Boolean = true,
serializer: Serializer = null)(implicit ct: ClassTag[C]): RDD[(K, C)] = self.withScope {
require(mergeCombiners != null, "mergeCombiners must be defined") // required as of Spark 0.9.0
//这里可以看到,pairRDD的key类型不能为数组,否则会报错
if (keyClass.isArray) {
if (mapSideCombine) {
throw new SparkException("Cannot use map-side combining with array keys.")
}
//hash分区器不能作用于数组键
if (partitioner.isInstanceOf[HashPartitioner]) {
throw new SparkException("HashPartitioner cannot partition array keys.")
}
}
val aggregator = new Aggregator[K, V, C](
self.context.clean(createCombiner),
self.context.clean(mergeValue),
self.context.clean(mergeCombiners))
//判断传入分区器是否相同
if (self.partitioner == Some(partitioner)) {
self.mapPartitions(iter => {
val context = TaskContext.get()
new InterruptibleIterator(context, aggregator.combineValuesByKey(iter, context))
}, preservesPartitioning = true)
} else {
//不相同的话重新返回shufferRDD
new ShuffledRDD[K, V, C](self, partitioner)
.setSerializer(serializer)
.setAggregator(aggregator)
.setMapSideCombine(mapSideCombine)
}
}
官方源码解释:三种形式的groupByKey
三个方法只是传递的参数不同,整体需要实现的功能是相同的,需要对结果的分区进行控制的话可以使用带有分区器参数的方法,需要重新设置分区数量的话可以使用带有分区数参数的方法,使用官方默认设置的话则是用无参数的方法。
/**
* Group the values for each key in the RDD into a single sequence. Hash-partitions the
* resulting RDD with the existing partitioner/parallelism level. The ordering of elements
* within each group is not guaranteed, and may even differ each time the resulting RDD is
* evaluated.
*
* @note This operation may be very expensive. If you are grouping in order to perform an
* aggregation (such as a sum or average) over each key, using `PairRDDFunctions.aggregateByKey`
* or `PairRDDFunctions.reduceByKey` will provide much better performance.
* 默认设置的方法
*/
def groupByKey(): RDD[(K, Iterable[V])] = self.withScope {
groupByKey(defaultPartitioner(self))
}
/**
* Group the values for each key in the RDD into a single sequence. Allows controlling the
* partitioning of the resulting key-value pair RDD by passing a Partitioner.
* The ordering of elements within each group is not guaranteed, and may even differ
* each time the resulting RDD is evaluated.
*
* @note This operation may be very expensive. If you are grouping in order to perform an
* aggregation (such as a sum or average) over each key, using `PairRDDFunctions.aggregateByKey`
* or `PairRDDFunctions.reduceByKey` will provide much better performance.
*
* @note As currently implemented, groupByKey must be able to hold all the key-value pairs for any
* key in memory. If a key has too many values, it can result in an [[OutOfMemoryError]].
* 带有分区器参数的方法
*/
def groupByKey(partitioner: Partitioner): RDD[(K, Iterable[V])] = self.withScope {
// groupByKey shouldn't use map side combine because map side combine does not
// reduce the amount of data shuffled and requires all map side data be inserted
// into a hash table, leading to more objects in the old gen.
val createCombiner = (v: V) => CompactBuffer(v)
val mergeValue = (buf: CompactBuffer[V], v: V) => buf += v
val mergeCombiners = (c1: CompactBuffer[V], c2: CompactBuffer[V]) => c1 ++= c2
val bufs = combineByKeyWithClassTag[CompactBuffer[V]](
createCombiner, mergeValue, mergeCombiners, partitioner, mapSideCombine = false)
bufs.asInstanceOf[RDD[(K, Iterable[V])]]
}
/**
* Group the values for each key in the RDD into a single sequence. Hash-partitions the
* resulting RDD with into `numPartitions` partitions. The ordering of elements within
* each group is not guaranteed, and may even differ each time the resulting RDD is evaluated.
*
* @note This operation may be very expensive. If you are grouping in order to perform an
* aggregation (such as a sum or average) over each key, using `PairRDDFunctions.aggregateByKey`
* or `PairRDDFunctions.reduceByKey` will provide much better performance.
*
* @note As currently implemented, groupByKey must be able to hold all the key-value pairs for any
* key in memory. If a key has too many values, it can result in an [[OutOfMemoryError]].
* 带有分区数量参数的方法
*/
def groupByKey(numPartitions: Int): RDD[(K, Iterable[V])] = self.withScope {
groupByKey(new HashPartitioner(numPartitions))
}
groupByKey方法主要作用是将键相同的所有的键值对分组到一个集合序列当中,如(a,1),(a,3),(b,1),(c,1),(c,3),分组后结果是((a,1),(a,3)),(b,1),((c,1),(c,3)),分组后的集合中的元素顺序是不确定的,比如键a的值集合也可能是((a,3),(a,1)).
相对而言,groupByKey方法是比较昂贵的操作,意思就是说比较消耗资源。所以如果你的目的是分组后对每一个键所对应的所有值进行求和或者取平均的话,那么请使用PairRDD中的reduceByKey方法或者aggregateByKey方法,这两种方法可以提供更好的性能
groupBykey是把所有的键值对集合都加载到内存中存储计算,所以如果一个键对应的值太多的话,就会导致内存溢出的错误,这是需要重点关注的地方
reduceByKey与groupByKey进行对比
- 返回值类型不同:reduceByKey返回的是RDD[(K, V)],而groupByKey返回的是RDD[(K, Iterable[V])],举例来说这两者的区别。比如含有一下数据的rdd应用上面两个方法做求和:(a,1),(a,2),(a,3),(b,1),(b,2),(c,1);reduceByKey产生的中间结果(a,6),(b,3),(c,1);而groupByKey产生的中间结果结果为((a,1)(a,2)(a,3)),((b,1)(b,2)),(c,1),(以上结果为一个分区中的中间结果)可见groupByKey的结果更加消耗资源
- 作用不同,reduceByKey作用是聚合,异或等,groupByKey作用主要是分组,也可以做聚合(分组之后)
- map端中间结果对键对应的值得聚合方式不同
单词计数说明两种方式的区别
val words = Array("a", "a", "a", "b", "b", "b")
val wordPairsRDD = sc.parallelize(words).map(word => (word, 1))
val wordCountsWithReduce = wordPairsRDD.reduceByKey(_ + _) //reduceByKey
val wordCountsWithGroup = wordPairsRDD.groupByKey().map(t => (t._1, t._2.sum)) //groupByKey
上面两种方法的结果是相同的,但是计算过程却又很大的区别,借用网上的一幅对比图来说明:
- reduceByKey在每个分区移动数据之前,会对每一个分区中的key所对应的values进行求和,然后再利用reduce对所有分区中的每个键对应的值进行再次聚合。整个过程如图:
- groupByKey是把分区中的所有的键值对都进行移动,然后再进行整体求和,这样会导致集群节点之间的开销较大,传输效率较低,也是上文所说的内存溢出错误出现的根本原因