reduceByKey:将相同的Key根据相应的逻辑进行处理。默认升序
/**
* Merge the values for each key using an associative and commutative reduce function. This will
* also perform the merging locally on each mapper before sending results to a reducer, similarly
* to a "combiner" in MapReduce. Output will be hash-partitioned with the existing partitioner/
* parallelism level.
*/
def reduceByKey(func: (V, V) => V): RDD[(K, V)] = self.withScope {
reduceByKey(defaultPartitioner(self), func)
}
groupByKey:作用在K,V格式的RDD上。根据Key进行分组。作用在(K,V),返回(K,Iterable )
/**
* Group the values for each key in the RDD into a single sequence. Hash-partitions the
* resulting RDD with the existing partitioner/parallelism level. The ordering of elements
* within each group is not guaranteed, and may even differ each time the resulting RDD is
* evaluated.
*
* @note This operation may be very expensive. If you are grouping in order to perform an
* aggregation (such as a sum or average) over each key, using `PairRDDFunctions.aggregateByKey`
* or `PairRDDFunctions.reduceByKey` will provide much better performance.
*/
def groupByKey(): RDD[(K, Iterable[V])] = self.withScope {
groupByKey(defaultPartitioner(self))
}
从源码可以得出,
1.reduceByKey将结果发送给reducer之前在本地进行merge(the merging locally on each mapper before sending results to a reducer),这样数据量会大幅度减小,从而减小传输,保证reduce端能够更快的进行结果计算。
2.groupByKey会对每一个RDD中的value值进行聚合形成一个序列(Iterator),所有的键值对都会被移动,此操作发生在reduce端,大量的数据通过网络进行传输,效率低下。
总结:reduceByKey更适合作用在较大的数据集上