Spark ReduceBykey&GroupByKey

最新推荐文章于 2023-10-28 08:30:00 发布

JKerving

最新推荐文章于 2023-10-28 08:30:00 发布

阅读量438

点赞数

分类专栏： Spark源码文章标签： Spark reduceBykey groupBykey

本文链接：https://blog.csdn.net/JKerving/article/details/107318871

版权

Spark源码专栏收录该内容

6 篇文章 0 订阅

订阅专栏

ReduceBykey&GroupByKey

文章目录

ReduceBykey&GroupByKey
- groupByKey源码
- reduceBykey与groupByKey对比

groupByKey源码

上一篇文章中讲到了reduceBykey的源码。还有个比较常见的算子是groupByKey，源码如下：

/**
 * Group the values for each key in the RDD into a single sequence. Allows controlling the
 * partitioning of the resulting key-value pair RDD by passing a Partitioner.
 * The ordering of elements within each group is not guaranteed, and may even differ
 * each time the resulting RDD is evaluated.
 *
 * @note This operation may be very expensive. If you are grouping in order to perform an
 * aggregation (such as a sum or average) over each key, using `PairRDDFunctions.aggregateByKey`
 * or `PairRDDFunctions.reduceByKey` will provide much better performance.
 *
 * @note As currently implemented, groupByKey must be able to hold all the key-value pairs for any
 * key in memory. If a key has too many values, it can result in an `OutOfMemoryError`.
 */
def groupByKey(partitioner: Partitioner): RDD[(K, Iterable[V])] = self.withScope {
  // groupByKey shouldn't use map side combine because map side combine does not
  // reduce the amount of data shuffled and requires all map side data be inserted
  // into a hash table, leading to more objects in the old gen.
  val createCombiner = (v: V) => CompactBuffer(v)
  val mergeValue = (buf: CompactBuffer[V], v: V) => buf += v
  val mergeCombiners = (c1: CompactBuffer[V], c2: CompactBuffer[V]) => c1 ++= c2
  val bufs = combineByKeyWithClassTag[CompactBuffer[V]](
    createCombiner, mergeValue, mergeCombiners, partitioner, mapSideCombine = false)
  bufs.asInstanceOf[RDD[(K, Iterable[V])]]
}

/**
 * Group the values for each key in the RDD into a single sequence. Hash-partitions the
 * resulting RDD with into `numPartitions` partitions. The ordering of elements within
 * each group is not guaranteed, and may even differ each time the resulting RDD is evaluated.
 *
 * @note This operation may be very expensive. If you are grouping in order to perform an
 * aggregation (such as a sum or average) over each key, using `PairRDDFunctions.aggregateByKey`
 * or `PairRDDFunctions.reduceByKey` will provide much better performance.
 *
 * @note As currently implemented, groupByKey must be able to hold all the key-value pairs for any
 * key in memory. If a key has too many values, it can result in an `OutOfMemoryError`.
 */
def groupByKey(numPartitions: Int): RDD[(K, Iterable[V])] = self.withScope {
  groupByKey(new HashPartitioner(numPartitions))
}

groupByKey方法主要作用是将Key相同的所有K，V分组到一个集合序列中，如(a,1),(a,3),(b,1),(c,1),(c,3)，分组后结果是((a,1),(a,3)),(b,1),((c,1),(c,3)),分组后的集合中的元素顺序是不确定的，比如a的集合也可能是((a,3),(a,1))。

相对而言，groupByKey方法是比较昂贵的操作，意思就是说比较消耗资源。所以如果你的目的是分组后对每一个Key所对应的所有值进行求和或者取平均的话，最好使用reduceByKey方法或者aggregate方法，这两种算子可以提供更好的性能。

groupByKey是把所有的K,V集合都加载到内存中存储计算，所以一个Key对应的Value过多，就会导致OOM。

reduceBykey与groupByKey对比

返回值类型不同：reduceBykey返回的是RDD[(K,V)]，而groupByKey返回的是RDD[(K,Iterable[V])]，举例来说这两者的区别。比如以下集合序列分别用reduceBykey和groupByKey算子做求和。集合(a,1),(a,2),(a,3),(b,1),(b,2),(c,1)，reduceBykey产生的中间结果是(a,6),(b,3),(c,1)。而groupByKey产生的中间结果为((a,1),(a,2),(a,3)),)((b,1),(b,2)),(c,1)，可以看到groupByKey确实消耗了更多资源
作用不同，reduceBykey作用是聚合，异或等，groupByKey作用主要是分组，也可以在分组后再做聚合
map端中间结果对K对应的V的聚合方式不同

val words = Array("a", "a", "a", "b", "b", "b")  

val wordPairsRDD = sc.parallelize(words).map(word => (word, 1))  

val wordCountsWithReduce = wordPairsRDD.reduceByKey(_ + _)  //reduceByKey

val wordCountsWithGroup = wordPairsRDD.groupByKey().map(t => (t._1, t._2.sum))  //groupByKey

上面两种方法的结果是相同的，但是计算过程有很大的区别，主要在于聚合过程的发生在什么阶段。

reduceBykey在每个分区移动数据之前，会对每一个分区中的key所对应的values进行聚合，然后再利用reduce对所有分区中的每个key对应的value进行再次聚合。如下图所示：
groupByKey是把分区中的所有的K，V都进行移动，然后再进行整体求和，这样会导致集群之间的开销较大，传输效率较低，也是上面所说的内存溢出错误出现的根本原因

在这里插入图片描述

JKerving

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Spark ReduceBykey&GroupByKey

ReduceBykey&GroupByKey文章目录ReduceBykey&GroupByKeygroupByKey源码reduceBykey与groupByKey对比groupByKey源码上一篇文章中讲到了reduceBykey的源码。还有个比较常见的算子是groupByKey，源码如下：/** * Group the values for each key in the RDD into a single sequence. Allows controlling the *
复制链接

扫一扫

专栏目录