Spark ReduceBykey&GroupByKey

ReduceBykey&GroupByKey

groupByKey源码

上一篇文章中讲到了reduceBykey的源码。还有个比较常见的算子是groupByKey,源码如下:

/**
 * Group the values for each key in the RDD into a single sequence. Allows controlling the
 * partitioning of the resulting key-value pair RDD by passing a Partitioner.
 * The ordering of elements within each group is not guaranteed, and may even differ
 * each time the resulting RDD is evaluated.
 *
 * @note This operation may be very expensive. If you are grouping in order to perform an
 * aggregation (such as a sum or average) over each key, using `PairRDDFunctions.aggregateByKey`
 * or `PairRDDFunctions.reduceByKey` will provide much better performance.
 *
 * @note As currently implemented, groupByKey must be able to hold all the key-value pairs for any
 * key in memory. If a key has too many values, it can result in an `OutOfMemoryError`.
 */
def groupByKey(partitioner: Partitioner): RDD[(K, Iterable[V])] = self.withScope {
  // groupByKey shouldn't use map side combine because map side combine does not
  // reduce the amount of data shuffled and requires all map side data be inserted
  // into a hash table, leading to more objects in the old gen.
  val createCombiner = (v: V) => CompactBuffer(v)
  val mergeValue = (buf: CompactBuffer[V], v: V) => buf += v
  val mergeCombiners = (c1: CompactBuffer[V], c2: CompactBuffer[V]) => c1 ++= c2
  val bufs = combineByKeyWithClassTag[CompactBuffer[V]](
    createCombiner, mergeValue, mergeCombiners, partitioner, mapSideCombine = false)
  bufs.asInstanceOf[RDD[(K, Iterable[V])]]
}

/**
 * Group the values for each key in the RDD into a single sequence. Hash-partitions the
 * resulting RDD with into `numPartitions` partitions. The ordering of elements within
 * each group is not guaranteed, and may even differ each time the resulting RDD is evaluated.
 *
 * @note This operation may be very expensive. If you are grouping in order to perform an
 * aggregation (such as a sum or average) over each key, using `PairRDDFunctions.aggregateByKey`
 * or `PairRDDFunctions.reduceByKey` will provide much better performance.
 *
 * @note As currently implemented, groupByKey must be able to hold all the key-value pairs for any
 * key in memory. If a key has too many values, it can result in an `OutOfMemoryError`.
 */
def groupByKey(numPartitions: Int): RDD[(K, Iterable[V])] = self.withScope {
  groupByKey(new HashPartitioner(numPartitions))
}

groupByKey方法主要作用是将Key相同的所有K,V分组到一个集合序列中,如(a,1),(a,3),(b,1),(c,1),(c,3),分组后结果是((a,1),(a,3)),(b,1),((c,1),(c,3)),分组后的集合中的元素顺序是不确定的,比如a的集合也可能是((a,3),(a,1))。

相对而言,groupByKey方法是比较昂贵的操作,意思就是说比较消耗资源。所以如果你的目的是分组后对每一个Key所对应的所有值进行求和或者取平均的话,最好使用reduceByKey方法或者aggregate方法,这两种算子可以提供更好的性能。

groupByKey是把所有的K,V集合都加载到内存中存储计算,所以一个Key对应的Value过多,就会导致OOM。

reduceBykey与groupByKey对比

  1. 返回值类型不同:reduceBykey返回的是RDD[(K,V)],而groupByKey返回的是RDD[(K,Iterable[V])],举例来说这两者的区别。比如以下集合序列分别用reduceBykey和groupByKey算子做求和。集合(a,1),(a,2),(a,3),(b,1),(b,2),(c,1),reduceBykey产生的中间结果是(a,6),(b,3),(c,1)。而groupByKey产生的中间结果为((a,1),(a,2),(a,3)),)((b,1),(b,2)),(c,1),可以看到groupByKey确实消耗了更多资源
  2. 作用不同,reduceBykey作用是聚合,异或等,groupByKey作用主要是分组,也可以在分组后再做聚合
  3. map端中间结果对K对应的V的聚合方式不同
val words = Array("a", "a", "a", "b", "b", "b")  

val wordPairsRDD = sc.parallelize(words).map(word => (word, 1))  

val wordCountsWithReduce = wordPairsRDD.reduceByKey(_ + _)  //reduceByKey

val wordCountsWithGroup = wordPairsRDD.groupByKey().map(t => (t._1, t._2.sum))  //groupByKey

上面两种方法的结果是相同的,但是计算过程有很大的区别,主要在于聚合过程的发生在什么阶段。

  1. reduceBykey在每个分区移动数据之前,会对每一个分区中的key所对应的values进行聚合,然后再利用reduce对所有分区中的每个key对应的value进行再次聚合。如下图所示:

    10

  2. groupByKey是把分区中的所有的K,V都进行移动,然后再进行整体求和,这样会导致集群之间的开销较大,传输效率较低,也是上面所说的内存溢出错误出现的根本原因

在这里插入图片描述

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值