reduceByKey、groupByKey、aggregateByKey区别及用法

前言

在面试过程中被问到reduceByKey、groupByKey、aggregateByKey区别,然鹅博主因没有相关业务支撑,在自学过程也没对其进行深入了解,所有在被问到的时候就非常尴尬,然后就去查看官网,资料,如何使用,查看其源码,好了,废话不多说,看下面。

参考官网介绍

  • reduceByKey(func, [numPartitions])
    当在 (K, V) 对的数据集上调用时,返回 (K, V) 对的数据集,其中每个键的值使用给定的 reduce 函数func聚合,该函数必须是 (V,V) => V. 与 中一样groupByKey,reduce 任务的数量可通过可选的第二个参数进行配置。

  • groupByKey ([ numPartitions ])
    当在 (K, V) 对的数据集上调用时,返回 (K, Iterable) 对的数据集。
    注意:如果您分组是为了对每个键执行聚合(例如求和或平均值),使用reduceByKey或aggregateByKey将产生更好的性能。
    注意:默认情况下,输出中的并行级别取决于父 RDD 的分区数。您可以传递一个可选numPartitions参数来设置不同数量的任务。

  • aggregateByKey ( zeroValue )( seqOp ,combOp , [ numPartitions ])
    当在 (K, V) 对的数据集上调用时,返回 (K, U) 对的数据集,其中每个键的值使用给定的组合函数和中性“零”值聚合。允许与输入值类型不同的聚合值类型,同时避免不必要的分配。与 in 一样groupByKey,reduce 任务的数量可通过可选的第二个参数进行配置。

使用方法

数据准备

//创建数组
val words: Array[String] = Array("one", "two", "two", "three", "three", "three", "111")
//使用parallelize加载数据
val wordPairsRdd: RDD[String] = scrdd.parallelize(words)
//使用map将数据转换成K,V形式的
val line: RDD[(String, Int)] = wordPairsRdd.map(word => (word,1))

具体算子使用

reduceByKey

val word: RDD[(String, Int)] = line.reduceByKey(_+_)
val word1: RDD[(String, Int)] = line.reduceByKey((_+_),2)
word.foreach(println)
word1.foreach(println)

下图计算结果为groupByKey不加分区
上图计算结果为groupByKey不加分区
下图计算结果为reduceByKey使用两个分区
在这里插入图片描述
groupByKey

//当然也可以不给分区数
 val wordcount: RDD[(String, Iterable[Int])] = line.groupByKey(2)
 val count: RDD[(String, Int)] = wordcount.map(t => (t._1,t._2.sum))

wordcount 的计算结果
在这里插入图片描述
count 计算结果在这里插入图片描述

aggregateByKey

//不分区
val agree1: RDD[(String, Int)] = line.aggregateByKey(0)((x, y)=> x+y, (x, y)=> x+y)
agree1.foreach(println)
//分区
val agree1: RDD[(String, Int)] = line.aggregateByKey(0,2)((x, y)=> x+y, (x, y)=> x+y)

aggregateByKey算子计算结果
在这里插入图片描述

reduceByKey、groupByKey、aggregateByKey区别

reduceByKey

附上源码

/**
 * Merge the values for each key using an associative and commutative reduce function. This will
 * also perform the merging locally on each mapper before sending results to a reducer, similarly
 * to a "combiner" in MapReduce. Output will be hash-partitioned with the existing partitioner/
 * parallelism level.
   */
def reduceByKey(func: (V, V) => V): RDD[(K, V)] = self.withScope {
    reduceByKey(defaultPartitioner(self), func)
  }
//reduceByKey调用reduceByKey
def reduceByKey(partitioner: Partitioner, func: (V, V) => V): RDD[(K, V)] = self.withScope {
   combineByKeyWithClassTag[V]((v: V) => v, func, func, partitioner)
 }

在mapper端会做本地的聚合,然后把聚合后的结果发给reducer.

groupByKey

附上源码

 /**
 * Group the values for each key in the RDD into a single sequence. Hash-partitions the
 * resulting RDD with into `numPartitions` partitions. The ordering of elements within
 * each group is not guaranteed, and may even differ each time the resulting RDD is evaluated.
 *  * @note This operation may be very expensive. If you are grouping in order to perform an
 * aggregation (such as a sum or average) over each key, using `PairRDDFunctions.aggregateByKey`
 * or `PairRDDFunctions.reduceByKey` will provide much better performance.
 *  * @note As currently implemented, groupByKey must be able to hold all the key-value pairs for any
 * key in memory. If a key has too many values, it can result in an `OutOfMemoryError`.
   */
def groupByKey(numPartitions: Int): RDD[(K, Iterable[V])] = self.withScope {
    groupByKey(new HashPartitioner(numPartitions))
  }
 //groupByKey调用了groupByKey
 def groupByKey(partitioner: Partitioner): RDD[(K, Iterable[V])] = self.withScope {
    // groupByKey shouldn't use map side combine because map side combine does not
    // reduce the amount of data shuffled and requires all map side data be inserted
    // into a hash table, leading to more objects in the old gen.
    val createCombiner = (v: V) => CompactBuffer(v)
    val mergeValue = (buf: CompactBuffer[V], v: V) => buf += v
    val mergeCombiners = (c1: CompactBuffer[V], c2: CompactBuffer[V]) => c1 ++= c2
    val bufs = combineByKeyWithClassTag[CompactBuffer[V]](
      createCombiner, mergeValue, mergeCombiners, partitioner, mapSideCombine = false)
    bufs.asInstanceOf[RDD[(K, Iterable[V])]]
  }

从注释中我们可以看出

  • 将 RDD 中每个键的值分组为一个序列。 哈希分区 生成的 RDD 分为 numPartitions 个分区。 内部元素的排序不能保证每个组,甚至可能在每次评估结果 RDD 时有所不同。
  • 此操作可能非常昂贵。 如果您分组是为了执行 使用PairRDDFunctions.aggregateByKey对每个键进行聚合(例如求和或平均值)或PairRDDFunctions.reduceByKey 将提供更好的性能。
  • 目前已实现,groupByKey 必须能够保存任何键在内存中。 如果一个键有太多的值,它可能会导致OutOfMemoryError
    这里我们如果使用groupByKey 处理大量数据的话需放置出现内存OOM

aggregateByKey

附上源码
从官网上看aggregateByKey ( zeroValue )( seqOp ,combOp , [ numPartitions ])
zeroValue 为初始值,seqOp做的就是在每个分区内部的聚合操作,而combOp就是汇总每个分区的结果的一个全局的操作。可以试想一下用这个函数来实现经典的wordcount和reducebykey的实现方式的区别

/**
   * Aggregate the values of each key, using given combine functions and a neutral "zero value".
   * This function can return a different result type, U, than the type of the values in this RDD,
   * V. Thus, we need one operation for merging a V into a U and one operation for merging two U's,
   * as in scala.TraversableOnce. The former operation is used for merging values within a
   * partition, and the latter is used for merging values between partitions. To avoid memory
   * allocation, both of these functions are allowed to modify and return their first argument
   * instead of creating a new U.
   */
 def aggregateByKey[U: ClassTag](zeroValue: U)(seqOp: (U, V) => U,
      combOp: (U, U) => U): RDD[(K, U)] = self.withScope {
    aggregateByKey(zeroValue, defaultPartitioner(self))(seqOp, combOp)
  }
  //aggregateByKey调用了aggregateByKey方法
  
 def aggregateByKey[U: ClassTag](zeroValue: U, partitioner: Partitioner)(seqOp: (U, V) => U,
     combOp: (U, U) => U): RDD[(K, U)] = self.withScope {
   // Serialize the zero value to a byte array so that we can get a new clone of it on each key
   val zeroBuffer = SparkEnv.get.serializer.newInstance().serialize(zeroValue)
   val zeroArray = new Array[Byte](zeroBuffer.limit)
   zeroBuffer.get(zeroArray)

   lazy val cachedSerializer = SparkEnv.get.serializer.newInstance()
   val createZero = () => cachedSerializer.deserialize[U](ByteBuffer.wrap(zeroArray))

   // We will clean the combiner closure later in `combineByKey`
   val cleanedSeqOp = self.context.clean(seqOp)
   combineByKeyWithClassTag[U]((v: V) => cleanedSeqOp(createZero(), v),
     cleanedSeqOp, combOp, partitioner)
 }

可以看到三者其实调用的都是:

def combineByKeyWithClassTag附上源码

 /**
   * :: Experimental ::
   * Generic function to combine the elements for each key using a custom set of aggregation
   * functions. Turns an RDD[(K, V)] into a result of type RDD[(K, C)], for a "combined type" C
   *
   * Users provide three functions:
   *
   *  - `createCombiner`, which turns a V into a C (e.g., creates a one-element list)
   *  - `mergeValue`, to merge a V into a C (e.g., adds it to the end of a list)
   *  - `mergeCombiners`, to combine two C's into a single one.
   *
   * In addition, users can control the partitioning of the output RDD, and whether to perform
   * map-side aggregation (if a mapper can produce multiple items with the same key).
   *
   * @note V and C can be different -- for example, one might group an RDD of type
   * (Int, Int) into an RDD of type (Int, Seq[Int]).
   */
  @Experimental
  def combineByKeyWithClassTag[C](
      createCombiner: V => C,
      mergeValue: (C, V) => C,
      mergeCombiners: (C, C) => C,
      partitioner: Partitioner,
      mapSideCombine: Boolean = true,
      serializer: Serializer = null)(implicit ct: ClassTag[C]): RDD[(K, C)] = self.withScope {
    require(mergeCombiners != null, "mergeCombiners must be defined") // required as of Spark 0.9.0
    if (keyClass.isArray) {
      if (mapSideCombine) {
        throw new SparkException("Cannot use map-side combining with array keys.")
      }
      if (partitioner.isInstanceOf[HashPartitioner]) {
        throw new SparkException("HashPartitioner cannot partition array keys.")
      }
    }
    val aggregator = new Aggregator[K, V, C](
      self.context.clean(createCombiner),
      self.context.clean(mergeValue),
      self.context.clean(mergeCombiners))
    if (self.partitioner == Some(partitioner)) {
      self.mapPartitions(iter => {
        val context = TaskContext.get()
        new InterruptibleIterator(context, aggregator.combineValuesByKey(iter, context))
      }, preservesPartitioning = true)
    } else {
      new ShuffledRDD[K, V, C](self, partitioner)
        .setSerializer(serializer)
        .setAggregator(aggregator)
        .setMapSideCombine(mapSideCombine)
    }
  }

感觉aggregatebykey貌似要复杂不少和reducebykey比起来。其实在实际的使用的时候也确实是这个样子的。

reducebykey 的源码中我们可以看到调用
combineByKeyWithClassTag[V]((v: V) => v, func, func, partitioner)

createCombiner: V => C,它的combiner 没有做聚合的处理。
mergeValue: (C, V) => C,
mergeCombiners: (C, C) => C,这个两个函数一样的,都是用的是func

aggregatebykey 的源码中我们可以看到这个的两个函数是比较灵活的,你可以自己去定义。seqOp做的就是在每个分区内部的聚合操作,而combOp就是汇总每个分区的结果的一个全局的操作。可以试想一下用这个函数来实现经典的wordcount和reducebykey的实现方式的区别。

def aggregateByKey[U: ClassTag](zeroValue: U)(seqOp: (U, V) => U,
combOp: (U, U) => U): RDD[(K, U)] = self.withScope {
aggregateByKey(zeroValue, defaultPartitioner(self))(seqOp, combOp)
}
以上便是博主的了解,没有总结具体的区别,大家可自行查看学习深入。

  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值