reduceByKey、groupByKey、aggregateByKey区别及用法

最新推荐文章于 2024-03-11 23:50:50 发布

Crazy549475374

最新推荐文章于 2024-03-11 23:50:50 发布

阅读量1.1k

点赞数 1

分类专栏：大数据文章标签：大数据 spark

本文链接：https://blog.csdn.net/qq_39565127/article/details/121563380

版权

大数据专栏收录该内容

1 篇文章 0 订阅

订阅专栏

reduceByKey、groupByKey、aggregateByKey区别及用法

前言
参考官网介绍
使用方法
- 数据准备
- 具体算子使用
reduceByKey、groupByKey、aggregateByKey区别

前言

在面试过程中被问到reduceByKey、groupByKey、aggregateByKey区别，然鹅博主因没有相关业务支撑，在自学过程也没对其进行深入了解，所有在被问到的时候就非常尴尬，然后就去查看官网，资料，如何使用，查看其源码，好了，废话不多说，看下面。

参考官网介绍

reduceByKey(func, [numPartitions])
当在 (K, V) 对的数据集上调用时，返回 (K, V) 对的数据集，其中每个键的值使用给定的 reduce 函数func聚合，该函数必须是 (V,V) => V. 与中一样groupByKey，reduce 任务的数量可通过可选的第二个参数进行配置。
groupByKey ([ numPartitions ])
当在 (K, V) 对的数据集上调用时，返回 (K, Iterable) 对的数据集。
注意：如果您分组是为了对每个键执行聚合（例如求和或平均值），使用reduceByKey或aggregateByKey将产生更好的性能。
注意：默认情况下，输出中的并行级别取决于父 RDD 的分区数。您可以传递一个可选numPartitions参数来设置不同数量的任务。
aggregateByKey ( zeroValue )( seqOp ,combOp , [ numPartitions ])
当在 (K, V) 对的数据集上调用时，返回 (K, U) 对的数据集，其中每个键的值使用给定的组合函数和中性“零”值聚合。允许与输入值类型不同的聚合值类型，同时避免不必要的分配。与 in 一样groupByKey，reduce 任务的数量可通过可选的第二个参数进行配置。

使用方法

数据准备

//创建数组
val words: Array[String] = Array("one", "two", "two", "three", "three", "three", "111")
//使用parallelize加载数据
val wordPairsRdd: RDD[String] = scrdd.parallelize(words)
//使用map将数据转换成K,V形式的
val line: RDD[(String, Int)] = wordPairsRdd.map(word => (word,1))

具体算子使用

reduceByKey

val word: RDD[(String, Int)] = line.reduceByKey(_+_)
val word1: RDD[(String, Int)] = line.reduceByKey((_+_),2)
word.foreach(println)
word1.foreach(println)

下图计算结果为groupByKey不加分区
上图计算结果为groupByKey不加分区
下图计算结果为reduceByKey使用两个分区
在这里插入图片描述
groupByKey

//当然也可以不给分区数
 val wordcount: RDD[(String, Iterable[Int])] = line.groupByKey(2)
 val count: RDD[(String, Int)] = wordcount.map(t => (t._1,t._2.sum))

wordcount 的计算结果
在这里插入图片描述
count 计算结果

aggregateByKey

//不分区
val agree1: RDD[(String, Int)] = line.aggregateByKey(0)((x, y)=> x+y, (x, y)=> x+y)
agree1.foreach(println)
//分区
val agree1: RDD[(String, Int)] = line.aggregateByKey(0，2)((x, y)=> x+y, (x, y)=> x+y)

aggregateByKey算子计算结果
在这里插入图片描述

reduceByKey、groupByKey、aggregateByKey区别

reduceByKey

附上源码

/**
 * Merge the values for each key using an associative and commutative reduce function. This will
 * also perform the merging locally on each mapper before sending results to a reducer, similarly
 * to a "combiner" in MapReduce. Output will be hash-partitioned with the existing partitioner/
 * parallelism level.
   */
def reduceByKey(func: (V, V) => V): RDD[(K, V)] = self.withScope {
    reduceByKey(defaultPartitioner(self), func)
  }
//reduceByKey调用reduceByKey
def reduceByKey(partitioner: Partitioner, func: (V, V) => V): RDD[(K, V)] = self.withScope {
   combineByKeyWithClassTag[V]((v: V) => v, func, func, partitioner)
 }

在mapper端会做本地的聚合，然后把聚合后的结果发给reducer.

groupByKey

附上源码

 /**
 * Group the values for each key in the RDD into a single sequence. Hash-partitions the
 * resulting RDD with into `numPartitions` partitions. The ordering of elements within
 * each group is not guaranteed, and may even differ each time the resulting RDD is evaluated.
 *  * @note This operation may be very expensive. If you are grouping in order to perform an
 * aggregation (such as a sum or average) over each key, using `PairRDDFunctions.aggregateByKey`
 * or `PairRDDFunctions.reduceByKey` will provide much better performance.
 *  * @note As currently implemented, groupByKey must be able to hold all the key-value pairs for any
 * key in memory. If a key has too many values, it can result in an `OutOfMemoryError`.
   */
def groupByKey(numPartitions: Int): RDD[(K, Iterable[V])] = self.withScope {
    groupByKey(new HashPartitioner(numPartitions))
  }
 //groupByKey调用了groupByKey
 def groupByKey(partitioner: Partitioner): RDD[(K, Iterable[V])] = self.withScope {
    // groupByKey shouldn't use map side combine because map side combine does not
    // reduce the amount of data shuffled and requires all map side data be inserted
    // into a hash table, leading to more objects in the old gen.
    val createCombiner = (v: V) => CompactBuffer(v)
    val mergeValue = (buf: CompactBuffer[V], v: V) => buf += v
    val mergeCombiners = (c1: CompactBuffer[V], c2: CompactBuffer[V]) => c1 ++= c2
    val bufs = combineByKeyWithClassTag[CompactBuffer[V]](
      createCombiner, mergeValue, mergeCombiners, partitioner, mapSideCombine = false)
    bufs.asInstanceOf[RDD[(K, Iterable[V])]]
  }

从注释中我们可以看出

将 RDD 中每个键的值分组为一个序列。哈希分区生成的 RDD 分为 numPartitions 个分区。内部元素的排序不能保证每个组，甚至可能在每次评估结果 RDD 时有所不同。
此操作可能非常昂贵。如果您分组是为了执行使用PairRDDFunctions.aggregateByKey对每个键进行聚合（例如求和或平均值）或PairRDDFunctions.reduceByKey 将提供更好的性能。
目前已实现，groupByKey 必须能够保存任何键在内存中。如果一个键有太多的值，它可能会导致OutOfMemoryError。
这里我们如果使用groupByKey 处理大量数据的话需放置出现内存OOM

aggregateByKey

附上源码
从官网上看aggregateByKey ( zeroValue )( seqOp ,combOp , [ numPartitions ])
zeroValue 为初始值，seqOp做的就是在每个分区内部的聚合操作，而combOp就是汇总每个分区的结果的一个全局的操作。可以试想一下用这个函数来实现经典的wordcount和reducebykey的实现方式的区别

/**
   * Aggregate the values of each key, using given combine functions and a neutral "zero value".
   * This function can return a different result type, U, than the type of the values in this RDD,
   * V. Thus, we need one operation for merging a V into a U and one operation for merging two U's,
   * as in scala.TraversableOnce. The former operation is used for merging values within a
   * partition, and the latter is used for merging values between partitions. To avoid memory
   * allocation, both of these functions are allowed to modify and return their first argument
   * instead of creating a new U.
   */
 def aggregateByKey[U: ClassTag](zeroValue: U)(seqOp: (U, V) => U,
      combOp: (U, U) => U): RDD[(K, U)] = self.withScope {
    aggregateByKey(zeroValue, defaultPartitioner(self))(seqOp, combOp)
  }
  //aggregateByKey调用了aggregateByKey方法
  
 def aggregateByKey[U: ClassTag](zeroValue: U, partitioner: Partitioner)(seqOp: (U, V) => U,
     combOp: (U, U) => U): RDD[(K, U)] = self.withScope {
   // Serialize the zero value to a byte array so that we can get a new clone of it on each key
   val zeroBuffer = SparkEnv.get.serializer.newInstance().serialize(zeroValue)
   val zeroArray = new Array[Byte](zeroBuffer.limit)
   zeroBuffer.get(zeroArray)

   lazy val cachedSerializer = SparkEnv.get.serializer.newInstance()
   val createZero = () => cachedSerializer.deserialize[U](ByteBuffer.wrap(zeroArray))

   // We will clean the combiner closure later in `combineByKey`
   val cleanedSeqOp = self.context.clean(seqOp)
   combineByKeyWithClassTag[U]((v: V) => cleanedSeqOp(createZero(), v),
     cleanedSeqOp, combOp, partitioner)
 }

可以看到三者其实调用的都是：

def combineByKeyWithClassTag附上源码

 /**
   * :: Experimental ::
   * Generic function to combine the elements for each key using a custom set of aggregation
   * functions. Turns an RDD[(K, V)] into a result of type RDD[(K, C)], for a "combined type" C
   *
   * Users provide three functions:
   *
   *  - `createCombiner`, which turns a V into a C (e.g., creates a one-element list)
   *  - `mergeValue`, to merge a V into a C (e.g., adds it to the end of a list)
   *  - `mergeCombiners`, to combine two C's into a single one.
   *
   * In addition, users can control the partitioning of the output RDD, and whether to perform
   * map-side aggregation (if a mapper can produce multiple items with the same key).
   *
   * @note V and C can be different -- for example, one might group an RDD of type
   * (Int, Int) into an RDD of type (Int, Seq[Int]).
   */
  @Experimental
  def combineByKeyWithClassTag[C](
      createCombiner: V => C,
      mergeValue: (C, V) => C,
      mergeCombiners: (C, C) => C,
      partitioner: Partitioner,
      mapSideCombine: Boolean = true,
      serializer: Serializer = null)(implicit ct: ClassTag[C]): RDD[(K, C)] = self.withScope {
    require(mergeCombiners != null, "mergeCombiners must be defined") // required as of Spark 0.9.0
    if (keyClass.isArray) {
      if (mapSideCombine) {
        throw new SparkException("Cannot use map-side combining with array keys.")
      }
      if (partitioner.isInstanceOf[HashPartitioner]) {
        throw new SparkException("HashPartitioner cannot partition array keys.")
      }
    }
    val aggregator = new Aggregator[K, V, C](
      self.context.clean(createCombiner),
      self.context.clean(mergeValue),
      self.context.clean(mergeCombiners))
    if (self.partitioner == Some(partitioner)) {
      self.mapPartitions(iter => {
        val context = TaskContext.get()
        new InterruptibleIterator(context, aggregator.combineValuesByKey(iter, context))
      }, preservesPartitioning = true)
    } else {
      new ShuffledRDD[K, V, C](self, partitioner)
        .setSerializer(serializer)
        .setAggregator(aggregator)
        .setMapSideCombine(mapSideCombine)
    }
  }

感觉aggregatebykey貌似要复杂不少和reducebykey比起来。其实在实际的使用的时候也确实是这个样子的。

reducebykey 的源码中我们可以看到调用
combineByKeyWithClassTag[V]((v: V) => v, func, func, partitioner)

createCombiner: V => C,它的combiner 没有做聚合的处理。
mergeValue: (C, V) => C,
mergeCombiners: (C, C) => C,这个两个函数一样的，都是用的是func

aggregatebykey 的源码中我们可以看到这个的两个函数是比较灵活的，你可以自己去定义。seqOp做的就是在每个分区内部的聚合操作，而combOp就是汇总每个分区的结果的一个全局的操作。可以试想一下用这个函数来实现经典的wordcount和reducebykey的实现方式的区别。

def aggregateByKey[U: ClassTag](zeroValue: U)(seqOp: (U, V) => U,
combOp: (U, U) => U): RDD[(K, U)] = self.withScope {
aggregateByKey(zeroValue, defaultPartitioner(self))(seqOp, combOp)
}
以上便是博主的了解，没有总结具体的区别，大家可自行查看学习深入。

Crazy549475374

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
reduceByKey、groupByKey、aggregateByKey区别及用法

reduceByKey、groupByKey、aggregateByKey区别及用法参考官网介绍使用方法数据准备具体算子使用reduceByKey、groupByKey、aggregateByKey区别reduceByKeygroupByKeyaggregateByKey参考官网介绍reduceByKey(func, [numPartitions])当在 (K, V) 对的数据集上调用时，返回 (K, V) 对的数据集，其中每个键的值使用给定的 reduce 函数func聚合，该函数必须是 (V,
复制链接

扫一扫