groupByKey 和reduceByKey 的区别

5 篇文章 1 订阅
一、首先他们都是要经过shuffle的,groupByKey在方法shuffle之间不会合并原样进行shuffle,。reduceByKey进行shuffle之前会先做合并,这样就减少了shuffle的io传送,所以效率高一点。
案例:
object GroupyKeyAndReduceByKeyDemo {
  def main(args: Array[String]): Unit = {
    Logger.getLogger("org").setLevel(Level.WARN)
    val config = new SparkConf().setAppName("GroupyKeyAndReduceByKeyDemo").setMaster("local")
    val sc = new SparkContext(config)
    val arr = Array("val config", "val arr")
    val socketDS = sc.parallelize(arr).flatMap(_.split(" ")).map((_, 1))
    //groupByKey 和reduceByKey 的区别:
    //他们都是要经过shuffle的,groupByKey在方法shuffle之间不会合并原样进行shuffle,
    //reduceByKey进行shuffle之前会先做合并,这样就减少了shuffle的io传送,所以效率高一点
    socketDS.groupByKey().map(tuple => (tuple._1, tuple._2.sum)).foreach(x => {
      println(x._1 + " " + x._2)
    })
    println("----------------------")
    socketDS.reduceByKey(_ + _).foreach(x => {
      println(x._1 + " " + x._2)
    })
    sc.stop()
  }
}
二 、首先groupByKey有三种

查看源码groupByKey()实现了 groupByKey(defaultPartitioner(self))
/**
   * Group the values for each key in the RDD into a single sequence. Hash-partitions the
   * resulting RDD with the existing partitioner/parallelism level. The ordering of elements
   * within each group is not guaranteed, and may even differ each time the resulting RDD is
   * evaluated.
   *
   * @note This operation may be very expensive. If you are grouping in order to perform an
   * aggregation (such as a sum or average) over each key, using `PairRDDFunctions.aggregateByKey`
   * or `PairRDDFunctions.reduceByKey` will provide much better performance.
   */
  def groupByKey(): RDD[(K, Iterable[V])] = self.withScope {
    groupByKey(defaultPartitioner(self))
  }

查看源码 groupByKey(numPartitions: Int) 实现了 groupByKey(new HashPartitioner(numPartitions))

/**
   * Group the values for each key in the RDD into a single sequence. Hash-partitions the
   * resulting RDD with into `numPartitions` partitions. The ordering of elements within
   * each group is not guaranteed, and may even differ each time the resulting RDD is evaluated.
   *
   * @note This operation may be very expensive. If you are grouping in order to perform an
   * aggregation (such as a sum or average) over each key, using `PairRDDFunctions.aggregateByKey`
   * or `PairRDDFunctions.reduceByKey` will provide much better performance.
   *
   * @note As currently implemented, groupByKey must be able to hold all the key-value pairs for any
   * key in memory. If a key has too many values, it can result in an `OutOfMemoryError`.
   */
  def groupByKey(numPartitions: Int): RDD[(K, Iterable[V])] = self.withScope {
    groupByKey(new HashPartitioner(numPartitions))
  }

其实上面两个都是实现了groupByKey(partitioner: Partitioner)

/**
   * Group the values for each key in the RDD into a single sequence. Allows controlling the
   * partitioning of the resulting key-value pair RDD by passing a Partitioner.
   * The ordering of elements within each group is not guaranteed, and may even differ
   * each time the resulting RDD is evaluated.
   *
   * @note This operation may be very expensive. If you are grouping in order to perform an
   * aggregation (such as a sum or average) over each key, using `PairRDDFunctions.aggregateByKey`
   * or `PairRDDFunctions.reduceByKey` will provide much better performance.
   *
   * @note As currently implemented, groupByKey must be able to hold all the key-value pairs for any
   * key in memory. If a key has too many values, it can result in an `OutOfMemoryError`.
   */
  def groupByKey(partitioner: Partitioner): RDD[(K, Iterable[V])] = self.withScope {
    // groupByKey shouldn't use map side combine because map side combine does not
    // reduce the amount of data shuffled and requires all map side data be inserted
    // into a hash table, leading to more objects in the old gen.
    val createCombiner = (v: V) => CompactBuffer(v)
    val mergeValue = (buf: CompactBuffer[V], v: V) => buf += v
    val mergeCombiners = (c1: CompactBuffer[V], c2: CompactBuffer[V]) => c1 ++= c2
    val bufs = combineByKeyWithClassTag[CompactBuffer[V]](
      createCombiner, mergeValue, mergeCombiners, partitioner, mapSideCombine = false)
    bufs.asInstanceOf[RDD[(K, Iterable[V])]]
  }
而groupByKey(partitioner: Partitioner)有实现了combineByKeyWithClassTag,所以可以说groupByKey其实底层都是combineByKeyWithClassTag的实现,只是实现的方式不同。


三、再查看reduceByKey也有三种方式


/**
   * Merge the values for each key using an associative and commutative reduce function. This will
   * also perform the merging locally on each mapper before sending results to a reducer, similarly
   * to a "combiner" in MapReduce. Output will be hash-partitioned with the existing partitioner/
   * parallelism level.
   */
  def reduceByKey(func: (V, V) => V): RDD[(K, V)] = self.withScope {
    reduceByKey(defaultPartitioner(self), func)
  }
/**
   * Merge the values for each key using an associative and commutative reduce function. This will
   * also perform the merging locally on each mapper before sending results to a reducer, similarly
   * to a "combiner" in MapReduce. Output will be hash-partitioned with numPartitions partitions.
   */
  def reduceByKey(func: (V, V) => V, numPartitions: Int): RDD[(K, V)] = self.withScope {
    reduceByKey(new HashPartitioner(numPartitions), func)
  }
/**
   * Merge the values for each key using an associative and commutative reduce function. This will
   * also perform the merging locally on each mapper before sending results to a reducer, similarly
   * to a "combiner" in MapReduce.
   */
  def reduceByKey(partitioner: Partitioner, func: (V, V) => V): RDD[(K, V)] = self.withScope {
    combineByKeyWithClassTag[V]((v: V) => v, func, func, partitioner)
  }
通过查看这三种reduceByKey不难发现,前两种是最后一种的实现。而最后一种是又实现了combineByKeyWithClassTag。

### groupByKey是这样实现的

combineByKeyWithClassTag[CompactBuffer[V]](createCombiner, mergeValue, mergeCombiners, partitioner, mapSideCombine = false)

### reduceByKey是这样实现的
combineByKeyWithClassTag[V]((v: V) => v, func, func, partitioner)

对比上面发现,groupByKey设置了mapSideCombine = false,在map端不进行合并,那就是在shuffle前不合并。而reduceByKey没有设置

难道reduceByKey默认合并吗????

四、接下来,我们仔细看一下combineByKeyWithClassTag
/**
   * :: Experimental ::
   * Generic function to combine the elements for each key using a custom set of aggregation
   * functions. Turns an RDD[(K, V)] into a result of type RDD[(K, C)], for a "combined type" C
   *
   * Users provide three functions:
   *
   *  - `createCombiner`, which turns a V into a C (e.g., creates a one-element list)
   *  - `mergeValue`, to merge a V into a C (e.g., adds it to the end of a list)
   *  - `mergeCombiners`, to combine two C's into a single one.
   *
   * In addition, users can control the partitioning of the output RDD, and whether to perform
   * map-side aggregation (if a mapper can produce multiple items with the same key).
   *
   * @note V and C can be different -- for example, one might group an RDD of type
   * (Int, Int) into an RDD of type (Int, Seq[Int]).
   */
  @Experimental
  def combineByKeyWithClassTag[C](
      createCombiner: V => C,
      mergeValue: (C, V) => C,
      mergeCombiners: (C, C) => C,
      partitioner: Partitioner,
      mapSideCombine: Boolean = true,
      serializer: Serializer = null)(implicit ct: ClassTag[C]): RDD[(K, C)] = self.withScope {
    require(mergeCombiners != null, "mergeCombiners must be defined") // required as of Spark 0.9.0
    if (keyClass.isArray) {
      if (mapSideCombine) {
        throw new SparkException("Cannot use map-side combining with array keys.")
      }
      if (partitioner.isInstanceOf[HashPartitioner]) {
        throw new SparkException("HashPartitioner cannot partition array keys.")
      }
    }
    val aggregator = new Aggregator[K, V, C](
      self.context.clean(createCombiner),
      self.context.clean(mergeValue),
      self.context.clean(mergeCombiners))
    if (self.partitioner == Some(partitioner)) {
      self.mapPartitions(iter => {
        val context = TaskContext.get()
        new InterruptibleIterator(context, aggregator.combineValuesByKey(iter, context))
      }, preservesPartitioning = true)
    } else {
      new ShuffledRDD[K, V, C](self, partitioner)
        .setSerializer(serializer)
        .setAggregator(aggregator)
        .setMapSideCombine(mapSideCombine)
    }
  }
通过查看combineByKeyWithClassTag的,发现reduceByKey默认在map端进行合并,那就是在shuffle前进行合并,如果合并了一些数据,那在shuffle时进行溢写则减少了磁盘IO,所以reduceByKey会快一些。

  • 2
    点赞
  • 6
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
### 回答1: groupByKeyreduceByKey是在Spark RDD中常用的两个转换操作。 groupByKey是按照键对元素进行分组,将相同键的元素放入一个迭代器中。这样会导致大量的数据被发送到同一台机器上,因此不推荐使用。 reduceByKey是在每个分区中首先对元素进行分组,然后对每组数据进行聚合操作。最后再对所有分区的结果进行聚合。因为减少了数据的移动,所以性能更优。 ### 回答2: 在Spark中,groupByKey()和reduceByKey()是两个非常重要的转换操作。尽管它们都可以用来对RDD数据集中的键值对进行聚合操作,但它们的实现方式和用途存在一定的区别groupByKey()是对RDD中所有拥有相同键的元素进行分组,将它们组合成字典式的对应关系,并将结果以(键, 值列表)的形式返回。这个函数是非常常见的,它可以帮助我们对键值对进行分组操作,但是它可能会对内存造成严重的影响,尤其是对于大的数据集和分组非常多的情况。对于大数据集,groupByKey()可能会导致内存不足,应该尽量避免使用。 reduceByKey()的核心思想是在分布式环境下执行reduce操作。类似于groupByKey(),reduceByKey()也按键将所有元素进行分组,但是它不是用一个完整的列表将所有值存储在集群中,而是将它们在每个分区上进行汇总并将结果返回到主程序,然后使用reduce操作将这些值合并为一个汇总值。 这个汇总值仅在reduce操作完成后返回到驱动程序,所以reduceByKey()不会对内存造成过多的压力。因为reduceByKey()是在分布式环境下操作的,所以它执行非常高效。 因此,reduceByKey()是比groupByKey()更好的转换操作,因为它在分布式环境下执行,能够高效地处理大数据集,并且减少了RDD中的数据移动,节省内存。 但是,有些情况下,groupByKey()还是可以使用的,例如,在数据量较小的情况下,或者需要将所有的键值对都分组的情况下。 ### 回答3: 在Apache Spark中,GroupByKeyReduceByKey都是对数据进行聚合的操作,但它们的实现方式有所不同。 GroupByKey按照key对RDD进行分组,然后返回具有相同key的元素的迭代器。它将相同的key下的所有元素放入一个可迭代的列表中,然后返回一个元组(key, value)的RDD。例如,如果我们有一个(key, value)的RDD,其中key为字符串类型,value为整数类型,想要按照key进行分组并将所有value相加,则可以使用GroupByKey操作来实现。 ReduceByKey操作也按照key对元素进行分组,但它是在分布式环境下对每个key下的元素进行归约(reduce)操作。这个归约操作可以是任何转换,例如加法、减法等。由于Spark是分布式的,ReduceByKey可以在每个节点上并行地执行reduce操作,这使得它比GroupByKey更快。同时,ReduceByKey也返回一个(key, value)的RDD,其中value是每个key下归约后的结果。 总的来说,GroupByKeyReduceByKey都是将RDD中的元素按照key进行分组,但ReduceByKey在执行reduce操作后返回的结果更快。如果只是对数据进行简单的分组,使用GroupByKey会更加适合,而如果需要对数据进行轻量级的归约操作,使用ReduceByKey会更加高效。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值