reduceBykey与groupByKey哪个好，通过源码解析?

最新推荐文章于 2023-07-05 18:45:00 发布

有风微冷

最新推荐文章于 2023-07-05 18:45:00 发布

阅读量635

点赞数

分类专栏： spark 文章标签： reduceByKey groupByKey

本文链接：https://blog.csdn.net/qq_36770189/article/details/97133936

版权

spark 专栏收录该内容

14 篇文章 0 订阅

订阅专栏

reduceByKey：将相同的Key根据相应的逻辑进行处理。默认升序

/**
   * Merge the values for each key using an associative and commutative reduce function. This will
   * also perform the merging locally on each mapper before sending results to a reducer, similarly
   * to a "combiner" in MapReduce. Output will be hash-partitioned with the existing partitioner/
   * parallelism level.
   */
  def reduceByKey(func: (V, V) => V): RDD[(K, V)] = self.withScope {
    reduceByKey(defaultPartitioner(self), func)
  }

groupByKey：作用在K，V格式的RDD上。根据Key进行分组。作用在（K，V），返回（K，Iterable ）

 /**
   * Group the values for each key in the RDD into a single sequence. Hash-partitions the
   * resulting RDD with the existing partitioner/parallelism level. The ordering of elements
   * within each group is not guaranteed, and may even differ each time the resulting RDD is
   * evaluated.
   *
   * @note This operation may be very expensive. If you are grouping in order to perform an
   * aggregation (such as a sum or average) over each key, using `PairRDDFunctions.aggregateByKey`
   * or `PairRDDFunctions.reduceByKey` will provide much better performance.
   */
  def groupByKey(): RDD[(K, Iterable[V])] = self.withScope {
    groupByKey(defaultPartitioner(self))
  }

从源码可以得出，
1.reduceByKey将结果发送给reducer之前在本地进行merge（the merging locally on each mapper before sending results to a reducer），这样数据量会大幅度减小，从而减小传输，保证reduce端能够更快的进行结果计算。

2.groupByKey会对每一个RDD中的value值进行聚合形成一个序列(Iterator)，所有的键值对都会被移动，此操作发生在reduce端，大量的数据通过网络进行传输，效率低下。

总结：reduceByKey更适合作用在较大的数据集上