Spark:groupByKey源码分析

最新推荐文章于 2023-12-06 07:40:44 发布

GScallion

最新推荐文章于 2023-12-06 07:40:44 发布

阅读量294

点赞数

分类专栏： Spark 文章标签： spark

本文链接：https://blog.csdn.net/qq_24325581/article/details/113186465

版权

Spark 专栏收录该内容

10 篇文章 0 订阅

订阅专栏

Spark版本：2.4.0

代码位置：org.apache.spark.rdd.PairRDDFunctions
groupByKey(): RDD[(K, Iterable[V])]
groupByKey(numPartitions: Int): RDD[(K, Iterable[V])]
最终调用combineByKeyWithClassTag

使用示例：

val source: RDD[(Int, Int)] = sc.parallelize(Seq((1, 1), (1, 2), (2, 2), (2, 3)))
val groupByKeyRDD: RDD[(Int, Iterable[Int])] = source.groupByKey()
groupByKeyRDD.map(tup => (tup._1, tup._2.sum)).foreach(println)

打印结果：

(1,3)
(2,5)

源码如下

其中方法1和方法2最终调用方法3

方法1
/**
   * Group the values for each key in the RDD into a single sequence. Hash-partitions the
   * resulting RDD with the existing partitioner/parallelism level. The ordering of elements
   * within each group is not guaranteed, and may even differ each time the resulting RDD is
   * evaluated.
   *
   * @note This operation may be very expensive. If you are grouping in order to perform an
   * aggregation (such as a sum or average) over each key, using `PairRDDFunctions.aggregateByKey`
   * or `PairRDDFunctions.reduceByKey` will provide much better performance.
   */
  //用户无需传入参数，返回值 RDD[(K, Iterable[V])]
  def groupByKey(): RDD[(K, Iterable[V])] = self.withScope {
    groupByKey(defaultPartitioner(self))
  }
  方法2
  /**
   * Group the values for each key in the RDD into a single sequence. Hash-partitions the
   * resulting RDD with into `numPartitions` partitions. The ordering of elements within
   * each group is not guaranteed, and may even differ each time the resulting RDD is evaluated.
   *
   * @note This operation may be very expensive. If you are grouping in order to perform an
   * aggregation (such as a sum or average) over each key, using `PairRDDFunctions.aggregateByKey`
   * or `PairRDDFunctions.reduceByKey` will provide much better performance.
   *
   * @note As currently implemented, groupByKey must be able to hold all the key-value pairs for any
   * key in memory. If a key has too many values, it can result in an `OutOfMemoryError`.
   */
  //需要参数：自定义分区数 numPartitions: Int
  //返回值 RDD[(K, Iterable[V])]
  def groupByKey(numPartitions: Int): RDD[(K, Iterable[V])] = self.withScope {
    groupByKey(new HashPartitioner(numPartitions))
  }

方法3 注意下述注释中的话，尽量不要使用该算子
/**
   * Group the values for each key in the RDD into a single sequence. Allows controlling the
   * partitioning of the resulting key-value pair RDD by passing a Partitioner.
   * The ordering of elements within each group is not guaranteed, and may even differ
   * each time the resulting RDD is evaluated.
   *
   * @note This operation may be very expensive. If you are grouping in order to perform an
   * aggregation (such as a sum or average) over each key, using `PairRDDFunctions.aggregateByKey`
   * or `PairRDDFunctions.reduceByKey` will provide much better performance.
   *
   * @note As currently implemented, groupByKey must be able to hold all the key-value pairs for any
   * key in memory. If a key has too many values, it can result in an `OutOfMemoryError`.
   */
  def groupByKey(partitioner: Partitioner): RDD[(K, Iterable[V])] = self.withScope {
    // groupByKey shouldn't use map side combine because map side combine does not
    // reduce the amount of data shuffled and requires all map side data be inserted
    // into a hash table, leading to more objects in the old gen.
    // 注意:groupByKey不能够在map端进行数据合并,这是与其它引用combineByKeyWithClassTag的*ByKey算子很大的区别
    val createCombiner = (v: V) => CompactBuffer(v) // 创建合并器
    val mergeValue = (buf: CompactBuffer[V], v: V) => buf += v // 同一个Executor内部数据合并
    val mergeCombiners = (c1: CompactBuffer[V], c2: CompactBuffer[V]) => c1 ++= c2 // 不同Executor之间数据合并
    val bufs = combineByKeyWithClassTag[CompactBuffer[V]](
      createCombiner, mergeValue, mergeCombiners, partitioner, mapSideCombine = false // 不能够在map端进行数据合并
      )
    bufs.asInstanceOf[RDD[(K, Iterable[V])]]
  }
----------------------------------------------------------------------------------------------
private[spark] object CompactBuffer {
  def apply[T: ClassTag](): CompactBuffer[T] = new CompactBuffer[T]

  def apply[T: ClassTag](value: T): CompactBuffer[T] = { // groupByKey合并器工作时调用的方法
    val buf = new CompactBuffer[T]
    buf += value
  }
}

GScallion

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Spark:groupByKey源码分析

Spark版本：2.4.0代码位置：org.apache.spark.rdd.PairRDDFunctionsgroupByKey(): RDD[(K, Iterable[V])]groupByKey(numPartitions: Int): RDD[(K, Iterable[V])]使用示例：val source: RDD[(Int, Int)] = sc.parallelize(Seq((1, 1), (1, 2), (2, 2), (2, 3)))val groupByKeyRDD: RD
复制链接

扫一扫

专栏目录