Spark核心编程(RDD转换算子)之聚合算子

最新推荐文章于 2022-12-02 02:07:06 发布

溜三丝耶

最新推荐文章于 2022-12-02 02:07:06 发布

阅读量1k

点赞数 5

分类专栏： Spark 文章标签： spark 大数据

本文链接：https://blog.csdn.net/Sarahdsy/article/details/106592984

版权

本文详细介绍了Spark中的聚合算子，包括reduceByKey、groupByKey、aggregateByKey、foldByKey和combineByKey，强调了它们的区别和底层实现。重点讨论了预聚合和无预聚合的算子，以及如何根据业务需求选择合适的聚合操作。

摘要由CSDN通过智能技术生成

RDD转换算子之聚合算子

聚合算子可以说是Spark计算里面的核心，所以搞懂底层的实现很有必要。

reduceByKey

说明

可以将数据按照相同的key对value进行聚合

  /**
   * 使用一个关联与交换的reduce函数来合并每个key的values值。
   * 在将结果发送到reducer之前，这也将在每个mapper上本地执行合并，类似于MapReduce中的combiner。
   */
  def reduceByKey(partitioner: Partitioner, func: (V, V) => V): RDD[(K, V)] = self.withScope {
    // 不会对第一个value进行处理，分区内和分区间计算规则相同
    combineByKeyWithClassTag[V]((v: V) => v, func, func, partitioner)
  }

  /**
   * 输出将使用numPartitions分区，进行哈希分区。
   */
  def reduceByKey(func: (V, V) => V, numPartitions: Int): RDD[(K, V)] = self.withScope {
    reduceByKey(new HashPartitioner(numPartitions), func)
  }

  /**
   * 输出将使用现有的分区器或是并行级别来进行哈希分区。
   */
  def reduceByKey(func: (V, V) => V): RDD[(K, V)] = self.withScope {
    reduceByKey(defaultPartitioner(self), func)
  }

案例

def main(args: Array[String]): Unit = {
    val conf: SparkConf = new SparkConf().setMaster("local[*]").setAppName("partition by")
    val sc = new SparkContext(conf)

    val rdd: RDD[(String, Int)] = sc.makeRDD(
      List(
        ("a", 1), ("b", 2),
        ("c", 3), ("d", 4),
        ("a", 5), ("b", 6)
      )
    )

    // 使用现有的分区器或是并行级别来进行哈希分区
    val rdd1: RDD[(String, Int)] = rdd.reduceByKey(_ + _)
    println(rdd1.collect().mkString(", "))

    // 使用numPartitions分区，进行哈希分区
    val rdd2: RDD[(String, Int)] = rdd.reduceByKey(_ + _, 2)
    println(rdd2.collect().mkString(", "))

    // 使用指定分区器进行分区
    val rdd3: RDD[(String, Int)] = rdd.reduceByKey(new HashPartitioner(2), _ + _)
    println(rdd3.collect().mkString(", "))
}

reduceByKey进一步调用：

reduceByKey(partitioner: Partitioner, func: (V, V) => V): RDD[(K, V)]
-- combineByKeyWithClassTag[V]((v: V) => v, func, func, partitioner）

查看combineByKeyWithClassTag方法：
/**
   * 泛型函数，使用一组自定义的聚合函数，来组合每个key的元素。将RDD[(K, V)]转换为RDD[(K, C)]类型的结果，得到一个“组合类型”C。
   * 
   * 用户提供三个函数：
   * - `createCombiner`, 将V转换为C（比如，创建只有一个元素的list）// 将计算的第一个值转换结构
   * - `mergeValue`, 将V合并成C（比如，将它添加到list的末尾）// 分区内的计算规则
   * - `mergeCombiners`, 将两个C组合成一个 // 分区间的计算规则
   * 
   * 此外，用户可以控制输出RDD的分区，以及是否执行map端的聚合（如果一个mapper可以使用同一个键生成多个项）
   * 
   * 注意：V和C可以不同 -- 例如，可以将类型(Int, Int)的RDD分组为类型(Int, Seq[Int])的RDD
   */
def combineByKeyWithClassTag[C](
              createCombiner: V => C, // 将计算的第一个值转换结构
              mergeValue: (C, V) => C, // 分区内的计算规则
              mergeCombiners: (C, C) => C, // 分区间的计算规则
              partitioner: Partitioner, // 控制输出RDD的分区
              mapSideCombine: Boolean = true, // 是否执行map端的聚合
              serializer: Serializer = null)(implicit ct: ClassTag[C]): RDD[(K, C)] = self.withScope {
    require(mergeCombiners != null, "mergeCombiners must be defined") // required as of Spark 0.9.0
    if (keyClass.isArray) {
      if (mapSideCombine) {
        throw new SparkException("Cannot use map-side combining with array keys.")
      }
      if (partitioner.isInstanceOf[HashPartitioner]) {
        throw new SparkException("HashPartitioner cannot partition array keys.")
      }
    }
    val aggregator = new Aggregator[K, V, C](
      self.context.clean(createCombiner),
      self.context.clean(mergeValue),
      self.context.clean(mergeCombiners))
    if (self.

最低0.47元/天解锁文章

溜三丝耶

关注

5
点赞
踩
4

收藏

觉得还不错? 一键收藏
1
评论
Spark核心编程(RDD转换算子)之聚合算子

文章目录RDD转换算子之聚合算子reduceByKey说明案例groupByKey说明案例aggregateByKey说明案例foldByKey说明案例combineByKey说明案例聚合算子小结有预聚合reduceByKeyaggregateByKeyfoldByKeycombineByKey无预聚合groupByKeyRDD转换算子之聚合算子聚合算子可以说是Spark计算里面的核心，所以搞懂底层的实现很有必要。reduceByKey说明可以将数据按照相同的key对value进行聚合
复制链接

扫一扫