Spark Key-Value 聚合类算子解析

最新推荐文章于 2024-08-16 10:10:49 发布

放赐~~

最新推荐文章于 2024-08-16 10:10:49 发布

阅读量2k

点赞数

文章标签： scala spark

本文链接：https://blog.csdn.net/qq_34117176/article/details/123906339

版权

Spark Key-Value 聚合类算子解析

1. combineByKey() 算子

Spark 所有聚合算子都是在 combineByKeyWithClassTag 的基础上实现的，combineByKey 是Spark的一个最通用的聚合算子。

def combineByKey[C](
    createCombiner: V => C,
    mergeValue: (C, V) => C,
    mergeCombiners: (C, C) => C): RDD[(K, C)] = self.withScope {
  // 这个方法是实际上最通用的聚合逻辑
  combineByKeyWithClassTag(createCombiner, mergeValue, mergeCombiners)(null)
}

参数 createCombiner，是算子的一个初始化函数，可以认为是一个初始化数据的方法。

参数 mergeValue，map 端的聚合函数。

参数 mergeCombiners，reduce 端的聚合函数。

val inputRDD: RDD[(Int, Char)] = sc.parallelize(Array((1, 'a'), (2, 'b'), (2, 'k'), (3, 'c'), (4, 'd'), (3, 'e'), (3, 'f'), (2, 'g'), (2, 'h')), 3)
val resultRDD: RDD[(Int, String)] = inputRDD.combineByKey(
      (x: Char) => if (x == 'c') {x + "0"} else {x + "1"}
      , (c: String, v: Char) => c + "+" + v
      , (c1: String, c2: String) => c1 + "_" + c2
      , 2
    )
resultRDD.mapPartitionsWithIndex((pid : Int, iter : Iterator[(Int, String)]) => {
      iter.map((value : (Int, String)) => s"PID: $pid, Value: $value")
    }).foreach(println)

PID: 0, Value: (4,d1)
PID: 0, Value: (2,b1+k_g1+h)
PID: 1, Value: (1,a1)
PID: 1, Value: (3,c0+e_f1)

combineBykey 逻辑处理流程

注意：createCombiner 只会对一个分区中的第一个相同的key赋值。

2. aggregateByKey() 算子

聚合程度上仅次于combineByKey的算子，同样既有有combine 端的聚合函数 seqOp,也有reduce 端的聚合函数 combOp, 与combineByKey的最大区别是，aggregateByKey的初始值是直接指定的（zeroValue），而 combineByKey 初始值是由初始化函数指定的，所以combineByKey要更加的灵活。

注意aggregateByKey 是可以进行柯里化的算子，但是没有必要实现。（zeroValue 参数在一个括号，另外两个函数则在另外的括号内）

def aggregateByKey[U: ClassTag](zeroValue: U, partitioner: Partitioner)(seqOp: (U, V) => U,
      combOp: (U, U) => U): RDD[(K, U)] = self.withScope {
    // Serialize the zero value to a byte array so that we can get a new clone of it on each key
    val zeroBuffer = SparkEnv.get.serializer.newInstance().serialize(zeroValue)
    val zeroArray = new Array[Byte](zeroBuffer.limit)
    zeroBuffer.get(zeroArray)

    lazy val cachedSerializer = SparkEnv.get.serializer.newInstance()
    val createZero = () => cachedSerializer.deserialize[U](ByteBuffer.wrap(zeroArray))

    // We will clean the combiner closure later in `combineByKey`
    val cleanedSeqOp = self.context.clean(seqOp)
    combineByKeyWithClassTag[U]((v: V) => cleanedSeqOp(createZero(), v),
      cleanedSeqOp, combOp, partitioner)
  }

参数 zeroValue 默认值对应结果类型。

参数 seqOp combine 端的聚合函数。

参数 combOp reduce 端的聚合函数。

val inputRDD: RDD[(Int, String)] = sc.parallelize(Array[(Int, String)]((1, "a"), (2, "b"), (3, "c"), (4, "d"), (2, "e"), (3, "f"), (2, "g"), (1, "h"), (2, "i")), 3)
val resultRDD: RDD[(Int, String)] = inputRDD.aggregateByKey("x", 2)(_ + "_" + _, _ + "@" + _)
resultRDD.foreach(println)

(4,x_d)
(2,x_b@x_e@x_g_i)
(1,x_a@x_h)
(3,x_c@x_f)

aggregateByKey 逻辑处理流程

注意上述默认值的赋值操作也是在map阶段进行了一次 seqOp 操作，只不过和combineByKey中一样，每次只是修改每个分区中各个key的第一个值。

3. foldByKey() 算子

同样是基于combineByKey实现的聚合算子，在聚合程度上大过 reduceByKey 小于 aggregateByKey。可以设定初始值（不是函数而是和aggregateByKey一样的固定值），但是combine()和 reduce() 的聚合函数是同一个（和reduceByKey）相同。

def foldByKey(
    zeroValue: V,
    partitioner: Partitioner)(func: (V, V) => V): RDD[(K, V)] = self.withScope {
  // Serialize the zero value to a byte array so that we can get a new clone of it on each key
  val zeroBuffer = SparkEnv.get.serializer.newInstance().serialize(zeroValue)
  val zeroArray = new Array[Byte](zeroBuffer.limit)
  zeroBuffer.get(zeroArray)

  // When deserializing, use a lazy val to create just one instance of the serializer per task
  lazy val cachedSerializer = SparkEnv.get.serializer.newInstance()
  val createZero = () => cachedSerializer.deserialize[V](ByteBuffer.wrap(zeroArray))

  val cleanedFunc = self.context.clean(func)
  combineByKeyWithClassTag[V]((v: V) => cleanedFunc(createZero(), v),
    cleanedFunc, cleanedFunc, partitioner)
}

参数 zeroValue 默认值对应结果类型。

参数 func combine 端和reduce 端的聚合函数。

val inputRDD: RDD[(Int, String)] = sc.parallelize(Array[(Int, String)]((1, "a"), (2, "b"), (3, "c"), (4, "d"), (2, "e"), (3, "f"), (2, "g"), (1, "h"), (2, "i")), 3)
val resultRDD: RDD[(Int, String)] = inputRDD.foldByKey("x")(_ + "_" + _)
resultRDD.foreach(println)

(3,x_c_x_f)
(4,x_d)
(1,x_a_x_h)
(2,x_b_x_e_x_g_i)

foldByKey 逻辑处理流程

和aggregateByKey的处理逻辑基本相同，都是对 zeroValue进行了一次 combine操作，但是只对每个分区的各个key的第一value做了操作。而后在reduce端同样全局做了一次聚合。

4. reduceByKey() 算子

reduceByKey 算子是我们最常用的RDD算子之一，同样的也是基于combineByKey 实现。不过reduceByKey和groupByKey一样属于较为专业的聚合算子，并不适合通用类的操作，reduceByKey 只有一个参数（不算 numPartition）, 算子基于这个参数对实现combine 和reduce 操作。

def reduceByKey(partitioner: Partitioner, func: (V, V) => V): RDD[(K, V)] = self.withScope {
  combineByKeyWithClassTag[V]((v: V) => v, func, func, partitioner)
}

val inputRDD: RDD[(Int, String)] = sc.parallelize(Array[(Int, String)]((1, "a"), (2, "b"), (3, "c"), (4, "d"), (5, "e"), (3, "f"), (2, "g"), (1, "h"), (2, "i")), 3)
val resultRDD: RDD[(Int, String)] = inputRDD.reduceByKey(_ + "_" + _, 2)
resultRDD.foreach(println)

(4,d)
(2,b_g_i)
(1,a_h)
(3,c_f)
(5,e)

reduceByKey 逻辑处理流程

5. groupByKey() 算子

groupByKey 算的上是大家学习大数据认识的第一个spark shuffle 算子了，groupByKey 是通用性最差的算子，没有combine阶段，reduce阶段的聚合也无法自定义实现，多数情况下不建议大家使用。

def groupByKey(partitioner: Partitioner): RDD[(K, Iterable[V])] = self.withScope {
  // groupByKey shouldn't use map side combine because map side combine does not
  // reduce the amount of data shuffled and requires all map side data be inserted
  // into a hash table, leading to more objects in the old gen.
  val createCombiner = (v: V) => CompactBuffer(v)
  val mergeValue = (buf: CompactBuffer[V], v: V) => buf += v
  val mergeCombiners = (c1: CompactBuffer[V], c2: CompactBuffer[V]) => c1 ++= c2
  val bufs = combineByKeyWithClassTag[CompactBuffer[V]](
    createCombiner, mergeValue, mergeCombiners, partitioner, mapSideCombine = false)
  bufs.asInstanceOf[RDD[(K, Iterable[V])]]
}

val inputRDD: RDD[(Int, String)] = sc.parallelize(Array[(Int, String)]((1, "a"), (2, "b"), (3, "c"), (4, "d"), (5, "e"), (3, "f"), (2, "g"), (1, "h"), (2, "i")), 3)
        val resultRDD: RDD[(Int, Iterable[String])] = inputRDD.groupByKey(2)
        resultRDD.foreach(println)

(4,CompactBuffer(d))
(2,CompactBuffer(b, g, i))
(1,CompactBuffer(a, h))
(3,CompactBuffer(c, f))
(5,CompactBuffer(e))

groupByKey 逻辑处理流程

参考文献许利杰，方亚芬《大数据处理框架 Apache Spark 设计与实现》

放赐~~

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Spark Key-Value 聚合类算子解析

Spark Key-Value 聚合类算子解析1. combineByKey() 算子Spark 所有聚合算子都是在 combineByKeyWithClassTag 的基础上实现的，combineByKey 是Spark的一个最通用的聚合算子。def combineByKey[C]( createCombiner: V => C, mergeValue: (C, V) => C, mergeCombiners: (C, C) => C): RDD[(K, C
复制链接

扫一扫