combineByKey

最新推荐文章于 2024-07-24 21:35:32 发布

xiaonaughty

最新推荐文章于 2024-07-24 21:35:32 发布

阅读量449

点赞数

分类专栏： Spark 文章标签： spark combineBy

Spark 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

  /**
   * Generic function to combine the elements for each key using a custom set of aggregation
   * functions. This method is here for backward compatibility. It does not provide combiner
   * classtag information to the shuffle.
   *
   * @see [[combineByKeyWithClassTag]]
   */
  def combineByKey[C](
      createCombiner: V => C,
      mergeValue: (C, V) => C,
      mergeCombiners: (C, C) => C,
      partitioner: Partitioner,
      mapSideCombine: Boolean = true,
      serializer: Serializer = null): RDD[(K, C)] = self.withScope {
    combineByKeyWithClassTag(createCombiner, mergeValue, mergeCombiners,
      partitioner, mapSideCombine, serializer)(null)
  }

/**
 * :: Experimental ::
 * Generic function to combine the elements for each key using a custom set of aggregation
 * functions. Turns an RDD[(K, V)] into a result of type RDD[(K, C)], for a "combined type" C
 * Note that V and C can be different -- for example, one might group an RDD of type
 * (Int, Int) into an RDD of type (Int, Seq[Int]). Users provide three functions:
 *  *  - `createCombiner`, which turns a V into a C (e.g., creates a one-element list)
 *  - `mergeValue`, to merge a V into a C (e.g., adds it to the end of a list)
 *  - `mergeCombiners`, to combine two C's into a single one.
 *  * In addition, users can control the partitioning of the output RDD, and whether to perform
 * map-side aggregation (if a mapper can produce multiple items with the same key).
   */
  @Experimental
  def combineByKeyWithClassTag[C](
      createCombiner: V => C,
      mergeValue: (C, V) => C,
      mergeCombiners: (C, C) => C,
      partitioner: Partitioner,
      mapSideCombine: Boolean = true,
      serializer: Serializer = null)(implicit ct: ClassTag[C]): RDD[(K, C)] = self.withScope {
    require(mergeCombiners != null, "mergeCombiners must be defined") // required as of Spark 0.9.0
    if (keyClass.isArray) {
      if (mapSideCombine) {
        throw new SparkException("Cannot use map-side combining with array keys.")
      }
      if (partitioner.isInstanceOf[HashPartitioner]) {
        throw new SparkException("Default partitioner cannot partition array keys.")
      }
    }
    val aggregator = new Aggregator[K, V, C](
      self.context.clean(createCombiner),
      self.context.clean(mergeValue),
      self.context.clean(mergeCombiners))
    if (self.partitioner == Some(partitioner)) {
      self.mapPartitions(iter => {
        val context = TaskContext.get()
        new InterruptibleIterator(context, aggregator.combineValuesByKey(iter, context))
      }, preservesPartitioning = true)
    } else {
      new ShuffledRDD[K, V, C](self, partitioner)
        .setSerializer(serializer)
        .setAggregator(aggregator)
        .setMapSideCombine(mapSideCombine)
    }
  }

函数式风格与命令式风格不同之处在于它说明了代码做了什么（what to do），而不是怎么做(how to do)。combineByKey函数主要接受了三个函数作为参数，分别为createCombiner、mergeValue、mergeCombiners。这三个函数足以说明它究竟做了什么。理解了这三个函数，就可以很好地理解combineByKey。

combineByKey是将RDD[(K,V)]combine为RDD[(K,C)]，因此，首先需要提供一个函数，能够完成从V到C的combine，称之为combiner。如果V和C类型一致，则函数为V => V。倘若C是一个集合，例如Iterable[V]，则createCombiner为V => Iterable[V]。

mergeValue则是将原RDD中Pair的Value合并为操作后的C类型数据。合并操作的实现决定了结果的运算方式。所以，mergeValue更像是声明了一种合并方式，它是由整个combine运算的结果来导向的。函数的输入为原RDD中Pair的V，输出为结果RDD中Pair的C。

最后的mergeCombiners则会根据每个Key所对应的多个C，进行归并。

让我们将combineByKey想象成是一个超级酷的果汁机。它能同时接受各种各样的水果，然后聪明地按照水果的种类分别榨出不同的果汁。苹果归苹果汁，橙子归橙汁，西瓜归西瓜汁。我们为水果定义类型为Fruit，果汁定义为Juice，那么combineByKey就是将RDD[(String, Fruit)]combine为RDD[(String, Juice)]。

注意，在榨果汁前，水果可能有很多，即使是相同类型的水果，也会作为不同的RDD元素：

(“apple”, apple1), (“orange”, orange1), (“apple”, apple2)

combine的结果是每种水果只有一杯果汁（只是容量不同罢了）:

(“apple”, appleJuice), (“orange”, orangeJuice)

这个果汁机由什么元件构成呢？首先，它需要一个元件提供将各种水果榨为各种果汁的功能；其次，它需要提供将果汁进行混合的功能；最后，为了避免混合错误，还得提供能够根据水果类型进行混合的功能。注意第二个函数和第三个函数的区别，前者只提供混合功能，即能够将不同容器的果汁装到一个容器中，而后者的输入已有一个前提，那就是已经按照水果类型放到不同的区域，果汁机在混合果汁时，并不会混淆不同区域的果汁。

果汁机的功能类似于groupByKey+foldByKey操作。它可以调用combineByKey函数：

case class Fruit(kind: String, weight: Int) {
  def makeJuice:Juice = Juice(weight * 100)
}
case class Juice(volumn: Int) {
  def add(j: Juice):Juice = Juice(volumn + j.volumn)
}
val apple1 = Fruit("apple", 5)
val apple2 = Fruit("apple", 8)
val orange1 = Fruit("orange", 10)
val fruit = sc.parallelize(List(("apple", apple1) , ("orange", orange1) , ("apple", apple2))) 
val juice = fruit.combineByKey(
  f => f.makeJuice,
  (j:Juice,f) => j.add(f.makeJuice),
  (j1:Juice,j2:Juice) => j1.add(j2) 
)

执行juice.collect，结果为：

Array[(String, Juice)] = Array((orange,Juice(1000)),
(apple,Juice(1300)))

RDD中有许多针对Pair RDD的操作在内部实现都调用了combineByKey函数。例如groupByKey：

class PairRDDFunctions[K, V](self: RDD[(K, V)])
  (implicit kt: ClassTag[K], vt: ClassTag[V], ord: Ordering[K] = null)
  extends Logging
  with SparkHadoopMapReduceUtil
  with Serializable {
  def groupByKey(partitioner: Partitioner): RDD[(K, Iterable[V])] = {
    val createCombiner = (v: V) => CompactBuffer(v)
    val mergeValue = (buf: CompactBuffer[V], v: V) => buf += v
    val mergeCombiners = (c1: CompactBuffer[V], c2: CompactBuffer[V]) => c1 ++= c2
    val bufs = combineByKey[CompactBuffer[V]](
      createCombiner, mergeValue, mergeCombiners, partitioner, mapSideCombine=false)
    bufs.asInstanceOf[RDD[(K, Iterable[V])]]
    }
}

groupByKey函数针对PairRddFunctions的RDD[(K, V)]按照key对value进行分组。它在内部调用了combineByKey函数，传入的三个函数分别承担了如下职责：

createCombiner是将原RDD中的K类型转换为Iterable[V]类型，实现为CompactBuffer。
mergeValue实则就是将原RDD的元素追加到CompactBuffer中，即将追加操作(+=)视为合并操作。
mergeCombiners则负责针对每个key值所对应的Iterable[V]，提供合并功能。
根据传入的函数实现不同，我们还可以利用combineByKey完成不同的工作，例如aggregate，fold，average等操作。这是一个高度的抽象，但从声明的角度来看，却又不需要了解过多的实现细节。这就是函数式编程的魅力。

摘自书：
Apress.Pro.Spark.Streaming.The.Zen.of.Real-Time.Analytics.Using.Apache.Spark

Combines the values by key by invoking user-defined accumulation and merge functions.
Arguments:
1. createCombiner : Function to create the combiner
2. mergeValue : Function to accumulate the values of each partition
3. mergeCombiner : Function to merge two accumulators across partitions
4. partitioner : The partitioner to use
Every time a new key is encountered, the createCombiner method is called to spawn a new combiner
instance for it. On subsequent encounters, mergeValue is used to accumulate its values. Once all keys have
been exhausted, mergeCombiner is invoked to merge accumulators across partitions.
Unlike with reduceByKey() and its variants, the input and output types of the values can be different.
Note that reduce() under the hood invokes reduceByKey() with the key as null . In turn, reduceByKey()
calls combineByKey() by
• Using the same types for input and output
• Using the same reduceFunc for both mergeValue and mergeCombiner
combineByKey() is useful to implement functionality that requires custom accumulation, such
as calculating averages. Listing 3-18 ranks the users in the dataset based on the average length of their
comments.

Ranking Authors on the Basis of the Average Length of Their Comments

val topAuthorsByAvgContent = comments.map(rec
=> ((parse(rec) \ "author").values.toString, (parse(rec) \ "body").values.
toString.split(" ").length))
.combineByKey(
(v) => (v, 1),
(accValue: (Int, Int), v) => (accValue._1 + v, accValue._2 + 1),
(accCombine1: (Int, Int), accCombine2: (Int, Int)) => (accCombine1._1 +
accCombine2._1, accCombine1._2 + accCombine2._2),
new HashPartitioner(ssc.sparkContext.defaultParallelism)
)
.map({ case (k, v) => (k, v._1 / v._2.toFloat)})
.map(r => (r._2, r._1))
.transform(rdd => rdd.sortByKey(ascending = false ))

转载过往记忆博客：
http://www.iteblog.com/archives/1291

scala> val data = sc.parallelize(List((1, “www”), (1, “iteblog”), (1,
“com”), 　　(2, “bbs”), (2, “iteblog”), (2, “com”), (3, “good”))) data:
org.apache.spark.rdd.RDD[(Int, String)] = 　　ParallelCollectionRDD[15]
at parallelize at :12

scala> val result = data.combineByKey(List(_), 　　(x: List [String],
y: String) => y :: x, (x: List[String], y: List[String]) => x ::: y)
result: org.apache.spark.rdd.RDD[(Int, List[String])] =
　　ShuffledRDD[19] at combineByKey at :14

scala> result.collect res20: Array[(Int, List[String])] =
Array((1,List(www, iteblog, com)), 　　 (2,List(bbs, iteblog, com)),
(3,List(good)))

scala> val data = sc.parallelize(List((“iteblog”, 1), (“bbs”, 1),
(“iteblog”, 3))) data: org.apache.spark.rdd.RDD[(String, Int)] =
　　ParallelCollectionRDD[24] at parallelize at :12

scala> val result = data.combineByKey(x => x, 　　(x: Int, y:Int) => x
+ y, (x:Int, y: Int) => x + y) result: org.apache.spark.rdd.RDD[(String, Int)] = 　　ShuffledRDD[25] at
combineByKey at :14

scala> result.collect res27: Array[(String, Int)] = Array((iteblog,4),
(bbs,1))

xiaonaughty

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
combineByKey

/** * Generic function to combine the elements for each key using a custom set of aggregation * functions. This method is here for backward compatibility. It does not provide combiner * classt
复制链接

扫一扫

专栏目录