spark函数讲解：aggregate

最新推荐文章于 2024-06-19 01:08:07 发布

漂浮的鱼~

最新推荐文章于 2024-06-19 01:08:07 发布

阅读量2.5k

点赞数

分类专栏： spark+scala 文章标签： spark 函数讲解 aggregate

本文链接：https://blog.csdn.net/tolcf/article/details/51900440

版权

spark+scala 专栏收录该内容

8 篇文章 0 订阅

订阅专栏

函数原型：

def
aggregate[U](zeroValue: U)(seqOp: (U, T) ⇒ U, combOp: (U, U) ⇒ U)(implicit arg0: ClassTag[U]): U
Aggregate the elements of each partition, and then the results for all the partitions, using given combine functions and a neutral "zero value". This function can return a different result type, U, than the type of this RDD, T. Thus, we need one operation for merging a T into an U and one operation for merging two U's, as in scala.TraversableOnce. Both of these functions are allowed to modify and return their first argument instead of creating a new U to avoid memory allocation.
zeroValue
the initial value for the accumulated result of each partition for the seqOp operator, and also the initial value for the combine results from different partitions for the combOp operator - this will typically be the neutral element (e.g. Nil for list concatenation or 0 for summation)
seqOp
an operator used to accumulate results within a partition
combOp
an associative operator used to combine results from different partitions

aggregate函数将每个分区里面的元素进行聚合（seqOp），然后用combine函数将每个分区的结果和初始值(zeroValue)进行combine操作。这个函数最终返回的类型不需要和RDD中元素类型一致。

实例：

scala> def seqOP(a:Int, b:Int) : Int = {
     |     val r = a*b
     |     println("seqOp: " + a + "\t" + b+"=>"+r)
     |     r
     |   }
seqOP: (a: Int, b: Int)Int

scala>   def combOp(a:Int, b:Int): Int = {
     |     val r= a+b
     |     println("combOp: " + a + "\t" + b+"=>"+r)
     |     r
     |   }
combOp: (a: Int, b: Int)Int

scala> val z = sc. parallelize ( List (1 ,2 ,3 ,4 ,5 ,6) , 2)
z: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[9] at parallelize at <console>:27

scala> z. aggregate(3)(seqOP, combOp)
combOp: 3	18=>21
combOp: 21	360=>381
res20: Int = 381