- 定义一个RDD,里面是元祖,key是姓名 value是分数,此RDD有三个分区如下
val input = sc.parallelize(Array((Fred,88), (Fred,95), (Fred,91), (Wilma,93), (Wilma,95), (Wilma,98)))
scala> input.mapPartitionsWithIndex((index,iter)=>Iterator(index+":"+ iter.mkString("-"))).collect
res3: Array[String] = Array(0:(Fred,88)-(Fred,95), 1:(Fred,91)-(Wilma,93), 2:(Wilma,95)-(Wilma,98))
- 使用combineByKey
val combine = input.combineByKey(
| (v)=>(v,1),
| (acc:(Int,Int),v)=>(acc._1+v,acc._2+1),
| (acc1:(Int,Int),acc2:(Int,Int))=>(acc1._1+acc2._1,acc1._2+acc2._2))
解释如下
combineBykey的算子定义如下
def combineByKey[C](createCombiner: Int => C,mergeValue: (C, Int) => C,mergeCombiners: (C, C) => C): org.apache.spark.rdd.RDD[(String, C)]
算子包含三个方法
方法一&