combineByKey 对数据集按照 Key 进行聚合
combineByKey(createCombiner, mergeValue, mergeCombiners, [partitioner], [mapSideCombiner], [serializer])
参数:
createCombiner 将 Value 进行初步转换
mergeValue 在每个分区把上一步转换的结果聚合
mergeCombiners 在所有分区上把每个分区的聚合结果聚合
partitioner 可选, 分区函数
mapSideCombiner 可选, 是否在 Map 端 Combine
serializer 序列化器
val rdd = sc.parallelize(Seq(
("zhangsan", 99.0),
("zhangsan", 98.0),
("zhangsan", 97.0),
("lisi", 97.0),
("lisi", 98.0),
("zhangsan", 97.0)),
("zhangsan", 96.0),
)
val combineRdd = rdd.combineByKey(
score => (score, 1), //createCombiner
(scoreCount: (Double, Int),newScore) => (scoreCount._1 + newScore, scoreCount._2 + 1),//mergeValue
(scoreCount1: (Double, Int), scoreCount2: (Double, Int)) =>
(scoreCount1._1 + scoreCount2._1, scoreCount1._2 + scoreCount2._2)//mergeCombiners
)
createCombiner 、mergeValue作用在同分区上
对于mergeValue的理解,上一步createaCombiner的结果作为参数1,要调用当前rdd的下一条作为参数2
例如:
("zhangsan", 99.0),
("zhangsan", 98.0),
("zhangsan", 97.0),
假设这三条数据在同一分区上, (“zhangsan”, 99.0)经过createCombiner后 value 格式为(99.0,1)
输出作为mergeValue 参数1–scoreCount,再取同分区rdd下条数据 (“zhangsan”, 98.0),98作为参数2–newScore,
执行完后面的function (scoreCount._1 + newScore, scoreCount._2 + 1),得到(197,2),
再作为merge参数1,取下一条(“zhangsan”, 97.0),97作为参数2,执行function得(294,3)
此分区执行完mergeValue 后得(“zhangsan”,(294.0,3))
所有分区执行完后再执行mergeCombiners