val conf = new SparkConf ().setMaster ("local").setAppName ("app_1")
val sc = new SparkContext (conf)
val people = List(("男", "李四"), ("男", "张三"), ("女", "韩梅梅"), ("女", "李思思"), ("男", "马云"))
val rdd = sc.parallelize(people,2)
val result = rdd.combineByKey(
(x: String) => (List(x), 1), //createCombiner
(peo: (List[String], Int), x : String) => (x :: peo._1, peo._2 + 1), //mergeValue
(sex1: (List[String], Int), sex2: (List[String], Int)) => (sex1._1 ::: sex2._1, sex1._2 + sex2._2)) //mergeCombiners
result.foreach(println)
结果
(男, ( List( 张三, 李四, 马云),3 ) ) (女, ( List( 李思思, 韩梅梅),2 ) )
解析:两个分区,分区一按顺序V1、V2、V3遍历
V1,发现第一个key=男时,调用createCombiner,即 (x: String) => (List(x), 1) V2,第二次碰到key=男的元素,调用mergeValue,即 (peo: (List[String], Int), x : String) => (x :: peo._1, peo._2 + 1) V3,发现第一个key=女,继续调用createCombiner,即 (x: String) => (List(x), 1) … … 待各V1、V2分区都计算完后,数据进行混洗,调用mergeCombiners,即 (sex1: (List[String], Int), sex2: (List[String], Int)) => (sex1._1 ::: sex2._1, sex1._2 + sex2._2))
总结:键值对RDD的操作与基本RDD操作无异的,只要找准Spark合并数据的基本思路,剩下的就是公式代入了