一、groupByKey
函数定义
def groupByKey(): RDD[(K, Iterable[V])]
def groupByKey(numPartitions: Int): RDD[(K, Iterable[V])]
def groupByKey(partitioner: Partitioner): RDD[(K, Iterable[V])]
groupByKey会将RDD[key,value] 按照相同的key进行分组,形成RDD[key,Iterable[value]]的形式,有点类似于sql中的groupby,例如类似于mysql中的group_concat
groupByKey不能传算法,相比于reduceByKey而言,groupByKey更耗性能
案例:对学生的成绩进行分组
scala版本
object GroupByKeyScala {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setMaster("local[*]").setAppName("groupByKey")
val sc = new SparkContext(conf)
val scoreDetails = sc.parallelize(List(("zhangsan",97),("zhangsan",87),("xiaoming",75),("lisi",95),("lisi",88)))
val groupByKeyRDD = scoreDetails.groupByKey()
//按名字分组(name,(score1,score2))
groupByKeyRDD.collect.foreach(println)
//输出形式(name,score)
groupByKeyRDD.collect.foreach(x => {
val name = x._1
val scores = x._2
scores.foreach(score => {
println(name,score)})
})
}
}
Java版本
public class GroupByKeyJava {
public static void main(String[] args) {
SparkConf conf = new SparkConf().setMaster("local[*]").setAppName("groupByKey");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<Tuple2<String,Float>> scoreDetails = sc.parallelize(Arrays.asList(
new Tuple2<>