这里简单描述一下几个rdd,key-values的使用
- groupByKey
- reduceByKey
- aggregateByKey
- foldByKey
- combineByKey
- mapValues
- join
groupByKey
返回((key,value数组),(key,value数组))
//定义两个kv类型的rdd
val rdd1: RDD[(Int, String)] = spark.sparkContext.makeRDD(Array((1, "r"), (2, "n"), (3, "g")))
val rdd2: RDD[(Int, String)] = spark.sparkContext.makeRDD(Array((1, "e"), (2, "d"), (3, "g")))
val resultRdd: RDD[(Int, Iterable[String])] = rdd1.groupByKey()
//打印结果:((1,("r","e")),(2,("n","d")),(3,("g","g")))
reduceByKey
以相加举例子
返回((key,value的和),(key,value的和))
val rdd1: RDD[(String, Int)] = spark.sparkContext.makeRDD(Array(("r", 1), ("n", 2), ("g", 3), ("r", 5)))
val resultRdd: RDD[(String, Int)] = rdd1.reduceByKey(_ + _)
//打印结果:List((n,2), (r,6), (g,3))
aggregateByKey
第一个参数0代表初始值
第二个参数代表自己分区内计算
第三个参数代表自己分区结果和其他分区结果计算
规则
以zhangsan,两个分区举例子
第一个分区(0 , 5)相加得5
第二个分区(0 , 5)相加得5
5和5相加等于10
val kvRdd: RDD[(String, Int)] = spark.sparkContext.parallelize(Array(("zhangsan", 5), ("zhangsan", 5), ("wangwu", 6), ("zhaoliu", 4), ("tianqi", 3)), 2)
val resultRdd: RDD[(String, Int)] = kvRdd.aggregateByKey(0)(_ + _, _ + _)
//打印结果:List((zhangsan,10), (zhaoliu,4), (tianqi,3), (wangwu,6))
foldByKey
aggregateByKey的简化版
第一个参数,初始值
第二个参数,传一个函数,函数做减法这个就做减法,不需要考虑自己分区和其他分区
val kvRdd: RDD[(String, Int)] = spark.sparkContext.parallelize(Array(("zhangsan", 5), ("zhangsan", 5), ("wangwu", 6), ("zhaoliu", 4), ("tianqi", 3)), 2)
val resultRdd: RDD[(String, Int)] = kvRdd.foldByKey(0)(_ + _)
//打印结果:List((zhangsan,10), (zhaoliu,4), (tianqi,3), (wangwu,6))
combineByKey
可以把原本的v做计算变成元组来计算
做一个计算每个k的v的平均数,需要v的sum和v的count
以zhangsan,两个分区举例子
第一个参数:把5变成(5,1)
第二个参数:把自己分区内的(5,1)和3做相加变成(8,2)
第三个参数:把一分区的(8,2)和二分区的(5,1)变成(13,3)
val kvRdd: RDD[(String, Int)] = spark.sparkContext.parallelize(Array(("zhangsan", 5),("zhangsan", 3), ("zhangsan", 5), ("wangwu", 6), ("zhaoliu", 4), ("tianqi", 3)), 2)
val resultRdd: RDD[(String, (Int, Int))] = kvRdd.combineByKey((_, 1), (x: (Int, Int), y) => (x._1 + y, x._2 + 1), (x: (Int, Int), y: (Int, Int)) => (x._1 + y._1, x._2 + y._2))
//打印结果:List((zhangsan,(13,3)), (zhaoliu,(4,1)), (tianqi,(3,1)), (wangwu,(6,1)))
mapValues
对v来操作
val kvRdd: RDD[(String, Int)] = spark.sparkContext.parallelize(Array(("zhangsan", 5), ("zhangsan", 5), ("wangwu", 6), ("zhaoliu", 4), ("tianqi", 3)), 2)
val resultRdd: RDD[(String, Int)] = kvRdd.mapValues(x => x * 10)
//打印结果:List((zhangsan,50), (zhangsan,50), (wangwu,60), (zhaoliu,40), (tianqi,30))
join
val rdd1: RDD[(String, Int)] = spark.sparkContext.makeRDD(Array(("r", 1), ("n", 2), ("g", 3), ("r", 5)))
val rdd2: RDD[(String, Int)] = spark.sparkContext.makeRDD(Array(("e", 1), ("d", 2), ("g", 3)))
val resultRdd: RDD[(String, (Int, Int))] = rdd1.join(rdd2)
//打印结果:List((g,(3,3)))