老大:combineBykey
有初始值,并且初始值还支持改变数据结构,最灵活
combineByKeyWithClassTag(createCombiner, mergeValue, mergeCombiners,
partitioner, mapSideCombine, serializer)(null)
// 3.4 使用combinebykey求平均值
val list: List[(String, Int)] = List(("a", 88), ("b", 95), ("a", 91), ("b", 93), ("a", 95), ("b", 98))
val rdd4: RDD[(String, Int)] = sc.makeRDD(list, 2)
val value4: RDD[(String, (Int, Int))] = rdd4.combineByKey(
i => (i, 1),
(res: (Int, Int), elem: Int) => (res._1 + elem, res._2 + 1),
(res1: (Int, Int), res2: (Int, Int)) => (res1._1 + res1._1, res2._2 + res1._2)
)
value4.collect().foreach(println)
value4.mapValues({
case (sum, count) => sum.toDouble / count
}).collect().foreach(println)
老二:aggregateByKey
有初始值,分区内和分区间计算逻辑还可以变,很灵活
combineByKeyWithClassTag[U]((v: V) => cleanedSeqOp(createZero(), v),
cleanedSeqOp, combOp, partitioner)
// 3.3 使用aggregateByKey求
val rdd3: RDD[(String, Int)] = sc.makeRDD(List(("a", 1), ("a", 3), ("b", 5), ("b", 7), ("b", 2), ("b", 4), ("b", 6), ("a", 7)), 2)
// 3.3 取出每个分区相同key对应值的最大值,然后相加,
rdd3.aggregateByKey(Int.MinValue)((res: Int, elem: Int) => math.max(res, elem)
, (res: Int, elem: Int) => res + elem)
.collect().foreach(println)
// 3.3 使用combineByKey来写
rdd3.combineByKey(
// combinebykey 中的默认值,是从集合中的值进行的改变,会占据集合中的一个元素
// i => Int.MinValue,
// i => i - Int.MinValue,
i => i,
(res: Int, elem: Int) => math.max(res, elem),
(res: Int, elem: Int) => res + elem
).collect().foreach(println)
老三:foldByKey
有初始值,分区内和分区间计算逻辑一致
combineByKeyWithClassTag[V]((v: V) => cleanedFunc(createZero(), v),
cleanedFunc, cleanedFunc, partitioner)
// 3.2 使用foldByKey
//3.1 创建第一个RDD
val list1: List[(String, Int)] = List(("a", 1), ("a", 3), ("a", 5), ("b", 7), ("b", 2), ("b", 4), ("b", 6), ("a", 7))
val rdd2 = sc.makeRDD(list1, 2)
//3.2 求wordcount
rdd2.foldByKey(0)(_ + _).collect().foreach(println)
// rdd2.foldByKey(10)(_ + _).collect().foreach(println)
// 3.2 使用combineByKey
rdd2.combineByKey(
i => i,
(res: Int, elem: Int) => res + elem,
(res: Int, elem: Int) => res + elem
).collect().foreach(println)
rdd2.combineByKey(
i => i + 10,
(res: Int, elem: Int) => res + elem,
(res: Int, elem: Int) => res + elem
).collect().foreach(println)
老四:reduceByKey
没有初始值,分区内和分区间计算逻辑一致
combineByKeyWithClassTag[V]((v: V) => v, func, func, partitioner)
//3.1.1 创建RDD
val rdd1 = sc.makeRDD(List(("a", 1), ("b", 5), ("a", 5), ("b", 2)))
//3.1.2 计算相同key对应值的相加结果
rdd1.reduceByKey(_ + _).collect().foreach(println)
//3.1.3 使用combineByKey来编写
rdd1.combineByKey(
i => i,
(res: Int, elem: Int) => res + elem,
(res: Int, elem: Int) => res + elem
).collect().foreach(println)
reduceByKey和foldByKey和aggregateByKey都可以用combineByKey来写出来。
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-EKhiN8Nc-1635516704468)(https://secure.wostatic.cn/static/pRjfed9mEp3tY893wU6Nga/image.png)]
package com.huc.Spark1.KeyAndValue
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}
object Test05_ {
def main(args: Array[String]): Unit = {
//1.创建SparkConf并设置App名称
val conf: SparkConf = new SparkConf().setAppName("SparkCore").setMaster("local[*]")
//2.创建SparkContext,该对象是提交Spark App的入口
val sc: SparkContext = new SparkContext(conf)
//3.使用Scala进行spark编程
//3.1.1 创建RDD
val rdd1 = sc.makeRDD(List(("a", 1), ("b", 5), ("a", 5), ("b", 2)))
//3.1.2 计算相同key对应值的相加结果
rdd1.reduceByKey(_ + _).collect().foreach(println)
//3.1.3 使用combineByKey来编写
rdd1.combineByKey(
i => i,
(res: Int, elem: Int) => res + elem,
(res: Int, elem: Int) => res + elem
).collect().foreach(println)
println("+++++++++++++++++++++++++++++++")
// 3.2 使用foldByKey
//3.1 创建第一个RDD
val list1: List[(String, Int)] = List(("a", 1), ("a", 3), ("a", 5), ("b", 7), ("b", 2), ("b", 4), ("b", 6), ("a", 7))
val rdd2 = sc.makeRDD(list1, 2)
//3.2 求wordcount
rdd2.foldByKey(0)(_ + _).collect().foreach(println)
// rdd2.foldByKey(10)(_ + _).collect().foreach(println)
// 3.2 使用combineByKey
rdd2.combineByKey(
i => i,
(res: Int, elem: Int) => res + elem,
(res: Int, elem: Int) => res + elem
).collect().foreach(println)
rdd2.combineByKey(
i => i + 10,
(res: Int, elem: Int) => res + elem,
(res: Int, elem: Int) => res + elem
).collect().foreach(println)
println("_____________________________")
// 3.3 使用aggregateByKey求
val rdd3: RDD[(String, Int)] = sc.makeRDD(List(("a", 1), ("a", 3), ("b", 5), ("b", 7), ("b", 2), ("b", 4), ("b", 6), ("a", 7)), 2)
// 3.3 取出每个分区相同key对应值的最大值,然后相加,
rdd3.aggregateByKey(Int.MinValue)((res: Int, elem: Int) => math.max(res, elem)
, (res: Int, elem: Int) => res + elem)
.collect().foreach(println)
// 3.3 使用combineByKey来写
rdd3.combineByKey(
// combinebykey 中的默认值,是从集合中的值进行的改变,会占据集合中的一个元素
// i => Int.MinValue,
// i => i - Int.MinValue,
i => i,
(res: Int, elem: Int) => math.max(res, elem),
(res: Int, elem: Int) => res + elem
).collect().foreach(println)
println("*&************************")
// 3.4 使用combinebykey求平均值
val list: List[(String, Int)] = List(("a", 88), ("b", 95), ("a", 91), ("b", 93), ("a", 95), ("b", 98))
val rdd4: RDD[(String, Int)] = sc.makeRDD(list, 2)
val value4: RDD[(String, (Int, Int))] = rdd4.combineByKey(
i => (i, 1),
(res: (Int, Int), elem: Int) => (res._1 + elem, res._2 + 1),
(res1: (Int, Int), res2: (Int, Int)) => (res1._1 + res1._1, res2._2 + res1._2)
)
value4.collect().foreach(println)
value4.mapValues({
case (sum, count) => sum.toDouble / count
}).collect().foreach(println)
//4.关闭连接
sc.stop()
}
}