题目:
给定一组键值对(“spark”, 2)(“hadoop”, 6)(“hadoop”, 4)(“spark”, 6),键值对的key表示图书名称,value表示某天的图书销量,请计算每个键对应的平均值,也就是计算每种图书的每天平均销量
下面利用scala实现:
环境是在jupyter notebook中的运行,使用scala语法:
命令1:
val rdd = sc.parallelize(Array(("spark", 2), ("hadoop", 6), ("hadoop", 4), ("spark", 6))) // 创建RDD
输出1:
rdd: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[0] at parallelize at <console>:25
命令2:
rdd.foreach(elem=>println(elem))
输出2:
(spark,2)
(hadoop,6)
(hadoop,4)
(spark,6)
命令3:
rdd.mapValues(x=>(x,1)).foreach(elem=>println(elem)) // mapValues不是行动操作
输出3:
(spark,(2,1))
(hadoop,(6,1))
(spark,(6,1))
(hadoop,(4,1))
命令4:
rdd.mapValues(x=>(x,1)).reduceByKey((x, y)=>(x._1+y._1, x._2+y._2)).foreach(elem=>println(elem))
输出4:
(hadoop,(10,2))
(spark,(8,2))
命令5:
rdd.mapValues(x=>(x,1)).reduceByKey((x, y)=>(x._1+y._1, x._2+y._2)).mapValues(x=>(x._1/x._2)).foreach(elem=>println(elem))
输出5:
(spark,4)
(hadoop,5)
命令6:
rdd.mapValues(x=>(x,1)).reduceByKey((x, y)=>(x._1+y._1, x._2+y._2)).mapValues(x=>(x._1/x._2)).collect()
输出6:
res10: Array[(String, Int)] = Array((hadoop,5), (spark,4))