spark pairRDD基本操作(三)——附带wordcount程序
由于pairRDD也是RDD,或者说是RDD的子类,所以pairRDD也有RDD的功能,下面是一个综合的例子,首先使用了filtermap,然后是一个简单的mapreduce程序,最后是一个简单的wordcount小程序。
本文主要参考书籍《O Reilly Learning spark》
好,下面上货。
val a = sc.parallelize(Array((1,2),(3,4),(3,6)))
a.collect().foreach(x => print(x + " "))
println(" ")
//进行filter操作
val b = a.filter({
case (key,value) => {value<5 && key < 2}
})
b.collect().foreach(x => print(x + " "))
println(" ")
val c = sc.parallelize(Array(("panda",0),("pink",3),("pirate",3),("panda",1),("pink",4)))
c.collect().foreach(x => print(x + " "))
println(" ")
val d = c.mapValues(x => (x,1))
d.collect().foreach(x => print(x + " "))
println(" ")
val e = d.reduceByKey((x,y) => (x._1 + y._1, x._2 + y._2))
e.collect().foreach(x => print(x + " "))
println(" ")
//wordcount exmaple
val input = sc.textFile("hdfs://192.168.1.221:9000/wordcountinput/123")
input.collect().foreach(x => print(x + ","))
println(" ")
//分步生成
val words = input.flatMap(x => x.split(" "))
words.collect().foreach(x => print(x + ","))
println(" ")
val result1 = words.map(x => (x, 1))
result1.collect().foreach(x => print(x + " "))
println(" ")
val result2 = result1.reduceByKey((x,y)=>x+y)
result2.collect().foreach(x => print(x + " "))
println(" ")
//直接生成
val result3 = input.flatMap(x => x.split(" ")).map(x => (x,1)).reduceByKey((x,y)=>x + y)
result3.collect().foreach(x => print(x + " "))
println(" ")
下面是运行截图:
其中的mapreduce的解释在书中用图示进行了说明,这里不再赘述,请看图: