val conf = new SparkConf().setAppName("ScalaWordCount").setMaster("local[4]")
//创建spark执行的入口
val sc = new SparkContext(conf)
//指定以后从哪里读取数据创建RDD(弹性分布式数据集)
val lines: RDD[String] = sc.textFile("/D:/a.txt")
//切分压平
val words: RDD[String] = lines.flatMap(_.split(" "))
//将单词和1组合
val wordAndOne: RDD[(String, Int)] = words.map((_, 1)).map(x=>{
println(x + "--" + Thread.currentThread())
x
})
//按Key进行聚合
// val reduced:RDD[(String,Int)] = wordAndOne.reduceByKey(_+_)
wordAndOne.repartition(4).getNumPartitions
val reduced: RDD[(String, Int)] = wordAndOne.reduceByKey((x, y) => {
val z = x + y
z
})
//排序
val sorted: RDD[(String, Int)] = reduced.sortBy(_._1, false)
val array = sorted.collect();
for (e <- array) println(e._1 + "--" + e._2)
// 释放资源
sc.stop()
val sorted: RDD[(String, Int)] = reduced.sortBy(_._1, false) sortBy是一个action,触发一次操作,reduceByKey会将各分片在本地进行reduce之后,保存到本地文件,shuffle后再重新reduce
当执行sorted.collect()时,直接执行shuffle后的的reduce,因为shuffle write已经保存在机器上,直接shuffle read就可以了