scala> val t= sc.textFile("README.md")
一、找出文件中单词数量最多的行的方法:
1.t.map(line => line.split(" ").size).reduce((a, b) => if (a > b) a else b)
2.import java.lang.Math
t.map(line => line.split(" ").size).reduce((a, b) => Math.max(a, b)
二、计算所有/某个单词的数量
t.map(l => l.split(" ").size).reduce(_+_)
t.filter(line => line.contains("b")).count()
三、统计词频
val wordCounts = t.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey((a, b) => a + b)
wordCounts.collect()
res1: Array[(String, Int)] = Array((For,5), (processing.,1), (Programs,1), (Because,1), (The,1), (agree,1), (cluster,9)...
def sample(withReplacement: Boolean, fraction: Double, seed: Long = Utils.random.nextLong): RDD[T]
t.sample(true,0.2).count()
...
链接:
理解Spark的核心RDD:http://www.infoq.com/cn/articles/spark-core-rdd/ Action算子:http://www.myexception.cn/other/1961287.html 使用 Spark API 写独立的应用程序,使用 Scala( SBT):http://www.51studyit.com/html/notes/20150123/1094.html Quick Start:http://blog.csdn.net/luyee2010/article/details/39291139