项目要求
要求统计txt英文文件中每个单词出现的次数。txt文件内随机拷贝英文内容,如下
The scientists re-analysed a sample collected by NASA astronauts during the 1972 Apollo mission.
What they found suggests large portions of the crust were formed at temperatures in excess of 2,300 degrees Celsius, which they say, could have been achieved by the melting of the outer layer.
These temperatures are incredibly high and suggest a terrific impact helped not only to destroy the lunar surface, but to build it. The idea overturns previous theories that colliding asteroids and comets were a purely destructive process with the lunar crust being created by magmas rising from the interior.
流程
- 初始化spark配置
- 通过
textFile
方法读取文件夹内的所有txt文件 - RDD的每一个元素为txt文件中的一行,通过
flatMap
方法(flatMap方法可以返回一个序列,普通的map方法返回一个元素)将每一行按空格分割,并将该行的所有词按Key-value的形式返回。 - 通过
reduceByKey
将所有相同键值的词聚合在一起,聚合函数为lambda x, y: x+y
即对集合内的两两元素相加。 - 通过
sortBy
将按频率排序,lambda x: x[1]
表示key-value的第二个值即value
from pyspark import SparkContext, SparkConf
def split(x):
words = x.split()
return [(word, 1) for word in words]
# set sparkcontext
conf = SparkConf().setMaster("local[*]").setAppName("My App")
sc = SparkContext(conf=conf)
sc.setLogLevel("ERROR")
rdd = sc.textFile('words.txt')
words = rdd.flatMap(split)
count = words.reduceByKey(lambda x, y: x+y).sortBy(lambda x: x[1],
ascending=False)
count.foreach(print)
# stop spark
sc.stop()