【Spark入门项目】词频统计

项目要求

要求统计txt英文文件中每个单词出现的次数。txt文件内随机拷贝英文内容,如下

The scientists re-analysed a sample collected by NASA astronauts during the 1972 Apollo mission.
What they found suggests large portions of the crust were formed at temperatures in excess of 2,300 degrees Celsius, which they say, could have been achieved by the melting of the outer layer.
These temperatures are incredibly high and suggest a terrific impact helped not only to destroy the lunar surface, but to build it. The idea overturns previous theories that colliding asteroids and comets were a purely destructive process with the lunar crust being created by magmas rising from the interior.

流程

  1. 初始化spark配置
  2. 通过textFile方法读取文件夹内的所有txt文件
  3. RDD的每一个元素为txt文件中的一行,通过flatMap方法(flatMap方法可以返回一个序列,普通的map方法返回一个元素)将每一行按空格分割,并将该行的所有词按Key-value的形式返回。
  4. 通过reduceByKey 将所有相同键值的词聚合在一起,聚合函数为lambda x, y: x+y即对集合内的两两元素相加。
  5. 通过sortBy将按频率排序,lambda x: x[1]表示key-value的第二个值即value
from pyspark import SparkContext, SparkConf

def split(x):
    words = x.split()
    return [(word, 1) for word in words]

# set sparkcontext
conf = SparkConf().setMaster("local[*]").setAppName("My App")
sc = SparkContext(conf=conf)
sc.setLogLevel("ERROR")

rdd = sc.textFile('words.txt')
words = rdd.flatMap(split)
count = words.reduceByKey(lambda x, y: x+y).sortBy(lambda x: x[1],
                                                   ascending=False)
count.foreach(print)
# stop spark
sc.stop()
  • 1
    点赞
  • 13
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值