【Spark入门项目】词频统计

最新推荐文章于 2024-04-02 13:01:45 发布

腾阳山泥若

最新推荐文章于 2024-04-02 13:01:45 发布

阅读量1.4k

点赞数 1

分类专栏： Spark

本文链接：https://blog.csdn.net/weixin_43486780/article/details/107736025

版权

Spark 专栏收录该内容

8 篇文章 0 订阅

订阅专栏

项目要求

要求统计txt英文文件中每个单词出现的次数。txt文件内随机拷贝英文内容，如下

The scientists re-analysed a sample collected by NASA astronauts during the 1972 Apollo mission.
What they found suggests large portions of the crust were formed at temperatures in excess of 2,300 degrees Celsius, which they say, could have been achieved by the melting of the outer layer.
These temperatures are incredibly high and suggest a terrific impact helped not only to destroy the lunar surface, but to build it. The idea overturns previous theories that colliding asteroids and comets were a purely destructive process with the lunar crust being created by magmas rising from the interior.

流程

初始化spark配置
通过textFile方法读取文件夹内的所有txt文件
RDD的每一个元素为txt文件中的一行，通过flatMap方法（flatMap方法可以返回一个序列，普通的map方法返回一个元素）将每一行按空格分割，并将该行的所有词按Key-value的形式返回。
通过reduceByKey 将所有相同键值的词聚合在一起，聚合函数为lambda x, y: x+y即对集合内的两两元素相加。
通过sortBy将按频率排序，lambda x: x[1]表示key-value的第二个值即value

from pyspark import SparkContext, SparkConf

def split(x):
    words = x.split()
    return [(word, 1) for word in words]

# set sparkcontext
conf = SparkConf().setMaster("local[*]").setAppName("My App")
sc = SparkContext(conf=conf)
sc.setLogLevel("ERROR")

rdd = sc.textFile('words.txt')
words = rdd.flatMap(split)
count = words.reduceByKey(lambda x, y: x+y).sortBy(lambda x: x[1],
                                                   ascending=False)
count.foreach(print)
# stop spark
sc.stop()