python中使用jieba分词库编写spark中文版WordCount

配置环境的链接:spark2.3在window10当中来搭建python3的使用环境pyspark

编写使用的IDE是pycharm

进入WordCount.py文件写入如下代码,就是中文版WordCount,很经典的分布式程序,需要用到中文分词库jieba,去除停用词再进行计数

from pyspark.context import SparkContext

import jieba

# from pyspark.sql.session import SparkSession

# from pyspark.ml import Pipeline

# from pyspark.ml.feature import StringIndexer, VectorIndexer

sc = SparkContext("local", "WordCount")   #初始化配置

data = sc.textFile(r"D:\WordCount.txt")   #读取是utf-8编码的文件

with open(r'd:\中文停用词库.txt','r',encoding='utf-8') as f:

    x=f.readlines()

stop=[i.replace('\n','') for i in x]

stop.extend([',','的','我','他','','。',' ','\n','?',';',':','-','(',')','!','1909','1920','325','B612','II','III','IV','V','VI','—','‘','’','“','”','…','、'])#停用标点之类

data=data.flatMap(lambda line: jieba.cut(line,cut_all=False)).filter(lambda w: w not in stop).\

    map(lambda w:(w,1)).reduceByKey(lambda w0,w1:w0+w1).sortBy(lambda x:x[1],ascending=False)

print(data.take(100))

输出结果为:

C:\Anaconda3.5.2.0\python.exe D:/Project/WordCount.py

WARNING: An illegal reflective access operation has occurred

WARNING: Illegal reflective access by org.apache.hadoop.security.authentication.util.KerberosUtil (file:/D:/spark/jars/hadoop-auth-2.7.3.jar) to method sun.security.krb5.Config.getInstance()

WARNING: Please consider reporting this to the maintainers of org.apache.hadoop.security.authentication.util.KerberosUtil

WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations

WARNING: All illegal access operations will be denied in a future release

2100-01-01 10:00:00 WARN  NativeCodeLoader:100 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

Setting default log level to "WARN".

To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).

2100-01-01 10:00:00 WARN  Utils:66 - Service 'SparkUI' could not bind on port 4040. Attempting port 4041.

[Stage 0:>                                                          (0 + 1) / 1]Building prefix dict from the default dictionary ...

Loading model from cache C:\Temp\jieba.cache

Loading model cost 0.9 seconds.

Prefix dict has been built succesfully.

[('小王子', 419), ('说', 360), ('没有', 200), ('一个', 199), ('说道', 120), ('星星', 119), ('星球', 104), ('会', 98), ('回答', 91), ('地方', 80), ('国王', 78), ('画', 74), ('狐狸', 72), ('知道', 68), ('中', 67), ('花', 64), ('羊', 62), ('一只', 61), ('道', 57), ('非常', 56), ('看到', 53), ('命令', 52), ('有点', 50), ('这是', 48), ('不会', 48), ('朋友', 47), ('沙漠', 46), ('走', 46), ('地理学家', 46), ('.', 45), ('时', 43), ('想', 42), ('事', 42), ('感到', 42), ('行星', 42), ('问题', 41), ('可能', 40), ('真', 40), ('重要', 39), ('猴面包树', 38), ('&#', 38), ('39', 38), (';', 38), ('时间', 37), ('象', 36), ('问', 36), ('笑', 36), ('地球', 36), ('里', 35), ('爱', 34), ('花儿', 34), ('这种', 32), ('喜欢', 32), ('做', 32), ('蛇', 32), ('驯服', 32), ('一点', 31), (':', 31), ('看着', 30), ('一种', 30), ('发现', 30), ('一定', 30), ('一颗', 30), ('\u3000', 30), ('你好', 30), ('点灯', 30), ('探察', 30), ('大人', 29), ('家', 29), ('东西', 28), ('看见', 28), ('好象', 28), ('这位', 28), ('提出', 28), ('问道', 28), ('应该', 28), ('吃', 28), ('一天', 28), ('请', 27), ('住', 27), ('起来', 27), ('现在', 27), ('奇怪', 26), ('从来', 26), ('已经', 26), ('明白', 26), ('朵花', 26), ('路灯', 26), ('寻找', 26), ('十分', 24), ('小家伙', 24), ('是从', 24), ('地说', 24), ('年', 24), ('自言自语', 24), ('虚荣', 24), ('生活', 22), ('严肃', 22), ('工作', 22), ('想要', 22)]

Process finished with exit code 0

最终结果是:

[('小王子', 419), ('说', 360), ('没有', 200),('一个', 199), ('说道', 120), ('星星', 119), ('星球', 104), ('会', 98), ('回答', 91), ('地方', 80), ('国王', 78), ('画', 74), ('狐狸', 72), ('知道', 68), ('中', 67), ('花', 64), ('羊', 62), ('一只', 61), ('道', 57), ('非常', 56), ('看到', 53), ('命令', 52), ('有点', 50), ('这是', 48), ('不会', 48), ('朋友', 47), ('沙漠', 46), ('走', 46), ('地理学家', 46), ('.', 45), ('时', 43), ('想', 42), ('事', 42), ('感到', 42), ('行星', 42), ('问题', 41), ('可能', 40), ('真', 40), ('重要', 39), ('猴面包树', 38), ('&#', 38), ('39', 38), (';', 38), ('时间', 37), ('象', 36), ('问', 36), ('笑', 36), ('地球', 36), ('里', 35), ('爱', 34), ('花儿', 34), ('这种', 32), ('喜欢', 32), ('做', 32), ('蛇', 32), ('驯服', 32), ('一点', 31), (':', 31), ('看着', 30), ('一种', 30), ('发现', 30), ('一定', 30), ('一颗', 30), ('\u3000', 30), ('你好', 30), ('点灯', 30), ('探察', 30), ('大人', 29), ('家', 29), ('东西', 28), ('看见', 28), ('好象', 28), ('这位', 28), ('提出', 28), ('问道', 28), ('应该', 28), ('吃', 28), ('一天', 28), ('请', 27), ('住', 27), ('起来', 27), ('现在', 27), ('奇怪', 26), ('从来', 26), ('已经', 26), ('明白', 26), ('朵花', 26), ('路灯', 26), ('寻找', 26), ('十分', 24), ('小家伙', 24), ('是从', 24), ('地说', 24), ('年', 24), ('自言自语', 24), ('虚荣', 24), ('生活', 22), ('严肃', 22), ('工作', 22), ('想要', 22)]

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

九是否随机的称呼

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值