python中文怎么通过jieba分词计算数量_Python3.6 利用jieba对中文文本进行分词,去停用词,统计词频...

from collections import Counter

import jieba

# jieba.load_userdict('userdict.txt')

# 创建停用词list

def stopwordslist(filepath):

stopwords = [line.strip() for line in open(filepath, 'r').readlines()]

return stopwords

# 对句子进行分词

def seg_sentence(sentence):

sentence_seged = jieba.cut(sentence.strip())

stopwords = stopwordslist('stop_words.txt') # 这里加载停用词的路径

outstr = ''

for word in sentence_seged:

if word not in stopwords:

if word != '\t':

outstr += word

outstr += " "

return outstr

inputs = open('wordsbag2.txt', 'r') # 加载要处理的文件的路径

outputs = open('result2.txt', 'w') # 加载处理后的文件路径

for line in inputs:

line_seg = seg_sentence(line) # 这里的返回值是字符串

outputs.write(line_seg)

outputs.close()

inputs.close()

# WordCount

with open('result2.txt', 'r') as fr: # 读入已经去除停用词的文件

data = jieba.cut(fr.read())

data = dict(Counter(data))

with open('wordcount2.txt', 'w') as fw: # 读入存储wordcount的文件路径

for k, v in data.items():

fw.write('%s,%d\n' % (k, v))

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
好的,我可以帮你使用Python对《西游记》进行分词停用词过滤和词频统计。不过需要先下载停用词表。 步骤如下: 1. 下载停用词表,可以从中文自然语言处理工具包NLTK中下载,也可以在GitHub上找到。这里我使用的是GitHub上的停用词表。 ```python import urllib.request import os if not os.path.exists('stopwords.txt'): print('Downloading stopwords...') url = 'https://raw.githubusercontent.com/goto456/stopwords/master/stopwords.txt' urllib.request.urlretrieve(url, 'stopwords.txt') print('Stopwords download complete.') ``` 2. 读取《西游记》文本 ```python with open('journey_to_the_west.txt', 'r', encoding='utf-8') as f: text = f.read() ``` 3. 对文本进行分词,这里使用jieba分词库 ```python import jieba words = jieba.lcut(text) ``` 4. 进行停用词过滤 ```python with open('stopwords.txt', 'r', encoding='utf-8') as f: stopwords = f.read().split('\n') filtered_words = [] for word in words: if word not in stopwords and word != '\n': filtered_words.append(word) ``` 5. 进行词频统计 ```python from collections import Counter word_count = Counter(filtered_words) print(word_count.most_common(20)) ``` 完整代码如下: ```python import urllib.request import os import jieba from collections import Counter if not os.path.exists('stopwords.txt'): print('Downloading stopwords...') url = 'https://raw.githubusercontent.com/goto456/stopwords/master/stopwords.txt' urllib.request.urlretrieve(url, 'stopwords.txt') print('Stopwords download complete.') with open('journey_to_the_west.txt', 'r', encoding='utf-8') as f: text = f.read() words = jieba.lcut(text) with open('stopwords.txt', 'r', encoding='utf-8') as f: stopwords = f.read().split('\n') filtered_words = [] for word in words: if word not in stopwords and word != '\n': filtered_words.append(word) word_count = Counter(filtered_words) print(word_count.most_common(20)) ``` 这段代码会输出前20个出现次数最多的词及其出现次数。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值