基于python的对话机器人bot项目开发(四)
学习简单的中文分词操作
安装jieba包
jieba1.py
import jieba
text = '我们一定可以顺利地走出困境' #·····待分词文本
text_cut = jieba.cut(text) #········开始分词
print(' '.join(text_cut))
运行:
Building prefix dict from the default dictionary ...
Dumping model to file cache C:\Users\Dell\AppData\Local\Temp\jieba.cache
Loading model cost 2.750 seconds.
Prefix dict has been built successfully.
我们 一定 可以 顺利 地 走出 困境
jieba2.py
import jieba
import re
import collections #导入三个库
text = 'test.txt' #测试文件
num = 10 #统计个数
sw = 'stop_words.txt' #指定停用词表
fn = open(text,'r',encoding = 'UTF-8')
string_data = fn.read()
fn.close()
pattern = re.compile(u'\t|\n|\.|-|:|;|\)|\(|\?|“|”|,|。|')
string_data = re.sub(pattern, '', string_data)
text_cut = jieba.cut(string_data, cut_all=False, HMM=True)
result_list=[]
with open(sw, 'r', encoding='UTF-8') as useless_file:
stopwords = set(useless_file.read().split('\n'))
stopwords.add(' ')
for word in text_cut:
if word not in stopwords:
result_list.append(word)
word_counts = collections.Counter(result_list) #统计词频
word_top = word_counts.most_common(num) #获取高频词语
print ('\n词语\t词频')
print ('_____________')
count = 0
for TopWord,Frequency in word_top:
if count == num:
break
print(TopWord + '\t',str(Frequency) + '\t')
count += 1
运行:
Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\Dell\AppData\Local\Temp\jieba.cache
Loading model cost 1.948 seconds.
Prefix dict has been built successfully.
词语 词频
_____________
投资 20
收益 17
的 12
元 11
净 10
= 10
回收 9
时间 9
项目 8
每天 8