基于python的对话机器人bot项目开发（四）

最新推荐文章于 2024-09-27 10:11:28 发布

andyyah晓波

最新推荐文章于 2024-09-27 10:11:28 发布

阅读量139

点赞数 2

分类专栏：基于python的对话机器人bot项目开发文章标签： python 开发语言

本文链接：https://blog.csdn.net/andyyah/article/details/140627767

版权

基于python的对话机器人bot项目开发专栏收录该内容

5 篇文章 0 订阅

订阅专栏

基于python的对话机器人bot项目开发（四）

学习简单的中文分词操作

安装jieba包
jieba1.py

import jieba

text = '我们一定可以顺利地走出困境'      #·····待分词文本
text_cut = jieba.cut(text)     #········开始分词
print(' '.join(text_cut))

运行：

Building prefix dict from the default dictionary ...
Dumping model to file cache C:\Users\Dell\AppData\Local\Temp\jieba.cache
Loading model cost 2.750 seconds.
Prefix dict has been built successfully.
我们 一定 可以 顺利 地 走出 困境

jieba2.py

import jieba 
import re 
import collections   #导入三个库

text = 'test.txt' #测试文件 
num = 10 #统计个数 
sw = 'stop_words.txt'    #指定停用词表
fn = open(text,'r',encoding = 'UTF-8')
string_data = fn.read()
fn.close()
pattern = re.compile(u'\t|\n|\.|-|：|；|\）|\（|\？|“|”|，|。|')
string_data = re.sub(pattern, '', string_data)
text_cut = jieba.cut(string_data, cut_all=False, HMM=True)
result_list=[] 

with open(sw, 'r', encoding='UTF-8') as useless_file: 
    stopwords = set(useless_file.read().split('\n')) 
stopwords.add(' ')
for word in text_cut:
    if word not in stopwords:
        result_list.append(word)

word_counts = collections.Counter(result_list) #统计词频 
word_top = word_counts.most_common(num) #获取高频词语

print ('\n词语\t词频')
print ('_____________')
count = 0
for TopWord,Frequency in word_top:
    if count == num: 
        break 
    print(TopWord + '\t',str(Frequency) + '\t')
    count += 1

运行：

Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\Dell\AppData\Local\Temp\jieba.cache
Loading model cost 1.948 seconds.
Prefix dict has been built successfully.

词语    词频
_____________
投资     20
收益     17
的       12
元       11
净       10
=        10
回收     9
时间     9
项目     8
每天     8