三国演义任务出现词频统计

最新推荐文章于 2024-01-25 17:03:44 发布

代码拖拉鸡

最新推荐文章于 2024-01-25 17:03:44 发布

阅读量1.2k

点赞数 1

分类专栏： python 文章标签：自学笔记

本文链接：https://blog.csdn.net/qq_38290604/article/details/86837106

版权

本文介绍了使用jieba库进行三国演义文本的词汇提取，强调了文本需为utf-8格式，以避免错误。通过运行代码发现非人名词语被错误提取，如'二人'、'玄德曰'等。优化方案包括将非人名高频词和同义词进行过滤和合并，以提高人名提取的准确性。

摘要由CSDN通过智能技术生成

使用jieba库将文本中的词汇进行提取，需要注意的是文本要存储为utf-8格式，否则会报错。

代码

import jieba
txt = open("threekingdoms.txt","r", encoding="utf-8").read()
words  = jieba.lcut(txt)
counts = {}
for word in words:
    if len(word) == 1:
        continue
    else:
        counts[word] = counts.get(word,0) + 1
items = list(counts.items())
items.sort(key=lambda x:x[1], reverse=True)
for i in range(15):
    word, count = items[i]
    print ("{0:<10}{1:>5}".format(word, count))