利用jieba分词，统计词频

weixin_52593633

已于 2022-12-02 12:27:22 修改

阅读量327

点赞数 2

分类专栏：笔记文章标签： word python 中文分词

于 2022-12-02 12:25:29 首次发布

本文链接：https://blog.csdn.net/weixin_52593633/article/details/128145936

版权

笔记专栏收录该内容

21 篇文章 0 订阅

订阅专栏

这里选择的是《钢铁是怎样炼成的》
第八章原文 .txt 链接

第九章原文 .txt 链接

文本和代码需在同一个目录

import jieba

excludes={"他们","我们","还有","这个","一个","时候","没有","已经","这样","什么","一样","就是"}  
#excludes 是根据运行结果自己添的，如果觉得运行结果还不够可以再添加，比如 运行结果的 “今天”
txt=open("钢铁是怎样炼成的第八章.txt","r").read()

words=jieba.lcut(txt) # 加上这个  cut_all=True 参数就显示 英文符号啦不建议使用
counts={}
for word in words:
    if len(word)==1: #排除单个汉字的分词
        continue
    elif word=="保尔" or word =="保尔柯察金":
        reword=word
    elif word=="达雅" or word =="达雅柯察金":
        reword=word
    else:
        reword=word
    counts[reword]=counts.get(reword,0)+1  #统计次数
for word in excludes: #出去语气词等
    del(counts[word])

items=list(counts.items())
items.sort(key=lambda x:x[1],reverse=True)  #按 出现次数从大到小排序
for i in range(10):
  word,count =items[i]
  print("{0:<10}{1:>5}".format(word,count))