Python中文词频统计

最新推荐文章于 2024-03-16 19:14:15 发布

lger_Pro

最新推荐文章于 2024-03-16 19:14:15 发布

阅读量5.9k

点赞数 5

分类专栏： Python

版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。

本文链接：https://blog.csdn.net/lger_pro/article/details/79732766

版权

Python 专栏收录该内容

12 篇文章 0 订阅

订阅专栏

以下是关于小说的中文词频统计

这里有三个文件，分别为novel.txt、punctuation.txt、meaningless.txt。
这三个是小说文本、特殊符号和无意义词

Python代码统计词频如下：

import jieba # jieba中文分词库
# 从文件读入小说
with open('novel.txt', 'r', encoding='UTF-8') as novelFile:
    novel = novelFile.read()

# 将小说中的特殊符号过滤
with open('punctuation.txt', 'r', encoding='UTF-8') as punctuationFile:
    for punctuation in punctuationFile.readlines():
        novel = novel.replace(punctuation[0], ' ')

# 添加特定词到词库
jieba.add_word('凤十')
jieba.add_word('林胖子')
jieba.add_word('黑道')
jieba.add_word('饿狼帮')
# 从文件独处无意义词
with open('meaningless.txt', 'r', encoding='UTF-8') as meaninglessFile:
    mLessSet = set(meaninglessFile.read().split('\n'))
mLessSet.add(' ')

novelList = list(jieba.cut(novel))
novelSet = set(novelList) - mLessSet # 将无意义词从词语集合中删除
novelDict = {}
# 统计出词频字典
for word in novelSet:
    novelDict[word] = novelList.count(word)

# 对词频进行排序
novelListSorted = list(novelDict.items())
novelListSorted.sort(key=lambda e: e[1], reverse=True)

# 打印前20词频
topWordNum = 0
for topWordTup in novelListSorted:
    if topWordNum == 20:
        break
    print(topWordTup)
    topWordNum += 1

# 打印记录： 
# ('杨易', 906)
# ('说道', 392)
# ('一个', 349)
# ('林胖子', 338)
# ('知道', 295)
# ('和', 218)
# ('心里', 217)
# ('已经', 217)
# ('没有', 217)
# ('这个', 206)
# ('有点', 198)
# ('道', 195)
# ('徐明', 194)
# ('就是', 192)
# ('看', 191)
# ('走', 185)
# ('有', 178)
# ('上', 176)
# ('好', 176)
# ('来', 170)

相关代码已上传CSDN

关注

5
点赞
踩
42

收藏

觉得还不错? 一键收藏
4
评论
Python中文词频统计

以下是关于小说的中文词频统计这里有三个文件，分别为novel.txt、punctuation.txt、meaningless.txt。这三个是小说文本、特殊符号和无意义词Python代码统计词频如下：import jieba # jieba中文分词库# 从文件读入小说with open('novel.txt', 'r', encoding='UTF-8') as novelFi...
复制链接

扫一扫

专栏目录

lger_Pro CSDN认证博客专家 CSDN认证企业博客

码龄8年

48: 原创

14万+: 周排名

61万+: 总排名

5万+: 访问

: 等级

1080: 积分

42: 粉丝

42: 获赞

50: 评论

177: 收藏

私信

关注

热门文章

分类专栏

Docker 1篇
JavaWeb 12篇
Spring 11篇
three.js 3篇
Hibernate 3篇
个人随记 6篇
工具类 2篇
Android 2篇
Spring Boot 5篇
大数据 7篇
Linux命令
Python 12篇
Maven 1篇
网络爬虫 6篇
Spring Security 1篇
Spring JPA 1篇
Hadoop 4篇
MyBatis 1篇
编程基础 3篇
开发事项 1篇

最新评论

Spring Boot简单应用——会员管理系统
weixin_57716469: org.springframework.beans.factory.UnsatisfiedDependencyException: Error creating bean with name 'inMemoryDatabaseShutdownExecutor' defined in class path resource [org/springframework/boot/devtools/autoconfigure/DevToolsDataSourceAutoConfiguration.class]: Unsatisfied dependency expressed through method 'inMemoryDatabaseShutdownExecutor' parameter 0;
Spring Boot简单应用——会员管理系统
兔师傅: ERROR 9912 --- [ restartedMain] o.s.boot.SpringApplication : Application run failed 楼主支持远程吗实在搞不出来了请楼主喝杯咖啡
Spring Boot简单应用——会员管理系统
lger_Pro: 应该不行，太久了都忘了。。
Spring Boot简单应用——会员管理系统
才华横溢i: 就是我想问一下这个是会员也可以登录吗？
Spring Boot简单应用——会员管理系统
etoyz: https://github.com/Mr-Pro/membership/issues/4

您愿意向朋友推荐“博客详情页”吗？

强烈不推荐
不推荐
一般般
推荐
强烈推荐

提交

最新文章

目录

评论 4

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。