【作业】python——小说Walden的词频统计，并从高到低排序

最新推荐文章于 2022-12-05 16:11:53 发布

iHUQAQ

最新推荐文章于 2022-12-05 16:11:53 发布

阅读量3k

点赞数 4

本文链接：https://blog.csdn.net/iHUQAQ/article/details/109756963

版权

本文章环境为Pychram-python3.8

一·确定文件位置

确定Walden.txt文件位置
例如：
在这里插入图片描述
将Walden.txt与py代码文件放至同一文件夹

二·逐步前进

1. 打开文件

f=open('Walden.txt','r',encoding='utf-8')

因为直接放在同一文件夹，文件路径为Walden.txt。若非同一文件夹，可以右键Walden.txt选择属性：

在这里插入图片描述
如上图，放在桌面上的Walden.txt文件的属性显示位置为C:\Users\iHU\Desktop

'r’为读文本，从Walden.txt文件中提取文本数据；

而encoding='utf-8’则是转化文本数据格式，以utf-8格式输出

可以加一句print(f.read())观察到
在这里插入图片描述
若不加encoding=‘utf-8’

则会显示编码错误（illegal multibyte sequence ）

2. 使用函数更改文本，便于计数

首先import re
把大写字母转为小写line=line.lower()
将各种符号转化为空格line=re.sub('[,.?;:"\'!]','',line)
即在这里插入图片描述

3.将结果放入列表words，用空格分隔单词

words=line.split()

4.设置counter函数

from collections import Counter
def counter(words):
    return Counter(words).most_common(10000)

记录列表words中出现的单词词频，并按大到小的顺序输出(most_common(10000)中的10000是输出元素数范围）

5.放入字典

dict={}
dict=counter(words)
print(dict)

利用字典性质，去重复元素

三·最终效果

import re
f=open('Walden.txt','r',encoding='utf-8')
line=f.read()
line=line.lower()
line=re.sub('[,.?;:"\'!]','',line)
words=line.split()
from collections import Counter
def counter(words):
    return Counter(words).most_common(10000)
dict={}
dict=counter(words)
print(dict)

执行
在这里插入图片描述

iHUQAQ

关注

4
点赞
踩
16

收藏

觉得还不错? 一键收藏
3
评论
【作业】python——小说Walden的词频统计，并从高到低排序

本文章环境为Pychram-python3.8一·确定文件位置确定Walden.txt文件位置例如：将Walden.txt与py代码文件放至同一文件夹二·逐步前进1. 打开文件f=open('Walden.txt','r',encoding='utf-8')因为直接放在同一文件夹，文件路径为Walden.txt。若非同一文件夹，可以右键Walden.txt选择属性：如上图，放在桌面上的Walden.txt文件的属性显示位置为C:\Users\iHU\Desktop'r’为读文本，从
复制链接

扫一扫