统计词频并输出高频词汇。

通宵不会困

已于 2024-09-05 22:24:33 修改

阅读量208

点赞数 3

文章标签： python 人工智能

于 2024-09-05 22:23:18 首次发布

本文链接：https://blog.csdn.net/m0_75197005/article/details/141941811

版权

任务：统计词频并输出高频词汇。

所给数据为某日中国日报英文版的一篇新闻报道，现要求使用Python语言编写程序统计其中出线频率最高的十个单词，输出对应的单词内容和频率（以字典形式呈现）。

说明：

（1）需在代码中排除标点符号干扰，即标点符号不被当作单词或单词的一部分。

（2）若单词中有大写字母需转换为小写字母进行统计，即大小写不敏感。

（3）空格不记为单词。

代码：

先读文本：

file =open("./word.txt","r",encoding="utf-8",)
text = file.read()

创建一个翻译表translator，将字符串中所有的标点符号映射成空字符串。

string.punctuation 是 Python 标准库 string 模块中的一个常量，它包含了所有的 ASCII 标点符号。这些标点符号被定义为那些通常不作为单词一部分的字符，例如逗号、句号、分号、冒号、引号、括号、各种标点和特殊符号等。“!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~”。

translator = str.maketrans('', '', string.punctuation)
#str.maketrans('要替换的字符集合', '替换后的字符集合', 指定替换的字符集合)
cleaned_text = text.translate(translator)
#使用translate移除所有的标点符号并存在cleaned_text中

转化为小写并且分割为单词列表

cleaned_text = cleaned_text.lower()
words = cleaned_text.split()

使用counter创建一个计数器，统计每个单词出现的次数，用most_common获取频率最高的十个。

最后转化成字典打印出来。

word_counts = Counter(words)
top_ten = word_counts.most_common(10)
print(dict(top_ten))

完整代码：

import string
from collections import Counter

# 假设文本数据存储在 text 变量中
file =open("./word.txt","r",encoding="utf-8",)
text = file.read()

# 移除标点符号
translator = str.maketrans('', '', string.punctuation)
cleaned_text = text.translate(translator)

# 转换为小写
cleaned_text = cleaned_text.lower()

# 分割文本为单词列表
words = cleaned_text.split()

# 统计单词频率
word_counts = Counter(words)

# 获取出现频率最高的十个单词
top_ten = word_counts.most_common(10)

# 输出结果
print(dict(top_ten))

输出：