8-2 词频统计之《哈姆雷特》

一根晓猪

已于 2023-07-19 21:22:15 修改

阅读量1.2k

点赞数

分类专栏： python 应用实践题目文章标签： python

于 2023-07-18 01:28:41 首次发布

本文链接：https://blog.csdn.net/c3872931/article/details/131778290

版权

python 应用实践题目专栏收录该内容

6 篇文章 0 订阅

订阅专栏

Hamlet 《哈姆雷特》是莎士比亚的一部经典悲剧作品。这里提供了该故事的文本文件：hamlet.txt。‪‬‪‬‪‬‪‬‪‬‮‬‪‬‮‬‪‬‪‬‪‬‪‬‪‬‮‬‫‬‮‬‪‬‪‬‪‬‪‬‪‬‮‬‫‬‮‬‪‬‪‬‪‬‪‬‪‬‮‬‪‬‭‬

请统计该文件中出现英文的词频，按照如下格式打印输出前10个高频词语：‪‬‪‬‪‬‪‬‪‬‮‬‪‬‮‬‪‬‪‬‪‬‪‬‪‬‮‬‫‬‮‬‪‬‪‬‪‬‪‬‪‬‮‬‫‬‮‬‪‬‪‬‪‬‪‬‪‬‮‬‪‬‭‬

the , 1138‪‬‪‬‪‬‪‬‪‬‮‬‪‬‮‬‪‬‪‬‪‬‪‬‪‬‮‬‫‬‮‬‪‬‪‬‪‬‪‬‪‬‮‬‫‬‮‬‪‬‪‬‪‬‪‬‪‬‮‬‪‬‭‬

即：英文单词（左对齐，宽度为10）+ 逗号 + 词语出现的频率（右对齐，宽度为5）‪‬‪‬‪‬‪‬‪‬‮‬‪‬‮‬‪‬‪‬‪‬‪‬‪‬‮‬‫‬‮‬‪‬‪‬‪‬‪‬‪‬‮‬‫‬‮‬‪‬‪‬‪‬‪‬‪‬‮‬‪‬‭‬
要求与说明：‪‬‪‬‪‬‪‬‪‬‮‬‪‬‮‬‪‬‪‬‪‬‪‬‪‬‮‬‫‬‮‬‪‬‪‬‪‬‪‬‪‬‮‬‫‬‮‬‪‬‪‬‪‬‪‬‪‬‮‬‪‬‭‬

标点符号及组合不算作英文词语，去除的标点及特殊符号如下 !"#$%&()*+,-./:;<=>?@^_‘{|}~‪‬‪‬‪‬‪‬‪‬‮‬‪‬‮‬‪‬‪‬‪‬‪‬‪‬‮‬‫‬‮‬‪‬‪‬‪‬‪‬‪‬‮‬‫‬‮‬‪‬‪‬‪‬‪‬‪‬‮‬‪‬‭‬
同一单词的各种大小写形式记作一个词，如The和the相同 ‪‬‪‬‪‬‪‬‪‬‮‬‪‬‮‬‪‬‪‬‪‬‪‬‪‬‮‬‫‬‮‬‪‬‪‬‪‬‪‬‪‬‮‬‫‬‮‬‪‬‪‬‪‬‪‬‪‬‮‬‪‬‭‬
在程序中，请使用文件名打开文件：hamlet.txt‪‬‪‬‪‬‪‬‪‬‮‬‪‬‮‬‪‬‪‬‪‬‪‬‪‬‮‬‫‬‮‬‪‬‪‬‪‬‪‬‪‬‮‬‫‬‮‬‪‬‪‬‪‬‪‬‪‬‮‬‪‬‭‬
hamlet.docx

import string

def process_word(word):
    # 去除标点及特殊符号
    word = word.strip(string.punctuation)
    return word.lower()

def main():
    word_freq = {}
    
    with open("hamlet.txt", "r", encoding="utf-8") as file:
        for line in file:
            words = line.split()
            for word in words:
                word = process_word(word)
                if word.isalpha():  # 只统计由字母组成的词
                    word_freq[word] = word_freq.get(word, 0) + 1
    
    sorted_word_freq = sorted(word_freq.items(), key=lambda x: x[1], reverse=True)
    top_10_words = sorted_word_freq[:10]
    
    for word, freq in top_10_words:
        print(f"{word:<10}, {freq:>5}")

if __name__ == "__main__":
    main()

一根晓猪

关注

0
点赞
踩
5

收藏

觉得还不错? 一键收藏
0
评论
8-2 词频统计之《哈姆雷特》

Hamlet 《哈姆雷特》是莎士比亚的一部经典悲剧作品。即：英文单词（左对齐，宽度为10）+ 逗号 + 词语出现的频率（右对齐，宽度为5）‪‬‪‬‪‬‪‬‪‬‮‬‪‬‮‬‪‬‪‬‪‬‪‬‪‬‮‬‫‬‮‬‪‬‪‬‪‬‪‬‪‬‮‬‫‬‮‬‪‬‪‬‪‬‪‬‪‬‮‬‪‬‭‬。在程序中，请使用文件名打开文件：hamlet.txt‪‬‪‬‪‬‪‬‪‬‮‬‪‬‮‬‪‬‪‬‪‬‪‬‪‬‮‬‫‬‮‬‪‬‪‬‪‬‪‬‪‬‮‬‫‬‮‬‪‬‪‬‪‬‪‬‪‬‮‬‪‬‭‬。
复制链接

扫一扫

专栏目录