python统计文本中各词数_统计文本中单词数python-CSDN博客

本文链接：https://blog.csdn.net/O_C_E_A_N_/article/details/124354238

【问题描述】

请统计hamlet.txt文件中出现的英文单词情况，统计并输出出现最多的前n个单词，注意：‪‬‪‬‪‬‪‬‪‬‮‬‪‬‭‬‪‬‪‬‪‬‪‬‪‬‮‬‭‬‪‬‪‬‪‬‪‬‪‬‪‬‮‬‫‬‫‬‪‬‪‬‪‬‪‬‪‬‮‬‭‬‫‬‪‬‪‬‪‬‪‬‪‬‮‬‭‬‫‬‪‬‪‬‪‬‪‬‪‬‮‬‪‬‮‬

(1) 单词不区分大小写，即需将大写转换成小写；‪‬‪‬‪‬‪‬‪‬‮‬‫‬‫‬‪‬‪‬‪‬‪‬‪‬‮‬‭‬‪‬‪‬‪‬‪‬‪‬‪‬‮‬‫‬‫‬‪‬‪‬‪‬‪‬‪‬‮‬‭‬‪‬‪‬‪‬‪‬‪‬‪‬‮‬‪‬‭‬‪‬‪‬‪‬‪‬‪‬‮‬‭‬‪‬‪‬‪‬‪‬‪‬‪‬‮‬‫‬‫‬‪‬‪‬‪‬‪‬‪‬‮‬‭‬‫‬‪‬‪‬‪‬‪‬‪‬‮‬‭‬‫‬‪‬‪‬‪‬‪‬‪‬‮‬‪‬‮‬

(2) 请在文本中剔除如下特殊符号：!"#$%&()*+,-./:;<=>?@[\]^_‘{|}~‪‬‪‬‪‬‪‬‪‬‮‬‪‬‭‬‪‬‪‬‪‬‪‬‪‬‮‬‭‬‪‬‪‬‪‬‪‬‪‬‪‬‮‬‫‬‫‬‪‬‪‬‪‬‪‬‪‬‮‬‭‬‫‬‪‬‪‬‪‬‪‬‪‬‮‬‭‬‫‬‪‬‪‬‪‬‪‬‪‬‮‬‪‬‮‬

(3) 输出n个单词和其出现次数，每个单词一行；‪‬‪‬‪‬‪‬‪‬‮‬‪‬‭‬‪‬‪‬‪‬‪‬‪‬‮‬‭‬‪‬‪‬‪‬‪‬‪‬‪‬‮‬‫‬‫‬‪‬‪‬‪‬‪‬‪‬‮‬‭‬‫‬‪‬‪‬‪‬‪‬‪‬‮‬‭‬‫‬‪‬‪‬‪‬‪‬‪‬‮‬‪‬‮‬

(4) 输出单词为小写形式。

此题不涉及编码转换，若想指定编码可在开始加上

#-- coding: utf-8 --

或在文件打开处指定编码

with open(“hamlet.txt”, “r”, encoding=‘utf-8’) as f：

 ........

 ........

【输入形式】

【输出形式】

以下仅是输出样例（仅列出3个，需要列出n个），不是最终结果：‪‬‪‬‪‬‪‬‪‬‮‬‪‬‭‬‪‬‪‬‪‬‪‬‪‬‮‬‭‬‪‬‪‬‪‬‪‬‪‬‪‬‮‬‫‬‫‬‪‬‪‬‪‬‪‬‪‬‮‬‭‬‫‬‪‬‪‬‪‬‪‬‪‬‮‬‭‬‫‬‪‬‪‬‪‬‪‬‪‬‮‬‪‬‮‬

the 1138

and 965

to 754

单词左对齐，并占10个位置；次数右对齐，并占5个位置

题解

def getText():
    t = open("hamlet.txt", "r").read()  # 读
    t = t.lower()  # 全体小写
    for i in "!\"#$%&()*+,-./:;<=>?@[\\]^_‘{|}~":  # 略
        t = t.replace(i, " ")
    return t


h = getText()
s = int(input())  # 指定数
words = h.split()  # 列表化
counts = {}  # 暂存字典
for word in words:  # 统计
    counts[word] = counts.get(word, 0) + 1
items = list(counts.items())
items.sort(key=lambda x: x[1], reverse=True)  # 降序统计
for i in range(s):  # 格式化输出
    word, count = items[i]
    print("{0:<10}{1:>5}".format(word, count))