词频统计
要求:
1.将一篇英文文章内容保存在 文章.txt 中
2.统计txt文本中的每个单词出现的次数,打印出出现最多的十个词的单词和词频。
环境:window10、Python 3.7.6、PyCharm
a.txt中文章内容如下:
Companionship of Books
A man may usually be known by the books he reads as well as by the company hekeeps; for there is a companionship of books as well as of men; and one should always livein the best company, whether it be of books or of men.
A good book may be among the best of friends.It is the same today that it alwayswas, and it will never change.It is the most patient and cheerful of companions. lt doesnot turn its back upon us in times of adversity or distress. It always receives us with thesame kindness; amusing and instructing us in youth, and comforting and consoling us inage.
Men often discover their affinity to each other by the mutual love they have for abook just as two persons sometimes discover a friend by the admiration which bothentertain for a third.There is an old proverb, Love me, love my dog.But there is morewisdom in
# 读取文章内容
with open('a.txt','r',encoding='UTF-8') as f:
context = f.readlines() # 按行读取文章数据并赋值给字符串数据
# print(context) # 打印出按行读取的数据,每一行就是一个列表中的元素 ['Companionship of Books\n', 'A man may usually be...]
# print(type(context)) # 查看数据类型:<class 'list'>
# 定义函数f 统计每个单词出现的次数
result = {} # 定义空字典,接收统计好的数据
def f(s):
# s数据为:Companionship of Books
words = s.split() # 按空格为分隔符把每个单词当作列表中的一个元素放入列表中
# print(words) # 输出的数据为:['Companionship', 'of', 'Books']
for word in words:
# print(words) # 输出的数据为:Companionship
if word in result:
result[word] += 1
else:
result[word] = 1
# 遍历列表context,把当中的每个元素再次存到列表中
for lines in context:
lines = lines.replace(',', ' ') # 把逗号替换为空格
lines = lines.replace('.', ' ') # 把句号替换为空格
lines = lines.replace(';', ' ') # 把分号替换为空格
# print(lines) # 查看输出的数据为: Companionship of Books
f(lines) # 用函数处理每一行的数据
# 输出结果
# for k, v in result.items():
# print(k, v) # 输出数据为:Companionship 1
# 打印出出现最多的十个词的单词和词频。
# 排序
result = sorted(result.items(), key=lambda d:(d[1], d[0]), reverse = True) # 用sorted()内置函数进行排序;result.items()获取key和value值的列表;key=lambda d:(d[1], d[0])先按照d[1]进行排序,如果d[1]相等则按照d[0]进行排序,d[1]是value值,d[0]是key值
# print(result) # 输出的数据为:[('the', 8), ('of', 8), ('and', 6), ...]
for k, v in result[:10]:
print("单词%s出现的次数是:%d" %(k,v))
输出最终结果如下图: