2020-12-27

词频统计

要求:
1.将一篇英文文章内容保存在 文章.txt 中
2.统计txt文本中的每个单词出现的次数,打印出出现最多的十个词的单词和词频。
环境:window10、Python 3.7.6、PyCharm


a.txt中文章内容如下:
Companionship of Books
A man may usually be known by the books he reads as well as by the company hekeeps; for there is a companionship of books as well as of men; and one should always livein the best company, whether it be of books or of men.
A good book may be among the best of friends.It is the same today that it alwayswas, and it will never change.It is the most patient and cheerful of companions. lt doesnot turn its back upon us in times of adversity or distress. It always receives us with thesame kindness; amusing and instructing us in youth, and comforting and consoling us inage.
Men often discover their affinity to each other by the mutual love they have for abook just as two persons sometimes discover a friend by the admiration which bothentertain for a third.There is an old proverb, Love me, love my dog.But there is morewisdom in

# 读取文章内容
with open('a.txt','r',encoding='UTF-8') as f:
	context = f.readlines()        # 按行读取文章数据并赋值给字符串数据
    # print(context)                  # 打印出按行读取的数据,每一行就是一个列表中的元素   ['Companionship of Books\n', 'A man may usually be...]
    # print(type(context))            # 查看数据类型:<class 'list'>
# 定义函数f   统计每个单词出现的次数
result = {}     # 定义空字典,接收统计好的数据
def f(s):
    # s数据为:Companionship of Books
    words = s.split()      # 按空格为分隔符把每个单词当作列表中的一个元素放入列表中
    # print(words)    # 输出的数据为:['Companionship', 'of', 'Books']
    for word in words:
        # print(words)   # 输出的数据为:Companionship
        if word in result:
            result[word] += 1
        else:
            result[word] = 1
# 遍历列表context,把当中的每个元素再次存到列表中
for lines in context:
    lines = lines.replace(',', ' ')        # 把逗号替换为空格
    lines = lines.replace('.', ' ')         # 把句号替换为空格
    lines = lines.replace(';', ' ')         # 把分号替换为空格
    # print(lines)      # 查看输出的数据为:   Companionship of Books
    f(lines)      # 用函数处理每一行的数据
# 输出结果
# for k, v in result.items():
#     print(k, v)     # 输出数据为:Companionship 1
# 打印出出现最多的十个词的单词和词频。
# 排序
result = sorted(result.items(), key=lambda d:(d[1], d[0]), reverse = True)    # 用sorted()内置函数进行排序;result.items()获取key和value值的列表;key=lambda d:(d[1], d[0])先按照d[1]进行排序,如果d[1]相等则按照d[0]进行排序,d[1]是value值,d[0]是key值
# print(result)       # 输出的数据为:[('the', 8), ('of', 8), ('and', 6), ...]
for k, v in result[:10]:
    print("单词%s出现的次数是:%d" %(k,v))

输出最终结果如下图:
输出结果

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值