[每日一题]39、用Python统计文件中单词出现个数

最新推荐文章于 2024-03-11 17:43:41 发布

Python3X

最新推荐文章于 2024-03-11 17:43:41 发布

阅读量1.7k

点赞数 1

本文链接：https://blog.csdn.net/weixin_43499626/article/details/102656402

版权

Python Every Day，第 39 期

题目：

任一个英文的纯文本文件，统计其中的单词出现的个数。如下：

In our world , one creature without any rivals is a lifeless creature. 

If a man lives without rivals, he is bound to be satisfied with the present and will not strive for the better. 

He would hold back before all difficulties and decline in inaction and laziness. 

Adverse environment tends to cultivate successful people. Therefore, your rivals are not your opponents or those you grudge. 

Instead , they are your good friends! 

In our lives, we need some rivals to "push us into the river", leaving us striving ahead in all difficulties and competitions. 

In our work, we need some rivals to be picky about us and supervise our work with rigorous requirements and standards. 
Due to our rivals, we can bring out our potential to the best; Due to our rivals, we will continuously promote our capabilities when competing with them!

常规解法

完整参考代码如下：

# coding=utf-8
from collections import defaultdict
import re
# 替换除了n't这类连字符外的所有非单词字符和数字字符
def replace(s):
    if s.group(1) == 'n\'t':
        return s.group(1)
    return ' '
def cal(filename='english.txt'):
    # 使用lambda来定义简单的函数
    dic = defaultdict(lambda: 0)#dic = defaultdict(int)也可以
    with open(filename, 'r') as f:
        data = f.read()
    # 全部变为小写字母
    data = data.lower()
    # 替换除了n't这类连字符外的所有非单词字符和数字字符
    data = re.sub(r'(n[\']t)|([\W\d])', replace, data)
    datalist = re.split(r'[\s\n]+', data)
    for item in datalist:
        dic[item] += 1
    del dic['']
    return dic

if __name__ == '__main__':
    dic = cal()
    for key, val in dic.items():
        print('%15s  --->   出现了 %s 次' % (key,val))

执行结果部分截图：

640?wx_fmt=png

以上，是一种比较常规的方法，有没有更简单的方法？并且要求按照出现次数排序显示。

Pythonic的解法：

我在每日一题36期中简单的讲了一下内置模块collections的用法

这里就可以借用里面的Counter方法的功能去完成我们的问题，

即： Counter用于追踪值的出现次数(列表中每个元素出现的次数)

# coding=utf-8
import re
from collections import Counter


def cal(filename='english.txt'):
    with open(filename, 'r') as f:
        data = f.read()
    data = data.lower().replace(',', '').replace('.', '')
    # 替换除了n't这类连字符外的所有非单词字符和数字字符
    datalist = re.split(r'[\s\n]+', data)
    return Counter(datalist).most_common()


if __name__ == '__main__':
    dic = cal()
    for i in range(len(dic)):
        print('%15s  --->   出现了 %s 次' % (dic[i][0], dic[i][1]))

输出结果部分截图：

640?wx_fmt=png