词频统计(Python实现)

Python词频

前言

最近有位同学发了个题目让我帮忙实现一下,是关于Python分词的,因为分词内容语言是英文,而且单词数量并不多,所以难度不大,仅仅只是Python常见数据类型的使用。

时间线

  • 2021年6月7日 完成初稿
  • 2021年6月8日 修改

内容

词频统计内容出自于Python的经典设计名言,即Python之禅,如下:

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren’t special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you’re Dutch.
Now is better than never.
Although never is often better than right now.
If the implementation is hard to explain, it’s a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea – let’s do more of those!

问题

主要有以下问题:

  • 统计文本总词数
  • 统计文本中每一个词语出现次数
  • 统计文本中出现次数最少,最多的词语集合及出现次数
  • 统计文本内容行数
  • 获取当前时间
  • 将上面的结果写入到新的文本中

设计

该程序并非多难,大致流程如下,当然,为了满足Python之禅中的Complex is better than complicated

所以将以上问题的解决函数尽量纯净。

  1. 读取文本内容
  2. 标点符号的清洗,即将英文的缩略表达,标点符号替换
  3. 将读取的内容整理成元素为字符串类型的列表
  4. 遍历列表,使用字典装载结果。使用单词作为键,出现次数为值。有两种可能:
    • 已在字典中,则将值+1
    • 不在字典中,加入字典,并赋值为1
  5. 获取上面字典的值集合中最小,最多的值,并且遍历字典,获取对应的键
  6. 使用readlines()读取文本行数
  7. 使用time获取当前时间,并以常见的格式打印
  8. 整理上面的结果,并将结果写入到文本中

实现

使用Python实现

import time
class WordCount:
    def __init__(self,filePath):
        self.filePath = filePath
        self.fileStr  = self.readFile(self.filePath)
        self.fileStr = self.cleanPunct(self.fileStr)
        self.wordNums,self.wordDicts  = self.countWord(self.fileStr)
    def readFile(self,filePath):
        f = open(filePath,"r",encoding='utf-8')
        fileStr = f.read()
        return fileStr
    def cleanPunct(self,fileStr:str):
        return fileStr.replace(',','').replace('.','').replace(':','')\
            .replace('!','').replace('--','').replace('*','')\
            .replace("n't",' not').replace("'s",' is').replace("'re",' are')
    def countWord(self,wordStr):
        wordStr = wordStr.split()
        res = {}
        count = 0
        for i in wordStr:
            i = i.lower()
            count += 1
            if i not in res.keys():
                res[i] = 1
            else:
                res[i] += 1
        return count,res
    def writeFile(self,filePath,content):
        with open(filePath,'w',encoding='utf-8') as f:
            f.write(content)
            f.close()
    def getMinCountOfWord(self):
        res = []
        minCount = min(self.wordDicts.values())
        for key,value in self.wordDicts.items():
            if value == minCount:
                res.append(key)
        return {'单词':res,'单词数量':len(res),'出现次数':minCount}
    def getMaxCountOfWord(self):
        res = []
        maxCount = max(self.wordDicts.values())
        for key,value in self.wordDicts.items():
            if value == maxCount:
                res.append(key)
        return {'单词':res,'单词数量':len(res),'出现次数':maxCount}
    def getCountByWord(self,word):
        if word in self.wordDicts.keys():
            return self.wordDicts[word]
        return
    def getLenOfFile(self):
        return len(open(self.filePath,'r').readlines())
    def getLocalTime(self):
        return time.strftime("%Y-%m-%d %H:%M:%S", time.localtime())
if __name__ == "__main__":
    word = 'python'
    filePath = 'words.txt'
    solution = WordCount(filePath)
    wordNums = solution.wordNums
    min_word,max_word = solution.getMinCountOfWord(),solution.getMaxCountOfWord()
    count = solution.getCountByWord(word)
    len = solution.getLenOfFile()
    now = solution.getLocalTime()
    # res = {
    #     '最少单词':min_word,
    #     '最多单词':max_word,
    #     '单词{}出现的次数'.format(word):count,
    #     '文本长度':len,
    #     '时间':now
    # }
    content = '''文本单词数量:{}\n出现最少单词:{}\n出现最多单词:{}\n单词{}出现的次数:{}\n文本长度:{}\n时间:{}'''.format(wordNums,min_word,max_word,word,count,len,now)
    fileResultPath = 'twordResult.txt'
    solution.writeFile(fileResultPath,content)

运行结果

文本单词数量:140
出现最少单词:{'单词': ['beautiful', 'ugly', 'explicit', 'implicit', 'simple', 'complicated', 'flat', 'nested', 'sparse', 'dense', 'readability', 'counts', 'cases', 'enough', 'break', 'rules', 'practicality', 'beats', 'purity', 'errors', 'pass', 'silently', 'explicitly', 'silenced', 'in', 'face', 'ambiguity', 'refuse', 'temptation', 'guess', 'there', 'and', 'preferably', 'only', 'that', 'at', 'first', 'you', 'dutch', 'often', 'right', 'hard', 'bad', 'easy', 'good', 'namespaces', 'honking', 'great', 'let', 'more', 'those'], '单词数量': 51, '出现次数': 1}
出现最多单词:{'单词': ['is'], '单词数量': 1, '出现次数': 12}
单词python出现的次数:None
文本长度:19
时间:2021-06-08 10:18:33

总结

正如前面所说,由于是英文分词,且单词数量并不多,所以难度并不大,主要考察的是Python的基本使用。

当然,当单词数量过大时,则需要进行相应优化。

  • 1
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值