用Python统计文本文件中词汇字母短语等分布

最新推荐文章于 2024-08-23 17:23:23 发布

ThomasMrY

最新推荐文章于 2024-08-23 17:23:23 发布

阅读量3.7k

点赞数 4

本文链接：https://blog.csdn.net/qq_35001962/article/details/83627235

版权

这篇博客介绍了如何使用Python进行文本分析，包括统计文本中26个字母、最常见的单词和短语的频率。通过单元测试、性能分析和代码优化，实现了高效准确的分析工具。博客提供了完整的项目代码链接，并详细展示了各个步骤的实现方法和优化过程。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

这是MSRA的高级软件设计结对编程的作业

这篇博客讨论具体地实现方式与过程，包括效能分析与单元测试

分析的工具使用方法可以参考这两篇博客:

该项目的完整代码，请参考下面的Github:

https://github.com/ThomasMrY/ASE-project-MSRA

先看一下这个项目的要求:

用户需求：英语的26 个字母的频率在一本小说中是如何分布的？某类型文章中常出现的单词是什么？某作家最常用的词汇是什么？《哈利波特》中最常用的短语是什么，等等。我们就写一些程序来解决这个问题，满足一下我们的好奇心。

要求：程序的单元测试，回归测试，效能测试C/C++/C# 等基本语言的运用和 debug。

题目要求：

Step-0：输出某个英文文本文件中 26 字母出现的频率，由高到低排列，并显示字母出现的百分比，精确到小数点后面两位。

Step-1：输出单个文件中的前 N 个最常出现的英语单词。

Step-2:支持 stop words，我们可以做一个 stop word 文件（停词表），在统计词汇的时候，跳过这些词。

Step-3:输出某个英文文本文件中单词短语出现的频率，由高到低排列，并显示字母出现的百分比，精确到小数点后面两位。

Step-4:第四步：把动词形态都统一之后再计数。

Step-0:输出某个英文文本文件中 26 字母出现的频率，由高到低排列，并显示字母出现的百分比，精确到小数点后面两位。

最初的想法是去除掉各种乱七八糟的符号之后，使用遍历整个文本文件的每一个字母，用一个字典存储计数，每次去索引字典的值，索引到该值之后，在字典的value上加一实现。具体实现的代码如下:

#!/usr/bin/env python
#-*- coding:utf-8 -*-
#author: Enoch time:2018/10/22 0031

import time
import re
import operator
from string import punctuation           

start = time.clock()
'''function：Calculate the word frequency of each line
    input:  line : a list contains a string for a row
            counts: an empty  dictionary 
    ouput:  counts: a dictionary , keys are words and values are frequencies
    data:2018/10/22
'''
def ProcessLine(line,counts):
    #Replace the punctuation mark with a space

    line = re.sub('[^a-z]', '', line)
    for ch in line:
        counts[ch] = counts.get(ch, 0) + 1
    return counts

def main():
    file = open("../Gone With The Wind.txt", 'r')
    wordsCount = 0
    alphabetCounts = {}
    for line in file:
        alphabetCounts = ProcessLine(line.lower(), alphabetCounts)
    wordsCount = sum(alphabetCounts.values())
    alphabetCounts = sorted(alphabetCounts.items(), key=lambda k: k[0])
    alphabetCounts = sorted(alphabetCounts, key=lambda k: k[1], reverse=True)
    for letter, fre in alphabetCounts:
    	print("|\t{:15}|{:<11.2%}|".format(letter, fre / wordsCount))

    file.close()


if __name__ == '__main__':
    main()

end = time.clock()
print (end-start)

这样做的代码理论上代码是正确的，为了验证代码的正确性，我们需要使用三个文本文件做单元测试，具体就是，一个空文件，一个小样本文件，和一个样本较多的文件，分别做验证，于是可以写单元测试的代码如下:

from count import CountLetters
CountLetters("Null.txt")
CountLetters("Test.txt")
CountLetters("gone_with_the_wind.txt")

其中:

Null.txt 是一个空的文本文件
gone_with_the_wind.txt 是《乱世佳人》的文本文件
Test.txt 是一个我们自己指定的内容固定的文本文件，这样就可以统计结果的正确性

经过我们的验证，这个结果是正确的。保证了结果的正确性，经过这样的验证，但还不清楚代码的覆盖率怎么样，于是我们使用工具coverage，对代码进行分析，使用如下命令行分析代码覆盖率

coverage run my_program.py arg1 arg2

得到的结果如下:

Name                      Stmts   Exec  Cover
---------------------------------------------
CountLetters                 56     50    100%
---------------------------------------------
TOTAL                        56     50    100%

可以看到，在保证代码覆盖率为100%的时候，代码运行是正确的。

但程序的运行速度怎么样呢？为了更加了解清楚它的运行速度，我们使用cprofile分析性能，从而提升运行的性能, 使用cprofile运行的结果为

我们大致知道main，Processline，ReplacePunctuations三个模块最耗时，其中最多是ProcessLine，我们就需要看preocessLine()模块里调用了哪些函数，花费了多长时间。

最后使用图形化工具graphviz画出具体地耗时情况如下:

step0

可以从上面的图像中看到文本有9千多行，low函数和re.sub被调用了9023次，每个字母每个字母的统计get也被调用了1765982次，这种一个字母一个字母的索引方式太慢了。我们需要寻求新的解决办法，于是想到了正则表达式，遍历字母表来匹配正则表达式，于是我们就得到了第二版的函数

###################################################################################
#Name:count_letters
#Inputs:file name
#outputs:None
#Author: Thomas
#Date:2018.10.22
###################################################################################
def CountLetters(file_name,n,stopName,verbName):
    print("File name:" + os.path.abspath(file_name))
    if (stopName != None):
        stopflag = True
    else:
        stopflag = False
    if(verbName != None):
        print("Verb tenses normalizing is not supported in this function!")
    else:
        pass
    totalNum = 0
    dicNum = {}
    t0 = time.clock()
    if (stopflag == True):
        with open(stopName) as f:
            stoplist = f.readlines()
    with open(file_name) as f:
        txt = f.read().lower()
    for letter in letters:
        dicNum[letter] = len(re.findall(letter,txt))
        totalNum += dicNum[letter]
    if (stopflag == True):
        for word in stoplist:
            word = word.replace('\n','')
            try:
                del tempc[word]
            except:
                pass
    dicNum = sorted(dicNum.items(), key=lambda k: k[0])
    dicNum = sorted(dicNum, key=lambda k: k[1], reverse=True)
    t1 = time.clock()
    display(dicNum[:n],'character',totalNum,9)
    print("Time Consuming:%4f" % (t1 - t0))

该函数把运行时间从原来的1.14s直接降到了0.2s，通过重复刚才的单元测试以及效能分析（这里我就不重复粘贴结果了），验证了在代码覆盖率为100%的情况下，代码的运行也是正确的，并且发现运行时间最长的就是其中的正则表达式，在这样的情况下，我们又寻求新的解决方案。最终我们发现了文本自带的count方法，将正则表达式用更该方法替换之后，即将上面的代码:

dicNum[letter] = len(re.findall(letter,txt))

替换为

dicNum[letter] = txt.count(letter) #here count is faster than re

成功的将时间降到了5.83e-5s可以说提高了非常多的数量级，优化到这里，基本上已经达到了优化的瓶颈，没法继续优化了。

注:后来的版本添加了许多功能，这里的代码是添加了功能之后的代码, 如需要运行最初的功能则需要将后面的参数指定成None。

Step-1：输出单个文件中的前 N 个最常出现的英语单词。

首先的了解，单词的定义是什么：

单词：以英文字母开头，由英文字母和字母数字符号组成的字符串视为一个单词。单词以分隔符分割且不区分大小写。在输出时，所有单词都用小写字符表示。

英文字母：A-Z，a-z
字母数字符号：A-Z，a-z，0-9
分割符：空格,非字母数字符号例：good123是一个单词，123good不是一个单词。good，Good和GOOD是同一个单词

最初的想法是去除掉各种乱七八糟的符号之后，是用空格分隔出单词，然后遍历文本中的每一个单词，用一个字典存储计数，每次去索引字典的值，索引到该值之后，在字典的value上加一实现。具体实现的代码如下:

#!/usr/bin/env python
#-*- coding:utf-8 -*-
#author: Eron time:2018/10/22 0022
import time
import re
start = time.time()
from string import punctuation           #Temporarily useless
 
'''function：Calculate the word frequency of each line
    input:  line : a list contains a string for a row
            counts: an empty  dictionary 
    ouput:  counts: a dictionary , keys are words and values are frequencies
    data:2018/10/22
'''
def ProcessLine(line,counts):
    #Replace the punctuation mark with a space
    #line=ReplacePunctuations(line)
    line = re.sub('[^a-z0-9]', ' ', line)
    words = line.split()
    for word in words:
        counts[word] = counts.get(word, 0) + 1
    return  counts


'''function：Replace the punctuation mark with a space
    input:  line : A list containing a row of original strings
    ouput:  line: a list whose punctuation is all replaced with spaces
    data:2018/10/22
'''
def ReplacePunctuations(line):
    for ch in line :
        #Create our own symbol list
        tags = [',','.','?','"','“','”','—']
        if ch in tags:
            line=line.replace(ch," "