我把考研英语二真题做了词频

最新推荐文章于 2025-05-29 10:27:40 发布

我一拳打弯你A柱

最新推荐文章于 2025-05-29 10:27:40 发布

阅读量761

点赞数

分类专栏：其他文章标签： python

本文链接：https://blog.csdn.net/Alian_W/article/details/108596474

版权

其他专栏收录该内容

5 篇文章

订阅专栏

大家好，我是W

这个东西其实我早就想做了，但是一直找不到真题Word版，所以没做。今天室友提醒我淘宝可以买到，结果真就有了呗。OK那就搞吧！

资源地址

花了我两块钱大洋买了真题，我也藏私就给大家分享出来吧！大家完全可以下载来照着做，点击链接。

统一DOCX格式

Word文档有些是doc格式，有些是docx格式，为了能够统一读文档，我们需要把doc文档转为docx格式。这里使用的工具是pypiwin32,大家可以使用下面的命令行下载：

pip install pypiwin32

通过pypiwin32库可以把doc转为docx文件，然后使用python-docx对docx文件进行读取:

pip install python-docx

得到这两个库后就可以对文件进行格式转换：

import os
import win32com.client as wc


def read_word(base_path, file_name_list):
    print(file_name_list)
    for file_name in file_name_list:
        word = wc.Dispatch('Word.Application')
        print(base_path + file_name + ".doc")
        doc = word.Documents.Open(base_path + file_name + ".doc")
        doc.SaveAs("D:/My_IDE/PyCharm/Project/python_basic/英语二词频分析/data/English_2_new/" + file_name + ".docx", 12,
                   False,
                   "", True, "", False, False, False, False)
    doc.Close()
    word.Quit()


if __name__ == '__main__':
    base_path = "D:/My_IDE/PyCharm/Project/python_basic/英语二词频分析/data/English_2_origin/"
    file_name_list = []
    for file_name in os.listdir(base_path):
        print(base_path + file_name)
        if file_name.endswith(".doc"):
            file_name_list.append(file_name.replace(".doc", ""))

    read_word(base_path, file_name_list)

另外有一篇博客大家可以学习：python读取word文件

读取DOCX文件

使用python-docx读取文件：

from docx import Document

def read_DOCX(base_path, file_name):
    document = Document(base_path + file_name)
    full_text = ""
    for paragraph in document.paragraphs:
        # print(paragraph.text)
        full_text += paragraph.text
    return full_text

整体流程

这个main方法表示了整体的流程，因为是很简单的小工具，所以并没有对代码做过多的注释相信大家也能看懂：

if __name__ == '__main__':
    base_path = "D:/My_IDE/PyCharm/Project/python_basic/英语二词频分析/data/English_2_new/"
    file_name_list = os.listdir(base_path)
    total_word_dict = {}
    total_text = ""
    exclude_list = read_exclude_text()
    # 1' 读取文件
    for file_name in file_name_list:
        full_text = read_DOCX(base_path, file_name)
        total_text += full_text
        # 2' 分词
        word_list = jb.cut(full_text)
        for word in word_list:
            if is_alphabet(word):
                total_word_dict[word] = total_word_dict.get(word, 0) + 1
    print(total_word_dict)
    # 按频次排序
    total_word_list = sorted(total_word_dict.items(), key=lambda x: x[1], reverse=True)
    print(total_word_list)
    for item in total_word_list:
        # 找对应翻译
        # 找不到接口批量翻译，可以尝试破解百度翻译的接口遍历实现
        # 转字符串
        if item[0] not in exclude_list:
            item = str(item[0]) + " " + str(item[1])

            with open("word_frequency.txt", 'a', encoding='utf-8') as f:
                f.write(item + "\n")

    # 3' 词云
    draw_wordCloud(total_text)

对于中间涉及的小问题有几个需要讲的。

判断英文

因为文档里还是有一些标点符号和中文的出现，但是我们需要的是英文，所以需要一个方法来判断英文：

def is_alphabet(char):
    if (char >= '\u0041' and char <= '\u005a') or (char >= '\u0061' and char <= '\u007a'):
        return True
    else:
        return False

在使用jieba分词后，需要剔除一些简单的词汇

这里我选择把需要提出的词汇放入txt文件中，大家如果有需要可以自行按格式添加：

def read_exclude_text():
    file_path = "./data/exclud_words.txt"
    exclude_list = []
    for line in open(file_path):
        exclude_list.append(line.strip('\r\n'))
    return exclude_list

若*item[0]*在exclude名单中则剔除：

if item[0] not in exclude_list:
    item = str(item[0]) + " " + str(item[1])

exclude_words.txt格式如下：

the
to
of
a
and
in
that
is
A

词云和jieba都很简单

我给大家找两篇博客，或者大家阅读代码就能学会：

完整代码

把doc转为docx：

import os
import win32com.client as wc

def read_word(base_path, file_name_list):
    print(file_name_list)
    for file_name in file_name_list:
        word = wc.Dispatch('Word.Application')
        print(base_path + file_name + ".doc")
        doc = word.Documents.Open(base_path + file_name + ".doc")
        doc.SaveAs("....python_basic/英语二词频分析/data/English_2_new/" + file_name + ".docx", 12,
                   False,
                   "", True, "", False, False, False, False)
    doc.Close()
    word.Quit()


if __name__ == '__main__':
    base_path = ".../python_basic/英语二词频分析/data/English_2_origin/"
    file_name_list = []
    for file_name in os.listdir(base_path):
        print(base_path + file_name)
        if file_name.endswith(".doc"):
            file_name_list.append(file_name.replace(".doc", ""))

    read_word(base_path, file_name_list)

分词、词云代码：

from wordcloud import WordCloud
import jieba as jb
import os
from docx import Document


def read_DOCX(base_path, file_name):
    document = Document(base_path + file_name)
    full_text = ""
    for paragraph in document.paragraphs:
        # print(paragraph.text)
        full_text += paragraph.text
    return full_text


def is_alphabet(char):
    if (char >= '\u0041' and char <= '\u005a') or (char >= '\u0061' and char <= '\u007a'):
        return True
    else:
        return False


def draw_wordCloud(total_text):
    WC = WordCloud(width=940, height=1080, background_color="white")
    WC.generate(total_text)
    WC.to_file("./data/wordCloud.png")


def read_exclude_text():
    file_path = "./data/exclud_words.txt"
    exclude_list = []
    for line in open(file_path):
        exclude_list.append(line.strip('\r\n'))
    return exclude_list

if __name__ == '__main__':
    base_path = "..../python_basic/英语二词频分析/data/English_2_new/"
    file_name_list = os.listdir(base_path)
    total_word_dict = {}
    total_text = ""
    exclude_list = read_exclude_text()
    # 1' 读取文件
    for file_name in file_name_list:
        full_text = read_DOCX(base_path, file_name)
        total_text += full_text
        # 2' 分词
        word_list = jb.cut(full_text)
        for word in word_list:
            if is_alphabet(word):
                total_word_dict[word] = total_word_dict.get(word, 0) + 1
    print(total_word_dict)
    # 按频次排序
    total_word_list = sorted(total_word_dict.items(), key=lambda x: x[1], reverse=True)
    print(total_word_list)
    for item in total_word_list:
        # 找对应翻译
        # 找不到接口批量翻译，可以尝试破解百度翻译的接口遍历实现
        # 转字符串
        if item[0] not in exclude_list:
            item = str(item[0]) + " " + str(item[1])

            with open("word_frequency.txt", 'a', encoding='utf-8') as f:
                f.write(item + "\n")

    # 3' 词云
    draw_wordCloud(total_text)