大家好,我是W
这个东西其实我早就想做了,但是一直找不到真题Word版,所以没做。今天室友提醒我淘宝可以买到,结果真就有了呗。OK那就搞吧!
资源地址
花了我两块钱大洋买了真题,我也藏私就给大家分享出来吧!大家完全可以下载来照着做,点击链接。
统一DOCX格式
Word文档有些是doc格式,有些是docx格式,为了能够统一读文档,我们需要把doc文档转为docx格式。这里使用的工具是pypiwin32,大家可以使用下面的命令行下载:
pip install pypiwin32
通过pypiwin32库可以把doc转为docx文件,然后使用python-docx对docx文件进行读取:
pip install python-docx
得到这两个库后就可以对文件进行格式转换:
import os
import win32com.client as wc
def read_word(base_path, file_name_list):
print(file_name_list)
for file_name in file_name_list:
word = wc.Dispatch('Word.Application')
print(base_path + file_name + ".doc")
doc = word.Documents.Open(base_path + file_name + ".doc")
doc.SaveAs("D:/My_IDE/PyCharm/Project/python_basic/英语二词频分析/data/English_2_new/" + file_name + ".docx", 12,
False,
"", True, "", False, False, False, False)
doc.Close()
word.Quit()
if __name__ == '__main__':
base_path = "D:/My_IDE/PyCharm/Project/python_basic/英语二词频分析/data/English_2_origin/"
file_name_list = []
for file_name in os.listdir(base_path):
print(base_path + file_name)
if file_name.endswith(".doc"):
file_name_list.append(file_name.replace(".doc", ""))
read_word(base_path, file_name_list)
另外有一篇博客大家可以学习:python读取word文件
读取DOCX文件
使用python-docx读取文件:
from docx import Document
def read_DOCX(base_path, file_name):
document = Document(base_path + file_name)
full_text = ""
for paragraph in document.paragraphs:
# print(paragraph.text)
full_text += paragraph.text
return full_text
整体流程
这个main方法表示了整体的流程,因为是很简单的小工具,所以并没有对代码做过多的注释相信大家也能看懂:
if __name__ == '__main__':
base_path = "D:/My_IDE/PyCharm/Project/python_basic/英语二词频分析/data/English_2_new/"
file_name_list = os.listdir(base_path)
total_word_dict = {}
total_text = ""
exclude_list = read_exclude_text()
# 1' 读取文件
for file_name in file_name_list:
full_text = read_DOCX(base_path, file_name)
total_text += full_text
# 2' 分词
word_list = jb.cut(full_text)
for word in word_list:
if is_alphabet(word):
total_word_dict[word] = total_word_dict.get(word, 0) + 1
print(total_word_dict)
# 按频次排序
total_word_list = sorted(total_word_dict.items(), key=lambda x: x[1], reverse=True)
print(total_word_list)
for item in total_word_list:
# 找对应翻译
# 找不到接口批量翻译,可以尝试破解百度翻译的接口遍历实现
# 转字符串
if item[0] not in exclude_list:
item = str(item[0]) + " " + str(item[1])
with open("word_frequency.txt", 'a', encoding='utf-8') as f:
f.write(item + "\n")
# 3' 词云
draw_wordCloud(total_text)
对于中间涉及的小问题有几个需要讲的。
判断英文
因为文档里还是有一些标点符号和中文的出现,但是我们需要的是英文,所以需要一个方法来判断英文:
def is_alphabet(char):
if (char >= '\u0041' and char <= '\u005a') or (char >= '\u0061' and char <= '\u007a'):
return True
else:
return False
在使用jieba分词后,需要剔除一些简单的词汇
这里我选择把需要提出的词汇放入txt文件中,大家如果有需要可以自行按格式添加:
def read_exclude_text():
file_path = "./data/exclud_words.txt"
exclude_list = []
for line in open(file_path):
exclude_list.append(line.strip('\r\n'))
return exclude_list
若*item[0]*在exclude名单中则剔除:
if item[0] not in exclude_list:
item = str(item[0]) + " " + str(item[1])
exclude_words.txt格式如下:
the
to
of
a
and
in
that
is
A
词云和jieba都很简单
我给大家找两篇博客,或者大家阅读代码就能学会:
完整代码
把doc转为docx:
import os
import win32com.client as wc
def read_word(base_path, file_name_list):
print(file_name_list)
for file_name in file_name_list:
word = wc.Dispatch('Word.Application')
print(base_path + file_name + ".doc")
doc = word.Documents.Open(base_path + file_name + ".doc")
doc.SaveAs("....python_basic/英语二词频分析/data/English_2_new/" + file_name + ".docx", 12,
False,
"", True, "", False, False, False, False)
doc.Close()
word.Quit()
if __name__ == '__main__':
base_path = ".../python_basic/英语二词频分析/data/English_2_origin/"
file_name_list = []
for file_name in os.listdir(base_path):
print(base_path + file_name)
if file_name.endswith(".doc"):
file_name_list.append(file_name.replace(".doc", ""))
read_word(base_path, file_name_list)
分词、词云代码:
from wordcloud import WordCloud
import jieba as jb
import os
from docx import Document
def read_DOCX(base_path, file_name):
document = Document(base_path + file_name)
full_text = ""
for paragraph in document.paragraphs:
# print(paragraph.text)
full_text += paragraph.text
return full_text
def is_alphabet(char):
if (char >= '\u0041' and char <= '\u005a') or (char >= '\u0061' and char <= '\u007a'):
return True
else:
return False
def draw_wordCloud(total_text):
WC = WordCloud(width=940, height=1080, background_color="white")
WC.generate(total_text)
WC.to_file("./data/wordCloud.png")
def read_exclude_text():
file_path = "./data/exclud_words.txt"
exclude_list = []
for line in open(file_path):
exclude_list.append(line.strip('\r\n'))
return exclude_list
if __name__ == '__main__':
base_path = "..../python_basic/英语二词频分析/data/English_2_new/"
file_name_list = os.listdir(base_path)
total_word_dict = {}
total_text = ""
exclude_list = read_exclude_text()
# 1' 读取文件
for file_name in file_name_list:
full_text = read_DOCX(base_path, file_name)
total_text += full_text
# 2' 分词
word_list = jb.cut(full_text)
for word in word_list:
if is_alphabet(word):
total_word_dict[word] = total_word_dict.get(word, 0) + 1
print(total_word_dict)
# 按频次排序
total_word_list = sorted(total_word_dict.items(), key=lambda x: x[1], reverse=True)
print(total_word_list)
for item in total_word_list:
# 找对应翻译
# 找不到接口批量翻译,可以尝试破解百度翻译的接口遍历实现
# 转字符串
if item[0] not in exclude_list:
item = str(item[0]) + " " + str(item[1])
with open("word_frequency.txt", 'a', encoding='utf-8') as f:
f.write(item + "\n")
# 3' 词云
draw_wordCloud(total_text)
项目地址
大家若是想直接运行可以在GitHub上下载我的项目:项目地址