我要学英语!所以计划是按招奶爸的小屋里说先把一部美剧中的所以词背下来再用英文字幕看,所以我开始搜罗剧本,找到了1-8季的剧本了就开始分析
转化为txt格式
我使用的是pywin32 中的win32com具体教程在我以前写的文章 代码为
from win32com import client
import os
#get all doc file
path = os.getcwd()
dh_path = os.path.join(path,"dh")
doc_files = os.listdir(dh_path)
#covert to txt file
word = client.Dispatch("Word.Application")
for doc_file in doc_files:
file_path = os.path.join(dh_path, doc_file)
print(file_path, "loading...", end=" ")
doc = word.Documents.Open(file_path)
doc.SaveAs(os.path.join(path, os.path.splitext(doc_file)[0])+'.txt',2)
print("done")
复制代码
词频分析
思路是从知乎中得到,很简单就是把非英文的字符全部变成回车就可以了,然后用字典递增统计就完成了 代码是
import re, os
files = os.listdir()
all_text = ""
for fil in files:
if os.path.splitext(fil)[1] == '.txt':
with open(os.path.join(os.getcwd(), fil)) as f:
print('loading', fil)
all_text = all_text + re.sub('[^a-zA-Z]', '\n', f.read())
D = dict()
for i in all_text.split():
D[i] = D.get(i, 0) + 1
with open(os.path.join(os.getcwd(), 'count.txt'), 'w') as f2:
for key,vlaue in sorted(D.items(), key=lambda item:item[1] ,reverse=True):
if vlaue == 1:
continue
f2.write(key + " " + str(vlaue))
f2.write("\n")
复制代码