预处理任务:对文本中的单词进行分隔,统计单词出现的频数并排序,对单词进行编码操作,按照单词出现的频数由大到小从0开始依次编码,步长为1。当再次输入文本语句时,我们即可得到每个单词相对应的编码数。
import codecs
import collections
import re
from operator import itemgetter
data_path="Lord of the rings.txt" #输入文本
vocab_path="vocab.txt" #单词文本
output_path="train.txt" #语句中相应单词对应的编码数
counter = collections.Counter() #生成空字典,用来统计单词频数
with codecs.open(data_path, "r", "utf-8") as f:
for line in f:
for word in re.split("\W+",line.strip()):
counter[word] += 1
f.close()
sorted_word_to_cnt = sorted(counter.items(), key=itemgetter(1), reverse=True) #按照单词频数进行由大到小排序
sorted_words = [x[0] for x in sorted_word_to_cnt]
if len(sorted_words) > 10000:
sorted_words = sorted_words[:10000]
with codecs.open(vocab_path, 'w', 'utf-8') as file_output:
for word in sorted_words:
file_output.write(word + '\n')
file_output.close()
with codecs.open(vocab_path, 'r', 'utf-8') as f_vocab:
vocab = [w.strip() for w in f_vocab.readlines()]
word_to_id = {k: v for (k, v) in zip(vocab, range(len(vocab)))}
def get_id(word):
return word_to_id[word]
fin = codecs.open(data_path, 'r', 'utf-8')
fout = codecs.open(output_path, 'w', 'utf-8')
for line in fin:
for words in re.split("\W+",line.strip()):
out_line = " ".join([str(get_id(words))]) + '\n' #寻找单词对应的编码数
fout.write(out_line)
fin.close()
fout.close()
输入文本如下:
得到的统计单词如下(按照频数由大到小排序的,这里只显示前几行):
重新输入文本,得到的句子中单词的相应编码如下(这里只显示前几行):
可以测试一下看看结果:
print(sorted_words[746]+" "+sorted_words[1053]+" "+sorted_words[448]+" "\
+sorted_words[1053]+" "+sorted_words[448]+" "+sorted_words[1053]+" "\
+sorted_words[332]+" "+sorted_words[4769]+" "+sorted_words[104])
Book 1 Chapter 1 Chapter 1 Minas TirithPippin looked
显然与输入文本第一行前几个单词对应。