本人是从网上教程视频中看到唐诗生成教学视频,自己按照上面的案例,手动敲了一下代码,如果仅仅只是手动按照上面给出的内容敲,其实学到的内容是有限的,所以我当时萌生出一个想法,如果是藏头诗呢,该怎么做呢,其实网上也有人给出来了,我现在里面的源码还是不太好用,当然我基于自己先前的代码进行的修改完成的,
代码下载。下面我说一下思路:
1.数据预处理
在这里我们读取文件,当然这里面的训练集是全是唐诗。然后做成向量。代码如下:
start_token = 'G'
end_token = 'E'
def process_poems(file_name):
poems = []
with open(file_name,"r",encoding='utf-8') as f:
for line in f.readlines():
try:
title,content = line.strip().split(':')
content = content.replace(' ','')
if '_' in content or '(' in content or '(' in content or '《' in content or '[' in content or \
start_token in content or end_token in content:
continue
if len(content) < 5 or len(content) > 79:
continue
content = start_token + content +end_token
poems.append(content)
except ValueError as e:
pass
#按诗的字数排序
poems = sorted(poems,key=lambda l:len(line))
#统计每个字出现的次数
all_words = []
for poem in poems:
all_words += [word for word in poem]
#这里根据包含了每个字对应的频率
counter = collections.Counter(all_words)
count_pairs = sorted(counter.items(),key=lambda x: -x[1])
words,_ = zip(*count_pairs)
# print(words)
#取前多少个字
words = words[:len(words)] + (' ',)
# print(words)
#每个字映射为一个数字ID
word_int_map = dict(zip(words,range(len(words))))
poems_vector = [list(map(lambda word:word_int_map.get(word,len(words)),poem))for poem in poems]
return poems_vector, word_int_map, words
2.生成Batch
在这里我们需要注意的是每首诗的长度不同,我们需要将其补全。然后x_data很容易表示出来了,那么y_data就是x_data向后移一位
def generate_batch(batch_size,poems_vec,word_to_int):
#每次取64首诗进行训练
n_chunk = len(poems_vec) // batch_size
x_batches = []
y_batches = []
for i in range(n_chunk):
start_index = i * batch_size
end_index = start_index + bat