【从官方案例学框架Keras】基于字符的LSTM文本生成
注:本系列仅帮助大家快速理解、学习并能独立使用相关框架进行深度学习的研究,理论部分还请自行学习补充,每个框架的官方经典案例写的都非常好,很值得进行学习使用。可以说在完全理解官方经典案例后加以修改便可以解决大多数常见的相关任务。
摘要:基于字符的LSTM文本生成
1 Introduction
本例将展示如何使用LSTM模型去做字符至字符的文本生成。生成的文本至少需要20个epochs才能开始局部连贯。建议使用GPU,因为RNN计算量较大
如果想要在新数据上尝试,请确保你的数据量至少有100k的字符,1M更好
2 Setup
from tensorflow import keras
from tensorflow.keras import layers
import numpy as np
import random
import io
3 Prepare the data
数据集可通过该链接下载:
https://s3.amazonaws.com/text-datasets/nietzsche.txt
原始数据是
I have an apple today
则训练数据定义为这样:step = 3
x | y |
---|---|
I have an | a |
ve an ap | p |
an apple | ’ ’ (空格) |
# path = keras.utils.get_file(
# "nietzsche.txt", origin="https://s3.amazonaws.com/text-datasets/nietzsche.txt"
# )
path = './input/nietzsche.txt'
with io.open(path, encoding="utf-8") as f:
text = f.read().lower()
text = text.replace("\n", " ") # We remove newlines chars for nicer display
print("Corpus length:", len(text))
chars = sorted(list(set(text)))
print("Total chars:", len(chars))
char_indices = dict((c, i) for i, c in enumerate(chars))
indices_char = dict((i, c) for i, c in enumerate(chars))
# cut the text in semi-redundant sequences of maxlen characters
maxlen = 40
step = 3
sentences = []
next_chars = []
for i in range(0, len(text) - maxlen, step):
sentences.append(text[i : i + maxlen])
next_chars.append(text[i + maxlen])
print("Number of sequences:", len(sentences))
x = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)
y = np.zeros((len(sentences), len(chars)), dtype=np.bool)
for i, sentence in enumerate(sentences):
for t, char in enumerate(sentence):
x[i, t, char_indices[char]] = 1
y[i, char_indices[next_chars[i]]] = 1
4 Prepare the text sampling function
定义生成文本函数,这里值得注意的是参数temperature
,这个值越小,预测的越平凡,越大,预测的可能性越多,更神奇
def sample(preds, temperature=1.0):
# helper function to sample an index from a probability array
preds = np.asarray(preds).astype("float64")
preds = np.log(preds) / temperature
exp_preds = np.exp(preds)
preds = exp_preds / np.sum(exp_preds)
probas = np.random.multinomial(1, preds, 1)
return np.argmax(probas)
5 Build the model: a single LSTM layer
model = keras.Sequential(
[
keras.Input(shape=(maxlen, len(chars))),
layers.LSTM(128),
layers.Dense(len(chars), activation="softmax"),
]
)
optimizer = keras.optimizers.RMSprop(learning_rate=0.01)
model.compile(loss="categorical_crossentropy", optimizer=optimizer)
model.summary()
6 Train the model
epochs = 40
batch_size = 128
for epoch in range(epochs):
model.fit(x, y, batch_size=batch_size, epochs=1)
print()
print("Generating text after epoch: %d" % epoch)
start_index = random.randint(0, len(text) - maxlen - 1)
for diversity in [0.2, 0.5, 1.0, 1.2]:
print("...Diversity:", diversity)
generated = ""
sentence = text[start_index : start_index + maxlen]
print('...Generating with seed: "' + sentence + '"')
for i in range(400):
x_pred = np.zeros((1, maxlen, len(chars)))
for t, char in enumerate(sentence):
x_pred[0, t, char_indices[char]] = 1.0
preds = model.predict(x_pred, verbose=0)[0]
next_index = sample(preds, diversity)
next_char = indices_char[next_index]
sentence = sentence[1:] + next_char
generated += next_char
print("...Generated: ", generated)
print()
训练效果还是很不错的,如果换成中文数据集将会更明显。