1.文本生成(char)
用LSTM做文本生成
举个小小的例子,来看看LSTM是怎么玩的
我们这里用温斯顿丘吉尔的人物传记作为我们的学习语料。
# -*- coding: utf-8 -*-
'''
用RNN做文本生成,用温斯顿丘吉尔的人物传记作为我们的学习语料
我们这里简单的文本预测是,给了前置的字母以后,下一个字母是谁?
比如,importan,给出t,Winsto,给出n,Britai, 给出n
'''
import numpy as np
from keras.models import Sequential
from keras.layers import Dense,Dropout,LSTM
from keras.callbacks import ModelCheckpoint
from keras.utils import np_utils
raw_text = open("input/Winston_Churchil.txt",encoding="utf-8").read()
raw_text = raw_text.lower()
chars = sorted(list(set(raw_text)))
char_to_int = dict((c,i) for i,c in enumerate(chars))
int_to_char = dict((i,c) for i,c in enumerate(chars))
'''
构造训练测试集
我们需要把我们的raw text变成可以用来训练的x,y:
x 是前置字母们 y 是后一个字母
'''
seg_length = 100
x = []
y = []
for i in range(0,len(raw_text)-seg_length):
given = raw_text[i:i+seg_length]
predict = raw_text[i+seg_length]
x.append([char_to_int[char] for char in given])
y.append(char_to_int[predict])
'''
此刻,楼上这些表达方式,类似就是一个词袋,或者说 index。
接下来我们做两件事:
1.我们已经有了一个input的数字表达(index),我们要把它变成LSTM需要的数组格式: [样本数,时间步伐,特征]
2.第二,对于output,我们在Word2Vec里学过,用one-hot做output的预测可以给我们更好的效果,相对于直接预测一个准确的y数值的话。
'''
n_patterns = len(x)
n_vocab = len(chars)
# 把x变成LSTM需要的样子
x = np.reshape(x,(n_patterns,seg_length,1))
# 简单normal到0-1之间
x = x / float(n_vocab)
# output变成one-hot
y = np_utils.to_categorical(y)
'''
模型建造(LSTM)
'''
model = Sequential()
model.add(LSTM(256,input_shape=(x.shape[1],x.shape[2])))
model.add(Dropout(0.2))
model.add(Dense(y.shape[1],activation="softmax"))
model.compile(loss="categorical_crossentropy",optimizer="adam")
model.fit(x,y,nb_epoch=50,batch_size=4096)
'''
测试程序,看看我们训练出来的LSTM的效果
'''
def predict_next(input_array):
x = np.reshape(input_array,(1,seg_length,1))
x = x / float(n_vocab)
y = model.predict(x)
return y
def string_to_index(raw_input):
res = []
for c in raw_input[(len(raw_input)-seg_length):]:
res.append(char_to_int[c])
return res
def y_to_char(y):
largest_index = y.argmax()
c = int_to_char[largest_index]
return c
def generate_article(init,rounds=200):
in_string = init.lower()
for i in range(rounds):
n = y_to_char(predict_next(string_to_index(in_string)))
in_string += n
return in_string
init = 'His object in coming to New York was to engage officers for that service. He came at an opportune moment'
article = generate_article(init)
print(article)
训练过程
Epoch 1/50
276730/276730 [==============================] - 197s - loss: 3.1120
Epoch 2/50
276730/276730 [==============================] - 197s - loss: 3.0227
Epoch 3/50
276730/276730 [==============================] - 197s - loss: 2.9910
Epoch 4/50
276730/276730 [==============================] - 197s - loss: 2.9337
Epoch 5/50
276730/276730 [==============================] - 197s - loss: 2.8971
Epoch 6/50
276730/276730 [==============================] - 197s - loss: 2.8784
Epoch 7/50
276730/276730 [==============================] - 197s - loss: 2.8640
Epoch 8/50
276730/276730 [==============================] - 197s - loss: 2.8516
Epoch 9/50
276730/276730 [==============================] - 197s - loss: 2.8384
Epoch 10/50
276730/276730 [==============================] - 197s - loss: 2.8254
Epoch 11/50
276730/276730 [==============================] - 197s - loss: 2.8133
Epoch 12/50
276730/276730 [==============================] - 197s - loss: 2.8032
Epoch 13/50
276730/276730 [==============================] - 197s - loss: 2.7913
Epoch 14/50
276730/276730 [==============================] - 197s - loss: 2.7831
Epoch 15/50
276730/276730 [==============================] - 197s - loss: 2.7744
Epoch 16/50
276730/276730 [==============================] - 197s - loss: 2.7672
Epoch 17/50
276730/276730 [==============================] - 197s - loss: 2.7601
Epoch 18/50
276730/276730 [==============================]