之前的文本我们是采用的一个自己输入的data,数据量太小,生成的新闻本效果不好,这次我们采用更多的数据进行预测。
1.数据获取
数据的格式与之前相同,只不过数据量增大了,下载的数据存储到/tmp/irish-lyrics-eof.txt(可以按照自己需求更改)
!wget --no-check-certificate \
https://storage.googleapis.com/laurencemoroney-blog.appspot.com/irish-lyrics-eof.txt \
-O /tmp/irish-lyrics-eof.txt
2.预处理
tokenizer = Tokenizer()
data = open('/tmp/irish-lyrics-eof.txt').read()
corpus = data.lower().split("\n")
tokenizer.fit_on_texts(corpus)
total_words = len(tokenizer.word_index) + 1
print(total_words)
input_sequences = []
for line in corpus:
token_list = tokenizer.texts_to_sequences([line])[0]
for i in range(1, len(token_list)):
n_gram_sequence = token_list[:i+1]
input_sequences.append(n_gram_sequence)
# pad sequences
max_sequence_len = max([len(x) for x in input_sequences])
input_sequences = np.array(pad_sequences(input_sequences, maxlen=max_sequence_len, padding='pre'))
# create predictors and label
xs, labels = input_sequences[:,:-1],input_sequences[:,-1]
ys = tf.keras.utils.to_categorical(labels, num_classes=total_words)
3.搭建模型
这次采用更大的网络规模
model = Sequential()
model.add(Embedding(total_words, 100, input_length=max_sequence_len-1))
model.add(Bidirectional(LSTM(150)))
model.add(Dense(total_words, activation='softmax'))
adam = Adam(lr=0.01)
model.compile(loss='categorical_crossentropy', optimizer=adam, metrics=['accuracy'])
#earlystop = EarlyStopping(monitor='val_loss', min_delta=0, patience=5, verbose=0, mode='auto')
history = model.fit(xs, ys, epochs=100, verbose=1)
4.绘制acc
import matplotlib.pyplot as plt
def plot_graphs(history, string):
plt.plot(history.history[string])
plt.xlabel("Epochs")
plt.ylabel(string)
plt.show()
plot_graphs(history, 'acc')
5.预测
seed_text = "I've got a bad feeling about this"
next_words = 100
for _ in range(next_words):
token_list = tokenizer.texts_to_sequences([seed_text])[0]
token_list = pad_sequences([token_list], maxlen=max_sequence_len-1, padding='pre')
predicted = model.predict(token_list, verbose=0)
predicted2 = np.argmax(predicted, axis=1)
output_word = ""
#这个循环是将词典的索引对的索引与预测的标签进行匹配,找到预测的单词索引后,将该单词加入到句子后
for word, index in tokenizer.word_index.items():
if index == predicted2:
output_word = word
break
seed_text += " " + output_word
print(seed_text)
结果:
I’ve got a bad feeling about this half dozen stout died and laid the song of the warrior bard shaken silver home now since i spent up in dublin call belfast city and the brother william stood at the door and they ring at me diggin for erin go bragh together again soon i patrick up her pipes bellows chanters and all up with a glass of love easy as the sea came to james connolly cry my a going to a baby on free sweep but away me forget old ireland all care of me darling in strife roaming tomorrow i paid than him went by
翻译后:
我有一种不好的感觉,因为我在都柏林度过了一个叫贝尔法斯特的城市,威廉兄弟站在门口,他们给我打电话,艾琳·迪金,你们再一起去吹牛吧,很快我拿起她的风箱唱诗班,喝了一杯爱大海来了詹姆斯·康诺利哭了我的一个即将成为一个婴儿的自由扫荡,但带走了我忘记了旧爱尔兰所有的关心我亲爱的在纷争中漫游明天我付出的比他过去的还要多
可以看到,虽然句子之间逻辑存在问题,但是每一句话都是通顺的。