tensorflow2.0 预测新文本(二)

之前的文本我们是采用的一个自己输入的data,数据量太小,生成的新闻本效果不好,这次我们采用更多的数据进行预测。

1.数据获取

数据的格式与之前相同,只不过数据量增大了,下载的数据存储到/tmp/irish-lyrics-eof.txt(可以按照自己需求更改)

!wget --no-check-certificate \
    https://storage.googleapis.com/laurencemoroney-blog.appspot.com/irish-lyrics-eof.txt \
    -O /tmp/irish-lyrics-eof.txt

2.预处理

tokenizer = Tokenizer()

data = open('/tmp/irish-lyrics-eof.txt').read()

corpus = data.lower().split("\n")

tokenizer.fit_on_texts(corpus)
total_words = len(tokenizer.word_index) + 1

print(total_words)
input_sequences = []
for line in corpus:
	token_list = tokenizer.texts_to_sequences([line])[0]
	for i in range(1, len(token_list)):
		n_gram_sequence = token_list[:i+1]
		input_sequences.append(n_gram_sequence)

# pad sequences 
max_sequence_len = max([len(x) for x in input_sequences])
input_sequences = np.array(pad_sequences(input_sequences, maxlen=max_sequence_len, padding='pre'))

# create predictors and label
xs, labels = input_sequences[:,:-1],input_sequences[:,-1]

ys = tf.keras.utils.to_categorical(labels, num_classes=total_words)

3.搭建模型

这次采用更大的网络规模

model = Sequential()
model.add(Embedding(total_words, 100, input_length=max_sequence_len-1))
model.add(Bidirectional(LSTM(150)))
model.add(Dense(total_words, activation='softmax'))
adam = Adam(lr=0.01)
model.compile(loss='categorical_crossentropy', optimizer=adam, metrics=['accuracy'])
#earlystop = EarlyStopping(monitor='val_loss', min_delta=0, patience=5, verbose=0, mode='auto')
history = model.fit(xs, ys, epochs=100, verbose=1)

4.绘制acc

import matplotlib.pyplot as plt


def plot_graphs(history, string):
  plt.plot(history.history[string])
  plt.xlabel("Epochs")
  plt.ylabel(string)
  plt.show()
  
plot_graphs(history, 'acc')

在这里插入图片描述

5.预测

seed_text = "I've got a bad feeling about this"
next_words = 100
  

for _ in range(next_words):
  token_list = tokenizer.texts_to_sequences([seed_text])[0]
  token_list = pad_sequences([token_list], maxlen=max_sequence_len-1, padding='pre')
  predicted = model.predict(token_list, verbose=0)
  predicted2 = np.argmax(predicted, axis=1)
  output_word = ""
  #这个循环是将词典的索引对的索引与预测的标签进行匹配,找到预测的单词索引后,将该单词加入到句子后
  for word, index in tokenizer.word_index.items():
    if index == predicted2:
      output_word = word
      break
  seed_text += " " + output_word

print(seed_text)

结果:
I’ve got a bad feeling about this half dozen stout died and laid the song of the warrior bard shaken silver home now since i spent up in dublin call belfast city and the brother william stood at the door and they ring at me diggin for erin go bragh together again soon i patrick up her pipes bellows chanters and all up with a glass of love easy as the sea came to james connolly cry my a going to a baby on free sweep but away me forget old ireland all care of me darling in strife roaming tomorrow i paid than him went by
翻译后:
我有一种不好的感觉,因为我在都柏林度过了一个叫贝尔法斯特的城市,威廉兄弟站在门口,他们给我打电话,艾琳·迪金,你们再一起去吹牛吧,很快我拿起她的风箱唱诗班,喝了一杯爱大海来了詹姆斯·康诺利哭了我的一个即将成为一个婴儿的自由扫荡,但带走了我忘记了旧爱尔兰所有的关心我亲爱的在纷争中漫游明天我付出的比他过去的还要多

可以看到,虽然句子之间逻辑存在问题,但是每一句话都是通顺的。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值