tensorflow2.0 预测新文本(二）

最新推荐文章于 2024-07-04 11:22:15 发布

smallworldxyl

最新推荐文章于 2024-07-04 11:22:15 发布

阅读量274

点赞数

分类专栏： python NLP 深度学习/tensorflow 文章标签：深度学习自然语言处理

本文链接：https://blog.csdn.net/smallworldxyl/article/details/120472218

版权

python 同时被 3 个专栏收录

14 篇文章 1 订阅

订阅专栏

深度学习/tensorflow

14 篇文章 1 订阅

订阅专栏

NLP

5 篇文章 2 订阅

订阅专栏

之前的文本我们是采用的一个自己输入的data，数据量太小，生成的新闻本效果不好，这次我们采用更多的数据进行预测。

1.数据获取

数据的格式与之前相同，只不过数据量增大了，下载的数据存储到/tmp/irish-lyrics-eof.txt（可以按照自己需求更改）

!wget --no-check-certificate \
    https://storage.googleapis.com/laurencemoroney-blog.appspot.com/irish-lyrics-eof.txt \
    -O /tmp/irish-lyrics-eof.txt

2.预处理

tokenizer = Tokenizer()

data = open('/tmp/irish-lyrics-eof.txt').read()

corpus = data.lower().split("\n")

tokenizer.fit_on_texts(corpus)
total_words = len(tokenizer.word_index) + 1

print(total_words)

input_sequences = []
for line in corpus:
	token_list = tokenizer.texts_to_sequences([line])[0]
	for i in range(1, len(token_list)):
		n_gram_sequence = token_list[:i+1]
		input_sequences.append(n_gram_sequence)

# pad sequences 
max_sequence_len = max([len(x) for x in input_sequences])
input_sequences = np.array(pad_sequences(input_sequences, maxlen=max_sequence_len, padding='pre'))

# create predictors and label
xs, labels = input_sequences[:,:-1],input_sequences[:,-1]

ys = tf.keras.utils.to_categorical(labels, num_classes=total_words)

3.搭建模型

这次采用更大的网络规模

model = Sequential()
model.add(Embedding(total_words, 100, input_length=max_sequence_len-1))
model.add(Bidirectional(LSTM(150)))
model.add(Dense(total_words, activation='softmax'))
adam = Adam(lr=0.01)
model.compile(loss='categorical_crossentropy', optimizer=adam, metrics=['accuracy'])
#earlystop = EarlyStopping(monitor='val_loss', min_delta=0, patience=5, verbose=0, mode='auto')
history = model.fit(xs, ys, epochs=100, verbose=1)

4.绘制acc

import matplotlib.pyplot as plt


def plot_graphs(history, string):
  plt.plot(history.history[string])
  plt.xlabel("Epochs")
  plt.ylabel(string)
  plt.show()
  
plot_graphs(history, 'acc')

在这里插入图片描述

5.预测

seed_text = "I've got a bad feeling about this"
next_words = 100
  

for _ in range(next_words):
  token_list = tokenizer.texts_to_sequences([seed_text])[0]
  token_list = pad_sequences([token_list], maxlen=max_sequence_len-1, padding='pre')
  predicted = model.predict(token_list, verbose=0)
  predicted2 = np.argmax(predicted, axis=1)
  output_word = ""
  #这个循环是将词典的索引对的索引与预测的标签进行匹配，找到预测的单词索引后，将该单词加入到句子后
  for word, index in tokenizer.word_index.items():
    if index == predicted2:
      output_word = word
      break
  seed_text += " " + output_word

print(seed_text)

结果：
I’ve got a bad feeling about this half dozen stout died and laid the song of the warrior bard shaken silver home now since i spent up in dublin call belfast city and the brother william stood at the door and they ring at me diggin for erin go bragh together again soon i patrick up her pipes bellows chanters and all up with a glass of love easy as the sea came to james connolly cry my a going to a baby on free sweep but away me forget old ireland all care of me darling in strife roaming tomorrow i paid than him went by
翻译后：
我有一种不好的感觉，因为我在都柏林度过了一个叫贝尔法斯特的城市，威廉兄弟站在门口，他们给我打电话，艾琳·迪金，你们再一起去吹牛吧，很快我拿起她的风箱唱诗班，喝了一杯爱大海来了詹姆斯·康诺利哭了我的一个即将成为一个婴儿的自由扫荡，但带走了我忘记了旧爱尔兰所有的关心我亲爱的在纷争中漫游明天我付出的比他过去的还要多

可以看到，虽然句子之间逻辑存在问题，但是每一句话都是通顺的。

smallworldxyl

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
tensorflow2.0 预测新文本(二）

之前的文本我们是采用的一个自己输入的data，数据量太小，生成的新闻本效果不好，这次我们采用更多的数据进行预测。1.数据获取数据的格式与之前相同，只不过数据量增大了，下载的数据存储到/tmp/irish-lyrics-eof.txt（可以按照自己需求更改）!wget --no-check-certificate \ https://storage.googleapis.com/laurencemoroney-blog.appspot.com/irish-lyrics-eof.txt \
复制链接

扫一扫