seq2seq翻译模型

最新推荐文章于 2024-05-07 16:40:57 发布

weixin_43191401

最新推荐文章于 2024-05-07 16:40:57 发布

阅读量465

点赞数

文章标签：自然语言处理深度学习

本文链接：https://blog.csdn.net/weixin_43191401/article/details/89060431

版权

今天把基于LSTM的翻译模型看了一遍，并做了注释，供大家参考，英语是原来的注释，汉语是我加上去的注释
这是数据集
链接：https://pan.baidu.com/s/1C2iwlEMnH9pSpf73tIBMbA
提取码：mq95
‘’’
#Sequence to sequence example in Keras (character-level).
This script demonstrates how to implement a basic character-level
sequence-to-sequence model. We apply it to translating
short English sentences into short French sentences,
character-by-character. Note that it is fairly unusual to
do character-level machine translation, as word-level
models are more common in this domain.
Summary of the algorithm

We start with input sequences from a domain (e.g. English sentences)
and corresponding target sequences from another domain
(e.g. French sentences).
An encoder LSTM turns input sequences to 2 state vectors
(we keep the last LSTM state and discard the outputs).
A decoder LSTM is trained to turn the target sequences into
the same sequence but offset by one timestep in the future,
a training process called “teacher forcing” in this context.
It uses as initial state the state vectors from the encoder.
Effectively, the decoder learns to generate targets[t+1...]
given targets[...t], conditioned on the input sequence.
In inference mode, when we want to decode unknown input sequences, we:
- Encode the input sequence into state vectors
- Start with a target sequence of size 1
  (just the start-of-sequence character)
- Feed the state vectors and 1-char target sequence
  to the decoder to produce predictions for the next character
- Sample the next character using these predictions
  (we simply use argmax).
- Append the sampled character to the target sequence
- Repeat until we generate the end-of-sequence character or we
  hit the character limit.
  Data download
  English to French sentence pairs.
  
  Lots of neat sentence pairs datasets.
  
  References
Sequence to Sequence Learning with Neural Networks
Learning Phrase Representations using
RNN Encoder-Decoder for Statistical Machine Translation

‘’’
from future import print_function

from keras.models import Model
from keras.layers import Input, LSTM, Dense
import numpy as np
from keras.utils import plot_model
batch_size = 64 # Batch size for training.
epochs = 100 # Number of epochs to train for.
latent_dim = 256 # Latent dimensionality of the encoding space.
num_samples = 10000 # Number of samples to train on.

data_path = r’C:\Users\seu\Desktop\deskfile\fra.txt’

input_texts = []
target_texts = []
input_characters = set()
target_characters = set()
#读取文本内容
with open(data_path, ‘r’, encoding=‘utf-8’) as f:
lines = f.read().split(’\n’)

#划分输入数据和需要得出的数据
for line in lines[: min(num_samples, len(lines) - 1)]:
input_text, target_text = line.split(’\t’)
# We use “tab” as the “start sequence” character
# for the targets, and “\n” as “end sequence” character.
#定义每句话的开始和结束
target_text = ‘\t’ + target_text + ‘\n’
input_texts.append(input_text)
target_texts.append(target_text)
#记录输入输出的所有字符
for char in input_text:
if char not in input_characters:
input_characters.add(char)
for char in target_text:
if char not in target_characters:
target_characters.add(char)

#可能使按照每个字符出现的频率进行排序
input_characters = sorted(list(input_characters))
target_characters = sorted(list(target_characters))
#即两种语言中最多有多少个字符
num_encoder_tokens = len(input_characters)
num_decoder_tokens = len(target_characters)

#输入和输出中最长的句子的长度
max_encoder_seq_length = max([len(txt) for txt in input_texts])
max_decoder_seq_length = max([len(txt) for txt in target_texts])

print(‘Number of samples:’, len(input_texts))
print(‘Number of unique input tokens:’, num_encoder_tokens)
print(‘Number of unique output tokens:’, num_decoder_tokens)
print(‘Max sequence length for inputs:’, max_encoder_seq_length)
print(‘Max sequence length for outputs:’, max_decoder_seq_length)

#给输入输出每个字符一个唯一的ID
input_token_index = dict(
[(char, i) for i, char in enumerate(input_characters)])
target_token_index = dict(
[(char, i) for i, char in enumerate(target_characters)])

‘’’
#定义编码的输入数据
#num_encoder_tokens是字符的长度
#可能是one-hot编码
max_encoder_seq_length大概相当于时间步长
‘’’

encoder_input_data = np.zeros(
(len(input_texts), max_encoder_seq_length, num_encoder_tokens),
dtype=‘float32’)
#max_decoder_seq_length是输入序列中最长的句子的长度
decoder_input_data = np.zeros(
(len(input_texts), max_decoder_seq_length , num_decoder_tokens),
dtype=‘float32’)
#输出是一个句子的序列
decoder_target_data = np.zeros(
(len(input_texts), max_decoder_seq_length, num_decoder_tokens),
dtype=‘float32’)

‘’‘seq2seq输出是一个序列’’’
#input_token_index[char]是该字符对应的ID
#将输入和输出中的每个字符进行编码
#对第i句中第t个字符进行编码
for i, (input_text, target_text) in enumerate(zip(input_texts, target_texts)):
for t, char in enumerate(input_text):
encoder_input_data[i, t, input_token_index[char]] = 1.
for t, char in enumerate(target_text):
# decoder_target_data is ahead of decoder_input_data by one timestep
decoder_input_data[i, t, target_token_index[char]] = 1.
if t > 0:
‘’’
应该是输出顺着时间移一位，训练神经网络预测单词中下一个字符的能力
‘’’
# decoder_target_data will be ahead by one timestep
# and will not include the start character.
‘’‘最后时刻的输出全部是零，因为是预测下一位，所以第一位是知道的’’’
decoder_target_data[i, t - 1, target_token_index[char]] = 1.

#定义输入层
#latent_dim应该是隐层节点数
‘’‘num_encoder_tokens是输入序列所有字符的个数’’’
encoder_inputs = Input(shape=(None, num_encoder_tokens))
‘’’
latent_dim是LSTM隐层节点数
‘’’
encoder = LSTM(latent_dim, return_state=True)
#LSTM和输入层相连
encoder_outputs, state_h, state_c = encoder(encoder_inputs)

‘’’
state_h, state_c是编码以后隐层最后序列的输出
encoder_outputs是每个时刻LSTM细胞砖的输出，但是encoder对应的LSTM中并没有将return sequence值为真，所以返回的只是最后时刻的输出
所谓编码器应该就是这个意思，把序列的输入变为简单的非序列输出
‘’’
encoder_states = [state_h, state_c]

#定义解码器的输入
decoder_inputs = Input(shape=(None, num_decoder_tokens))

decoder_lstm = LSTM(latent_dim, return_sequences=True, return_state=True)

‘’’
decoder_outputs是decoder_lstm每一个时刻细胞砖的输出
‘’’
decoder_outputs, _, _ = decoder_lstm(decoder_inputs,
initial_state=encoder_states)

#定义网络的输出层
‘’’
输出层是要翻译的语言中字符个数
输出层是和解码器每个时刻的输出都有连接
输出不是序列形式
‘’’
decoder_dense = Dense(num_decoder_tokens, activation=‘softmax’)
decoder_outputs = decoder_dense(decoder_outputs)
print(decoder_outputs.shape)
print(“模型输出” + str(decoder_outputs.shape))
print(‘输出数据:’ + str(decoder_target_data.shape))

model = Model([encoder_inputs, decoder_inputs], decoder_outputs)

model.compile(optimizer=‘rmsprop’, loss=‘categorical_crossentropy’)

model.fit([encoder_input_data, decoder_input_data], decoder_target_data,
batch_size=batch_size,
epochs=1,
validation_split=0.2)

print(decoder_input_data.shape)
model.save(‘s2s.h5’)

‘’‘编码器的LSTM作为一个模型’’’
encoder_model = Model(encoder_inputs, encoder_states)

‘’‘编码器的隐层输出和细胞体的输出作为解码器的输入，定义解码器模型’’’
decoder_state_input_h = Input(shape=(latent_dim,))
decoder_state_input_c = Input(shape=(latent_dim,))
decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]
‘’‘重新定义解码器’’’
decoder_outputs, state_h, state_c = decoder_lstm(
decoder_inputs, initial_state=decoder_states_inputs)
‘’’
state_h和state_c是解码器的状态输出，注意，现在state_h和state_c已经改变，不再是编码器的状态输出
‘’’
decoder_states = [state_h, state_c]
decoder_outputs = decoder_dense(decoder_outputs)
print(len([decoder_inputs] + decoder_states_inputs))
‘’’
decoder_inputs是时序输入
decoder_output是时序输出
decoder_states_inputs是状态输入，即编码器的状态输出
decoder_states是解码器的状态输出
定义解码器的模型
‘’’
decoder_model = Model(
[decoder_inputs] + decoder_states_inputs,
[decoder_outputs] + decoder_states)
print(decoder_inputs.shape)

‘’’

编号和字符互换，主要是为了获得翻译后的单词
‘’’
reverse_input_char_index = dict(
(i, char) for char, i in input_token_index.items())
reverse_target_char_index = dict(
(i, char) for char, i in target_token_index.items())

def decode_sequence(input_seq):
# Encode the input as state vectors.

# :param input_seq:
# :return:
#
'''获得编码器的状态输出'''
states_value = encoder_model.predict(input_seq)

# Generate empty target sequence of length 1.
target_seq = np.zeros((1, 1, num_decoder_tokens))
# Populate the first character of target sequence with the start character.
target_seq[0, 0, target_token_index['\t']] = 1.

# Sampling loop for a batch of sequences
# (to simplify, here we assume a batch of size 1).
stop_condition = False
decoded_sentence = ''
while not stop_condition:
    '''解码器的输入应为序列输入，只有一个时刻就可以作为输入吗？'''
    '''训练时输入时长是固定的，但是做预测时可以只输入一个时刻的数据，因为权重是共享的'''
    output_tokens, h, c = decoder_model.predict(
        [target_seq] + states_value)

    print('---------------------------------------')
    print(output_tokens.shape)
    # Sample a token
    '''获取最后一个时刻的输出'''
    '''np.argmax是获取最大输出的索引，以知道是哪个字符'''
    sampled_token_index = np.argmax(output_tokens[0, -1, :])
    sampled_char = reverse_target_char_index[sampled_token_index]
    decoded_sentence += sampled_char

    # Exit condition: either hit max length
    # or find stop character.
    if (sampled_char == '\n' or
       len(decoded_sentence) > max_decoder_seq_length):
        stop_condition = True

    # Update the target sequence (of length 1).
    target_seq = np.zeros((1, 1, num_decoder_tokens))
    target_seq[0, 0, sampled_token_index] = 1.

    # Update states
    states_value = [h, c]

return decoded_sentence

for seq_index in range(100):
# Take one sequence (part of the training set)
# for trying out decoding.
input_seq = encoder_input_data[seq_index: seq_index + 1]
decoded_sentence = decode_sequence(input_seq)
print(’-’)
print(‘Input sentence:’, input_texts[seq_index])
print(‘Decoded sentence:’, decoded_sentence)

weixin_43191401

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
seq2seq翻译模型

今天把基于LSTM的翻译模型看了一遍，并做了注释，供大家参考，英语是原来的注释，汉语是我加上去的注释这是数据集链接：https://pan.baidu.com/s/1C2iwlEMnH9pSpf73tIBMbA提取码：mq95‘’’#Sequence to sequence example in Keras (character-level).This script demonstra...
复制链接

扫一扫