目录
声明
文中出现的代码有些是部分的、表意用的,不能直接搬用,一个是直接上完整代码篇幅就太长了,另一个是我本人希望用“授人以渔”态度写文章。有不足的地方还请各位指正!
模型
项目是基于Seq2Seq模型实现的,基本结构如下图:
值得注意的是,实际上我是把源语言句子整句一次性输入编码器,但是解码器却是如图按时间步挨个预测的。这里体现了一个训练策略:Teachrer Forcing,在解码器中,把上一个时间步的正确结果作为当前时间步的输入得到当前时间步的预测输出。使用图中的例子,可以理解为教给解码器当前时刻的输入是“bos”,期望它预测出“I”,教给它“I”,期望它预测出“am”,目标就是最大化条件概率P(I|bos)、P(am|I)、P(author|I)…的乘积。当然本文偏于应用,严谨的数学推理什么的我做不来,哈哈。
实际应用中,除了编码器和解码器,还使用了注意力机制。其中编码器的简单结构如图:
解码器的简单结构如图:
注意力机制有多种,可以简单表达为:
a
t
t
e
n
t
i
o
n
=
v
(
t
a
n
h
(
W
1
s
+
W
2
h
)
s
c
o
r
e
=
s
o
f
t
m
a
x
(
a
t
t
e
n
t
i
o
n
)
c
o
n
t
e
x
t
=
s
c
o
r
e
×
e
n
c
o
d
e
r
o
u
t
attention = v ( tanh ( W_1 s + W_2 h)\\ score = softmax(attention)\\ context = score \times encoder_{out}
attention=v(tanh(W1s+W2h)score=softmax(attention)context=score×encoderout
模型中的LSTM可以使用其他循环神经网络单元如GRU替换,这个读者自行选择。因为数据处理部分中,使用0填充短句但0又不作为词汇表中的词,因此需要忽略。TensorFlow已经实现,只需在嵌入层设置参数mask_zero
为True
且嵌入层必须在网络的第一层。代码如下:
class EncoderLSTM(keras.Model):
def __init__(self, source_vocab_size):
super(EncoderLSTM, self).__init__()
self.units = 256
self.embedding_dim = 128
self.embedding_layer = keras.layers.Embedding(source_vocab_size, self.embedding_dim, mask_zero=True)
self.lstm1 = keras.layers.LSTM(self.units, return_sequences=True, return_state=True)
# self.hidd_state_w = keras.layers.Dense(self.units)
# self.cell_state_w = keras.layers.Dense(self.units)
self.dropout = keras.layers.Dropout(0.2)
def call(self, input, init_hidden_state, init_cell_state):
"""
:param input: shape is (batch size, sentence length)
:param init_hidden_state: shape is (batch size, encoder units)
:param init_cell_state: shape is (batch size, encoder units)
:return: **encoder_out**, *shape is (batch size, time step, 2 * encoder units),
it contains all hidden states of all steps;*
**decoder_init_hidden_state**, *concat two final hidden state(forward
and backward)*
"""
output = self.dropout(self.embedding_layer(input))
encoder_out, decoder_hidden_state, decoder_cell_state = self.lstm1(output, [init_hidden_state, init_cell_state])
# decoder_init_hidden_state = self.hidd_state_w(hidden_state)
# decoder_init_cell_state = self.cell_state_w(cell_state)
return encoder_out, decoder_hidden_state, decoder_cell_state
class AddictiveAttention(keras.layers.Layer):
def __init__(self, units):
super(AddictiveAttention, self).__init__()
self.encoder_out_w = keras.layers.Dense(units)
self.decoder_hidden_w = keras.layers.Dense(units)
# according to theory, the two weights above should use 1 as units to make the
# weigth matrix's shape is (decoder units, 1), but this means less params. so
# I let its shape be (decoder units, decoder units) to learn more params in
# DecoderLSTM class.
self.v = keras.layers.Dense(1)
self.tanh = nn.tanh
def call(self, encoder_out, decoder_hidden_state):
""""
:param encoder_out: shape is (batch size, time step, 2 * embedding dim)
:param decoder_hidden_state: shape is (batch size, decoder units)
:return:
"""
attention = self.v(self.tanh(self.encoder_out_w(encoder_out) + expand_dims(self.decoder_hidden_w(decoder_hidden_state), 1)))
# the shape of attention is (batch size, time step)
attention_score = nn.softmax(squeeze(attention, 2), 1)
# use attention to compute attention score by softmax, and attention score's shape is (batch size, time step)
context = reduce_sum(expand_dims(attention_score, 2) * encoder_out, 1)
# context's shape is (batch size, 2 * decoder units)
return context
class DecoderLSTM(keras.Model):
def __init__(self, target_vocab_size):
super(DecoderLSTM, self).__init__()
self.units = 256
self.embedding_dim = 128
self.embedding_layer = keras.layers.Embedding(target_vocab_size, self.embedding_dim, mask_zero=True)
self.lstm1 = keras.layers.LSTM(self.units, return_sequences=True, return_state=True)
self.attention_layer = AddictiveAttention(self.units)
self.dropout = keras.layers.Dropout(0.2)
self.output_layer = keras.layers.Dense(target_vocab_size)
def call(self, input, init_hidden_state, init_cell_state, encoder_out):
"""
:param input: shape is (batch size, )
:param init_hidden_state: shape is (batch size, decoder units)
:param init_cell_state: shape is (batch size, decoder units)
:param init_o_state: shape is (batch size, decoder units)
:param encoder_out: shape is (batch size, time step, 2 * encoder units)
:return:
"""
embedded = self.dropout(self.embedding_layer(input))
context = self.attention_layer(encoder_out, init_hidden_state)
input_lstm = expand_dims(concat((embedded, context), 1), 1)
decoder_out, decoder_hidden_state, decoder_cell_state = self.lstm1(input_lstm, [init_hidden_state, init_cell_state])
output = self.output_layer(decoder_out)
return output, decoder_hidden_state, decoder_cell_state
训练
训练部分没有太多需要说明,一般都是实例化模型或者网络,设置超参数,加载数据,接着轮若干个epoch进行训练。而用于训练的目标函数也就是损失,使用的是交叉熵损失:对于每个时间步,label是一个单词,predict是概率向量,predict中最大概率的索引在词汇表中对应的单次就是最终的预测结果,本质上是分类问题。例如词汇表中有10个单词,predict就是有10个分量的一维向量[p1, p2, p3, … , p10],假设label是3在词汇表中对应的单词,那么p3越大,其他分量越小,交叉熵损失就越小,理想情况是p3为1其他为0。做损失也需使用mask去除用于填充的0部分。在训练部分也可以添加评估操作,训练完若干个epoch就拿网络参数测试一下,显示当前训练后的测试效果,这部分比较自由。
def ComputeCrossEntropyLossOneBatch(predict, label):
"""
Use a mask to ignore padding value (usually we use 0 for padding if a sentence
is shorter than max length) and compute crossentropy loss in one batch data when training or testing.
:param predict: shape is (batch size, target vocab size).
:param label: shape is (batch size, ).
:return: loss
"""
loss_object = keras.losses.SparseCategoricalCrossentropy(from_logits=True, reduction='none')
mask = math.logical_not(math.equal(label, 0))
loss = loss_object(label, predict)
mask = cast(mask, dtype=loss.dtype)
loss *= mask
return reduce_sum(loss)
def TrainOneBatch(source, target, encoder, decoder, optimizer, encoder_hidden_state, encoder_cell_state):
"""
This function is assistance for function **Train**, it's used to process one batch
data of train dataset, compute gradient and backpropgation.
:param source: source language data of train dataset.
:param target: target language data of test dataset
:param encoder: an RNN net for transfer input sequence data to context.
:param decoder: an RNN net for transfer context to output sequence.
:param optimizer: when I train my models, RMSProp has better performance, maybe it
performs better on your task, too.
:param encoder_hidden_state: for batch data, this is usually zero tensor.
:param encoder_cell_state:
:return:
"""
batch_loss = 0
with GradientTape() as tape:
encoder_out, decoder_hidden_state, decoder_cell_state = encoder(source, encoder_hidden_state, encoder_cell_state)
decoder_in = target[:, 0]
for word_index in range(1, target.shape[1]):
predict, decoder_hidden_state, decoder_cell_state = decoder(decoder_in, decoder_hidden_state, decoder_cell_state, encoder_out)
predict = squeeze(predict, 1)
batch_loss += ComputeCrossEntropyLossOneBatch(predict, target[:, word_index])
decoder_in = target[:, word_index]
batch_avg_loss = batch_loss / target.shape[1]
variables = encoder.trainable_variables + decoder.trainable_variables
gradient = tape.gradient(batch_loss, variables)
optimizer.apply_gradients(zip(gradient, variables))
return batch_avg_loss
应用
拿数据处理部分保存的字典文件以及训练好的参数,就可以做简单的翻译了。实际上结果不会太好,比如我的就总是会重复最后一个单词,需要对翻译结果进行额外的处理。还有标点符号的问题,因为前面数据处理丢弃了,所以我这也预测不出来。读者可以保留标点尝试一番。
def Translate(text):
"""
Translate plain text in Chinese to text in English.
"""
model_load_path = './checkpoint/checkpoint-aic-l1-zh2en/model-lstm-l1-u256-e128-drop0.25-aic-20'
corpus = re.split('(。|!|\!|\.|?|\?)', text)
punctuations = ['。', '!', '?', '.', '!', '?']
corpus = [corpus[i] + corpus[i + 1] for i in range(len(corpus)) if corpus[i] not in punctuations and i < len(corpus) - 1]
print('Split sentences : {}'.format(corpus))
sent_words = [ProcessSentence(sent) for sent in corpus]
print('Word segmentation : {}'.format(sent_words))
source_vocab_size, target_vocab_size, source_w2i, target_i2w, _, target_max_length = LoadDataInformation('./DataInformation.txt')
encoder = EncoderLSTM(source_vocab_size)
decoder = DecoderLSTM(target_vocab_size)
checkpoint = train.Checkpoint(encoder=encoder, decoder=decoder)
checkpoint.restore(model_load_path)
text_trans = []
for sent in sent_words:
encoder_hidden_state = zeros((1, encoder.units))
encoder_cell_state = zeros((1, encoder.units))
# sent_code = CodeSentence(sent, source_w2i)
sent_code = convert_to_tensor([[GetWordIndex(word, source_w2i) for word in sent]])
print('Sentence code : {}'.format(sent_code))
encoder_out, decoder_hidden_state, decoder_cell_state = encoder(sent_code, encoder_hidden_state, encoder_cell_state)
decoder_in = convert_to_tensor([source_w2i["<bos>"]])
sent_trans = [" "]
for i in range(1, target_max_length):
predict, decoder_hidden_state, decoder_cell_state = decoder(decoder_in, decoder_hidden_state, decoder_cell_state, encoder_out)
predict = squeeze(predict, 1)
pred_word_index = tensorflow.argmax(predict[0]).numpy()
decoder_in = convert_to_tensor([pred_word_index])
if pred_word_index == 0 or pred_word_index == 2:
break
elif target_i2w[pred_word_index] != sent_trans[-1]:
sent_trans.append(target_i2w[pred_word_index])
else:
pass
text_trans.append(" ".join(sent_trans).strip(" "))
return text_trans
这里展示我翻译时较好的结果:
本篇完整源码
Seq2Seq.py
Train.py
Translate.py
Test.py
文章集合
【TensorFlow】基于Seq2Seq模型的中英机器翻译完整项目(一)数据处理篇
【TensorFlow】基于Seq2Seq模型的中英机器翻译完整项目(二)模型训练篇
创作不易,如果有所帮助,求点赞收藏加关注,谢谢!