【TensorFlow】基于Seq2Seq模型的中英机器翻译完整项目（二）模型训练篇

最新推荐文章于 2024-08-08 14:28:23 发布

何文轩v2021

最新推荐文章于 2024-08-08 14:28:23 发布

阅读量587

点赞数 14

分类专栏： AI-TensorFlow 文章标签： tensorflow 机器翻译人工智能 python pycharm 神经网络自然语言处理

本文链接：https://blog.csdn.net/m0_73954336/article/details/140892006

版权

AI-TensorFlow 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

声明

文中出现的代码有些是部分的、表意用的，不能直接搬用，一个是直接上完整代码篇幅就太长了，另一个是我本人希望用“授人以渔”态度写文章。有不足的地方还请各位指正！

模型

项目是基于Seq2Seq模型实现的，基本结构如下图：
在这里插入图片描述
值得注意的是，实际上我是把源语言句子整句一次性输入编码器，但是解码器却是如图按时间步挨个预测的。这里体现了一个训练策略：Teachrer Forcing，在解码器中，把上一个时间步的正确结果作为当前时间步的输入得到当前时间步的预测输出。使用图中的例子，可以理解为教给解码器当前时刻的输入是“bos”，期望它预测出“I”，教给它“I”，期望它预测出“am”，目标就是最大化条件概率P(I|bos）、P(am|I)、P(author|I)…的乘积。当然本文偏于应用，严谨的数学推理什么的我做不来，哈哈。
实际应用中，除了编码器和解码器，还使用了注意力机制。其中编码器的简单结构如图：

解码器的简单结构如图：

注意力机制有多种，可以简单表达为：
$W_1 s + W_2 h)\\ score = softmax(attention)\\ context = score \times encoder_{out}$
模型中的LSTM可以使用其他循环神经网络单元如GRU替换，这个读者自行选择。因为数据处理部分中，使用0填充短句但0又不作为词汇表中的词，因此需要忽略。TensorFlow已经实现，只需在嵌入层设置参数mask_zero为True且嵌入层必须在网络的第一层。代码如下：

class EncoderLSTM(keras.Model):
    def __init__(self, source_vocab_size):
        super(EncoderLSTM, self).__init__()
        self.units = 256
        self.embedding_dim = 128
        self.embedding_layer = keras.layers.Embedding(source_vocab_size, self.embedding_dim, mask_zero=True)
        self.lstm1 = keras.layers.LSTM(self.units, return_sequences=True, return_state=True)
        # self.hidd_state_w = keras.layers.Dense(self.units)
        # self.cell_state_w = keras.layers.Dense(self.units)
        self.dropout = keras.layers.Dropout(0.2)
    def call(self, input, init_hidden_state, init_cell_state):
        """
        :param input: shape is (batch size, sentence length)
        :param init_hidden_state: shape is (batch size, encoder units)
        :param init_cell_state: shape is (batch size, encoder units)
        :return: **encoder_out**, *shape is (batch size, time step, 2 * encoder units),
                 it contains all hidden states of all steps;*
                 **decoder_init_hidden_state**, *concat two final hidden state(forward
                 and backward)*
        """
        output = self.dropout(self.embedding_layer(input))
        encoder_out, decoder_hidden_state, decoder_cell_state = self.lstm1(output, [init_hidden_state, init_cell_state])
        # decoder_init_hidden_state = self.hidd_state_w(hidden_state)
        # decoder_init_cell_state = self.cell_state_w(cell_state)
        return encoder_out, decoder_hidden_state, decoder_cell_state


class AddictiveAttention(keras.layers.Layer):
    def __init__(self, units):
        super(AddictiveAttention, self).__init__()
        self.encoder_out_w = keras.layers.Dense(units)
        self.decoder_hidden_w = keras.layers.Dense(units)
        # according to theory, the two weights above should use 1 as units to make the
        # weigth matrix's shape is (decoder units, 1), but this means less params. so
        # I let its shape be (decoder units, decoder units) to learn more params in
        # DecoderLSTM class.
        self.v = keras.layers.Dense(1)
        self.tanh = nn.tanh
    def call(self, encoder_out, decoder_hidden_state):
        """"
        :param encoder_out: shape is (batch size, time step, 2 * embedding dim)
        :param decoder_hidden_state: shape is (batch size, decoder units)
        :return:
        """
        attention = self.v(self.tanh(self.encoder_out_w(encoder_out) + expand_dims(self.decoder_hidden_w(decoder_hidden_state), 1)))
        # the shape of attention is (batch size, time step)
        attention_score = nn.softmax(squeeze(attention, 2), 1)
        # use attention to compute attention score by softmax, and attention score's shape is (batch size, time step)
        context = reduce_sum(expand_dims(attention_score, 2) * encoder_out, 1)
        # context's shape is (batch size, 2 * decoder units)
        return context


class DecoderLSTM(keras.Model):
    def __init__(self, target_vocab_size):
        super(DecoderLSTM, self).__init__()
        self.units = 256
        self.embedding_dim = 128
        self.embedding_layer = keras.layers.Embedding(target_vocab_size, self.embedding_dim, mask_zero=True)
        self.lstm1 = keras.layers.LSTM(self.units, return_sequences=True, return_state=True)
        self.attention_layer = AddictiveAttention(self.units)
        self.dropout = keras.layers.Dropout(0.2)
        self.output_layer = keras.layers.Dense(target_vocab_size)
    def call(self, input, init_hidden_state, init_cell_state, encoder_out):
        """
        :param input: shape is (batch size, )
        :param init_hidden_state: shape is (batch size, decoder units)
        :param init_cell_state: shape is (batch size, decoder units)
        :param init_o_state: shape is (batch size, decoder units)
        :param encoder_out: shape is (batch size, time step, 2 * encoder units)
        :return:
        """
        embedded = self.dropout(self.embedding_layer(input))
        context = self.attention_layer(encoder_out, init_hidden_state)
        input_lstm = expand_dims(concat((embedded, context), 1), 1)
        decoder_out, decoder_hidden_state, decoder_cell_state = self.lstm1(input_lstm, [init_hidden_state, init_cell_state])
        output = self.output_layer(decoder_out)
        return output, decoder_hidden_state, decoder_cell_state

训练

训练部分没有太多需要说明，一般都是实例化模型或者网络，设置超参数，加载数据，接着轮若干个epoch进行训练。而用于训练的目标函数也就是损失，使用的是交叉熵损失：对于每个时间步，label是一个单词，predict是概率向量，predict中最大概率的索引在词汇表中对应的单次就是最终的预测结果，本质上是分类问题。例如词汇表中有10个单词，predict就是有10个分量的一维向量[p1, p2, p3, … , p10]，假设label是3在词汇表中对应的单词，那么p3越大，其他分量越小，交叉熵损失就越小，理想情况是p3为1其他为0。做损失也需使用mask去除用于填充的0部分。在训练部分也可以添加评估操作，训练完若干个epoch就拿网络参数测试一下，显示当前训练后的测试效果，这部分比较自由。

def ComputeCrossEntropyLossOneBatch(predict, label):
    """
    Use a mask to ignore padding value (usually we use 0 for padding if a sentence
    is shorter than max length) and compute crossentropy loss in one batch data when training or testing.
    :param predict: shape is (batch size, target vocab size).
    :param label: shape is (batch size, ).
    :return: loss
    """
    loss_object = keras.losses.SparseCategoricalCrossentropy(from_logits=True, reduction='none')
    mask = math.logical_not(math.equal(label, 0))
    loss = loss_object(label, predict)
    mask = cast(mask, dtype=loss.dtype)
    loss *= mask
    return reduce_sum(loss)


def TrainOneBatch(source, target, encoder, decoder, optimizer, encoder_hidden_state, encoder_cell_state):
    """
    This function is assistance for function **Train**, it's used to process one batch
    data of train dataset, compute gradient and backpropgation.
    :param source: source language data of train dataset.
    :param target: target language data of test dataset
    :param encoder: an RNN net for transfer input sequence data to context.
    :param decoder: an RNN net for transfer context to output sequence.
    :param optimizer: when I train my models, RMSProp has better performance, maybe it
           performs better on your task, too.
    :param encoder_hidden_state: for batch data, this is usually zero tensor.
    :param encoder_cell_state:
    :return:
    """
    batch_loss = 0
    with GradientTape() as tape:
        encoder_out, decoder_hidden_state, decoder_cell_state = encoder(source, encoder_hidden_state, encoder_cell_state)
        decoder_in = target[:, 0]
        for word_index in range(1, target.shape[1]):
            predict, decoder_hidden_state, decoder_cell_state = decoder(decoder_in, decoder_hidden_state, decoder_cell_state, encoder_out)
            predict = squeeze(predict, 1)
            batch_loss += ComputeCrossEntropyLossOneBatch(predict, target[:, word_index])
            decoder_in = target[:, word_index]
        batch_avg_loss = batch_loss / target.shape[1]
        variables = encoder.trainable_variables + decoder.trainable_variables
        gradient = tape.gradient(batch_loss, variables)
        optimizer.apply_gradients(zip(gradient, variables))
    return batch_avg_loss

应用

拿数据处理部分保存的字典文件以及训练好的参数，就可以做简单的翻译了。实际上结果不会太好，比如我的就总是会重复最后一个单词，需要对翻译结果进行额外的处理。还有标点符号的问题，因为前面数据处理丢弃了，所以我这也预测不出来。读者可以保留标点尝试一番。

def Translate(text):
    """
    Translate plain text in Chinese to text in English.
    """
    model_load_path = './checkpoint/checkpoint-aic-l1-zh2en/model-lstm-l1-u256-e128-drop0.25-aic-20'
    corpus = re.split('(。|！|\!|\.|？|\?)', text)
    punctuations = ['。', '！', '？', '.', '!', '?']
    corpus = [corpus[i] + corpus[i + 1] for i in range(len(corpus)) if corpus[i] not in punctuations and i < len(corpus) - 1]
    print('Split sentences : {}'.format(corpus))
    sent_words = [ProcessSentence(sent) for sent in corpus]
    print('Word segmentation : {}'.format(sent_words))
    source_vocab_size, target_vocab_size, source_w2i, target_i2w, _, target_max_length = LoadDataInformation('./DataInformation.txt')
    encoder = EncoderLSTM(source_vocab_size)
    decoder = DecoderLSTM(target_vocab_size)
    checkpoint = train.Checkpoint(encoder=encoder, decoder=decoder)
    checkpoint.restore(model_load_path)
    text_trans = []
    for sent in sent_words:
        encoder_hidden_state = zeros((1, encoder.units))
        encoder_cell_state = zeros((1, encoder.units))
        # sent_code = CodeSentence(sent, source_w2i)
        sent_code = convert_to_tensor([[GetWordIndex(word, source_w2i) for word in sent]])
        print('Sentence code : {}'.format(sent_code))
        encoder_out, decoder_hidden_state, decoder_cell_state = encoder(sent_code, encoder_hidden_state, encoder_cell_state)
        decoder_in = convert_to_tensor([source_w2i["<bos>"]])
        sent_trans = [" "]
        for i in range(1, target_max_length):
            predict, decoder_hidden_state, decoder_cell_state = decoder(decoder_in, decoder_hidden_state, decoder_cell_state, encoder_out)
            predict = squeeze(predict, 1)
            pred_word_index = tensorflow.argmax(predict[0]).numpy()
            decoder_in = convert_to_tensor([pred_word_index])
            if pred_word_index == 0 or pred_word_index == 2:
                break
            elif target_i2w[pred_word_index] != sent_trans[-1]:
                sent_trans.append(target_i2w[pred_word_index])
            else:
                pass
        text_trans.append(" ".join(sent_trans).strip(" "))
    return text_trans