在 TF2 中创建和训练神经机器翻译模型

A Guide to the Encoder-Decoder Model and the Attention Mechanism | by Eduardo Muñoz | Better Programming --- 编码器 - 解码器模型和注意力机制指南 | 作者:爱德华多 · 穆尼奥斯 | 更好的编程Create and train a neural machine translation model with attention in TF2
在 TF2 中创建和训练神经机器翻译模型

Create and train a neural machine translation model with attention in TF2
在 TF2 中创建和训练神经机器翻译模型

Photo by Alireza Attari on Unsplash
摄影:Alireza Attari on Unsplash

Today, we’ll continue our journey through the world of NLP. In this article, we’re going to describe the basic architecture of an encoder-decoder model that we’ll apply to a neural machine translation problem, translating texts from English to Spanish.
今天,我们将继续我们的NLP世界之旅。在本文中,我们将描述编码器-解码器模型的基本架构,我们将将其应用于神经机器翻译问题,将文本从英语翻译成西班牙语。

Later, we’ll introduce a technique that has been a great step forward in the treatment of NLP tasks: the attention mechanism. We’ll detail a basic processing of the attention applied to a scenario of a sequence-to-sequence model, many-to-many approach.
稍后,我们将介绍一种在处理NLP任务方面向前迈出一大步的技术:注意力机制。我们将详细介绍应用于序列到序列模型(多对多方法)场景的注意力的基本处理。

But for the moment, it’ll be a simple attention model. We won’t comment (yet) on the more complex models that’ll be discussed in future articles, like when we’ll address the subject of transformers.
但就目前而言,这将是一个简单的注意力模型。我们不会(暂时)评论将在以后的文章中讨论的更复杂的模型,例如我们将讨论变压器的主题。

“Machine translation is the task of automatically converting source text in one language to text in another language. Given a sequence of text in a source language, there is no one single best translation of that text to another language. This is because of the natural ambiguity and flexibility of human language. This makes the challenge of automatic machine translation difficult, perhaps one of the most difficult in artificial intelligence.”
“机器翻译是将一种语言的源文本自动转换为另一种语言的文本的任务。给定源语言的文本序列,没有一个最佳文本到另一种语言的翻译。这是因为人类语言的自然歧义和灵活性。这使得自动机器翻译的挑战变得困难,这可能是人工智能中最困难的挑战之一。

— “Machine Learning Mastery” by Jason Brownlee, Ph.D. [1]
— “机器学习精通”,作者:Jason Brownlee, Ph.D. [1]

Initially machine translation (MT) problems were faced using statistical approaches, based mainly on Bayes probabilities. But when neural networks became more powerful and popular, researchers began to explore the capabilities of this technology and new solutions were found. It is called neural machine translation (NMT).
最初,机器翻译(MT)问题使用主要基于贝叶斯概率的统计方法面临。但是,当神经网络变得更加强大和流行时,研究人员开始探索这项技术的功能,并找到了新的解决方案。它被称为神经机器翻译(NMT)。

From the above, we can deduce that NMT is a problem where we process an input sequence to produce an output sequence — that is, a sequence-to-sequence (seq2seq) problem. Specifically, the many-to-many type, with a sequence of several elements both at the input and at the output, and the encoder-decoder architecture for recurrent neural networks is the standard method.
综上所述,我们可以推断出NMT是一个问题,我们处理输入序列以产生输出序列 - 即序列到序列(seq2seq)问题。具体来说,多对多类型(在输入和输出处都有多个元素的序列)和用于递归神经网络的编码器-解码器架构是标准方法。

Depiction of Sutskever Encoder-Decoder Model for Text Translation
用于文本翻译的Sutskever编码器-解码器模型的描述
Taken from “Sequence to Sequence Learning with Neural Networks,” 2014
摘自“Sequence to Sequence Learning with Neural Networks”,2014

The seq2seq model consists of two subnetworks, the encoder and the decoder. The encoder, on the left hand, receives sequences from the source language as inputs and produces, as a result, a compact representation of the input sequence, trying to summarize or condense all of its information. Then that output becomes an input or initial state to the decoder, which can also receive another external input.
seq2seq 模型由两个子网组成,即编码器和解码器。左侧的编码器接收来自源语言的序列作为输入,并因此生成输入序列的紧凑表示,试图总结或压缩其所有信息。然后,该输出成为解码器的输入或初始状态,解码器也可以接收另一个外部输入。

At each time step, the decoder generates an element of its output sequence based on the input received and its current state, as well as updating its own state for the next time step.
在每个时间步长,解码器根据接收到的输入及其当前状态生成其输出序列的元素,并为下一个时间步更新自己的状态。

The input and output sequences are of fixed size, but they don’t have to match — the length of the input sequence may differ from that of the output sequence.
输入和输出序列的大小是固定的,但它们不必匹配 — 输入序列的长度可能与输出序列的长度不同。

The critical point of this model is how to get the encoder to provide the most complete and meaningful representation of its input sequence in a single output element to the decoder because this vector or state is the only information the decoder will receive from the input to generate the corresponding output. The longer the input, the harder it’ll be to compress into a single vector.
该模型的关键点是如何让编码器在单个输出元素中向解码器提供其输入序列的最完整和最有意义的表示,因为此向量或状态是解码器从输入接收以生成相应输出的唯一信息。输入越长,压缩到单个向量中的难度就越大。

We’ll describe the model in detail and build it in a later section.
我们将详细描述该模型,并在后面的部分中构建它。

Image by mcmurryjulie on Pixabay
图片来源:mcmurryjulie on Pixabay

For this exercise, we’ll use pairs of simple sentences. The source text will be in English, and the target text will be in Spanish, from the Tatoeba project where people contribute, adding translations every day. This is the link to some translations in different languages. There you can download the Spanish/English spa_eng.zip file; it contains 124,457 pairs of sentences.
在本练习中,我们将使用成对的简单句子。源文本将是英文的,目标文本将是西班牙语的,来自人们贡献的Tatoeba项目,每天都在添加翻译。这是指向不同语言的一些翻译的链接。在那里您可以下载西班牙语/英语 spa_eng.zip 文件;它包含 124,457 对句子。

The text sentences are almost clean; they’re simple plain text, so we only need to remove accents, lowercase the sentences, and replace everything with a space except (a-zA-Z.?!, and ,). The code to apply this preprocess has been taken from the TensorFlow tutorial for neural machine translation.
文本句子几乎是干净的;它们是简单的纯文本,所以我们只需要删除重音,小写句子,并用除( a-z 、 A-Z 、 . 、 ? 、 ! 和 , )以外的空格替换所有内容。应用此预处理的代码取自用于神经机器翻译的 TensorFlow 教程。

Next, let’s see how to prepare the data for our model. It’s very simple, and the steps are the following:
接下来,让我们看看如何为我们的模型准备数据。这非常简单,步骤如下:

  • Tokenize the data to convert the raw text into a sequence of integers. First, we create a Tokenizer object from the Keras library and fit it to our text (one tokenizer for the input and another one for the output).
    标记数据以将原始文本转换为整数序列。首先,我们从 Keras 库中创建一个 Tokenizer 对象并将其适合我们的文本(一个用于输入的标记器,另一个用于输出)。
  • Extract a sequence of integers from the text: We call the text_to_sequence method of the tokenizer for every input and output text.
    从文本中提取整数序列:我们对每个输入和输出文本调用分词器的 text_to_sequence 方法。
  • Calculate the maximum length of the input and output sequences
    计算输入和输出序列的最大长度
  • Create the input and output vocabularies: Using the tokenizer we’ve created previously, we can retrieve the vocabularies, one to match word to integer (word2idx) and a second one to match the integer to the corresponding word (idx2word).
    创建输入和输出词汇表:使用我们之前创建的分词器,我们可以检索词汇表,一个用于将单词与整数 ( word2idx ) 匹配,另一个用于将整数与相应的单词 ( idx2word ) 匹配。
  • Padding the sentences: We need to pad zeros at the end of the sequences so all sequences have the same length. Otherwise, we won’t be able train the model on batches.
    填充句子:我们需要在序列的末尾填充零,以便所有序列具有相同的长度。否则,我们将无法批量训练模型。
  • Create a batch data generator: We want to train the model on batches/ group of sentences, so we need to create a data set using the tf.data library and the function batch_on_slices on the input and output sequences.
    创建一个批处理数据生成器:我们希望在批处理/句子组上训练模型,因此我们需要使用 tf.data 库和输入和输出序列上的函数 batch_on_slices 创建一个数据集。

For better understanding, we can divide the model into three basic components:
为了更好地理解,我们可以将模型分为三个基本组件:

From “Understanding Encoder-Decoder Sequence to Sequence Model” by Simeon Kostadinov [3]
从Simeon Kostadinov的“理解编码器-解码器序列到序列模型”[3]

The encoder 编码器

Layers of recurrent units where, in each time step, an input token is received, collecting relevant information and producing a hidden state. This depends on the type of RNN; in our example, a LSTM, the unit mixes the current hidden state and the input and returns an output, discarded, and a new hidden state.
循环单元层,在每个时间步长中接收输入令牌,收集相关信息并生成隐藏状态。这取决于 RNN 的类型;在我们的示例中,LSTM 单元混合了当前隐藏状态和输入,并返回输出、丢弃和新的隐藏状态。

The encoder vector 编码器矢量

The encoder vector is the last hidden state of the encoder, and it tries to contain as much of the useful input information as possible to help the decoder get the best results. It’s the only information from the input that the decoder will get.
编码器向量是编码器的最后一个隐藏状态,它尝试包含尽可能多的有用输入信息,以帮助解码器获得最佳结果。这是解码器将从输入中获得的唯一信息。

The decoder 解码器

Layers of recurrent units — e.g., LSTMs — where each unit produces an output at a time step t. The hidden state of the first unit is the encoder vector, and the rest of the units accept the hidden state from the previous unit. The output is calculated using a softmax function to obtain a probability for every token in the output vocabulary.
循环单元层 - 例如,LSTM - 其中每个单元在时间步长t产生输出。第一个单元的隐藏状态是编码器向量,其余单元接受前一个单元的隐藏状态。输出是使用 softmax 函数计算的,以获得输出词汇表中每个标记的概率。

The Encoder class:
Encoder 类:

The Decoder class:
Decoder 类:

Once our encoder and decoder are defined, we can init them and set the initial hidden state. We’ve included a simple test, calling the encoder and decoder to check that they work fine:
定义编码器和解码器后,我们可以初始化它们并设置初始隐藏状态。我们包含了一个简单的测试,调用编码器和解码器来检查它们是否正常工作:

Now we need to define a custom loss function to avoid taking into account the 0 values and padding values when calculating the loss. And also we have to define a custom accuracy function.
现在我们需要定义一个自定义损失函数,以避免在计算损失时考虑 0 值和填充值。我们还必须定义一个自定义精度函数。

As we mentioned before, we’re interested in training the network in batches; therefore, we create a function that carries out the training of a batch of the data:
正如我们之前提到的,我们有兴趣批量训练网络;因此,我们创建了一个函数来执行一批数据的训练:

  • Call the encoder for the batch input sequence — the output is the encoded vector
    调用批处理输入序列的编码器 — 输出是编码的向量
  • Set the decoder initial states to the encoded vector
    将解码器初始状态设置为编码向量
  • Call the decoder, taking the right-shifted target sequence as the input. The output are the logits (the softmax function is applied in the loss function).
    调用解码器,将右移的目标序列作为输入。输出是对数(softmax 函数应用于损失函数)。
  • Calculate the loss and accuracy of the batch data
    计算批次数据的损失和准确性
  • Update the learnable parameters of the encoder and the decoder
    更新编码器和解码器的可学习参数
  • Update the optimizer 更新优化程序

As you can observe, our train function receives three sequences:
正如你所观察到的,我们的训练函数接收三个序列:

  • Input sequence: Array of integers of shape: [batch_size, max_seq_len, embedding dim]. It’s the input sequence to the encoder.
    输入序列:形状为 [batch_size, max_seq_len, embedding dim] 的整数数组。 它是编码器的输入序列。
  • Target sequence: Array of integers of shape: [batch_size, max_seq_len, embedding dim]. It’s the target of our model, the output that we want for our model.
    目标序列:形状为 [batch_size, max_seq_len, embedding dim] 的整数数组。它是模型的目标,也是我们模型所需的输出。
  • Target input sequence: Array of integers of shape: [batch_size, max_seq_len, embedding dim]. It’s the input sequence to the decoder because we use teacher forcing.
    目标输入序列:形状: [batch_size, max_seq_len, embedding dim] 的整数数组。 它是解码器的输入序列,因为我们使用教师强迫。

Teacher forcing 教师强迫

Teacher forcing is a training method critical to the development of deep learning models in NLP. “It’s a way for quickly and efficiently training recurrent neural network models that use the ground truth from a prior time step as the input.”, [8] “What is Teacher Forcing for Recurrent Neural Networks?” by Jason Brownlee PhD
教师强迫是一种训练方法,对于 NLP 中深度学习模型的开发至关重要。“这是一种快速有效地训练递归神经网络模型的方法,这些模型使用先前时间步的基本事实作为输入.”, [8] “什么是递归神经网络的教师强迫?” 作者:Jason Brownlee PhD

In a recurrent network, usually the input to an RNN at the time step t is the output of the RNN in the previous time step, t-1. But with teacher forcing, we can use the actual output to improve the learning capabilities of the model.
在循环网络中,通常在时间步长t处对RNN的输入是前一个时间步长t-1中RNN的输出。但是通过教师强迫,我们可以使用实际输出来提高模型的学习能力。

“Teacher forcing works by using the actual or expected output from the training dataset at the current time step y(t) as input in the next time step X(t+1), rather than the output generated by the network.”
“教师强制的工作原理是使用当前时间步长y(t)的训练数据集的实际或预期输出作为下一个时间步长X(t+1)的输入,而不是网络生成的输出。

— “Deep Learning” by Ian Goodfellow
——伊恩·古德费罗(Ian Goodfellow)的《深度学习》(Deep Learning)

So in our example, the target output at time step is the decoder input at time step t+1. Our input sequence to the decoder will be the expected target sequence shifted one position to the right. To do this, we insert the sequence start token <sos> in the first position so the token in position 1 goes to position 2, the token from 2 to 3, and so on. To equalize the lengths of the sequences and delimit their end, in the target sequence we’ll place a sequence end token <eos> in the last position.
因此,在我们的示例中,时间步长 t 的目标输出是时间步长 t+1 的解码器输入。解码器的输入序列将是向右移动一个位置的预期目标序列。为此,我们在第一个位置插入序列开始标记 <sos> ,以便位置 1 中的标记转到位置 2,标记从 2 转到位置 3,依此类推。为了均衡序列的长度并分隔其结束,在目标序列中,我们将在最后一个位置放置一个序列结束标记 <eos> 。

When our model output don’t vary from what was seen by the model during training, teacher forcing is very effective. But if we need a more creative model, where given an input sequence there can be several possible outputs, we should avoid this technique or apply it randomly (only in some random time steps).
当我们的模型输出与模型在训练期间看到的输出没有变化时,教师强制非常有效。但是,如果我们需要一个更具创造性的模型,给定一个输入序列可以有几个可能的输出,我们应该避免使用这种技术或随机应用它(仅在一些随机时间步长中)。

Now, we can code the main train function:
现在,我们可以编写主训练函数:

We’re almost ready — our last step includes a call to the main train function, and we create a checkpoint object to save our model. Because the training process requires a long time to run, every two epochs we’ll save it. Later we can restore it and use it to make predictions.
我们几乎准备好了 — 最后一步包括调用主训练函数,我们创建一个 checkpoint 对象来保存我们的模型。因为训练过程需要很长时间才能运行,所以每隔两个纪元我们就会保存它。稍后我们可以恢复它并使用它来进行预测。

We train our encoder-decoder model for a limited amount of time (one hour, approximately) taking 40,000 pairs of sentences and RNNs of 512 units. We achieve good results:
我们在有限的时间内(大约一小时)训练编码器-解码器模型,需要 40,000 对句子和 512 个单元的 RNN。我们取得了良好的成绩:

Make predictions 进行预测

In the prediction step, out input is a sequence of length one, the sos token. Then we call the encoder and decoder repeatedly until we get the eos token or reach the maximum length defined.
在预测步骤中,out 输入是长度为 1 的序列,即 sos 标记。然后我们反复调用编码器和解码器,直到我们得到 eos 个令牌或达到定义的最大长度。

Some examples of the predictions we got are:
我们得到的一些预测示例是:

['we re not going .', 'why are you sad ?']
['no vamos . <eos>', '¿ por que estas triste ? <eos>']

The previously described model based on RNNs has a serious problem when working with long sequences because the information of the first tokens is lost or diluted as more tokens are processed. The context vector has been given the responsibility of encoding all of the information in a given source sentence into a vector of few hundred elements. It made it challenging for the models to deal with long sentences. A solution was proposed in Bahdanau et al., 2014 [4] and Luong et al., 2015,[5].
前面描述的基于 RNN 的模型在处理长序列时存在严重问题,因为随着更多代币的处理,第一个代币的信息会丢失或稀释。上下文向量被赋予了将给定源句子中的所有信息编码为几百个元素的向量的责任。这使得模型处理长句子具有挑战性。Bahdanau等人,2014年[4]和Luong等人,2015年[5]提出了解决方案。

They introduce a technique called attention, which highly improved the quality of machine-translation systems. “Attention allows the model to focus on the relevant parts of the input sequence as needed, accessing all the past hidden states of the encoder, instead of just the last one”, [8] “Seq2seq Model with Attention” by Zhang Handou. At each decoding step, the decoder gets to look at any particular state of the encoder and can selectively pick out specific elements from that sequence to produce the output. We’ll focus on the Luong perspective.
他们引入了一种称为注意力的技术,该技术大大提高了机器翻译系统的质量。“注意力允许模型根据需要专注于输入序列的相关部分,访问编码器过去的所有隐藏状态,而不仅仅是最后一个”,[8] 张汉豆的“Seq2seq Model with Attention”。在每个解码步骤中,解码器可以查看编码器的任何特定状态,并可以选择性地从该序列中挑选出特定元素以产生输出。我们将专注于Luong的观点。

“Attention Mechanism” by Gabriel Loye [6]
加布里埃尔·洛耶的“注意力机制” [6]

There are two relevant points to focus on:
有两个相关点需要关注:

The alignment vector 对齐向量

“The alignment vector is a vector with the same length as the input or source sequence and is computed at every time step of the decoder”, [9] “Attention: Sequence 2 Sequence model with Attention Mechanism” by Renu Khandelwal. Each of its values is the score (or the probability) of the corresponding word within the source sequence; they tell the decoder what to focus on at each time step. There are three ways to calculate the alignment scores:
“对齐向量是与输入或源序列长度相同的向量,并在解码器的每个时间步长进行计算”,[9] Renu Khandelwal的“注意:具有注意力机制的序列2序列模型”。它的每个值都是源序列中相应单词的分数(或概率);它们告诉解码器在每个时间步要关注什么。有三种方法可以计算对齐分数:

  • Dot product: We only need to take the hidden states of the encoder and multiply them by the hidden state of the decoder
    点积:我们只需要取编码器的隐藏状态,然后乘以解码器的隐藏状态
  • General: Very similar to the dot product but a weight matrix is included
    一般:与点积非常相似,但包含权重矩阵
  • Concat: The decoder hidden state and encoder hidden states are added together first before being passed through a linear layer with a tanh activation function and, finally, being multiplied by a weight matrix
    Concat:解码器隐藏状态和编码器隐藏状态首先相加,然后通过具有tanh激活函数的线性层,最后乘以权重矩阵

Decoder output 解码器输出

The alignment scores are softmaxed so the weights will be between 0-1.
对齐分数是软最大值,因此权重将在 0-1 之间。

The context vector 上下文向量

The context vector is the weighted average sum of the encoder’s output, the dot product of the alignment vector, and the encoder’s output.
上下文向量是编码器输出、对齐向量的点积和编码器输出的加权平均和。

Once our Attention class has been defined, we can create the decoder. The complete sequence of steps when calling the decoder are:
一旦定义了我们的 Attention 类,我们就可以创建解码器了。调用解码器时的完整步骤顺序如下:

  1. Generate the encoder hidden states as usual, one for every input token
    像往常一样生成编码器隐藏状态,每个输入令牌一个
  2. Apply an RNN to produce a new hidden state, taking its previous hidden state and the target output from the previous time step
    应用 RNN 以生成新的隐藏状态,获取其先前的隐藏状态和上一个时间步的目标输出
  3. Calculate the alignment scores, as described previously
    计算对齐分数,如前所述
  4. Calculate the context vector
    计算上下文向量
  5. In the last operation, the context vector is concatenated with the decoder hidden state we generated previously. Then, it’s passed through a linear layer, which acts as a classifier for us to obtain the probability scores of the next predicted word.
    在最后一个操作中,上下文向量与我们之前生成的解码器隐藏状态连接起来。然后,它通过一个线性层,该线性层充当我们获得下一个预测词的概率分数的分类器。

And that’s all, we have just build a decoder with the Luong’s Attention mechanism. As you can suppose the encoder is the same as we create previously, in our initial seq2seq model.
仅此而已,我们刚刚构建了一个带有Luong注意力机制的解码器。您可以假设编码器与我们之前在初始 seq2seq 模型中创建的编码器相同。

Now we can define our step-train function to train batch data. It’s very similar to the one we saw previously, but this time, we pass all the hidden states returned by the encoder to the decoder. And we need to create a loop to iterate through the target sequences, calling the decoder for each one and calculating the loss function, comparing the output to the expected target.
现在我们可以定义步进训练函数来训练批处理数据。它与我们之前看到的非常相似,但这一次,我们将编码器返回的所有隐藏状态传递给解码器。我们需要创建一个循环来遍历目标序列,为每个序列调用解码器并计算损失函数,将输出与预期目标进行比较。

That’s all — we’re ready to train our encoder-decoder with attention. We just need to create the checkpoint and call the main train function (it’s the same as we coded for the model without attention).
仅此而已 - 我们已经准备好专心训练我们的编码器解码器。我们只需要创建检查点并调用主训练函数(这与我们为模型编码而不注意相同)。

This time we improve our results in approximately the same amount of time:
这一次,我们在大致相同的时间内改进了结果:

The predict function for our model, with attention iterating over the target sequence, started by the <sos> token, receives the next word from the decoder and the alignment vector. This vector can be plotted to show us to which input token the decoder should pay more attention to.
我们模型的预测函数,注意迭代目标序列,从 <sos> 标记开始,从解码器和对齐向量接收下一个单词。可以绘制此向量以向我们显示解码器应更关注哪个输入令牌。

Then we can call the predict_att_seq2seq function and plot the alignments to observe how our model works:
然后我们可以调用 predict_att_seq2seq 函数并绘制 alignments 来观察我们的模型是如何工作的:

You can access and get the code in my GitHub repository at this link, or you can get this article along with the code at my blog.
您可以通过此链接访问并获取我的 GitHub 存储库中的代码,或者您可以在我的博客上获取本文和代码。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值