全网强烈推荐-详细的图解Transformer ((好多图))

kingking44

已于 2024-04-24 09:17:53 修改

阅读量48

点赞数 1

分类专栏： AI机器智能文章标签： transformer 深度学习人工智能

于 2024-04-08 14:18:23 首次发布

原文链接：https://jalammar.github.io/illustrated-transformer/

版权

AI机器智能专栏收录该内容

12 篇文章 0 订阅

订阅专栏

原文地址：The Illustrated Transformer

In the previous post, we looked at Attention – a ubiquitous method in modern deep learning models. Attention is a concept that helped improve the performance of neural machine translation applications. In this post, we will look at The Transformer – a model that uses attention to boost the speed with which these models can be trained. The Transformer outperforms the Google Neural Machine Translation model in specific tasks. The biggest benefit, however, comes from how The Transformer lends itself to parallelization. It is in fact Google Cloud’s recommendation to use The Transformer as a reference model to use their Cloud TPU offering. So let’s try to break the model apart and look at how it functions.
在之前的帖子中，我们看过了注意力机制——这是现代深度学习模型中一种普遍存在的方法。注意力机制是一个概念，它帮助提高了神经机器翻译应用的性能。在本文中，我们将看看Transformer——一种利用注意力机制提高这些模型训练速度的模型。Transformer在特定任务中胜过了Google神经机器翻译模型。然而，最大的好处来自于Transformer如何适合并行化。事实上，谷歌云建议使用Transformer作为参考模型来使用他们的云TPU产品。因此，让我们尝试解析这个模型，看看它的工作原理。

The Transformer was proposed in the paper Attention is All You Need. A TensorFlow implementation of it is available as a part of the Tensor2Tensor package. Harvard’s NLP group created a guide annotating the paper with PyTorch implementation. In this post, we will attempt to oversimplify things a bit and introduce the concepts one by one to hopefully make it easier to understand to people without in-depth knowledge of the subject matter.
Transformer是在论文《Attention is All You Need》中提出的。它的TensorFlow实现作为Tensor2Tensor软件包的一部分可供使用。哈佛大学的自然语言处理小组创建了一个指南，使用PyTorch实现并注释了这篇论文。在本文中，我们将尝试稍微简化一些内容，逐个介绍概念，希望能让没有深入了解该主题的人更容易理解。

A High-Level Look 高层次概览
Let’s begin by looking at the model as a single black box. In a machine translation application, it would take a sentence in one language, and output its translation in another.
让我们首先将模型视为一个单一的黑盒子。在机器翻译应用中，它会接收一种语言的句子，并输出其在另一种语言中的翻译。
在这里插入图片描述
Popping open that Optimus Prime goodness, we see an encoding component, a decoding component, and connections between them.
打开这个优秀的Optimus Prime模型，我们看到了一个编码组件、一个解码组件以及它们之间的连接。

The encoding component is a stack of encoders (the paper stacks six of them on top of each other – there’s nothing magical about the number six, one can definitely experiment with other arrangements). The decoding component is a stack of decoders of the same number.
编码组件是一堆编码器（论文将它们叠加了六个——数字六并非神奇，可以尝试其他排列）。解码组件也是相同数量的解码器的堆叠。

在这里插入图片描述

The encoders are all identical in structure (yet they do not share weights). Each one is broken down into two sub-layers:
编码器在结构上都是相同的（但它们不共享权重）。每个编码器被分解为两个子层：
在这里插入图片描述

The encoder’s inputs first flow through a self-attention layer – a layer that helps the encoder look at other words in the input sentence as it encodes a specific word. We’ll look closer at self-attention later in the post.
编码器的输入首先通过一个自注意力层，这一层帮助编码器在编码特定单词时查看输入句子中的其他单词。我们稍后会更仔细地研究自注意力。

The outputs of the self-attention layer are fed to a feed-forward neural network. The exact same feed-forward network is independently applied to each position.
自注意力层的输出被馈送到一个前馈神经网络。完全相同的前馈网络独立地应用于每个位置。

The decoder has both those layers, but between them is an attention layer that helps the decoder focus on relevant parts of the input sentence (similar what attention does in seq2seq models).
解码器包含这两层，但它们之间有一个注意力层，帮助解码器集中注意力于输入句子的相关部分（类似于seq2seq模型中的注意力机制）。
在这里插入图片描述
Bringing The Tensors Into The Picture
将张量引入图景

Now that we’ve seen the major components of the model, let’s start to look at the various vectors/tensors and how they flow between these components to turn the input of a trained model into an output.
既然我们已经看到了模型的主要组件，让我们开始看看各种向量/张量以及它们是如何在这些组件之间流动的，将经过训练的模型的输入转化为输出。

As is the case in NLP applications in general, we begin by turning each input word into a vector using an embedding algorithm.
和一般的自然语言处理应用一样，我们首先通过使用嵌入算法将每个输入单词转换为一个向量。

在这里插入图片描述
Each word is embedded into a vector of size 512. We’ll represent those vectors with these simple boxes.
每个单词被嵌入为大小为512的向量。我们将用这些简单的框表示这些向量。

The embedding only happens in the bottom-most encoder. The abstraction that is common to all the encoders is that they receive a list of vectors each of the size 512 – In the bottom encoder that would be the word embeddings, but in other encoders, it would be the output of the encoder that’s directly below. The size of this list is hyperparameter we can set – basically it would be the length of the longest sentence in our training dataset.
嵌入只发生在最底层的编码器中。所有编码器共同的抽象是它们接收一个大小为512的向量列表——在最底层的编码器中，这将是单词嵌入，但在其他编码器中，它将是直接在下方的编码器的输出。这个列表的大小是我们可以设置的超参数——基本上它将是我们训练数据集中最长句子的长度。

After embedding the words in our input sequence, each of them flows through each of the two layers of the encoder.
在嵌入了输入序列中的单词之后，每个单词都会通过编码器的两个层。

在这里插入图片描述
Here we begin to see one key property of the Transformer, which is that the word in each position flows through its own path in the encoder. There are dependencies between these paths in the self-attention layer. The feed-forward layer does not have those dependencies, however, and thus the various paths can be executed in parallel while flowing through the feed-forward layer.
在这里，我们开始看到Transformer的一个关键特性，即每个位置的单词在编码器中流经自己的路径。在自注意力层中，这些路径之间存在依赖关系。然而，前馈层没有这些依赖关系，因此在通过前馈层时，各个路径可以并行执行。

Next, we’ll switch up the example to a shorter sentence and we’ll look at what happens in each sub-layer of the encoder.
接下来，我们将将示例切换到一个较短的句子，并查看编码器的每个子层中发生了什么。

Now We’re Encoding! 现在我们正在进行编码！
As we’ve mentioned already, an encoder receives a list of vectors as input. It processes this list by passing these vectors into a ‘self-attention’ layer, then into a feed-forward neural network, then sends out the output upwards to the next encoder.
正如我们已经提到的，编码器接收一个向量列表作为输入。它通过将这些向量传递到一个“自注意力”层，然后传递到一个前馈神经网络来处理这个列表，然后将输出向上发送到下一个编码器。

在这里插入图片描述
The word at each position passes through a self-attention process. Then, they each pass through a feed-forward neural network – the exact same network with each vector flowing through it separately.
每个位置的单词经过自注意力过程。然后，它们每一个都通过一个前馈神经网络——每个向量都单独流经完全相同的网络。

Self-Attention at a High Level
高水平的自注意力

Don’t be fooled by me throwing around the word “self-attention” like it’s a concept everyone should be familiar with. I had personally never came across the concept until reading the Attention is All You Need paper. Let us distill how it works.
不要被我随意使用“自我关注”这个词所欺骗，好像每个人都应该熟悉这个概念。在阅读《Attention is All You Need》论文之前，我个人从未接触过这个概念。让我们梳理一下它是如何工作的。

Say the following sentence is an input sentence we want to translate:
以下句子是我们想要翻译的输入句子：

”The animal didn’t cross the street because it was too tired”
“这只动物没有过马路，因为它太累了”

What does “it” in this sentence refer to? Is it referring to the street or to the animal? It’s a simple question to a human, but not as simple to an algorithm.
这个句子中的“它”指的是什么？是指街道还是指动物？对于人类来说，这是一个简单的问题，但对于算法来说却不那么简单。

When the model is processing the word “it”, self-attention allows it to associate “it” with “animal”.
当模型处理单词“it”时，自注意力机制允许它将“it”与“animal”相关联。

As the model processes each word (each position in the input sequence), self attention allows it to look at other positions in the input sequence for clues that can help lead to a better encoding for this word.
当模型处理每个单词（输入序列中的每个位置）时，自注意力机制允许它查看输入序列中其他位置的线索，以帮助更好地对该单词进行编码。

If you’re familiar with RNNs, think of how maintaining a hidden state allows an RNN to incorporate its representation of previous words/vectors it has processed with the current one it’s processing. Self-attention is the method the Transformer uses to bake the “understanding” of other relevant words into the one we’re currently processing.
如果您熟悉RNN，可以想象一下，维护隐藏状态允许RNN将其对已处理的先前单词/向量的表示与当前正在处理的单词/向量相结合。自注意力是Transformer使用的方法，用于将其他相关单词的“理解”融入到我们当前正在处理的单词中。
在这里插入图片描述
As we are encoding the word “it” in encoder #5 (the top encoder in the stack), part of the attention mechanism was focusing on “The Animal”, and baked a part of its representation into the encoding of “it”.
当我们在编码器＃5（堆栈中的顶层编码器）中编码单词“it”时，注意力机制的一部分关注了“The Animal”，并将其表示的一部分融入到“it”的编码中。

Be sure to check out the Tensor2Tensor notebook where you can load a Transformer model, and examine it using this interactive visualization.
请务必查看Tensor2Tensor笔记本，在那里您可以加载Transformer模型，并使用这个交互式可视化工具来检查它。

Self-Attention in Detail
自注意力细节
Let’s first look at how to calculate self-attention using vectors, then proceed to look at how it’s actually implemented – using matrices.
首先让我们看看如何使用向量计算自注意力，然后继续看看它是如何实际实现的——使用矩阵。

The first step in calculating self-attention is to create three vectors from each of the encoder’s input vectors (in this case, the embedding of each word). So for each word, we create a Query vector, a Key vector, and a Value vector. These vectors are created by multiplying the embedding by three matrices that we trained during the training process.
计算自注意力的第一步是从每个编码器的输入向量（在这种情况下，是每个单词的嵌入）中创建三个向量。因此，对于每个单词，我们创建一个查询向量，一个键向量和一个值向量。这些向量是通过将嵌入与在训练过程中训练的三个矩阵相乘来创建的。

Notice that these new vectors are smaller in dimension than the embedding vector. Their dimensionality is 64, while the embedding and encoder input/output vectors have dimensionality of 512. They don’t HAVE to be smaller, this is an architecture choice to make the computation of multiheaded attention (mostly) constant.
请注意，这些新向量的维度小于嵌入向量。它们的维度为64，而嵌入和编码器的输入/输出向量的维度为512。它们不必小，这是一种架构选择，以使多头注意力的计算（大多数情况下）保持恒定。

在这里插入图片描述
Multiplying x1 by the WQ weight matrix produces q1, the “query” vector associated with that word. We end up creating a “query”, a “key”, and a “value” projection of each word in the input sentence.
将x1乘以WQ权重矩阵会产生q1，即与该单词相关联的“查询”向量。我们最终创建了输入句子中每个单词的“查询”、“键”和“值”投影。

What are the “query”, “key”, and “value” vectors?
什么是“查询”、“键”和“值”向量？

They’re abstractions that are useful for calculating and thinking about attention. Once you proceed with reading how attention is calculated below, you’ll know pretty much all you need to know about the role each of these vectors plays.
它们是在计算和思考注意力时非常有用的抽象概念。一旦您继续阅读下面关于注意力如何计算的内容，您将基本上了解关于每个向量扮演的角色的所有需要了解的信息。
详见博客：为什么需要三组向量

The second step in calculating self-attention is to calculate a score. Say we’re calculating the self-attention for the first word in this example, “Thinking”. We need to score each word of the input sentence against this word. The score determines how much focus to place on other parts of the input sentence as we encode a word at a certain position.
计算自注意力的第二步是计算得分。假设我们正在计算这个示例中的第一个单词“Thinking”的自注意力。我们需要对输入句子中的每个单词对这个单词进行评分。得分决定了在编码某个位置的单词时，我们要将多少注意力放在输入句子的其他部分上。

The score is calculated by taking the dot product of the query vector with the key vector of the respective word we’re scoring. So if we’re processing the self-attention for the word in position #1, the first score would be the dot product of q1 and k1. The second score would be the dot product of q1 and k2.
得分是通过将查询向量与我们正在评分的相应单词的键向量进行点积运算来计算的。因此，如果我们正在处理位置＃1的单词的自注意力，第一个得分将是q1和k1的点积。第二个得分将是q1和k2的点积。

在这里插入图片描述
The third and fourth steps are to divide the scores by 8 (the square root of the dimension of the key vectors used in the paper – 64. This leads to having more stable gradients. There could be other possible values here, but this is the default), then pass the result through a softmax operation. Softmax normalizes the scores so they’re all positive and add up to 1.
第三步和第四步是将得分除以8（论文中使用的键向量维度的平方根，即64。这导致梯度更加稳定。这里可能存在其他可能的值，但这是默认值），然后通过softmax操作进行处理。Softmax将得分归一化，使它们都为正数，并且加起来等于1。

在这里插入图片描述

This softmax score determines how much each word will be expressed at this position. Clearly the word at this position will have the highest softmax score, but sometimes it’s useful to attend to another word that is relevant to the current word.
这个softmax得分确定了每个单词在这个位置上的表达程度。显然，这个位置上的单词将具有最高的softmax得分，但有时候将注意力集中在与当前单词相关的其他单词上也是有用的。

The fifth step is to multiply each value vector by the softmax score (in preparation to sum them up). The intuition here is to keep intact the values of the word(s) we want to focus on, and drown-out irrelevant words (by multiplying them by tiny numbers like 0.001, for example).
第五步是将每个值向量乘以softmax得分（为了将它们加起来）。这里的直觉是保持我们想要关注的单词的值不变，并淹没掉无关的单词（例如，通过将它们乘以像0.001这样的小数字）。

The sixth step is to sum up the weighted value vectors. This produces the output of the self-attention layer at this position (for the first word).
第六步是将加权值向量相加。这产生了该位置（对于第一个单词）的自注意力层的输出。

在这里插入图片描述
That concludes the self-attention calculation. The resulting vector is one we can send along to the feed-forward neural network. In the actual implementation, however, this calculation is done in matrix form for faster processing. So let’s look at that now that we’ve seen the intuition of the calculation on the word level.
这就完成了自注意力计算。得到的向量是可以发送到前馈神经网络的一个向量。然而，在实际实现中，这个计算是以矩阵形式进行的，以实现更快的处理。所以现在让我们来看看，在我们已经了解了单词级别计算的直觉之后，这是如何实现的。

Matrix Calculation of Self-Attention
矩阵计算自注意力

The first step is to calculate the Query, Key, and Value matrices. We do that by packing our embeddings into a matrix X, and multiplying it by the weight matrices we’ve trained (WQ, WK, WV).
第一步是计算查询、键和值矩阵。我们通过将我们的嵌入打包成一个矩阵X，然后将其与我们已经训练的权重矩阵（WQ、WK、WV）相乘来实现这一点。

在这里插入图片描述
Every row in the X matrix corresponds to a word in the input sentence. We again see the difference in size of the embedding vector (512, or 4 boxes in the figure), and the q/k/v vectors (64, or 3 boxes in the figure)
X矩阵中的每一行对应输入句子中的一个单词。我们再次看到嵌入向量的大小（512，或图中的4个框）与q/k/v向量的大小（64，或图中的3个框）之间的差异。

Finally, since we’re dealing with matrices, we can condense steps two through six in one formula to calculate the outputs of the self-attention layer.
最后，由于我们正在处理矩阵，我们可以将步骤二到步骤六压缩为一个公式，以计算自注意力层的输出。
在这里插入图片描述
The Beast With Many Heads
具有多个头的巨兽

The paper further refined the self-attention layer by adding a mechanism called “multi-headed” attention. This improves the performance of the attention layer in two ways:
论文进一步通过添加一个称为“多头”注意力机制来改进自注意力层。这种机制在两个方面提高了注意力层的性能：

It expands the model’s ability to focus on different positions. Yes, in the example above, z1 contains a little bit of every other encoding, but it could be dominated by the actual word itself. If we’re translating a sentence like “The animal didn’t cross the street because it was too tired”, it would be useful to know which word “it” refers to.
它扩展了模型聚焦于不同位置的能力。是的，在上面的例子中，z1包含了其他编码的一点点，但它可能被实际的单词所主导。如果我们翻译一句像“这只动物没有过马路，因为它太累了”的句子，知道“它”指的是哪个词会很有用。

It gives the attention layer multiple “representation subspaces”. As we’ll see next, with multi-headed attention we have not only one, but multiple sets of Query/Key/Value weight matrices (the Transformer uses eight attention heads, so we end up with eight sets for each encoder/decoder). Each of these sets is randomly initialized. Then, after training, each set is used to project the input embeddings (or vectors from lower encoders/decoders) into a different representation subspace.
它给了注意力层多个“表示子空间”。正如我们将在下面看到的，通过多头注意力，我们不仅有一个，而且有多组查询/键/值权重矩阵（Transformer使用八个注意力头，因此我们最终为每个编码器/解码器得到八组）。每组权重矩阵是随机初始化的。然后，在训练之后，每组权重矩阵被用来将输入嵌入（或来自较低编码器/解码器的向量）投影到不同的表示子空间中。

在这里插入图片描述
With multi-headed attention, we maintain separate Q/K/V weight matrices for each head resulting in different Q/K/V matrices. As we did before, we multiply X by the WQ/WK/WV matrices to produce Q/K/V matrices.
使用多头注意力，我们为每个头维护单独的Q/K/V权重矩阵，从而得到不同的Q/K/V矩阵。与之前一样，我们将X乘以WQ/WK/WV矩阵以产生Q/K/V矩阵。

If we do the same self-attention calculation we outlined above, just eight different times with different weight matrices, we end up with eight different Z matrices
如果我们像上面概述的那样进行相同的自注意力计算，只是使用不同的权重矩阵重复进行八次，我们最终会得到八个不同的Z矩阵。

在这里插入图片描述
This leaves us with a bit of a challenge. The feed-forward layer is not expecting eight matrices – it’s expecting a single matrix (a vector for each word). So we need a way to condense these eight down into a single matrix.
这给我们带来了一些挑战。前馈层不期望八个矩阵——它期望一个单独的矩阵（每个单词一个向量）。因此，我们需要一种方法将这八个矩阵压缩成一个单独的矩阵。

How do we do that? We concat the matrices then multiply them by an additional weights matrix WO.
我们如何做到这一点呢？我们将这些矩阵连接起来，然后将它们乘以额外的权重矩阵WO。

在这里插入图片描述
That’s pretty much all there is to multi-headed self-attention. It’s quite a handful of matrices, I realize. Let me try to put them all in one visual so we can look at them in one place
这基本上就是多头自注意力的全部内容。我意识到这是相当多的矩阵。让我尝试将它们全部放在一个可视化中，这样我们就可以在一个地方查看它们。
在这里插入图片描述
Now that we have touched upon attention heads, let’s revisit our example from before to see where the different attention heads are focusing as we encode the word “it” in our example sentence:
现在我们已经涉及到了注意力头，让我们重新审视之前的例子，看看在编码我们示例句子中的单词“it”时，不同的注意力头集中在哪里：

在这里插入图片描述
As we encode the word “it”, one attention head is focusing most on “the animal”, while another is focusing on “tired” – in a sense, the model’s representation of the word “it” bakes in some of the representation of both “animal” and “tired”.
当我们编码单词“it”时，一个注意力头主要关注“the animal”，而另一个注意力头则关注“tired”——从某种意义上说，模型对单词“it”的表示中融入了“animal”和“tired”的一些表示。

If we add all the attention heads to the picture, however, things can be harder to interpret:
然而，如果我们将所有的注意力头都添加到图中，事情可能会更难解释：
在这里插入图片描述

Representing The Order of The Sequence Using Positional Encoding

使用位置编码表示序列的顺序

One thing that’s missing from the model as we have described it so far is a way to account for the order of the words in the input sequence.
到目前为止，我们描述的模型中缺少的一个要素是解释输入序列中单词顺序的方法。

To address this, the transformer adds a vector to each input embedding. These vectors follow a specific pattern that the model learns, which helps it determine the position of each word, or the distance between different words in the sequence. The intuition here is that adding these values to the embeddings provides meaningful distances between the embedding vectors once they’re projected into Q/K/V vectors and during dot-product attention.
为了解决这个问题，Transformer 在每个输入嵌入中添加了一个向量。这些向量遵循模型学习的特定模式，这有助于确定每个单词的位置，或者序列中不同单词之间的距离。这里的直觉是，在将这些值添加到嵌入中后，一旦它们被投影到 Q/K/V 向量中并在点积注意力中使用，它们提供了嵌入向量之间的有意义的距离。
在这里插入图片描述
To give the model a sense of the order of the words, we add positional encoding vectors – the values of which follow a specific pattern.
为了让模型对单词的顺序有所感知，我们添加了位置编码向量，其数值遵循特定的模式。

If we assumed the embedding has a dimensionality of 4, the actual positional encodings would look like this:
假设嵌入的维度为4，实际的位置编码如下所示：

在这里插入图片描述
A real example of positional encoding with a toy embedding size of 4
玩具嵌入大小为4的位置编码的真实示例

What might this pattern look like?
这个模式可能是什么样子？

In the following figure, each row corresponds to a positional encoding of a vector. So the first row would be the vector we’d add to the embedding of the first word in an input sequence. Each row contains 512 values – each with a value between 1 and -1. We’ve color-coded them so the pattern is visible.
在下面的图中，每一行对应一个向量的位置编码。因此，第一行将是我们将添加到输入序列中第一个单词的嵌入中的向量。每一行包含512个值 - 每个值在1到-1之间。我们已对它们进行了颜色编码，以便看到模式。

在这里插入图片描述
A real example of positional encoding for 20 words (rows) with an embedding size of 512 (columns). You can see that it appears split in half down the center. That’s because the values of the left half are generated by one function (which uses sine), and the right half is generated by another function (which uses cosine). They’re then concatenated to form each of the positional encoding vectors.
一个具有20个单词（行）和512个嵌入大小（列）的位置编码的真实示例。您可以看到它似乎沿中心分割成两半。这是因为左半部分的值由一个函数（使用正弦函数）生成，而右半部分的值由另一个函数（使用余弦函数）生成。然后它们被串联在一起形成每个位置编码向量。

The formula for positional encoding is described in the paper (section 3.5). You can see the code for generating positional encodings in get_timing_signal_1d(). This is not the only possible method for positional encoding. It, however, gives the advantage of being able to scale to unseen lengths of sequences (e.g. if our trained model is asked to translate a sentence longer than any of those in our training set).
位置编码的公式在论文中有描述（第3.5节）。您可以在get_timing_signal_1d()中看到生成位置编码的代码。这并不是唯一可能的位置编码方法。然而，它具有能够缩放到未见过的序列长度的优势（例如，如果我们训练的模型被要求翻译比我们训练集中任何句子都长的句子）。

July 2020 Update: The positional encoding shown above is from the Tensor2Tensor implementation of the Transformer. The method shown in the paper is slightly different in that it doesn’t directly concatenate, but interweaves the two signals. The following figure shows what that looks like. Here’s the code to generate it:
2020年7月更新：上面显示的位置编码来自Transformer的Tensor2Tensor实现。在论文中显示的方法略有不同，它不是直接连接，而是交织这两个信号。下图显示了它的样子。以下是生成它的代码：
在这里插入图片描述

The Residuals 残差

One detail in the architecture of the encoder that we need to mention before moving on, is that each sub-layer (self-attention, ffnn) in each encoder has a residual connection around it, and is followed by a layer-normalization step.
在继续之前，我们需要提及编码器架构中的一个细节，即每个编码器中的每个子层（自注意力、前馈神经网络）周围都有一个残差连接，并且后面跟着一个层归一化步骤。
在这里插入图片描述
If we’re to visualize the vectors and the layer-norm operation associated with self attention, it would look like this:
如果我们要可视化与自注意力相关的向量和层归一化操作，它会是这样的：

在这里插入图片描述
This goes for the sub-layers of the decoder as well. If we’re to think of a Transformer of 2 stacked encoders and decoders, it would look something like this:
对于解码器的子层也是如此。如果我们要考虑一个堆叠了2个编码器和解码器的Transformer，它会是这样的：
在这里插入图片描述

The Decoder Side 解码器端

Now that we’ve covered most of the concepts on the encoder side, we basically know how the components of decoders work as well. But let’s take a look at how they work together.
现在我们已经涵盖了大部分编码器端的概念，基本上我们也了解了解码器组件的工作原理。但让我们看看它们是如何一起工作的。

The encoder start by processing the input sequence. The output of the top encoder is then transformed into a set of attention vectors K and V. These are to be used by each decoder in its “encoder-decoder attention” layer which helps the decoder focus on appropriate places in the input sequence:
编码器首先处理输入序列。然后，顶层编码器的输出被转换为一组注意力向量K和V。每个解码器在其“编码器-解码器注意力”层中使用这些向量，这有助于解码器集中注意力于输入序列中的适当位置：

请添加图片描述

After finishing the encoding phase, we begin the decoding phase. Each step in the decoding phase outputs an element from the output sequence (the English translation sentence in this case).
完成编码阶段后，我们开始解码阶段。解码阶段的每一步都会输出输出序列中的一个元素（在本例中是英语翻译句子）。

The following steps repeat the process until a special symbol is reached indicating the transformer decoder has completed its output. The output of each step is fed to the bottom decoder in the next time step, and the decoders bubble up their decoding results just like the encoders did. And just like we did with the encoder inputs, we embed and add positional encoding to those decoder inputs to indicate the position of each word.
接下来的步骤重复这个过程，直到达到一个特殊的符号，表示Transformer解码器已经完成了其输出。每个步骤的输出被馈送到下一个时间步的底层解码器，并且解码器会像编码器一样向上冒出其解码结果。就像我们对编码器输入所做的那样，我们对这些解码器输入进行嵌入和添加位置编码，以指示每个单词的位置。

https://jalammar.github.io/images/t/transformer_decoding_2.gif
在这里插入图片描述
The self attention layers in the decoder operate in a slightly different way than the one in the encoder:
解码器中的自注意力层与编码器中的略有不同：

In the decoder, the self-attention layer is only allowed to attend to earlier positions in the output sequence. This is done by masking future positions (setting them to -inf) before the softmax step in the self-attention calculation.
在解码器中，自注意力层只能注意到输出序列中较早的位置。这是通过在自注意力计算中的 softmax 步骤之前屏蔽未来位置（将它们设置为 -inf）来实现的。

The “Encoder-Decoder Attention” layer works just like multiheaded self-attention, except it creates its Queries matrix from the layer below it, and takes the Keys and Values matrix from the output of the encoder stack.
“编码器-解码器注意力”层的工作原理与多头自注意力类似，只是它的查询矩阵是从其下面的层创建的，而键和值矩阵则来自编码器堆栈的输出。

The Final Linear and Softmax Layer

最后的线性和 Softmax 层

The decoder stack outputs a vector of floats. How do we turn that into a word? That’s the job of the final Linear layer which is followed by a Softmax Layer.
解码器堆栈输出一个浮点数向量。我们如何将其转换为一个单词？这是最终的线性层和其后的Softmax层的工作。

The Linear layer is a simple fully connected neural network that projects the vector produced by the stack of decoders, into a much, much larger vector called a logits vector.
线性层是一个简单的全连接神经网络，它将解码器堆栈产生的向量投影到一个远远更大的向量中，称为对数向量。

Let’s assume that our model knows 10,000 unique English words (our model’s “output vocabulary”) that it’s learned from its training dataset. This would make the logits vector 10,000 cells wide – each cell corresponding to the score of a unique word. That is how we interpret the output of the model followed by the Linear layer.
假设我们的模型知道10,000个独特的英文单词（我们模型的“输出词汇”），它从训练数据集中学到了这些单词。这将使得对数向量的宽度为10,000个单元 - 每个单元对应于一个独特单词的分数。这就是我们如何解释线性层后模型的输出的。

The softmax layer then turns those scores into probabilities (all positive, all add up to 1.0). The cell with the highest probability is chosen, and the word associated with it is produced as the output for this time step.
然后，Softmax层将这些分数转换为概率（全部为正，全部加起来为1.0）。具有最高概率的单元被选择，并与之相关联的单词被生成为这个时间步的输出。

在这里插入图片描述
This figure starts from the bottom with the vector produced as the output of the decoder stack. It is then turned into an output word.
这个图从底部开始，以解码器堆栈产生的向量作为起点，然后将其转换为一个输出单词。

Recap Of Training 培训回顾

Now that we’ve covered the entire forward-pass process through a trained Transformer, it would be useful to glance at the intuition of training the model.
现在我们已经介绍了通过训练的Transformer进行完整前向传递的过程，那么了解训练模型的直觉将会很有帮助。

During training, an untrained model would go through the exact same forward pass. But since we are training it on a labeled training dataset, we can compare its output with the actual correct output.
在训练过程中，未经训练的模型将经历完全相同的前向传递过程。但由于我们正在训练它使用带标签的训练数据集，因此我们可以将其输出与实际正确输出进行比较。

To visualize this, let’s assume our output vocabulary only contains six words(“a”, “am”, “i”, “thanks”, “student”, and “” (short for ‘end of sentence’)).
为了可视化这一点，让我们假设我们的输出词汇只包含六个单词（“a”，“am”，“i”，“thanks”，“student”和“”（代表‘句子结束’））。

The output vocabulary of our model is created in the preprocessing phase before we even begin training
我们模型的输出词汇是在训练开始之前的预处理阶段创建的。

Once we define our output vocabulary, we can use a vector of the same width to indicate each word in our vocabulary. This also known as one-hot encoding. So for example, we can indicate the word “am” using the following vector:
一旦我们定义了输出词汇，我们可以使用相同宽度的向量来表示我们词汇表中的每个单词。这也被称为独热编码。因此，例如，我们可以使用以下向量来表示单词“am”：

在这里插入图片描述
Example: one-hot encoding of our output vocabulary
例如：输出词汇的独热编码示例
“a”: [1, 0, 0, 0, 0, 0]
“am”: [0, 1, 0, 0, 0, 0]
“i”: [0, 0, 1, 0, 0, 0]
“thanks”: [0, 0, 0, 1, 0, 0]
“student”: [0, 0, 0, 0, 1, 0]
“”: [0, 0, 0, 0, 0, 1]

Following this recap, let’s discuss the model’s loss function – the metric we are optimizing during the training phase to lead up to a trained and hopefully amazingly accurate model.
在这个回顾之后，让我们讨论模型的损失函数 - 在训练阶段优化的度量标准，以期望得到一个训练有素且准确的模型。

The Loss Function 损失函数

Say we are training our model. Say it’s our first step in the training phase, and we’re training it on a simple example – translating “merci” into “thanks”.
假设我们正在训练我们的模型。假设这是训练阶段的第一步，我们正在对一个简单的示例进行训练 - 将“merci”翻译成“thanks”。

What this means, is that we want the output to be a probability distribution indicating the word “thanks”. But since this model is not yet trained, that’s unlikely to happen just yet.
这意味着我们希望输出是一个概率分布，指示单词“thanks”。但由于这个模型尚未训练，目前这种情况不太可能发生。

在这里插入图片描述
Since the model’s parameters (weights) are all initialized randomly, the (untrained) model produces a probability distribution with arbitrary values for each cell/word. We can compare it with the actual output, then tweak all the model’s weights using backpropagation to make the output closer to the desired output.
由于模型的参数（权重）都是随机初始化的，（未经训练的）模型会产生一个概率分布，其中每个单元/单词的值都是任意的。我们可以将其与实际输出进行比较，然后使用反向传播调整所有模型的权重，使输出更接近期望的输出。

How do you compare two probability distributions? We simply subtract one from the other. For more details, look at cross-entropy and Kullback–Leibler divergence.
你如何比较两个概率分布？我们只需将一个减去另一个。要了解更多细节，请查看交叉熵和Kullback–Leibler散度。

But note that this is an oversimplified example. More realistically, we’ll use a sentence longer than one word. For example – input: “je suis étudiant” and expected output: “i am a student”. What this really means, is that we want our model to successively output probability distributions where:
但请注意，这只是一个过度简化的例子。更现实的情况是，我们会使用长度超过一个单词的句子。例如 - 输入：“je suis étudiant” 期望输出：“i am a student”。这实际上意味着我们希望我们的模型能够连续输出概率分布，其中：

Each probability distribution is represented by a vector of width vocab_size (6 in our toy example, but more realistically a number like 30,000 or 50,000)
每个概率分布由宽度为vocab_size的向量表示（在我们的示例中为6，但更现实的是一个像30,000或50,000这样的数字）。

The first probability distribution has the highest probability at the cell associated with the word “i”
第一个概率分布在与单词“i”相关的单元格中具有最高的概率。

The second probability distribution has the highest probability at the cell associated with the word “am”
第二个概率分布在与单词“am”相关的单元格中具有最高的概率。

And so on, until the fifth output distribution indicates ‘’ symbol, which also has a cell associated with it from the 10,000 element vocabulary.
一直持续下去，直到第五个输出分布指示“”符号，该符号也与包含在10,000个元素词汇表中的一个单元格相关联。
在这里插入图片描述
The targeted probability distributions we’ll train our model against in the training example for one sample sentence.
我们在训练示例中将针对一个样本句子进行训练的目标概率分布。

After training the model for enough time on a large enough dataset, we would hope the produced probability distributions would look like this:
在足够大的数据集上对模型进行足够长时间的训练后，我们希望生成的概率分布看起来像这样：
在这里插入图片描述
Hopefully upon training, the model would output the right translation we expect. Of course it’s no real indication if this phrase was part of the training dataset (see: cross validation). Notice that every position gets a little bit of probability even if it’s unlikely to be the output of that time step – that’s a very useful property of softmax which helps the training process.
希望在训练过程中，模型会输出我们期望的正确翻译。当然，如果这个短语是训练数据集的一部分，这并不是真正的指标（参见：交叉验证）。请注意，即使某个位置不太可能成为该时间步长的输出，softmax 的一个非常有用的特性是每个位置都会获得一点概率，这有助于训练过程。

Now, because the model produces the outputs one at a time, we can assume that the model is selecting the word with the highest probability from that probability distribution and throwing away the rest. That’s one way to do it (called greedy decoding). Another way to do it would be to hold on to, say, the top two words (say, ‘I’ and ‘a’ for example), then in the next step, run the model twice: once assuming the first output position was the word ‘I’, and another time assuming the first output position was the word ‘a’, and whichever version produced less error considering both positions #1 and #2 is kept. We repeat this for positions #2 and #3…etc. This method is called “beam search”, where in our example, beam_size was two (meaning that at all times, two partial hypotheses (unfinished translations) are kept in memory), and top_beams is also two (meaning we’ll return two translations). These are both hyperparameters that you can experiment with.
现在，因为模型逐个生成输出，我们可以假设模型从该概率分布中选择具有最高概率的单词，并丢弃其余的单词。这是一种方法（称为贪婪解码）。另一种方法是保留，例如，前两个单词（例如，‘I’和‘a’），然后在下一步中，运行模型两次：一次假设第一个输出位置是单词‘I’，另一次假设第一个输出位置是单词‘a’，然后考虑位置＃1和＃2，哪个版本产生的错误较少就保留下来。我们对位置＃2和＃3等重复此操作…等等。这种方法称为“束搜索”，在我们的示例中，beam_size为两个（意味着在所有时候，内存中保留了两个部分假设（未完成的翻译）），top_beams也是两个（意味着我们将返回两个翻译）。这些都是您可以进行实验的超参数。

Go Forth And Transform

前进并进行转换

I hope you’ve found this a useful place to start to break the ice with the major concepts of the Transformer. If you want to go deeper, I’d suggest these next steps:
希望您发现这是一个有用的开始，可以了解Transformer的主要概念。如果您想深入了解，我建议采取以下步骤：

Read the Attention Is All You Need paper, the Transformer blog post
(Transformer: A Novel Neural Network Architecture for Language
Understanding), and the Tensor2Tensor announcement. 阅读《Attention Is
All You
Need》论文、Transformer博客文章（《Transformer：一种用于语言理解的新型神经网络架构》）以及Tensor2Tensor的公告。
Watch Łukasz Kaiser’s talk walking through the model and its details
观看Łukasz Kaiser的讲座，了解该模型及其细节。
Play with the Jupyter Notebook provided as part of the Tensor2Tensor
repo 在Tensor2Tensor存储库提供的Jupyter Notebook中进行实验。
Explore the Tensor2Tensor repo. 探索Tensor2Tensor存储库。

kingking44

关注

1
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
全网强烈推荐-详细的图解Transformer ((好多图))

解码器堆栈输出一个浮点数向量。我们如何将其转换为一个单词？这是最终的线性层和其后的Softmax层的工作。线性层是一个简单的全连接神经网络，它将解码器堆栈产生的向量投影到一个远远更大的向量中，称为对数向量。假设我们的模型知道10,000个独特的英文单词（我们模型的“输出词汇”），它从训练数据集中学到了这些单词。这将使得对数向量的宽度为10,000个单元 - 每个单元对应于一个独特单词的分数。这就是我们如何解释线性层后模型的输出的。
复制链接

扫一扫

专栏目录