The Illustrated Transformer（翻译）

最新推荐文章于 2023-05-27 00:03:26 发布

wh_xiexing

最新推荐文章于 2023-05-27 00:03:26 发布

阅读量376

点赞数

原文链接：https://jalammar.github.io/illustrated-transformer/

版权

本文深入探讨Transformer模型，一种使用注意力机制提升训练速度的模型。Transformer通过自注意力层和前向神经网络处理序列数据，尤其在翻译任务中表现出色。文章详细介绍了模型的结构，包括编码器、解码器、自注意力层的工作原理，以及位置编码如何捕获词序信息。此外，还讨论了训练过程中的损失函数和优化策略。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

The Illustrated Transformer

Discussions: Hacker News (65 points, 4 comments), Reddit r/MachineLearning (29 points, 3 comments)
Translations: Chinese (Simplified), French, Japanese, Korean, Russian, Spanish, Vietnamese
Watch: MIT’s Deep Learning State of the Art lecture referencing this post

In the previous post, we looked at Attention – 在上一个帖子中，我们考察了注意力机制，注意力是能帮助提高神经网络性能的一个概念。在这个帖子中，我们考察Transormer. 一种使用注意力机制大幅提高训练速度的模型。在很多特定任务，Tansformer技术胜过谷歌的GNM翻译模型，现在，就让我们解开模型的面纱，看看它是如何运作的。

The Transformer 在论文Attention is All You Need中被提出，这个帖子中，我们尽量简化它，希望没有相关知识的人易于理解这项技术，我们相继介绍相关的概念。

高层次的视角

刚开始，我们可将模型看作黑盒子，它将一种语言的句子，输出为另一种语言的句子。

拆开黑箱子，我们发现里面有编码器和解码器组件，以及它们之间的连接。

编码器组件和解码器组件又由一系列的堆栈形式的encoder, decoder组成：

堆栈中所有的encoder(或decoder)具有完全同等的结构。它们又可以细分为两层结构：子注意力层和前向神经网络。

在帖子的后面，我们将仔细考察自注意力层。每个位置上的前向神经网络都是独立的。解码器也有同样的这两层。但是，它们之间还多了一个注意力层，可帮助解码器关注输入句子的相关部分。

将张量带入到画面

我们现在看看各种各样的向量和张量，以及它们在这些组件之间是如何流动的。就像NLP中的情形，我们以将输入变成向量的嵌入算法开始切入。

每个输入的词汇被嵌入为512的向量。我们将向量表表示为简单的盒子。

编码器接受一系列的长度为512的向量，而序列的长度就是超参，它通常是模型能接收的最长的句子的长度（一个句子最多能包含的词的数目）。经过嵌入后（变成向量），它们中的每一个在编码器的两层之间流动。

这里我们能看到Transformer的一个主要特性，每个位置上的词只在自己的路径上流动。这些路径之间有依赖关系（在自关注层上）,前向层上无这些依赖关系，正因为如此，在前向层上可以做到并行化执行。

下一步，我们以一个短句为例，我们考察编码器的子层中发生了什么。

现在我们正在编码阶段！

就像我们前面提到的,一个编码器以一个向量列表作为输入。在处理这个向量列表的时候。将这些向量输入到自注意力层，然后再输入到前向神经网络，最后将结果输出到下一个编码器。

高层次的自注意力

不要被“self-attention”这个词迷惑了，（同时自己又不怎么理解的时候）而觉得自己傻傻的，因为这个词不是每个人都应该熟悉的概念。在阅读《Attention is All You Need》论文之前，我个人从未遇到过这个概念。让我们提炼它是如何工作的。

假设我们要翻译以下句子：

”The animal didn't cross the street because it was too tired”

句中的“it”到底指代为何？是指街道还是动物？对于人类来说，这是一个简单的问题，但对于算法而言却并非如此简单。当模型在处理“it”的时候，self-attention可以将“it”和“animal”联系起来。

在模型处理每个单词（输入序列中的每个位置）时，self-attention使其能够查看输入序列中的其他位置以寻找线索，从而有助于更好地对该单词进行编码。

如果您熟悉RNN，考虑一下如何通过保持隐藏状态,来使RNN将先前已处理过的单词/向量的表示形式与当前正在处理的单词/向量进行合并。 self-attention是Transformer用来将其他相关单词的“理解”融入我们当前正在处理的单词的方法。

当我们在编码器5（堆栈中的顶部编码器）中对单词“ it”进行编码时，注意力机制的一部分集中在“ The Animal”上，并将其表示的一部分烘焙（融合）到“ it”的编码中。

在使用Tensor2Tensor笔记本，您可以在其中加载Transformer模型，并使用此交互式可视化方式对其进行检查。

自注意详解

现在我们来看看如何通过向量计算自注意力，然后继续考察它是如何实现的-通过矩阵。第一步，从输入向量（每个词的嵌入）创建三个向量，如此依赖，我们创建了一个Query 向量，Key向量，和Value向量（一变三，有什么意义吗？）这三个向量是将嵌入向量同三个矩阵相乘得到的。而这三个矩阵又是我们在模型训练过程中得到的。注意：这三新的向量在尺寸上比嵌入向量小一些。它们的尺寸为64，而嵌入和编码器的输入、输出向量却有512的尺寸。它们不是必须要有小的尺寸，而是为了多头关注力的计算成为常量，所做成的结构选择。

x1与WQ权重矩阵相乘产生了q1(query vector), 我们最终还会生成key vector, value vector.

第二步，计算得分。也就是说，我们为这里例子中的首个词语“Thinking”计算自注意力。基于这个词，我们需要给输入的每个词打分。这个分数决定了，当我们编码某个位置上的词的时候，多少关注被放在输入的句子的其它部分。下面开始详细介绍如何计算每个词最终的Value. 不一一赘述。

The score is calculated by taking the dot product of the query vector with the key vector of the respective word we’re scoring. So if we’re processing the self-attention for the word in position #1, the first score would be the dot product of q1 and k1. The second score would be the dot product of q1 and k2.

The third and forth steps are to divide the scores by 8 (the square root of the dimension of the key vectors used in the paper – 64. This leads to having more stable gradients. There could be other possible values here, but this is the default), then pass the result through a softmax operation. Softmax normalizes the scores so they’re all positive and add up to 1.

This softmax score determines how much each word will be expressed at this position. Clearly the word at this position will have the highest softmax score, but sometimes it’s useful to attend to another word that is relevant to the current word.

The fifth step is to multiply each value vector by the softmax score (in preparation to sum them up). The intuition here is to keep intact the values of the word(s) we want to focus on, and drown-out irrelevant words (by multiplying them by tiny numbers like 0.001, for example).

The sixth step is to sum up the weighted value vectors. This produces the output of the self-attention layer at this position (for the first word).

经过第6步，计算就结束了。现在每个词不仅仅输出自己携带的信息，而且把旁边的词携带的信息，融合到自己的基因中，然后输出这经过融合的value.(也就是上图中的Z1)

自注意力中的矩阵计算

这个小节其实是补充说明如何计算Z值

The first step is to calculate the Query, Key, and Value matrices. We do that by packing our embeddings into a matrix X, and multiplying it by the weight matrices we’ve trained (WQ, WK, WV).

Every row in the X matrix corresponds to a word in the input sentence. We again see the difference in size of the embedding vector (512, or 4 boxes in the figure), and the q/k/v vectors (64, or 3 boxes in the figure)

Finally, since we’re dealing with matrices, we can condense steps two through six in one formula to calculate the outputs of the self-attention layer.

The self-attention calculation in matrix form

（似乎少了一个求和符号）

多头的野兽？

论文后面引入了多头的概念，从两个方面提高了注意力层的性能。多头，也就是原来每个word对应一组 (Q,K,V). 现在对应多组(Q,K,V), 原来计算一次QKV, 现在计算8组词，有啥用？我们接着往下看。

扩展了模型集中于不同位置的能力，是的，在上面的例子中, z1 包含了其它的每个词汇编码的信息，但它自己可能占据绝对主要的位置.
它为注意力层提供了多个“表示空间”。没看懂是什么道理，感觉有些牵强，不过深度学习就是行为科学，只要效果好，可以胡说八道。

With multi-headed attention, we maintain separate Q/K/V weight matrices for each head resulting in different Q/K/V matrices. As we did before, we multiply X by the WQ/WK/WV matrices to produce Q/K/V matrices.

If we do the same self-attention calculation we outlined above, just eight different times with different weight matrices, we end up with eight different Z matrices

This leaves us with a bit of a challenge. The feed-forward layer is not expecting eight matrices – it’s expecting a single matrix (a vector for each word). So we need a way to condense these eight down into a single matrix.

这给我们留下一点挑战，本来前向网络层预期输入一个单独的矩阵（Z），结果你一下子给了8个，咋办？我们的办法就是将这8个matrix连接起来，然后乘一个单独的矩阵W0，最后合并成了一个Z.如下图。

How do we do that? We concat the matrices then multiple them by an additional weights matrix WO.

下面我们把所有的步骤合并在一张图中展示。

既然我们接触到了注意力头，现在让我们回顾一下我们的例子，看看当我们编码“it”的时候，不同的注意力头集中于哪些地方。

当我们编码“it”的时候，一个“头”的注意力集中在“the animal”(注意图上右边土黄颜色最深的地方)。而另外的“头”将注意力集中在 "tired" （绿色最深）-- in a sense, the model's representation of the word "it" bakes in some of the representation of both "animal" and "tired".

如果我们将所有的attention heads都放入到图中，就很难直观地解释了。

用位置相关的编码重新表达语句的顺序

我们前面描述模型时，遗漏的一个事情，就是如何解释句子中词汇的顺序。

为解决词序的利用问题，Transformer对每个词新增了一个向量，这些向量遵循模型学习的指定模式，来帮助模型决定词的位置，或者序列中不同词的距离。直观的理解，当其映射到Q/K/V向量以及点乘的attention时,增加这样的值（对每个input word新增的向量）为词向量（input vector）提供了有意义的距离。

为了能够给模型提供词序的信息，新增位置编码向量，每个向量值都遵循指定模式。

如果我们假定嵌入向量有4维，实际的位置编码向量将如下所示：

A real example of positional encoding with a toy embedding size of 4

这个模式看起来像什么呢？

在下图中，每一行代表一个词向量的位置编码.如此，第一行对应句子中第一个词汇的嵌入向量，每行包含512个值，每个是一个-1~1之间的数值，我们用颜色来表示它们。

一个真实的例子有20个词，每个词512维。可以观察中间显著的分隔，那是因为左侧是用sine函数生成，右侧是用cosine生成。位置编码的左半部分和右半部分用不同的函数生成的，最后被拼接在一起

位置向量编码方法在论文的3.5节有提到，也可以看代码get_timing_signal_ld()，对位置编码而言并不只有一种方法。需要注意的是，编码方法必须能够处理未知长度的序列。除了直接拼接的方式生成位置编码向量，也可以用交织的方式生成编码向量。

残差

在我们继续后面内容之前，需要注意到一个细节，编码器的每个子层有一个残差连接，并且后续衔接一个正则层。

假如我们要对这些与子注意力相关的向量及正则层进行可视化，它可能看起来就像下面这样。

在解码器端的子层也进行同样的操作。加入我考量一个2层堆叠的Transformer,它看起来就像下面这样。

解码器端

现在，我们已经覆盖了编码器端大部分的概念，基本上，我们知道解码器的组件如何工作的。但是我们需要看看它们是如何协作的。

编码器开始处理输入的句子序列。最顶层的编码器的输出被转换为注意力向量K和V.这些将被解码器用来帮助解码器将注意集中在哪些位置。

一旦完成编码阶段，我们开始进入解码阶段，解码过程的每一个步骤输出句子序列的一个元素。下列步骤重复这个处理过程，直到接收到一个特殊符号，表示该解码过程结束。每个步骤的结果在下一个时刻，喂给底层的解码器。与编码器类似，解码器像冒泡一样吐出它们的结果。

解码器的自注意力层，它的操作方式与编码器稍有不同：

在解码端，在输出序列种，自注意力层仅仅被允许扩展到更前面的位置。这是通过在softmax步骤之前，淹膜后面位置做到的。

编码器和解码器之间的注意力层，它的工作方式与多头的自注意力相同，除了，它从编码器取数据。

最后的线性层和SOFTMAX层

解码器输出堆叠的浮点数向量。我们如何将之变为词汇？它就是最后线性层和softmax层要做的工作。（看起来跟CNN的用法，是一个套路嘛, 搞清楚它的用法，本小节就可以略过了）

线性层就是一个全连接层，它将堆叠的输出向量转换为一个尺寸大得多的logits向量。

Let’s assume that our model knows 10,000 unique English words (our model’s “output vocabulary”) that it’s learned from its training dataset. This would make the logits vector 10,000 cells wide – each cell corresponding to the score of a unique word. That is how we interpret the output of the model followed by the Linear layer.

The softmax layer then turns those scores into probabilities (all positive, all add up to 1.0). The cell with the highest probability is chosen, and the word associated with it is produced as the output for this time step.

This figure starts from the bottom with the vector produced as the output of the decoder stack. It is then turned into an output word.

重新回顾训练过程

Now that we’ve covered the entire forward-pass process through a trained Transformer, it would be useful to glance at the intuition of training the model.

During training, an untrained model would go through the exact same forward pass. But since we are training it on a labeled training dataset, we can compare its output with the actual correct output.

To visualize this, let’s assume our output vocabulary only contains six words(“a”, “am”, “i”, “thanks”, “student”, and “<eos>” (short for ‘end of sentence’)).

The output vocabulary of our model is created in the preprocessing phase before we even begin training.

Once we define our output vocabulary, we can use a vector of the same width to indicate each word in our vocabulary. This also known as one-hot encoding. So for example, we can indicate the word “am” using the following vector:

Example: one-hot encoding of our output vocabulary

Following this recap, let’s discuss the model’s loss function – the metric we are optimizing during the training phase to lead up to a trained and hopefully amazingly accurate model.

损失函数

我们在训练模型的时候，一个简单的例子，将"merci"翻译为"thanks". 这意味着，我们像要输出的结果是一个概率分布，表明这个词是“thanks".但是，既然这个模型还没有被训练，这种情况就不可能发生。

看到这里就大概知道怎么搞了，就是用算法cross-entropy and Kullback–Leibler divergence.来度量两种概率分布之间距离。后面可以不看了。

Since the model's parameters (weights) are all initialized randomly, the (untrained) model produces a probability distribution with arbitrary values for each cell/word. We can compare it with the actual output, then tweak all the model's weights using backpropagation to make the output closer to the desired output.

How do you compare two probability distributions? We simply subtract one from the other. For more details, look at cross-entropy and Kullback–Leibler divergence.

But note that this is an oversimplified example. More realistically, we’ll use a sentence longer than one word. For example – input: “je suis étudiant” and expected output: “i am a student”. What this really means, is that we want our model to successively output probability distributions where:

Each probability distribution is represented by a vector of width vocab_size (6 in our toy example, but more realistically a number like 30,000 or 50,000)
The first probability distribution has the highest probability at the cell associated with the word “i”
The second probability distribution has the highest probability at the cell associated with the word “am”
And so on, until the fifth output distribution indicates ‘<end of sentence>’ symbol, which also has a cell associated with it from the 10,000 element vocabulary.

The targeted probability distributions we'll train our model against in the training example for one sample sentence.

After training the model for enough time on a large enough dataset, we would hope the produced probability distributions would look like this:

Hopefully upon training, the model would output the right translation we expect. Of course it's no real indication if this phrase was part of the training dataset (see: cross validation). Notice that every position gets a little bit of probability even if it's unlikely to be the output of that time step -- that's a very useful property of softmax which helps the training process.

Now, because the model produces the outputs one at a time, we can assume that the model is selecting the word with the highest probability from that probability distribution and throwing away the rest. That’s one way to do it (called greedy decoding). Another way to do it would be to hold on to, say, the top two words (say, ‘I’ and ‘a’ for example), then in the next step, run the model twice: once assuming the first output position was the word ‘I’, and another time assuming the first output position was the word ‘a’, and whichever version produced less error considering both positions #1 and #2 is kept. We repeat this for positions #2 and #3…etc. This method is called “beam search”, where in our example, beam_size was two (meaning that at all times, two partial hypotheses (unfinished translations) are kept in memory), and top_beams is also two (meaning we’ll return two translations). These are both hyperparameters that you can experiment with.

Go Forth And Transform

I hope you’ve found this a useful place to start to break the ice with the major concepts of the Transformer. If you want to go deeper, I’d suggest these next steps: