The Illustrated Transformer(翻译)

The Illustrated Transformer

In the previous post, we looked at Attention – 在上一个帖子中,我们考察了注意力机制,注意力是能帮助提高神经网络性能的一个概念。在这个帖子中,我们考察Transormer. 一种使用注意力机制大幅提高训练速度的模型。在很多特定任务,Tansformer技术胜过谷歌的GNM翻译模型,现在,就让我们解开模型的面纱,看看它是如何运作的。

The Transformer 在论文Attention is All You Need中被提出,这个帖子中,我们尽量简化它, 希望没有相关知识的人易于理解这项技术,我们相继介绍相关的概念。




编码器组件和解码器组件又由一系列的堆栈形式的encoder, decoder组成:














不要被“self-attention”这个词迷惑了,(同时自己又不怎么理解的时候)而觉得自己傻傻的,因为这个词不是每个人都应该熟悉的概念。 在阅读《Attention is All You Need》论文之前,我个人从未遇到过这个概念。 让我们提炼它是如何工作的。


The animal didn't cross the street because it was too tired

句中的“it”到底指代为何?是指街道还是动物? 对于人类来说,这是一个简单的问题,但对于算法而言却并非如此简单。当模型在处理“it”的时候,self-attention可以将“it”和“animal”联系起来。


如果您熟悉RNN,考虑一下如何通过保持隐藏状态,来使RNN将先前已处理过的单词/向量的表示形式与当前正在处理的单词/向量进行合并。 self-attention是Transformer用来将其他相关单词的“理解”融入我们当前正在处理的单词的方法

当我们在编码器5(堆栈中的顶部编码器)中对单词“ it”进行编码时,注意力机制的一部分集中在“ The Animal”上,并将其表示的一部分烘焙(融合)到“ it”的编码中。



现在我们来看看如何通过向量计算自注意力,然后继续考察它是如何实现的-通过矩阵。第一步,从输入向量(每个词的嵌入)创建三个向量,如此依赖,我们创建了一个Query 向量,Key向量,和Value向量(一变三,有什么意义吗?)这三个向量是将嵌入向量同三个矩阵相乘得到的。而这三个矩阵又是我们在模型训练过程中得到的。注意:这三新的向量在尺寸上比嵌入向量小一些。它们的尺寸为64,而嵌入和编码器的输入、输出向量却有512的尺寸。它们不是必须要有小的尺寸,而是为了多头关注力的计算成为常量,所做成的结构选择。


x1与WQ权重矩阵相乘产生了q1(query vector), 我们最终还会生成key vector, value vector.

第二步,计算得分。 也就是说,我们为这里例子中的首个词语“Thinking”计算自注意力。基于这个词,我们需要给输入的每个词打分。这个分数决定了,当我们编码某个位置上的词的时候,多少关注被放在输入的句子的其它部分。 下面开始详细介绍如何计算每个词最终的Value. 不一一赘述。

The score is calculated by taking the dot product of the query vector with the key vector of the respective word we’re scoring. So if we’re processing the self-attention for the word in position #1, the first score would be the dot product of q1 and k1. The second score would be the dot product of q1 and k2.



The third and forth steps are to divide the scores by 8 (the square root of the dimension of the key vectors used in the paper – 64. This leads to having more stable gradients. There could be other possible values here, but this is the default), then pass the result through a softmax operation. Softmax normalizes the scores so they’re all positive and add up to 1.

This softmax score determines how much each word will be expressed at this position. Clearly the word at this position will have the highest softmax score, but sometimes it’s useful to attend to another word that is relevant to the current word.


The fifth step is to multiply each value vector by the softmax score (in preparation to sum them up). The intuition here is to keep intact the values of the word(s) we want to focus on, and drown-out irrelevant words (by multiplying them by tiny numbers like 0.001, for example).

The sixth step is to sum up the weighted value vectors. This produces the output of the self-attention layer at this position (for the first word).




The first step is to calculate the Query, Key, and Value matrices. We do that by packing our embeddings into a matrix X, and multiplying it by the weight matrices we’ve trained (WQ, WK, WV).

Every row in the X matrix corresponds to a word in the input sentence. We again see the difference in size of the embedding vector (512, or 4 boxes in the figure), and the q/k/v vectors (64, or 3 boxes in the figure)


Finally, since we’re dealing with matrices, we can condense steps two through six in one formula to calculate the outputs of the self-attention layer.

The self-attention calculation in matrix form



论文后面引入了多头的概念,从两个方面提高了注意力层的性能。多头,也就是原来每个word对应一组 (Q,K,V). 现在对应多组(Q,K,V), 原来计算一次QKV, 现在计算8组词,有啥用? 我们接着往下看。

  1. 扩展了模型集中于不同位置的能力,是的,在上面的例子中, z1 包含了其它的每个词汇编码的信息,但它自己可能占据绝对主要的位置. 

  2. 它为注意力层提供了多个“表示空间”。没看懂是什么道理,感觉有些牵强,不过深度学习就是行为科学,只要效果好,可以胡说八道。

With multi-headed attention, we maintain separate Q/K/V weight matrices for each head resulting in different Q/K/V matrices. As we did before, we multiply X by the WQ/WK/WV matrices to produce Q/K/V matrices.

If we do the same self-attention calculation we outlined above, just eight different times with different weight matrices, we end up with eight different Z matrices

This leaves us with a bit of a challenge. The feed-forward layer is not expecting eight matrices – it’s expecting a single matrix (a vector for each word). So we need a way to condense these eight down into a single matrix.


How do we do that? We concat the matrices then multiple them by an additional weights matrix WO.




当我们编码“it”的时候,一个“头”的注意力集中在“the animal”(注意图上右边土黄颜色最深的地方)。而另外的“头”将注意力集中在 "tired" (绿色最深)-- in a sense, the model's representation of the word "it" bakes in some of the representation of both "animal" and "tired".

如果我们将所有的attention heads都放入到图中,就很难直观地解释了。



为解决词序的利用问题,Transformer对每个词新增了一个向量,这些向量遵循模型学习的指定模式,来帮助模型决定词的位置,或者序列中不同词的距离。直观的理解,当其映射到Q/K/V向量以及点乘的attention时,增加这样的值(对每个input word新增的向量)为词向量(input vector)提供了有意义的距离。




A real example of positional encoding with a toy embedding size of 4


















解码器输出堆叠的浮点数向量。我们如何将之变为词汇?它就是最后线性层和softmax层要做的工作。(看起来跟CNN的用法,是一个套路嘛, 搞清楚它的用法,本小节就可以略过了


Let’s assume that our model knows 10,000 unique English words (our model’s “output vocabulary”) that it’s learned from its training dataset. This would make the logits vector 10,000 cells wide – each cell corresponding to the score of a unique word. That is how we interpret the output of the model followed by the Linear layer.

The softmax layer then turns those scores into probabilities (all positive, all add up to 1.0). The cell with the highest probability is chosen, and the word associated with it is produced as the output for this time step.

This figure starts from the bottom with the vector produced as the output of the decoder stack. It is then turned into an output word.



Now that we’ve covered the entire forward-pass process through a trained Transformer, it would be useful to glance at the intuition of training the model.

During training, an untrained model would go through the exact same forward pass. But since we are training it on a labeled training dataset, we can compare its output with the actual correct output.

To visualize this, let’s assume our output vocabulary only contains six words(“a”, “am”, “i”, “thanks”, “student”, and “<eos>” (short for ‘end of sentence’)).

The output vocabulary of our model is created in the preprocessing phase before we even begin training.

Once we define our output vocabulary, we can use a vector of the same width to indicate each word in our vocabulary. This also known as one-hot encoding. So for example, we can indicate the word “am” using the following vector:

Example: one-hot encoding of our output vocabulary

Following this recap, let’s discuss the model’s loss function – the metric we are optimizing during the training phase to lead up to a trained and hopefully amazingly accurate model.


我们在训练模型的时候,一个简单的例子,将"merci"翻译为"thanks". 这意味着,我们像要输出的结果是一个概率分布,表明这个词是“thanks".但是,既然这个模型还没有被训练,这种情况就不可能发生。


看到这里就大概知道怎么搞了, 就是用算法cross-entropy and Kullback–Leibler divergence.来度量两种概率分布之间距离。   后面可以不看了。 

Since the model's parameters (weights) are all initialized randomly, the (untrained) model produces a probability distribution with arbitrary values for each cell/word. We can compare it with the actual output, then tweak all the model's weights using backpropagation to make the output closer to the desired output.

How do you compare two probability distributions? We simply subtract one from the other. For more details, look at cross-entropy and Kullback–Leibler divergence.

But note that this is an oversimplified example. More realistically, we’ll use a sentence longer than one word. For example – input: “je suis étudiant” and expected output: “i am a student”. What this really means, is that we want our model to successively output probability distributions where:

  • Each probability distribution is represented by a vector of width vocab_size (6 in our toy example, but more realistically a number like 30,000 or 50,000)
  • The first probability distribution has the highest probability at the cell associated with the word “i”
  • The second probability distribution has the highest probability at the cell associated with the word “am”
  • And so on, until the fifth output distribution indicates ‘<end of sentence>’ symbol, which also has a cell associated with it from the 10,000 element vocabulary.

The targeted probability distributions we'll train our model against in the training example for one sample sentence.

After training the model for enough time on a large enough dataset, we would hope the produced probability distributions would look like this:

Hopefully upon training, the model would output the right translation we expect. Of course it's no real indication if this phrase was part of the training dataset (see: cross validation). Notice that every position gets a little bit of probability even if it's unlikely to be the output of that time step -- that's a very useful property of softmax which helps the training process.

Now, because the model produces the outputs one at a time, we can assume that the model is selecting the word with the highest probability from that probability distribution and throwing away the rest. That’s one way to do it (called greedy decoding). Another way to do it would be to hold on to, say, the top two words (say, ‘I’ and ‘a’ for example), then in the next step, run the model twice: once assuming the first output position was the word ‘I’, and another time assuming the first output position was the word ‘a’, and whichever version produced less error considering both positions #1 and #2 is kept. We repeat this for positions #2 and #3…etc. This method is called “beam search”, where in our example, beam_size was two (meaning that at all times, two partial hypotheses (unfinished translations) are kept in memory), and top_beams is also two (meaning we’ll return two translations). These are both hyperparameters that you can experiment with.

Go Forth And Transform

I hope you’ve found this a useful place to start to break the ice with the major concepts of the Transformer. If you want to go deeper, I’d suggest these next steps:

Follow-up works:


Thanks to Illia PolosukhinJakob UszkoreitLlion Jones Lukasz KaiserNiki Parmar, and Noam Shazeer for providing feedback on earlier versions of this post.

Please hit me up on Twitter for any corrections or feedback.

Written on June 27, 2018






