变形金刚2_变形金刚（

最新推荐文章于 2024-07-23 14:36:35 发布

weixin_26752765

最新推荐文章于 2024-07-23 14:36:35 发布

阅读量344

点赞数

文章标签： python java

原文链接：https://towardsdatascience.com/transformers-state-of-the-art-natural-language-processing-1d84c4c7462b

版权

变形金刚2

重点 (Top highlight)

This is a 3 part series where we will be going through Transformers, BERT, and a hands-on Kaggle challenge — Google QUEST Q&A Labeling to see Transformers in action (top 4.4% on the leaderboard).In this part (1/3) we will be looking at how Transformers became state-of-the-art in various modern natural language processing tasks and their working.

这是一个分为3部分的系列文章，我们将通过Transformers，BERT和动手Kaggle挑战-Google QUEST问题与 解答标签来查看Transformers的使用情况(在排行榜上排名前4.4％)。在这一部分(1/3)我们将研究《变形金刚》如何在各种现代自然语言处理任务及其工作中成为最先进的技术。

The Transformer is a deep learning model proposed in the paper Attention is All You Need by researchers at Google and the University of Toronto in 2017, used primarily in the field of natural language processing (NLP).

吨他变压器是在提出一种深度学习模型关注的是你所需要的研究人员在谷歌和多伦多在2017年的大学，主要是在自然语言处理(NLP)的领域。

Like recurrent neural networks (RNNs), Transformers are designed to handle sequential data, such as natural language, for tasks such as translation and text summarization. However, unlike RNNs, Transformers do not require that the sequential data be processed in the order. For example, if the input data is a natural language sentence, the Transformer does not need to process the beginning of it before the end. Due to this feature, the Transformer allows for much more parallelization than RNNs and therefore reduced training times.

像递归神经网络(RNN)一样，变形金刚旨在处理顺序数据(例如自然语言)，以执行翻译和文本摘要之类的任务。但是，与RNN不同，Transformer不需要按顺序处理顺序数据。例如，如果输入数据是自然语言语句，则Transformer不需要在结束之前处理它的开头。由于此功能，与RNN相比，Transformer允许更多的并行化，因此减少了训练时间。

Transformers were designed around the concept of attention mechanism which was designed to help memorize long source sentences in neural machine translation.

围绕注意机制的概念设计了变压器，该机制旨在帮助记忆神经机器翻译中的长句。

Sounds cool right?Let’s take a look under the hood and see how things work.

听起来不错吧？让我们看一下引擎盖，看看它们是如何工作的。

Transformers are based on an encoder-decoder architecture that comprises of encoders which consists of a set of encoding layers that processes the input iteratively one layer after another and decoders that consists of a set of decoding layers that does the same thing to the output of the encoder.

变压器基于编码器-解码器体系结构，该体系结构由编码器和解码器组成，其中编码器由一组编码层组成，这些编码层一层又一层地迭代处理输入，而解码器由一组解码层组成，这些解码层对输出的内容进行相同的处理编码器。

So, when we pass a sentence into a transformer, it is embedded and passed into a stack of encoders. The output from the final encoder is then passed into each decoder block in the decoder stack. The decoder stack then generates the output.

因此，当我们将句子传递给转换器时，它将被嵌入并传递给编码器堆栈。最终编码器的输出然后传递到解码器堆栈中的每个解码器块。然后，解码器堆栈生成输出。

All the encoder blocks in the transformer are identical and similarly, all the decoder blocks in the transformer are identical.

变压器中的所有编码器块是相同的，并且类似地，变压器中的所有解码器块是相同的。

This was a very high-level representation of a transformer and it wouldn’t probably make much sense when understanding how transformers are so efficient in modern NLP tasks.Don’t worry, to make things clearer, we will go through the internals of an encoder and decoder cell now…

这是一个非常高级的变压器表示形式，当理解变压器如何在现代NLP任务中如此高效时，可能没有多大意义。不用担心，为了使事情更清楚，我们将仔细研究变压器的内部结构。编码器和解码器单元现在…

Encoder

编码器

The encoder has 2 parts, self-attention, and a feed-forward neural network.

编码器由两部分组成：自我注意和前馈神经网络。

The encoder’s inputs first flow through a self-attention layer — a layer that helps the encoder look at other words in the input sentence as it encodes a specific word. Basically for each input word ‘x’ the self-attention layer generates a vector Z such that it takes all the input words (x1, x2, x3, …, xn) into the picture before generating Z. I’ll come to why it takes all the input word’s embedding into the picture and how it generates Z later in this blog but for now, just remember these brief high-level summarizations of the subcomponents of an encoder.

编码器的输入首先流经自我注意层，该层可以帮助编码器在对特定单词进行编码时查看输入句子中的其他单词。基本上，对于每个输入单词'x'，自我注意层都会生成一个向量Z ，以使其在生成Z之前将所有输入单词(x1，x2，x3，…，xn)放入图片中。在本博客的稍后部分，我将讨论为什么要将所有输入词都嵌入图片中以及如何生成Z ，但是现在，请记住编码器子组件的这些简要概述。

The outputs of the self-attention layer are fed to a feed-forward neural network. The feed-forward neural network generates an output for each input Z and the output from the feed-forward neural network is passed into the next encoder block’s self-attention layer and so on.

自我注意层的输出被馈送到前馈神经网络。前馈神经网络为每个输入Z生成一个输出，前馈神经网络的输出将传递到下一个编码器块的自注意层，依此类推。

Now that we have an idea of what all is inside an encoder, let’s understand the tensor operations happening inside each component.

现在我们已经了解了编码器内部的所有内容，让我们了解每个组件内部发生的张量操作。

First comes the input:

首先是输入：

We know that transformers are used for NLP tasks so the data we deal with is usually a corpus of sentences, but since machine learning algorithms are all about matrix operations, we first need to convert the human-readable sentences into a machine-readable format (numbers). To convert the sentences into numbers, we use ‘word embeddings’. This step is simple, each word in a sentence is represented as an n-dimensional vector (n is usually 512) and for transformers, we typically use GloVe embedding representation of words. There is also something called positional encoding that is applied to these embedding but I’ll come to it later.Once we have the embedding for each input word, we pass these embedding simultaneously to the self-attention layer.

我们知道转换器是用于NLP任务的，因此我们处理的数据通常是句子的主体，但是由于机器学习算法都是关于矩阵运算的，因此我们首先需要将人类可读的句子转换为机器可读的格式(数字)。要将句子转换为数字，我们使用“单词嵌入”。此步骤很简单，将句子中的每个单词表示为n维向量(n通常为512)，对于转换器，我们通常使用GloVe嵌入单词表示法。还有一些叫做位置编码的东西被应用到这些嵌入中，但是稍后我会介绍。一旦我们为每个输入单词都嵌入了嵌入，我们就将这些嵌入同时传递给自我注意层。

The training parameters of self-attention layer:

自我注意层的训练参数：

Different layers have different learning parameters eg. a Dense layer has weights and bias, a Convolutional layer has kernels as the learning parameters similarly in the self-attention layer, we have 4 learning parameters:- Query matrix: Wq- Key matrix: Wk- Value matrix: Wv- Output matrix: Wo (this is not the output matrix but a trainable parameter that generates the final output of the self-attention layer Z).

不同的层具有不同的学习参数，例如。一个密集层具有权重和偏差，一个卷积层也具有内核作为自注意力层的学习参数，我们有4个学习参数： -查询矩阵： Wq- 关键矩阵： Wk- 值矩阵： Wv- 输出矩阵： 禾 (这不是输出矩阵，而是可训练的参数，该参数生成自我注意层Z的最终输出)。

The first 3 trainable parameters have a special purpose, they are used for generating 3 new parameters:- Query: Q- Key: K- Value: Vwhich are later used for generating output Z from input x, let’s see how-

前三个可训练参数具有特殊用途，它们用于生成3个新参数： -查询： Q- 键： K- 值： V ，稍后用于从输入x生成输出Z ，让我们看看如何-

Some points to keep in mind are:- The input tensor x has n-rows and m-columns where n is the number of input words and m is the vector size of each word i.e. 512.- The output tensors Q, K, V, and Z have n-rows and dk-columns where n is the number of input words and dk is 64. The values of m and dk are no random values but were found to work the best by researchers who came up with this architecture.

请记住以下几点：-输入张量x具有n行和m列，其中n是输入单词的数量， m是每个单词的向量大小，即512.-输出张量Q，K，V和 Z n行 DK -columns其中n是输入字的数量和DK为64 m的值和DK都没有随机值，但被发现谁用这种架构想出了研究人员的工作是最好的。

After calculating the 3 parameters Q, K, V as mentioned above, the self-attention layer then calculates scores, a vector for each of the input words.

如上所述，在计算了三个参数Q，K，V之后 ，自我注意层将计算分数，即每个输入单词的向量。

Dot-product attention:

点产品注意事项：

The next step in the self-attention layer is to calculate the value of the vector score corresponding to each input word. This score calculation is one of the most crucial steps that bring the attention mechanism to life (well… not literally). The vector score has a size of n where n is the number of input words and each element of this vector is a number that tells how much does the word that it corresponds to contributes to the current word.Let’s consider an example to get the intuition-“The animal didn’t cross the street because it was too tired”In the above sentence, the word it refers to the animal and not the road. For us, this is pretty simple to grasp but not for a machine with no attention, because we know how grammar works and we’ve developed a sense that it will be referring to animal more than words like cross or road. This sense of grammar comes to transformers after training but the fact that for a given word, it considers all the words in the input and then has the ability to select the one that it thinks contributes the most is what the attention mechanism is about.For the above sentence, the score vector generated for the word it will have 11 numbers, each corresponding to a word in the input sentence. For a well-trained model, this score vector will have larger numbers at positions 2 and 8 because the words at 2(animal) and 8(it) contribute the most to it. It may look something like: [2, 60, 4, 5, 3, 8, 5, 90, 7, 6, 3]Notice that the values at positions 2 and 8 are greater than the values at other positions.

自我注意层的下一步是计算与每个输入单词相对应的矢量分数的值。分数计算是使注意力机制栩栩如生的最关键步骤之一(嗯……不是字面上的意思)。向量分数的大小为n 哪里 ñ 是输入单词的数量，该向量的每个元素是一个数字，表明该单词对应的单词对当前单词有多少贡献。让我们考虑一个例子来获得直觉： “动物没有过马路因为太累了” 在以上句子中，它是指动物而不是道路。对于我们来说，这很容易掌握，但对于没有注意力的机器来说却并非如此，因为我们知道语法是如何工作的，并且我们已经形成一种感觉，即它比动物之类的单词“ cross”或“ road”更能指动物。这种语法意识是经过训练的变形者，但是对于一个给定的单词，它会考虑输入中的所有单词，然后能够选择其认为贡献最大的单词这一事实，这就是注意力机制的意义所在。对于上述的句子，对于单词生成的得分矢量它将具有11个数字，每一个对应于在输入句子的单词。对于训练有素的模型，此得分向量将在位置2和8处具有较大的数字，因为2(动物)和8(it)处的单词对其贡献最大。它可能看起来像：[2，60，4,5，3,8，5，90，7，6,3]注意，在位置2和8的值比在其它位置的值越大。

Let’s see how these scores are generated in the self-attention layer.Till now, for each word, we have Q, K, V vectors. To generate the score vector, we use something called the dot-product attention where we take a dot product between the Q and the K vectors to generate the score. The value of Q is corresponding to the query of the word for which we are calculating the score, in the above example, the word was it whereas there are n values of K, each corresponding to the key vector of the input words.So, if we want to generate the scores for the word it:

让我们看看这些分数是如何在自我注意层中生成的。到目前为止，对于每个单词，我们都有Q，K，V向量。为了生成得分向量，我们使用了一种称为“ 点积注意”的方法 ，其中我们取Q和K向量之间的点积来生成得分。 Q的值对应于我们要为其计算分数的单词的查询，在上面的示例中，单词是它，而存在n个K值，每个值对应于输入单词的键向量。如果我们想生成单词的分数：

We take the query vector of it: Q
我们取它的查询向量： Q
We take the key vectors of the input sentences: K1, K2, K3, …, Kn.
我们采用输入句子的关键向量： K1，K2，K3，…，Kn。
We take a dot product between Q and K’s and obtain n scores.
我们在Q和K之间取一个点积，并得到n分。

After calculating the scores, we kind of normalize the scores by dividing them by squared root of (dk) which was the column-dimension of vectors Q, K, V.This step was mandatory because the creators of the transformer found that normalizing the scores by sqrt. of dk gives better results.

在计算出分数之后，我们通过将分数除以( dk )的平方根(即向量Q，K，V的 列维 )来对分数进行归一化 。此步骤是必需的，因为转换器的创建者发现对分数进行归一化由sqrt。的dk效果更好。

After normalizing the score vectors, we encode them using softmax function such that the output is proportional to the original scores but all the values sum up to 1.

对得分向量进行归一化后，我们使用softmax函数对其进行编码，以使输出与原始得分成正比，但所有值的总和为1。

Once we have the ‘softmaxed’ scores ready, we simply multiply each score element with the value vector V corresponding to it, such that we get n value vectors V after this operation: [V1, V2, V3, …, Vn].Now to obtain the output Z of the self-attention, we simply add all the n value vectors.

一旦我们准备好“ softmaxed”分数，我们就简单地将每个分数元素与对应的值向量V相乘，以便在此操作后获得n个值向量V ：[ V1，V2，V3，…，Vn ]。为了获得自我注意的输出Z ，我们只需将所有n个值向量相加即可。

The above diagrams illustrate the steps of the self-attention layer.

上图说明了自我注意层的步骤。

Multi-head Attention:

多头注意：

Now that we know how an attention-head works, and how amazing it is there is a catch to it. A single attention-head can sometimes miss some of the words in input that contribute most to the spotlight word, like in the example before, sometimes the attention head may fail to pay attention to the word animal while predicting the word it and this may cause problems.To tackle this issue, instead of just a single attention-head, we use multiple attention-heads, each working in a similar manner. This idea helps us to reduce the error or miscalculation by any single attention head.This is also referred to as multi-head attention.

现在，我们知道了注意头是如何工作的，以及它有多神奇，这有一个吸引点。单注意头有时会错过一些在输入之前的例子最有助于聚光灯字，之类的话语，有时同时预测字呢注意头可能无法要注意单词的动物 ，这可能会导致为了解决这个问题，我们使用多个关注头，而不仅仅是一个单独的关注头，每个关注头的工作方式都相似。这个想法可以帮助我们减少任何单个关注头的错误或计算错误，这也称为多头关注 。

In the transformers, multi-head attention typically uses 8 attention heads.Now notice that the output of a single attention-head was of 64 dimensions, but if we use multi-head attention, we will get 8 such 64-dimensional vectors as output.

在变形金刚中，多头注意力通常使用8个注意力头，现在注意单个注意力头的输出为64维，但是如果我们使用多头注意力，我们将获得8个这样的64维向量作为输出。

Turns out there is a final trainable parameter Output matrix Wo that I mentioned before that comes into play here.In the final layer of the self-attention, all the output [Z0, Z1, Z2,…, Z7] are concatenated and multiplied with Wo such that the final output Z is of a dimension 64.

原来，我之前提到的是一个最终可训练的参数输出矩阵 Wo 。在自我注意的最后一层，所有输出[Z0，Z1，Z2，…，Z7]被级联并乘以Wo ，使得最终输出Z的尺寸为64。

Below is the diagram to show all the steps discussed above:

下图显示了上面讨论的所有步骤：

Positional encoding:

位置编码：

Remember in first comes the input section I mentioned positional encoding, let’s see what are they and how they help. The problem with our current awesome transformer is that it does not take the position of the input words into account. Unlike RNN where we had timesteps to denote which word comes before and after, in transformers since the words are fed simultaneously, we need some kind of positional encoding that defines which word comes after which.Positional encoding comes to our rescue as it gives the input embedding a sense of position, we first generate the position embeddings for each of the input words and these position embeddings are then added to the word embeddings of the respective words to generate embeddings with a time signal.

请记住，首先我提到位置编码的输入部分，让我们看看它们是什么以及它们如何提供帮助。我们当前出色的变压器存在的问题是它没有考虑输入字的位置。与RNN不同，在RNN中，我们有时间步长指示哪个单词出现在前后，而在转换器中，由于单词是同时馈送的，我们需要某种位置编码来定义哪个单词出现在后面，因为位置编码可以提供输入为了嵌入位置感，我们首先为每个输入单词生成位置嵌入，然后将这些位置嵌入添加到各个单词的单词嵌入中，以生成带有时间信号的嵌入。

There were many proposed method for generating the positional embeddings like one-hot encoded vectors or binary encoding but what the researchers found to work the best was using the equations below to generate the embeddings:

提出了许多生成位置嵌入的方法，例如单热编码矢量或二进制编码，但是研究人员发现最有效的方法是使用以下公式生成嵌入：

When we plot the 128-dimensional positional encoding for a sentence with a maximum length of 50, it looks something like:

当我们绘制最大长度为50的句子的128维位置编码时，它看起来像：

Residual connections:

残余连接：

Finally, there is one more improvisation added to the encoders known as residual connections or skip connections which allow the output from the previous layer to bypass layers in between.It helps in deep networks where there are many hidden layers and if any layer in between is not of much use or is not learning much, skip connections help in bypassing that layer.Another thing to note is that when the residual connections are added and the resultant is normalized.

最后，在编码器中又增加了一种即席连接(残余连接或跳过连接)，可以使前一层的输出绕过中间的层，这对于深度网络中存在许多隐藏层并且中间有任何层的情况很有帮助。没有太大用处或学习不多的地方，跳过连接有助于绕过该层。另一要注意的是，当添加剩余连接并将结果标准化后。

Decoder

解码器

A decoder is very similar to the encoder. Like encoder, it also has the self-attention and feed-forward network but it also has an additional block known as Encoder-Decoder Attention sandwiched between the two.The Encoder-Decoder Attention layer works just like multiheaded self-attention, except it creates its Queries matrix from the layer below it, and takes the Keys and Values matrix from the output of the encoder stack.The remaining 2 layers work exactly the same as those in the encoder cell.

解码器与编码器非常相似。像编码器，它也有自关注和前馈网络，但它也有被称为夹在two.The 编码器-解码器注意层之间的编码器-解码器注意的附加块的工作原理一样多头自注意，除了它创建它的查询矩阵来自其下一层，并从编码器堆栈的输出中获取键和值矩阵，其余两层的工作原理与编码器单元中的相同。

The input to the decoder stack is sequential unlike the simultaneous input in encoder stack, meaning the first output word is passed into the decoder as an input using which it generates the second output now this output is again passed as an input to the decoder and using that it generates the third output and so on…

解码器堆栈的输入是顺序的，与编码器堆栈中的同时输入不同，这意味着第一个输出字作为输入传递到解码器，通过它生成第二个输出，现在此输出再次作为输入传递到解码器，并且它会生成第三个输出，依此类推...

The output of the decoders is passed into a linear layer with softmax activation using which, the correct word is predicted.

解码器的输出通过softmax激活传递到线性层，通过该层预测正确的字。

Once the transformer predicts a word using forward propagation, the prediction is compared with the actual label using a loss function like cross-entropy and then all the trainable parameters are updated using back-propagation.Well, this is one simplified way of understanding how learning happens in transformers. There are more variations like taking the complete output sentence for calculating the loss. To know more you can check out this amazing blog on Transformer by Jay Alammar.

转换器使用正向传播预测单词后，使用诸如交叉熵之类的损失函数将预测与实际标签进行比较，然后使用反向传播更新所有可训练的参数。这是一种了解学习方式的简化方式发生在变压器中。还有更多变化，例如采用完整的输出语句来计算损失。要了解更多信息，您可以查看Jay Alammar撰写的有关Transformer的精彩博客。

With this, we have come to the end of this blog. Hope the read was pleasant.I would like to thank all the creators for creating the awesome content I referred to for writing this blog.

至此，我们到了本博客的结尾。 希望阅读愉快。我要感谢所有创作者创造了我写此博客所提到的精彩内容。

Reference links:

参考链接：

Final note

最后说明

Thank you for reading the blog. I hope it was useful for some of you aspiring to do projects or learn some new concepts in NLP.

感谢您阅读博客。我希望这对有志于在NLP中进行项目或学习一些新概念的人有用。

In part 2/3 we will go through BERT (Bidirectional Encoder Representations from Transformers).

在第2/3部分中，我们将介绍BERT(来自变压器的双向编码器表示)。

In part 3/3 we will go through a hands-on Kaggle challenge — Google QUEST Q&A Labeling to see Transformers in action (top 4.4% on the leaderboard).

在第3/3部分中，我们将进行动手的Kaggle挑战-Google QUEST问题与解答标签，以查看《变形金刚》的使用情况(在排行榜上排名前4.4％)。

Find me on LinkedIn: www.linkedin.com/in/sarthak-vajpayee

在LinkedIn 上找到我： www.linkedin.com/in/sarthak-vajpayee

Peace! ☮

和平！ ☮

翻译自: https://towardsdatascience.com/transformers-state-of-the-art-natural-language-processing-1d84c4c7462b

变形金刚2

weixin_26752765

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
变形金刚2_变形金刚（

变形金刚2 重点 (Top highlight)This is a 3 part series where we will be going through Transformers, BERT, and a hands-on Kaggle challenge — Google QUEST Q&A Labeling to see Transformers in action (top 4....
复制链接

扫一扫