软件开发同行请注意_请注意

最新推荐文章于 2024-08-09 17:00:00 发布

weixin_26704853

最新推荐文章于 2024-08-09 17:00:00 发布

阅读量300

点赞数

原文链接：https://medium.com/swlh/attention-please-1e16e7011a08

版权

软件开发同行请注意

语言建模路径：第2章 (THE LANGUAGE MODELING PATH: CHAPTER 2)

Hi and welcome to the second step of the Language Modeling Path; a series of articles from Machine Learning Reply aimed to cover the most important milestones that brought to life the huge language models able to imitate and (let’s say) understand the human language like BERT and GPT-3.

您好，欢迎来到语言建模之路的第二步；机器学习回复中的一系列文章旨在涵盖最重要的里程碑，这些里程碑使能够模仿并(例如)理解BERT和GPT-3等人类语言的庞大语言模型栩栩如生。

In this article we are going to talk about attention layers. This kind of architectural trick was first applied to the computer vision field [1] but in this article we will focus only on the Neural Natural Language Processing application and in particular on the sequence to sequence applications for Neural Machine Translation (NMT). While this article is based on two papers ([2] and [3]), attention is a widespread technique that you can find explained everywhere if you need to go deeper.

在本文中，我们将讨论注意力层。这种架构技巧首先应用于计算机视觉领域[1]，但在本文中，我们将仅关注神经自然语言处理应用程序，尤其是神经机器翻译 (NMT)的序列到序列应用程序。虽然本文基于两篇论文([2]和[3])，但是注意力是一种广泛使用的技术，如果您需要更深入的话，可以在各处找到解释。

To better understand this chapter, it is strongly suggested to have a good comprehension of encoder-decoder sequence-to-sequence models. If you need a good refresh of the key concept of those architectures, you can start from the first chapter of our Language Modeling Path.

为了更好地理解本章，强烈建议对编码器-解码器序列间模型有一个很好的理解。如果您需要很好地刷新这些架构的关键概念，则可以从“ 语言建模路径”的第一章开始。

Sequence-to-Sequence architectures
序列到序列架构

现实世界中的注意力 (Attention in the real world)

The introduction of attention mechanism in common seq2seq applications allows the elaboration of longer and more complex sentences. As we anticipated the basic insight of this trick was born inside the computer vision framework and then developed around natural language for Neural Machine Translation (NMT) applications. In this article we will focus on NMT simply because this is the natural habitat of such algorithm but nevertheless the family of attention-based models (models that rely on this particular architectural pattern) counts among its ranks many state of the art models in most of Natural Language Processing application fields.

在常见的seq2seq应用程序中引入注意机制可以详细说明更长和更复杂的句子。如我们所料，此技巧的基本见解是在计算机视觉框架内产生的，然后围绕自然语言开发，用于神经机器翻译(NMT)应用程序。在本文中，我们将仅关注NMT，因为这是这种算法的自然栖息地，但是基于注意的模型家族(依赖于这种特定架构模式的模型)在大多数情况下都属于其最先进的模型自然语言处理应用领域。

This is because attention mechanism is the key that makes Google Assistant and Amazon Alexa understand our intentions even if we use more than a simple sentence to express it.

这是因为注意力机制是使Google Assistant和Amazon Alexa理解我们意图的关键，即使我们使用的不仅仅是一个简单的句子来表达它。

It can give a boost of accuracy in all the applications that require text embedding. Here it is a brief (incomplete) list of topics where we were able to experience its improvements with respect to not attention-based models.

它可以在需要嵌入文本的所有应用程序中提高准确性。这是主题的简短列表(不完整)，在这些列表中，我们能够体验到针对非关注模型的改进。

Document retrieval
文件检索
Text classification
文字分类
Text clustering
文字聚类
Text similarity
文字相似度
Personalized search engines
个性化搜索引擎
Text generation
文字产生

Moreover, the attention mechanism gives a more detailed insight on what part of the input had the highest impact on the decision made by our model, and this is a huge advantage in a production environment making the black box neural network a little less black.

此外，注意力机制可以更详细地了解输入的哪些部分对我们的模型决策产生的影响最大，这在生产环境中具有巨大优势，这使得黑盒神经网络的黑度降低了一点。

标准序列到序列NMT中的问题 (A problem in standard Sequence-to-Sequence NMT)

In the previous chapter we discussed one of the most effective architectures still used nowadays for the NMT problem, the sequence-to-sequence model. In those kind of networks an input sequence, made of the single words that compound the sentence in the starting language, is fed into a recurrent neural network, like for example an LSTM network. This network tries to collect all the information from each single input word and stores it inside a fixed length array. This first part of the model is called the encoder. The encoded array is then passed as initial hidden state of a second recurrent neural network that tries to generate the correct translation of one word at the time, starting at each time from the previous hidden state and the previously generated word; this second part is called decoder.

在上一章中，我们讨论了当今仍在解决NMT问题的最有效架构之一，即序列到序列模型。在这类网络中，由将以初始语言构成句子的单词组成的输入序列输入到递归神经网络中，例如LSTM网络。该网络尝试从每个输入字中收集所有信息，并将其存储在固定长度的数组中。模型的第一部分称为编码器 。然后，将编码后的数组作为第二个递归神经网络的初始隐藏状态传递，该神经网络尝试一次生成一个单词的正确翻译，每次翻译均从先前的隐藏状态和先前生成的单词开始；第二部分称为解码器 。

This kind of approach allows the input and the output sequence to have a great flexibility in length and opened up the way to deep learning inside the automatic translation field of research. However this approach still presents some intuitive downsides.

这种方法允许输入和输出序列在长度上具有很大的灵活性 ，并为自动翻译研究领域内的深度学习开辟了道路。但是，这种方法仍然存在一些直观的缺点。

From [2] we can in fact read:

实际上，从[2]中我们可以读到：

“A potential issue with this encoder–decoder approach is that a neural network needs to be able to compress all the necessary information of a source sentence into a fixed-length vector. This may make it difficult for the neural network to cope with long sentences, especially those that are longer than the sentences in the training corpus. Cho et al.(2014b) showed that indeed the performance of a basic encoder–decoder deteriorates rapidly as the length of an input sentence increases.”

“这种编码器/解码器方法的潜在问题在于， 神经网络需要能够将源语句的所有必要信息压缩成固定长度的向量 。这可能使神经网络难以应付长句子，尤其是那些比训练语料库中的句子更长的句子。 Cho等人(2014b)表明，随着输入句子长度的增加，基本编解码器的性能确实Swift下降。”

Even if LSTM recurrent neural networks are able to persist information taken from the beginning of the input sentence up to the end i.e. up to the encoded array, the available space where this information can be stored is limited by the fixed dimension of the encoded array. If the input sentence is too long with respect to this fixed dimension, then we cannot avoid the loss of some important information and the translation will be affected from it.

即使LSTM递归神经网络能够持久地保留从输入语句的开头到结尾(即直到编码数组)所获取的信息，但可以存储该信息的可用空间也受到编码数组固定尺寸的限制。 如果输入的句子相对于此固定维度而言过长，那么我们将无法避免某些重要信息的丢失，翻译将因此而受到影响。

基于注意力的模型 (The attention-based model)

The idea behind the attention mechanism is the following. During encoding phase, at each step of the RNN I store the hidden state related to each one of the input sentence’s words. Then during the decoding, at each step of the RNN, in addition to the previous decoder’s hidden state and generated word, I take also into account a weighted average of the previously stored encoder’s hidden states. The weights of this average depend directly from the previous decoder’s hidden state and are learned during training in order to give more attention to the encoder’s hidden states related to the words that are most “aligned” (semantically, syntactically, etc..) with the translated word I am trying to generate at the current step.

注意机制背后的想法如下。在编码阶段，在RNN的每一步中，我都存储与每个输入句子的单词相关的隐藏状态。然后在解码过程中，在 RNN的每个步骤中，除了先前解码器的隐藏状态和生成的字，我还考虑了先前存储的编码器的隐藏状态的加权平均值 。该平均值的权重直接取决于先前的解码器的隐藏状态，并在训练过程中学习，以便更多地注意与最“对齐”(在语义上，语法上等)的单词相关的编码器的隐藏状态。我正在尝试在当前步骤中生成的翻译词。

在解码器的每个步骤中，我都会突出显示与当前翻译步骤严格相关的输入语句的一小部分信息。 (At each step of the decoder I highlight the information of a small subset of the input sentence strictly related specifically to the current translation step.)

Let’s see in deeper detail how this idea has been implemented. Let’s start defining some initial symbols.

让我们更详细地了解如何实现这个想法。让我们开始定义一些初始符号。

X = (x₁, …, xT) Input sentence
X =(x₁ ，…，xT)输入句子
Y = (y₁, …, yS) Output sentence
Y =(y₁，…，y S)输出语句
hₜ = Encoder’s hidden state at step t with t in [1, …, T]
hₜ=编码器在步骤t处的隐藏状态，其中t在[1，…，T]中
sₜ = Decoder’s hidden state at step t with t in [1, …, S]
sₜ=解码器在步骤t处的隐藏状态，其中t在[1，…，S]中

As we saw in the chapter one of the Language Modeling Path, the target here it is as always finding an estimation of the conditional probability of the next generated word given all the previous ones and given also some kind of information taken from the input sentence X. This conditional probability is computed applying some function g (for example a dense neural network) to the current decoder step’s hidden state sₜ. In the case of attention based models, the current state sₜ is on its turn computed starting not only from the previously generated word yₜ-₁ and previous hidden state sₜ-₁ as it was in common seq2seq architectures but also from the so called context vector cₜ, that represents the weighted average we were talking about a few moments before.

正如我们在“ 语言建模路径”的第一章中所看到的那样，这里的目标是像往常一样，在给定所有先前单词并且还从输入句子X中获取某种信息的情况下，找到对下一个生成单词的条件概率的估计。。通过将某些函数g (例如，密集神经网络)应用于当前解码器步骤的隐藏状态sₜ来计算此条件概率。在基于注意力的模型的情况下， 当前状态sₜ不仅从先前生成的词y₁- previous和先前的隐藏状态sₜ-₁开始(如常见的seq2seq架构中一样)开始计算，而且还从所谓的上下文向量开始计算cₜ ，它代表我们刚才谈论的加权平均值。

The key fact of the Attention mechanism is indeed the context vector cₜ. Note that this context vector is specific for the sequence step t, in this way the generation of each different word building up the output translation will give more importance to different sections of the input sentence.

注意机制的关键事实确实是上下文向量cₜ 。请注意，此上下文向量是特定于序列步骤t的，这样，生成输出翻译的每个不同单词的生成将更加重视输入句子的不同部分 。

In most cases the context vector is just a weighted average of the encoder’s hidden states that we stored at each step of the encoding phase. The weights aₜᵢ are just a softmax rescaling (in order to have them between 0 and 1 and summing up to 1) of a certain value eₜᵢ called alignment score.

在大多数情况下，上下文向量只是我们在编码阶段的每个步骤中存储的编码器隐藏状态的加权平均值。权重aₜᵢ只是一个确定值eₜᵢ的softmax重新缩放(以使其在0到1之间且总和为1)，称为对齐分数。

Those definitions are quite common among many of the different kind of attention implemented. On the other side the way to compute those scores eₜᵢ can differ from paper to paper. In the case of the original implementation described in [2] the scores are simply computed by applying a simple Feedforward Neural Network (referred to with the name of alignment model) on the concatenation of sₜ-₁ and hᵢ, but remember that it’s not the only way to implement the attention mechanism.

这些定义在实施的许多不同类型的注意中非常普遍。另一方面，计算这些分数的方法可能因纸而异。对于[2]中描述的原始实现，通过在sₜ-₁和hᵢ的串联上应用简单的前馈神经网络(称为比对模型的名称)来简单地计算分数，但请记住，不是实施注意力机制的唯一途径。

“The probability aₜᵢ, or its associated energy eₜᵢ reflects the importance of the annotation hᵢ with respect to the previous hidden state sₜ-₁ in deciding the next state sₜ and generating yₜ. […] The decoder decides which parts of the source sentence to pay attention to. By letting the decoder have an attention mechanism, we relieve the encoder from the burden of having to encode all information in the source sentence into a fixed-length vector” [2]

“概率aₜᵢ，或与其相关的能源eₜᵢ反映了注解hᵢ相对于在决定下一状态sₜ并产生yₜ的重要性，以前隐藏状态sₜ-₁。 […] 解码器决定源句中要注意的部分 。通过让解码器具有一种关注机制，我们使编码器免除了必须将源语句中的所有信息编码为固定长度矢量的负担” [2]。

At the end of the translation, collecting all the aₜᵢ weights, it is possible to build up an alignment matrix that for each generate words, shows which input words where the most influent for that particular generation step.

在翻译结束时，收集所有的权重aₜᵢ，有可能建立一个对齐矩阵，对于每个生成的话，显示了哪些输入字，其中最流入该特定生成步骤。

当地关注 (Local Attention)

This improvement of the architecture, unfortunately, brings with it a strong downside in terms of computational and memory cost. In fact, the longer the input sequence is the greater is the number of hidden states I must store and of the additional weights I must train to learn to combine them in an appropriate way. Moreover, this combination of numerous arrays must be repeated for each step of the generated sentence increasing the number of operations required. Since the development of attention mechanism was meant precisely to improve long sentences management this issue can’t be ignored. This is why, in [3], it has been developed a small variation on standard attention mechanism called local attention.

不幸的是，这种体系结构的改进带来了很大的计算和存储成本方面的缺点 。实际上，输入序列越长，我必须存储的隐藏状态数和必须训练的额外权重就需要学习以适当的方式组合它们。而且，必须对所生成语句的每个步骤重复许多阵列的这种组合，从而增加了所需的操作数量。由于注意力机制的发展正好是为了改善长句子的管理，所以这个问题不容忽视。这就是为什么在[3]中，它在标准注意力机制上开发了一个小的变体，称为局部注意力。

The intuition behind the local attention is simply the idea that instead of considering all the encoder’s hidden states to compute the context vector, at each decoder step we select only those ones inside a small moving window W, that at each decoder’s step slides over the encoder’s hidden states.

引起局部关注的直觉是这样一种想法，即不用考虑所有编码器的隐藏状态来计算上下文向量， 而是在每个解码器步骤中只选择一个小的移动窗口W内的那些 ，在每个解码器步骤中滑过编码器的隐藏的状态。

The size of the window D is usually empirically selected, while the center pₜ can be identified in two different ways depending on the implementation.

窗口D的大小通常根据经验选择，而中心p 1可以根据实现方式以两种不同的方式识别。

Monotonic → pₜ = tThis approach assumes that source and target sentence are already roughly aligned. In many combinations of source-target languages this is not necessarily a wrong assumption, especially if the dimension of the window D is large enough. This means simply that to generate the 5th word of the output translation I will pay attention to words in the area around the 5th input word.
单调→pₜ= t此方法假定源句子和目标句子已经大致对齐。在源目标语言的许多组合中，这不一定是错误的假设，尤其是在窗口D的尺寸足够大的情况下。这只是意味着要生成输出翻译的第5个字，我将注意第5个输入字周围的字。
Predictive → pₜ = T · sigmoid(vₚ · tanh(Wₚhₜ))In this case the value of pₜ is computed by another feedforward neural network having learnable weights Wₚ and vₚ. The application of sigmoid function and multiplication by T (quick reminder: T is the length of the source sentence) ensures that value of pₜ will be between 0 and T itself hence representig the position of a word in the input sentence. By training these weights, the model is able to learn by its own, depending on the current hidden state hₜ, at which step of the input sentence the small window shall be centered and in this way what subset of the input hidden state must be considered to compute the context vector.
预测→pₜ= T · S形(vₚ · tanh(Wₚhₜ))在这种情况下， pₜ的值是由另一个具有可学习权重Wₚ和vₚ的前馈神经网络计算的。使用S形函数并乘以T (快速提醒： T是源句的长度)可确保pₜ的值在0到T本身之间，从而表示单词在输入句中的位置。通过训练这些权重，模型能够根据当前的隐藏状态hₜ自行学习，小输入窗口应在输入语句的哪一步居中，并以此方式考虑输入隐藏状态的子集计算上下文向量。

While the Monotonic approach is surely simpler and lighter, the Predictive one has proven to give more accurate translations. Anyway both the methods has surpassed the results obtained by the classical (global) attention mechanism as presented in [2].

虽然单调方法肯定更简单，更轻便，但Predictive方法已被证明可以提供更准确的翻译。无论如何，这两种方法都超过了[2]中提出的经典(全局)注意力机制所获得的结果。

结果 (Results)

To compare improvements of a new model with respect to the previous state of the art the common procedure is to select some benchmark datasets where previous models have already been tested and see if the newly proposed one can do better. One of this benchmark dataset (very popular when [2] and [3] were published) is the WMT’14 dataset, a corpus made of pairs of sentences where to each sentence is associated a valid translation. In the corpus there are five pairs of languages that can be selected:

为了比较新模型相对于现有技术的改进，通用过程是选择一些基准数据集 ，其中先前的模型已经过测试，看看新提出的模型是否可以做得更好。这个基准数据集(当发表[2]和[3]时非常流行)是WMT'14数据集 ，它是由成对的句子组成的语料库，其中每个句子都与一个有效的翻译相关联。在语料库中，可以选择五对语言：

French-English
法语-英语
Hindi-English
印地语-英语
German-English
德语-英语
Czech-English
捷克英语
Russian-English
俄语-英语

The papers at the base of this article selected the French-English corpus [2] and German-English [3] corpus to evaluate the model performances.

本文基础部分的论文选择了法英语料库[2]和德英语[3]语料库来评估模型性能。

In addition to a benchmarking dataset, it will be required also a common metric to compare the accuracy of each translation. One of the most popular metric is the BLEU metric that stands for bilingual evaluation understudy. Long story short, this score is a sort of precision computed between distribution generated by the model and the actual translation that

除了基准数据集外，还需要一个通用指标来比较每次翻译的准确性。最受欢迎的指标之一是BLEU指标，它表示双语评估学习。 长话短说，该分数是在模型生成的分布与实际翻译之间计算出的一种精度那

Takes into account how many times a certain generated token is present in the target sentences.
考虑目标句子中某个生成的标记存在多少次。
Instead of evaluating a word at the time, it is computed over n-grams (groups of n words).
而不是同时评估一个单词，而是通过n-gram(n个单词的组)进行计算。
Penalizes shorter translations.
惩罚较短的翻译。

Another article would be necessary to discuss the BLEU score and benchmarking of language models, but if you want to know some more on BLEU here it’s its paper.

讨论BLEU得分和语言模型基准测试将需要另一篇文章，但是如果您想了解BLEU的更多信息，请点击这里。

This boring introduction was only to let you understand the following few statements over model results. For example, the first attention-based model [2] was able to outperform the non-attentional models on the French-English task of an amount of percentage points of BLEU score between 7 and 11, depending on the maximum length of the test sentences. Obviously the greatest improvements were observed over the longest sentences. Instead talking about [3], the model that introduced the local attention approach was able to surpass the results of basic attention model over the German-English dataset of an additional 1-3 percentage points.

这个无聊的介绍只是为了让您了解有关模型结果的以下几条语句。例如，基于测试句子的最大长度，第一个基于注意力的模型[2]能够在法语-英语任务上胜过非注意力模型，其BLEU分数的百分比在7到11之间。。显然，在最长的句子中观察到了最大的改进。与其讨论[3]，引入局部注意力方法的模型在德语-英语数据集上比基本注意力模型的结果要高出1-3个百分点。

To better understand what kind of improvement the attention mechanism has brought, let’s see also some qualitative result of those models taken from [2].

为了更好地理解注意力机制带来了哪些改进，我们还可以从[2]中获得这些模型的定性结果。

As an example, consider this source sentence from the test set:

举个例子，考虑一下测试集中的这个源语句：

“An admitting privilege is the right of a doctor to admit a patient to a hospital or a medical centre to carry out a diagnosis or a procedure, based on his status as a healthcare worker at a hospital. “

“准入特权是医生根据患者在医院担任医护人员的身份，允许患者进入医院或医疗中心进行诊断或程序的权利。 “

The RNNencdec-50 [non-attentional seq2seq] translated this sentence into:

RNNencdec-50 [non-attentional seq2seq]将此句子翻译为：

“Un privilège d’admission est le droit d’un médecin de reconnaître un patient à l’hôpital ou un centre médical d’un diagnostic ou de prendre un diagnostic en fonction de son état de santé.”

“获得住院医生资格的优先权，并向其诊断和诊断儿子的行为进行诊断。”

Published as a conference paper at ICLR 2015The RNNencdec-50 [non-attentional seq2seq] correctly translated the source sentence until “a medical center”. However, from there on (underlined), it deviated from the original meaning of the source sentence. For instance, it replaced “based on his status as a health care worker at a hospital” in the source sentence with “en fonction de son état de santé” (“based on his state of health”). On the other hand, the RNNsearch-50 [attention-based model] generated the following correct translation, preserving the whole meaning of the input sentence without omitting any details:

作为会议论文在ICLR 2015上发表。RNNencdec-50 [non-attentional seq2seq]正确翻译了源句，直到“医疗中心”为止。但是，从此处开始(带下划线)，它与源句子的原始含义有所不同。例如，它在原文句中将“根据其在医院的医务人员的身份”替换为“根据儿子的健康状况”(“根据其健康状况”)。另一方面，RNNsearch-50 [基于注意力的模型]生成了以下正确的翻译，保留了输入句子的整体含义，而没有省略任何细节：

“Un privilège d’admission est le droit d’un médecin d’admettre un patient à un hôpital ou un centre m édical pour effectuer un diagnostic ou une procédure, selon son statut de travailleur des soins de santé à l’hôpital”

“获得住院医生的私人特权，并在诊断过程中对患者的医疗效果进行了诊断， ”

This example is able to strongly highlight the true essence of attention improvement. Using the old seq2seq model the encoded array has been able to store enough information only to correctly generate the first half of the sentence. The second part of the generated sentence is simply a reshuffle of previous elements, but no new component is added. The words “status” or “healthcare worker” are completely ignored since there is no more room in the encoded hidden state to bring them from the encoder to the decoder. On the other hand, using the attention mechanism this problem is solved. To generate the first half of the sentence I can focus only on the first words, forgetting what appears at the end; then, as we approach the second half of the sentence, our attention is moved to another set of input words and such words like “status” or “healthcare worker” are allowed to give their impact over the generation of newly translated words.

这个例子能够强烈地强调注意力改善的真正本质。使用旧的seq2seq模型，编码后的数组已经能够存储足够的信息，仅能正确生成句子的前半部分。生成的句子的第二部分只是对先前元素的重新组合，但未添加任何新组件。 单词“状态”或“医护人员”被完全忽略，因为在编码的隐藏状态下没有更多的空间将它们从编码器带到解码器 。另一方面，使用注意力机制解决了这个问题。要生成句子的前半部分，我只能关注第一个单词，而忘了最后出现的内容。然后，当我们接近句子的后半部分时，我们的注意力转移到另一组输入单词上，并且允许诸如“状态”或“医疗保健工作者”之类的单词对新翻译单词的产生产生影响。

结论 (Conclusions)

The attention mechanism is a keystone to fully understand the current state of the art of Neural Language Model. One of the most beautiful (in my opinion) aspect of its development is that as most of deep learning evolutions (or even deep learning itself) the intuition behind it reflects the human thinking behaviour. When we approach a translation, we do not simply read the input sentence once, and then write the output. This could be done only for very small sentences. We can’t require a machine to work like this. What a human does, it is to read the sentence once and then read small pieces of sentence at the time focusing only on that piece’s translation, but keeping in mind where the whole sentence wants to go. This is exactly how sequence to sequence attention-based models work. The ability of memorizing longer and longer sentences will open up the way to models suitable for training on enormous corpus of text and this, as you can imagine, is the way that a neural language model learns how we, humans, are used to communicating.

注意机制是充分了解神经语言模型最新技术的基石。它的发展最美丽(在我看来)的方面之一是，随着大多数深度学习进化(甚至是深度学习本身)的发展，其背后的直觉也反映了人类的思维行为。当我们进行翻译时，我们不只是简单地阅读输入语句一次，然后写输出。这仅适用于非常小的句子。我们不能要求机器像这样工作。人类要做的是，先阅读一次句子，然后一次只阅读句子的小片段，然后阅读其中的小片段 ，但要记住整个句子要去的地方。 这正是基于注意的序列排序的工作原理 。记忆越来越长的句子的能力将为适用于训练大量文本语料库的模型开辟道路，正如您可以想象的那样，这是神经语言模型学习人类(人类)如何进行交流的方式。

I hope this article was of any help to understand the attention mechanism and can’t wait to see you in the next chapter of the Language Modeling Path. We will talk about the transformer, a model first published in the paper “Attention Is All You Need” (coincidence? I think not) and that is at the base of all the most popular deep learning language models like BERT, GPT-2, T-NLG and GPT-3.

我希望这篇文章对理解注意机制有一定的帮助，迫不及待地在“ 语言建模之路”的下一章与您相见。我们将讨论变压器 ，该模型首先发表在论文“ Attention Is All You Need ”(巧合吗？我想不是)，它是所有最受欢迎的深度学习语言模型(例如BERT，GPT-2， T-NLG和GPT-3。

Bye!

再见！