Attention is all your need(20.11.21)

最新推荐文章于 2023-04-17 22:58:32 发布

fuchengguo666

最新推荐文章于 2023-04-17 22:58:32 发布

阅读量332

点赞数

分类专栏： sentiment analysis 文章标签： nlp

本文链接：https://blog.csdn.net/fuchengguo666/article/details/109899926

版权

sentiment analysis 专栏收录该内容

20 篇文章 2 订阅

订阅专栏

Attention is all your need

Abstract

The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism.
主要的序列转导模型是基于复杂的递归或卷积神经网络，包括一个编码器和一个解码器。性能最好的模型还通过注意机制连接编码器和解码器。
We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely.
我们提出了一种新的简单的网络结构，即Transformer，它完全基于注意力机制，完全不需要重复和卷积。
Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles, by over 2 BLEU.
在两个机器翻译任务上的实验表明，这些模型在质量上是优越的，同时具有更高的并行性，所需的训练时间大大减少。我们的模型在WMT 2014英语到德语翻译任务中达到了28.4 BLEU，比现有的最佳结果（包括ensembles）提高了超过2个BLEU。
On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature.
在WMT 2014英语到法语翻译任务中，我们的模型在8个GPU上训练了3.5天后，建立了一个新的单一模型最先进的BLEU分数41.8，这只是文献中最好模型的培训成本的一小部分。
We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
通过对大量和有限的训练数据的成功应用，证明了该transfomer能很好地推广到其他任务。

一、Introduction

自从有了RNN的变体LSTM和GRU，Numerous efforts have since continued to push the boundaries of recurrent language models and encoder-decoder architectures.
此后，人们一直在努力扩大循环语言模型和编码器-解码器体系结构的界限。
Recurrent models typically factor computation along the symbol positions of the input and outputsequences.
递归模型通常沿输入和输出序列的符号位置考虑计算。
Aligning the positions to steps in computation time, they generate a sequence of hidden states ht, as a function of the previous hidden state ht−1 and the input for position t. This inherently sequential nature precludes parallelization within training examples, which becomes critical at longer sequence lengths, as memory constraints limit batching across examples.
将位置与计算时间中的步骤对齐，它们生成一系列隐藏状态ht，作为先前隐藏状态ht−1的函数和位置t的输入。这种固有的顺序性质排除了训练示例中的并行化，这在较长的序列长度下变得至关重要，因为内存约束限制了跨示例的批处理。
Recent work has achieved significant improvements in computational efficiency through factorization tricks and conditional computation, while also improving model performance in case of the latter. The fundamental constraint of sequential computation, however, remains.
最近的工作已经通过因子分解技巧和条件计算在计算效率上取得了显著的改进，同时也提高了后者的模型性能。然而，顺序计算的基本约束仍然存在。
Attention mechanisms have become an integral part of compelling sequence modeling and transduction models in various tasks, allowing modeling of dependencies without regard to their distance in the input or output sequences. In all but a few cases, however, such attention mechanisms are used in conjunction with a recurrent network.注意机制已经成为各种任务中令人信服的序列建模和转导模型的一个组成部分，允许对依赖关系进行建模，而不考虑它们在输入或输出序列中的距离。然而，除少数情况外，这些注意机制都是与一个循环网络一起使用的。
In this work we propose the Transformer, a model architecture eschewing recurrence and instead relying entirely on an attention mechanism to draw global dependencies between input and output.The Transformer allows for significantly more parallelization … .在这项工作中，我们提出了一种Transformer，一种避免重复的模型体系结构，完全依赖于注意力机制来绘制输入和输出之间的全局依存关系.Transformer可以实现更多的并行化，…。

二、Background

Self-attention, sometimes called intra-attention is an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence. Self-attention has been used successfully in a variety of tasks including reading comprehension, abstractive summarization, textual entailment and learning task-independent sentence representations.
自我注意，有时称为内注意，是一种将单个序列的不同位置联系起来以计算序列的表示的一种注意机制。自我注意已成功地应用于阅读理解、抽象总结、文本蕴涵和学习任务无关的句子表征。
End-to-end memory networks are based on a recurrent attention mechanism instead of sequence-aligned recurrence and have been shown to perform well on simple-language question answering and language modeling tasks.
端到端记忆网络是基于循环注意机制而不是序列对齐的循环机制，在简单的语言问答和语言建模任务中表现良好.
the Transformer is the first transduction model relying entirely on self-attention to compute representations of its input and output without using sequence-aligned RNNs or convolution.
Transformer是第一个完全依靠self-attention来计算其输入和输出表示的转导(Transduction)模型，而无需使用序列对齐的RNN或CNN。

三、Model Architecture

-Most competitive neural sequence transduction models have an encoder-decoder structure.
大多数竞争性神经序列转导模型都具有编码器-解码器结构。
-Here, the encoder maps an input sequence of symbol representations(x1,…,xn)to a sequence of continuous representations z= (z1,…,zn). Given z, the decoder then generates an output sequence(y1,…,ym)of symbols one element at a time. At each step the model is auto-regressive[10], consuming the previously generated symbols as additional input when generating the next.
这里，编码器将符号表示的输入序列（x1，…，xn）映射到连续表示序列z=（z1，…，zn）。给定z，解码器然后一次生成一个符号的输出序列（y1，…，ym）。在每一步中，模型都是自回归的[10]，在生成下一步时，使用先前生成的符号作为附加输入。
-The Transformer follows this overall architecture using stacked self-attention and point-wise, fully connected layers for both the encoder and decoder, shown in the left and right halves of Figure 1, respectively.
Transformer遵循这个整体架构，对编码器和解码器使用堆叠的自我关注和点式的完全连接层，分别显示在图1的左半部和右半部。

在这里插入图片描述

3.1 Encoder and Decoder Stacks
1）Encoder:
i.The encoder is composed of a stack of N= 6identical layers.Each layer has two sub-layers.
编码器由一个N=6个相同层组成的堆栈。每层有两个子层。
ii.The first is a multi-head self-attention mechanism, and the second is a simple, position-wise fully connected feed-forward network. We employ a residual connection [11] around each of the two sub-layers, followed by layer normalization [1].
第一种是多头自我注意机制，第二种是一种简单的、位置上完全连接的前馈网络。我们在两个子层的每一个子层周围使用一个剩余连接[11]，然后进行层标准化[1]。
iii.That is, the output of each sub-layer is LayerNorm(x+ Sublayer(x)), where Sublayer(x) is the function implemented by the sub-layer itself. To facilitate these residual connections, all sub-layers in the model, as well as the embedding layers, produce outputs of dimension d_model= 512.
也就是说，每个子层的输出是LayerNorm（x+Sublayer（x）），其中Sublayer（x）是子层本身实现的功能。为了方便这些剩余连接，模型中的所有子层以及嵌入层都会生成维数为dmodel=512的输出。
2）Decoder：
i.he decoder is also composed of a stack of N= 6 identical layers. In addition to the two sub-layers in each encoder layer, the decoder inserts a third sub-layer, which performs multi-head attention over the output of the encoder stack. Similar to the encoder, we employ residual connections around each of the sub-layers, followed by layer normalization.
解码器还由N＝6个相同层的堆栈组成。除了每个编码器层中的两个子层外，解码器插入第三个子层，该子层对编码器堆栈的输出执行多个头部关注。与编码器类似，我们在每个子层周围使用剩余连接，然后进行层标准化。
ii.We also modify the self-attention sub-layer in the decoder stack to prevent positions from attending to subsequent positions. This masking, combined with fact that the output embeddings are offset by one position, ensures that the predictions for position i can depend only on the known outputs at positions less than i.
我们还修改了解码器堆栈中的自我注意子层，以防止位置出现在后续位置。这种掩盖，加上输出嵌入被一个位置偏移的事实，确保了对位置i的预测只能依赖于小于i位置的已知输出。
3.2 Attention
An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key.
注意函数可以描述为将一个查询和一组键值对映射到一个输出，其中查询、键、值和输出都是向量。输出被计算为值的加权和，其中分配给每个值的权重由查询与相应键的兼容性函数计算。

在这里插入图片描述

3.2.1 Scaled Dot-Product Attention
We call our particular attention “Scaled Dot-Product Attention” (Figure 2). The input consists of queries and keys of dimension d_k, and values of dimension d_v. We compute the dot products of the query with all keys, divide each by√dk, and apply a softmax function to obtain the weights on the values.
我们称我们的特别关注为“标度点产品关注”（图2）。输入包含维度为d_k的查询和键以及维度为d_v的值。我们用所有键计算查询的点积，将每个键除以√dk，然后应用softmax函数获得值的权重。
In practice, we compute the attention function on a set of queries simultaneously, packed together into a matrixQ. The keys and values are also packed together into matrices K and V. We compute the matrix of outputs as:
在实践中，我们同时对一组查询计算注意力函数，并将它们打包成一个matrixQ。键和值也打包到矩阵K和V中。我们将输出矩阵计算为：
The two most commonly used attention functions are additive attention [2], and dot-product (multi-plicative) attention. Dot-product attention is identical to our algorithm, except for the scaling factor of 1√dk. Additive attention computes the compatibility function using a feed-forward network with a single hidden layer.While the two are similar in theoretical complexity, dot-product attention is much faster and more space-efficient in practice, since it can be implemented using highly optimized matrix multiplication code.
两个最常用的注意力函数是附加注意力[2]和点积注意力。点乘注意力与我们的算法相同，只是比例因子为1√ dk。添加注意力使用一个单一隐藏层的前馈网络计算兼容性函数。尽管两者在理论上的复杂度相似，但是在实践中点积的关注要快得多，并且空间效率更高，因为可以使用高度优化的矩阵乘法代码来实现。
While for small values of d_k the two mechanisms perform similarly, additive attention outperforms dot product attention without scaling for larger values of d_k[3]. We suspect that for large values of d_k, the dot products grow large in magnitude, pushing the softmax function into regions where it has extremely small gradients4. To counteract this effect, we scale the dot products by 1√dk.
对于较小的d_k值，这两种机制的执行效果相似，而对于较大的d_k [3]而言，加法注意要优于点积注意。我们怀疑，对于较大的d_k值，点积的大小会增大，从而将softmax函数推入梯度极小的区域4。为了抵消这种影响，我们将点积缩放1√dk。
3.2.2 Multi-Head Attention
Instead of performing a single attention function with d_model-dimensional keys, values and queries, we found it beneficial to linearly project the queries, keys and values h times with different, learned linear projections to d_k, d_k and d_v dimensions, respectively. On each of these projected versions of queries, keys and values we then perform the attention function in parallel, yielding d_v-dimensional output values. These are concatenated and once again projected, resulting in the final values, as depicted in Figure 2.
与用d_model维的键，值和查询执行单个注意功能相比，我们发现将查询，键和值分别以不同的学习线性投影h次线性投影到d_k，d_k和d_v维是有益的。然后，在查询，键和值的这些预计的每个版本上，我们并行执行关注功能，从而产生d_v维输出值。将它们连接起来并再次投影，得到最终值，如图2所示。
Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions. With a single attention head, averaging inhibits this.
多头注意允许模型在不同的位置联合处理来自不同表示子空间的信息。对于一个注意力集中的头部，平均会抑制这种情况。
In this work we employ h= 8parallel attention layers, or heads. For each of these we use d_k=d_v=d_model/h= 64. Due to the reduced dimension of each head, the total computational cost is similar to that of single-head attention with full dimensionality.
在这项工作中，我们采用h = 8个平行注意层或头部。对于每一个，我们使用d_k = d_v = d_model / h =64。由于每个头部的维数降低，总的计算量与全维的单头注意的计算量相似。
3.2.3 Applications of Attention in our Model
The Transformer uses multi-head attention in three different ways:
transformer通过三种不同方式使用多头注意力：
1）In “encoder-decoder attention” layers, the queries come from the previous decoder layer, and the memory keys and values come from the output of the encoder. This allows every position in the decoder to attend over all positions in the input sequence. This mimics the typical encoder-decoder attention mechanisms in sequence-to-sequence models such as[38,2,9].
在“编解码器注意”层中，查询来自前一个解码器层，存储键和值来自编码器的输出。这使得译码器中的每个位置都能处理输入序列中的所有位置。这在序列到序列模型中模拟了典型的编解码器注意机制，例如[38,2,9]。
2）The encoder contains self-attention layers. In a self-attention layer all of the keys, values and queries come from the same place, in this case, the output of the previous layer in the encoder. Each position in the encoder can attend to all positions in the previous layer of the encoder.
编码器包含自我注意层。在自我关注层中，所有键，值和查询都来自同一位置，在这种情况下，是编码器中前一层的输出。编码器中的每个位置都可以覆盖编码器上一层中的所有位置。
3）Similarly, self-attention layers in the decoder allow each position in the decoder to attend to all positions in the decoder up to and including that position. We need to prevent leftward information flow in the decoder to preserve the auto-regressive property. We implement this inside of scaled dot-product attention by masking out (setting to−∞) all values in the input of the softmax which correspond to illegal connections. See Figure 2.
类似地，解码器中的自我注意层允许解码器中的每个位置关注直到包括该位置的解码器中的所有位置。我们需要阻止解码器中的向左信息流，以保留自回归属性。我们通过屏蔽（设置为-∞）softmax输入中与非法连接相对应的所有值，来在按比例缩放的点积产品注意范围内实现此功能。见图2。
3.3 Position-wise Feed-Forward Networks
In addition to attention sub-layers, each of the layers in our encoder and decoder contains a fully connected feed-forward network, which is applied to each position separately and identically. This consists of two linear transformations with a ReLU activation in between.
除了关注子层之外，我们的编码器和解码器中的每个层还包含一个完全连接的前馈网络，该网络分别独立且相同地应用于每个位置。这由两个线性变换组成，两个线性变换之间具有ReLU激活。
While the linear transformations are the same across different positions, they use different parameters from layer to layer. Another way of describing this is as two convolutions with kernel size 1.The dimensionality of input and output is d_model= 512, and the inner-layer has dimensionality d_ff= 2048.
虽然线性变换在不同位置上相同，但是它们使用不同的参数。描述它的另一种方式是使用两个内核大小为1的卷积。输入和输出的维数为d_model = 512，内层的维数为d_ff = 2048。
3.4 Embeddings and Softmax
Similarly to other sequence transduction models, we use learned embeddings to convert the input tokens and output tokens to vectors of dimension d_model. We also use the usual learned linear transformation and softmax function to convert the decoder output to predicted next-token probabilities. In our model, we share the same weight matrix between the two embedding layers and the pre-softmax linear transformation, similar to [30]. In the embedding layers, we multiply those weights by√d_model.
与其他序列转导模型类似，我们使用学习嵌入将输入标记和输出标记转换为维数模型的向量。我们还使用通常学习的线性变换和softmax函数将解码器输出转换为预测的下一个标记概率。在我们的模型中，我们在两个嵌入层之间共享相同的权重矩阵和pre-softmaxlinear变换，类似于[30]。在嵌入层中，我们用√dmodel乘以这些权重。
3.5 Positional Encoding

Since our model contains no recurrence and no convolution, in order for the model to make use of the order of the sequence, we must inject some information about the relative or absolute position of the tokens in the sequence. To this end, we add “positional encodings” to the input embeddings at the bottoms of the encoder and decoder stacks. The positional encodings have the same dimension d_model as the embeddings, so that the two can be summed. There are many choices of positional encodings, learned and fixed[9].
由于我们的模型不包含递归和卷积，为了使模型能够利用序列的顺序，我们必须注入一些关于标记在序列中的相对或绝对位置的信息。为此，我们在编码器和解码器堆栈底部的输入嵌入中添加“位置编码”。位置编码具有与嵌入相同的维d_模型，因此可以将二者相加。有许多位置编码的选择，可以学习的和固定。
In this work, we use sine and cosine functions of different frequencies:在这项工作中，我们使用不同频率的正弦和余弦函数：

在这里插入图片描述
其中pos是位置，i是尺寸。也就是说，位置编码的每个维度对应于一个正弦曲线。波长形成从2π到10000·2π的几何级数。我们选择此函数是因为我们假设它会允许模型轻松学习相对位置的参加，因为对于任何固定的偏移量k，PE_pos + k都可以表示为PE_pos的线性函数。

四、Why Self-Attention

五、 Training

5.1 Training Data and Batching
5.2 Hardware and Schedule
8 NVIDIA P100 GPUs.
5.3 Optimizer
5.4 Regularization

六、 Results

6.1 Machine Translation
1）On the WMT 2014 English-to-German translation task, the big transformer model (Transformer (big)in Table 2) outperforms the best previously reported models (including ensembles) by more than2.0BLEU, establishing a new state-of-the-art BLEU score of28.4. The configuration of this model is listed in the bottom line of Table 3. Training took3.5days on8P100 GPUs. Even our base model surpasses all previously published models and ensembles, at a fraction of the training cost of any of the competitive models.
在WMT 2014英语到德语翻译任务中，大变压器模型（表2中的变压器（大））比先前报告的最好的模型（包括组合体）高出2.0BLEU以上，从而建立了最先进的BLEU分数28.4。该型号的配置列于表3的最后一行。在8p100 GPU上训练3.5天。即使是我们的基础模型也超过了之前发布的所有模型和组合，只是任何竞争模型的培训成本的一小部分。
2）On the WMT 2014 English-to-French translation task, our big model achieves a BLEU score of 41.0, outperforming all of the previously published single models, at less than1/4the training cost of the previous state-of-the-art model. The Transformer (big) model trained for English-to-French used dropout rate P_drop= 0.1, instead of 0.3.
在WMT 2014英语到法语的翻译任务中，我们的大型模型的BLEU得分达到41.0，优于以前发布的所有单个模型，而培训成本却不到以前的最新模型的1/4。为英语到法语训练的Transformer（大型）模型使用的drop率P_drop = 0.1，而不是0.3。
3）For the base models, we used a single model obtained by averaging the last 5 checkpoints, which were written at 10-minute intervals. For the big models, we averaged the last 20 checkpoints. We used beam search with a beam size of4and length penaltyα= 0.6[38]. These hyper parameters were chosen after experimentation on the development set. We set the maximum output length during inference to input length +50, but terminate early when possible [38].
对于基本模型，我们使用一个单独的模型，通过平均最后5个检查点获得，这些检查点每隔10分钟编写一次。对于大模型，我们平均了最后20个检查点。我们使用波束搜索，波束大小为4，长度惩罚α=0.6[38]。这些超参数是在开发集上进行实验后选择的。我们将推理过程中的最大输出长度设置为输入长度+50，但尽可能提前终止[38]。
4）Table 2 summarizes our results and compares our translation quality and training costs to other model architectures from the literature. We estimate the number of floating point operations used to train a model by multiplying the training time, the number of GPUs used, and an estimate of the sustained single-precision floating-point capacity of each GPU.
表2总结了我们的结果，并将我们的翻译质量和培训成本与文献中的其他模型架构进行了比较。我们通过乘以训练时间，使用的GPU数量以及每个GPU持续的单精度浮点容量的估计值，来估计用于训练模型的浮点运算的数量。
6.2 Model Variations
To evaluate the importance of different components of the Transformer, we varied our base model in different ways, measuring the change in performance on English-to-German translation on the development set, news test 2013. We used beam search as described in the previous section, but no checkpoint averaging. We present these results in Table 3.
为了评估Transformer的不同组件的重要性，我们以不同的方式改变了基本模型，在开发集，新闻测试2013中测量了英语到德语翻译的性能变化。我们使用了前文所述的光束搜索部分，但没有检查点平均。我们将这些结果显示在表3中。
In Table 3 rows (A), we vary the number of attention heads and the attention key and value dimensions, keeping the amount of computation constant, as described in Section 3.2.2. While single-head attention is 0.9 BLEU worse than the best setting, quality also drops off with too many heads.
在表3行（A）中，我们改变了注意头的数量、注意键和值维度，保持计算量不变，如第3.2.2节所述。虽然单头注意力比最佳设置差0.9，但头部过多也会导致质量下降。

在这里插入图片描述
In Table 3 rows (B), we observe that reducing the attention key size d_k hurts model quality. This suggests that determining compatibility is not easy and that a more sophisticated compatibility function than dot product may be beneficial. We further observe in rows © and (D) that, as expected, bigger models are better, and dropout is very helpful in avoiding over-fitting. In row (E) we replace our sinusoidal positional encoding with learned positional embeddings [9], and observe nearly identical results to the base model.
在表3（B）行中，我们观察到减小注意键大小d_k会损害模型质量。这表明确定兼容性并不容易，并且比点乘积更复杂的兼容性功能可能会有所帮助。我们进一步在（C）和（D）行中观察到，正如预期的那样，较大的模型更好，并且遗漏对于避免过度拟合非常有帮助。在（E）行中，我们将正弦位置编码替换为学习的位置嵌入[9]，并观察到与基本模型几乎相同的结果。

6.3 English Constituency Parsing
To evaluate if the Transformer can generalize to other tasks we performed experiments on English constituency parsing. This task presents specific challenges: the output is subject to strong structural constraints and is significantly longer than the input. Furthermore, RNN sequence-to-sequence models have not been able to attain state-of-the-art results in small-data regimes[37].为了评估Transformer是否可以推广到其他任务，我们对英语选区解析进行了实验。这项任务提出了具体的挑战：产出受到强大的结构约束，并且比投入的时间长得多。此外，RNN序列到序列模型还无法在小数据体制中获得最新的结果[37]。
We trained a 4-layer transformer with d_model= 1024on the Wall Street Journal (WSJ) portion of thePenn Treebank [25], about 40K training sentences. We also trained it in a semi-supervised setting, using the larger high-confidence and BerkleyParser corpora from with approximately 17M sentences[37]. We used a vocabulary of 16K tokens for the WSJ only setting and a vocabulary of 32K tokens for the semi-supervised setting.
我们在Penn树库的《华尔街日报》（WSJ）部分[25]上训练了一个d_model = 1024的4层transformer，训练句子约为40K。我们还使用来自约1700万个句子的较大的高可信度和BerkleyParser语料库，在半监督的情况下对其进行了训练[37]。对于仅WSJ设置，我们使用了16K token的词汇表；对于半监督设置，我们使用了32K token的词汇表。
We performed only a small number of experiments to select the dropout, both attention and residual(section 5.4), learning rates and beam size on the Section 22 development set, all other parameters remained unchanged from the English-to-German base translation model. During inference, we increased the maximum output length to input length +300. We used a beam size of 21and α= 0.3for both WSJ only and the semi-supervised setting.
我们仅进行了少量实验来选择第22节开发集上的dropout，注意力和残差（第5.4节），学习率和波束大小，所有其他参数在英语到德语基础翻译模型中均保持不变。在推论期间，我们将最大输出长度增加到输入长度+300。对于WSJ和半监督设置，我们使用的光束大小均为21，α= 0.3。
Our results in Table 4 show that despite the lack of task-specific tuning our model performs surprisingly well, yielding better results than all previously reported models with the exception of theRecurrent Neural Network Grammar [8].In contrast to RNN sequence-to-sequence models [37], the Transformer outperforms the Berkeley-Parser [29] even when training only on the WSJ training set of 40K sentences.
我们在表4中的结果表明，尽管缺少针对特定任务的调优，我们的模型仍然表现出令人惊讶的出色表现，除递归神经网络语法外，其结果比以前报告的所有模型都更好。与RNN序列到序列模型[37]相比，即使仅在40K句子的WSJ训练集上进行训练，该Transformer的性能也优于Berkeley-Parser [29]。

七、Conclusion

In this work, we presented the Transformer, the first sequence transduction model based entirely on attention, replacing the recurrent layers most commonly used in encoder-decoder architectures with multi-headed self-attention.
在这项工作中，我们介绍了Transformer，这是完全基于注意力的第一个序列转导模型，用多头自注意力代替了编解码器体系结构中最常用的循环层。
For translation tasks, the Transformer can be trained significantly faster than architectures based on recurrent or convolutional layers. On both WMT 2014 English-to-German and WMT 2014 English-to-French translation tasks, we achieve a new state of the art. In the former task our best model outperforms even all previously reported ensembles.
对于翻译任务，与基于循环层或卷积层的体系结构相比，可以大大加快Transformer的培训速度。在WMT 2014英语到德语和WMT 2014英语到法语的翻译任务中，我们都达到了最新水平。在前一项任务中，我们最好的模型甚至胜过所有先前报告的合奏。
We are excited about the future of attention-based models and plan to apply them to other tasks. We plan to extend the Transformer to problems involving input and output modalities other than text and to investigate local, restricted attention mechanisms to efficiently handle large inputs and outputs such as images, audio and video. Making generation less sequential is another research goals of ours.
我们对基于注意力的模型的未来感到兴奋，并计划将其应用于其他任务。我们计划将“变形金刚”扩展到涉及文本以外的涉及输入和输出形式的问题，并研究局部受限的注意机制，以有效处理大型输入和输出，例如图像，音频和视频。使世代相继减少是我们的另一个研究目标。

fuchengguo666

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Attention is all your need(20.11.21)

Attention is all your needAbstractThe dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through a
复制链接

扫一扫