Attention Is All You Need 阅读翻译

最新推荐文章于 2024-07-04 17:07:39 发布

脚踏实地仰望星空

最新推荐文章于 2024-07-04 17:07:39 发布

阅读量1.5k

点赞数 4

分类专栏： Attention

本文链接：https://blog.csdn.net/u010505915/article/details/106736023

版权

Transformer模型是一种完全基于注意力机制的序列转换模型，抛弃了传统的循环或卷积层。通过多头自注意力，Transformer提高了训练速度和翻译质量，在WMT 2014英语-德语任务中超越了所有之前报道的模型，实现了28.4的BLEU分数，而在英语-法语任务中达到了41.8的最新BLEU分数，显著降低了训练成本。

摘要由CSDN通过智能技术生成

Abstract 摘要

The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English- to-German translation task, improving over the existing best results, including ensembles, by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.

主流序列转换模型基于复杂的循环神经网络或卷积神经网络，这些神经网络包括编码器和解码器。性能最好的模型还通过attention机制将编码器和解码器连接起来。我们提出了一种新的简单的网络结构，Transformer，仅基于attention机制，完全不需要循环和卷积。对两个机器翻译任务的实验表明，这些模型在质量上更优、并行性更好、训练时间更短。我们的模型在WMT 2014英语-德语翻译任务上达到28.4 BLEU，比现有的最佳结果（包括集成模型）提高了2 BLEU以上。在WMT 2014英法翻译任务中，我们的模型在8个GPU上训练3.5天后，建立了一个新的单模型最新的BLEU分数41.8，这个时间只是目前文献中记载的最好的模型训练成本的一小部分。我们将该变换器成功地应用于具有大量和有限训练数据的英语句法分析中，证明了Transformer可以很好地推广到其他任务。

1 Introduction 简介

Recurrent neural networks, long short-term memory [13] and gated recurrent [7] neural networks in particular, have been firmly established as state of the art approaches in sequence modeling and transduction problems such as language modeling and machine translation [35, 2, 5]. Numerous efforts have since continued to push the boundaries of recurrent language models and encoder-decoder architectures [38, 24, 15].

在序列建模和转换问题中，如语言建模和机器翻译[35，2，5]，循环神经网络，特别是长时短记忆[13]和门控循环[7]神经网络，已经被确立为最先进的方法。自那以来，人们花了许多努力来推动循环语言模型和编码器-解码器体系结构的界限[38、24、15]。

Recurrent models typically factor computation along the symbol positions of the input and output sequences. Aligning the positions to steps in computation time, they generate a sequence of hidden states , as a function of the previous hidden state and the input for position t. This inherently sequential nature precludes parallelization within training examples, which becomes critical at longer sequence lengths, as memory constraints limit batching across examples. Recent work has achieved significant improvements in computational efficiency through factorization tricks [21] and conditional computation [32], while also improving model performance in case of the latter. The fundamental constraint of sequential computation, however, remains.

循环模型通常是对输入和输出序列的符号位置进行因子计算。通过在计算期间将位置与步骤对齐，循环模型根据前一步的隐藏状态和输入，产生位置t的隐藏状态序列。这种固有的顺序特性阻碍了样本训练的并行化，这在更长的序列长度上变得至关重要，因为有限的内存会限制样本的批次大小。最近的工作通过巧妙的因子分解[21]和条件计算[32]，在计算效率方面取得重大进展，后者还同时提高了模型性能。然而，顺序计算的基本约束依然存在。

Attention mechanisms have become an integral part of compelling sequence modeling and transduction models in various tasks, allowing modeling of dependencies without regard to their distance in the input or output sequences [2, 19]. In all but a few cases [27], however, such attention mechanisms are used in conjunction with a recurrent network.

In this work we propose the Transformer, a model architecture eschewing recurrence and instead relying entirely on an attention mechanism to draw global dependencies between input and output. The Transformer allows for significantly more parallelization and can reach a new state of the art in translation quality after being trained for as little as twelve hours on eight P100 GPUs.

在各种任务中，attention机制已经成为序列建模和转换模型不可或缺的一部分，它可以建模依赖关系而不考虑其在输入或输出序列中的距离[2, 19]。除少数情况外[27]，这种attention机制都与循环网络一起使用。

在这项工作中，我们提出了Transformer，这种模型架构不使用循环网络结构，而是完全依赖于attention机制来表示输入和输出之间的全局依赖关系。Transformer允许进行更多的并行化，并且可以在八个P100 GPU上进行十二小时的训练后达到翻译质量的新的最佳结果。

2 Background 背景

The goal of reducing sequential computation also forms the foundation of the Extended Neural GPU [16], ByteNet [18] and ConvS2S [9], all of which use convolutional neural networks as basic building block, computing hidden representations in parallel for all input and output positions. In these models, the number of operations required to relate signals from two arbitrary input or output positions grows in the distance between positions, linearly for ConvS2S and logarithmically for ByteNet. This makes it more difficult to learn dependencies between distant positions [12]. In the Transformer this is reduced to a constant number of operations, albeit at the cost of reduced effective resolution due to averaging attention-weighted positions, an effect we counteract with Multi-Head Attention as described in section 3.2.

减少序列计算的目标也构成了Extended Neural GPU [16]，ByteNet [18]和ConvS2S [9]的基础，它们都使用卷积神经网络作为基本构建模块，并行计算所有输入和输出位置的隐藏表示。在这些模型中，关联任意两个输入或输出位置的信号所需的操作次数会随着位置之间的距离而增加，ConvS2S是线性增加，而ByteNet是对数增加。这使得学习远距离之间的依赖关系变得更加困难[12]。在Transformer中，关联两个输入或输出位置的信号所需的操作次数可以减少到固定的次数，尽管由于用attention权重化的位置取平均降低了效果，但是我们使用Multi-Head Attention抵消了效果降低，具体描述见 3.2。

Self-attention, sometimes called intra-attention is an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence. Self-attention has been used successfully in a variety of tasks including reading comprehension, abstractive summarization, textual entailment and learning task-independent sentence representations [4, 27, 28, 22].
End-to-end memory networks are based on a recurrent attention mechanism instead of sequence-aligned recurrence and have been shown to perform well on simple-language question answering and language modeling tasks [34].
To the best of our knowledge, however, the Transformer is the first transduction model relying entirely on self-attention to compute representations of its input and output without using sequence-aligned RNNs or convolution. In the following sections, we will describe the Transformer, motivate self-attention and discuss its advantages over models such as [17, 18] and [9].

Self-attention，有时称为intra-attention，是一种attention机制，它关联单个序列的不同位置以计算序列的表示。 Self-attention已成功用于各种任务，包括阅读理解、摘要概括、文本蕴涵和学习与任务无关的句子表征[4，27，28，22]。

端到端的内存网络基于循环attention机制，而不是序列对齐的循环，并且已被证明在简单语言的问题回答和语言建模任务中表现良好[34]。

就我们所知，Transformer是第一个完全依靠self-attention来计算输入和输出表示而不使用序列对齐RNN或卷积的转换模型。在后续章节中，我们将描述Transformer，引出self-attention并讨论它相对 [17, 18]和[9]几个模型的优势。

3 Model Architecture 模型框架

Most competitive neural sequence transduction models have an encoder-decoder structure [5, 2, 35]. Here, the encoder maps an input sequence of symbol representations () to a sequence of continuous representations z = (). Given z, the decoder then generates an output sequence () of symbols one element at a time. At each step the model is auto-regressive [10], consuming the previously generated symbols as additional input when generating the next.
The Transformer follows this overall architecture using stacked self-attention and point-wise, fully connected layers for both the encoder and decoder, shown in the left and right halves of Figure 1, respectively.