<Attention Is All You Need>:全网首次提出Transformer模型论文中英文对照学习

论文摘要

英文

The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to
be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including
ensembles, by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the
best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
∗Equal contribution. Listing order is random. Jakob proposed replacing RNNs with self-attention and started the effort to evaluate this idea. Ashish, with Illia, designed and implemented the first Transformer models and has been crucially involved in every aspect of this work. Noam proposed scaled dot-product attention, multi-head attention and the parameter-free position representation and became the other person involved in nearly every detail. Niki designed, implemented, tuned and evaluated countless model variants in our original codebase and tensor2tensor. Llion also experimented with novel model variants, was responsible for our initial codebase, and efficient inference and visualizations. Lukasz and Aidan spent countless long days designing various parts of and implementing tensor2tensor, replacing our earlier codebase, greatly improving results and massively accelerating our research.
†Work performed while at Google Brain. ‡Work performed while at Google Research. 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA. arXiv:1706.03762v7 [cs.CL] 2 Aug 2023

中文

主导的序列转换模型基于复杂的循环或卷积神经网络,包括编码器和解码器。性能最好的模型还通过注意机制连接编码器和解码器。我们提出了一种新的简单网络架构,Transformer,仅基于注意机制,完全摒弃了循环和卷积。对两个机器翻译任务的实验表明,这些模型在质量上更为优秀,同时更易于并行化,并且训练时间显著缩短。我们的模型在WMT 2014英德翻译任务上实现了28.4的BLEU分数,在现有最佳结果基础上提高了超过2 BLEU,包括模型集成在内。在WMT 2014英法翻译任务中,我们的模型在8个GPU上训练3.5天后,取得了41.8的BLEU分数,创下了新的单模型最优结果,训练成本仅为文献中最佳模型的一小部分。我们展示了Transformer在其他任务上的泛化能力,成功将其应用于使用大量和有限训练数据的英语成分解析。

*平等贡献。列出顺序是随机的。Jakob提出用自注意力替换RNN,并开始评估这个想法。Ashish与Illia设计并实现了第一个Transformer模型,并在工作的各个方面起着关键作用。Noam提出了缩放的点积注意力,多头注意力和无参数位置表示,并成为几乎每个细节中涉及的另一个人。Niki在我们的原始代码库和tensor2tensor中设计、实现、调整和评估了无数的模型变体。Llion还尝试了新颖的模型变体,负责我们的初始代码库以及高效的推理和可视化。Lukasz和Aidan花费了无数的日夜设计和实现tensor2tensor的各个部分,取代了我们早期的代码库,大大改善了结果并加速了我们的研究。

†在Google Brain工作时完成的工作。
‡在Google Research工作时完成的工作。
第31届神经信息处理系统会议(NIPS 2017),美国加州长滩。
arXiv:1706.03762v7 [cs.CL] 2023年8月2日

中英对照-正文

1 Introduction

Recurrent neural networks, long short-term memory [13] and gated recurrent [7] neural networks in particular, have been firmly established as state of the art approaches in sequence modeling and transduction problems such as language modeling and machine translation [ 35 , 2 , 5]. Numerous efforts have since continued to push the boundaries of recurrent language models and encoder-decoder architectures [38, 24, 15].
Recurrent models typically factor computation along the symbol positions of the input and output sequences. Aligning the positions to steps in computation time, they generate a sequence of hidden states ht, as a function of the previous hidden state ht−1 and the input for position t. This inherently sequential nature precludes parallelization within training examples, which becomes critical at longer sequence lengths, as memory constraints limit batching across examples. Recent work has achieved significant improvements in computational efficiency through factorization tricks [ 21 ] and conditional computation [ 32 ], while also improving model performance in case of the latter. The fundamental constraint of sequential computation, however, remains.
Attention mechanisms have become an integral part of compelling sequence modeling and transduction models in various tasks, allowing modeling of dependencies without regard to their distance in the input or output sequences [ 2, 19 ]. In all but a few cases [ 27 ], however, such attention mechanisms are used in conjunction with a recurrent network.
In this work we propose the Transformer, a model architecture eschewing recurrence and instead relying entirely on an attention mechanism to draw global dependencies between input and output.
The Transformer allows for significantly more parallelization and can reach a new state of the art in translation quality after being trained for as little as twelve hours on eight P100 GPUs.

1 简介

循环神经网络,尤其是长短期记忆(LSTM)[13]和门控循环神经网络[7],已经被确定为序列建模和转换问题的最先进方法,如语言建模和机器翻译[35, 2, 5]。此后,已经进行了大量努力,以推动循环语言模型和编码器-解码器架构的边界[38, 24, 15]。
循环模型通常沿着输入和输出序列的符号位置进行计算分解。将位置对齐到计算时间步长,它们生成一系列隐藏状态ht,作为前一个隐藏状态ht−1和位置t的输入的函数。这种固有的顺序性质排除了训练示例内部的并行化,这在序列长度较长时变得关键,因为内存约束限制了跨示例的批处理。最近的工作通过因子化技巧[21]和条件计算[32]取得了计算效率方面的显著改进,同时在后者的情况下也提高了模型性能。然而,顺序计算的基本约束仍然存在。
注意机制已经成为各种任务中引人注目的序列建模和转换模型的重要组成部分,允许对依赖关系进行建模,而不考虑它们在输入或输出序列中的距离[2, 19]。然而,在除了少数情况[27]之外,这种注意机制通常与循环网络结合使用。
在这项工作中,我们提出Transformer,这是一种模型架构,它避免了循环,而是完全依赖于注意机制来在输入和输出之间绘制全局依赖关系。Transformer允许更多的并行化,并且在仅在8个P100 GPU上训练12小时后就可以达到新的翻译质量最先进水平。

2 Background

The goal of reducing sequential computation also forms the foundation of the Extended Neural GPU [16 ], ByteNet [ 18 ] and ConvS2S [ 9], all of which use convolutional neural networks as basic building block, computing hidden representations in parallel for all input and output positions. In these models, the number of operations required to relate signals from two arbitrary input or output positions grows in the distance between positions, linearly for ConvS2S and logarithmically for ByteNet. This makes it more difficult to learn dependencies between distant positions [12 ]. In the Transformer this is
reduced to a constant number of operations, albeit at the cost of reduced effective resolution due to averaging attention-weighted positions, an effect we counteract with Multi-Head Attention as described in section 3.2. Self-attention, sometimes called intra-attention is an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence. Self-attention has been used successfully in a variety of tasks including reading comprehension, abstractive summarization, textual entailment and learning task-independent sentence representations [4, 27, 28, 22]. End-to-end memory networks are based on a recurrent attention mechanism instead of sequence-
aligned recurrence and have been shown to perform well on simple-language question answering and language modeling tasks [34]. To the best of our knowledge, however, the Transformer is the first transduction model relying entirely on self-attention to compute representations of its input and output without using sequence-aligned RNNs or convolution. In the following sections, we will describe the Transformer, motivate self-attention and discuss its advantages over models such as [17, 18] and [9].

2 背景

减少顺序计算的目标也构成了Extended Neural GPU[16]、ByteNet[18]和ConvS2S[9]的基础,它们都使用卷积神经网络作为基本构建块,在所有输入和输出位置上并行计算隐藏表示。在这些模型中,将来自两个任意输入或输出位置的信号关联所需的操作数量随着位置之间的距离增加而增加,对于ConvS2S是线性增长,对于ByteNet是对数增长。这使得学习远距离位置之间的依赖关系变得更加困难[12]。在Transformer中,这被减少到一定数量的操作,尽管以降低有效分辨率为代价,因为平均关注加权位置,我们在第3.2节中描述的多头注意力中对此进行了抵消。
自注意力,有时称为内注意力,是一种注意机制,它关联序列中的不同位置,以计算序列的表示。自注意力已经成功地应用于各种任务,包括阅读理解、抽象摘要、文本蕴含和学习任务独立的句子表示[4, 27, 28, 22]。
端到端记忆网络基于循环注意机制而不是序列对齐循环,并且已经显示在简单语言问答和语言建模任务中表现良好[34]。
据我们所知,然而,Transformer是第一个完全依赖于自注意力来计算其输入和输出表示的转换模型,而不使用序列对齐的循环神经网络或卷积。在接下来的章节中,我们将描述Transformer,解释自注意力的动机,并讨论它相对于模型如[17, 18]和[9]的优势。

3 Model Architecture

Most competitive neural sequence transduction models have an encoder-decoder structure [ 5, 2 , 35].
Here, the encoder maps an input sequence of symbol representations (x1, …, xn) to a sequence of continuous representations z = (z1, …, zn). Given z, the decoder then generates an output sequence (y1, …, ym) of symbols one element at a time. At each step the model is auto-regressive [10], consuming the previously generated symbols as additional input when generating the next.

3 模型架构

大多数竞争性的神经序列转换模型都具有编码器-解码器结构[5, 2, 35]。在这里,编码器将一个符号表示的输入序列(x1,…,xn)映射到一个连续表示的序列z =(z1,…,zn)。给定z,解码器然后逐个生成一个符号序列(y1,…,ym)的输出。在每个步骤中,模型都是自回归的[10],在生成下一个元素时,消耗先前生成的符号作为附加输入。
在这里插入图片描述
The Transformer follows this overall architecture using stacked self-attention and point-wise, fully connected layers for both the encoder and decoder, shown in the left and right halves of Figure 1, respectively.
Transformer遵循这一整体架构,使用堆叠的自注意力和点对点全连接层作为编码器和解码器,分别显示在图1的左半部分和右半部分。

3.1 Encoder and Decoder Stacks

Encoder: The encoder is composed of a stack of N = 6 identical layers. Each layer has two
sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, position-wise fully connected feed-forward network. We employ a residual connection [ 11 ] around each of the two sub-layers, followed by layer normalization [1]. That is, the output of each sub-layer is LayerNorm(x + Sublayer(x)), where Sublayer(x) is the function implemented by the sub-layer itself. To facilitate these residual connections, all sub-layers in the model, as well as the embedding
layers, produce outputs of dimension dmodel = 512.

3.1 编码器和解码器堆栈

编码器:编码器由N = 6个相同的层堆叠而成。每一层有两个子层。第一个是多头自注意力机制,第二个是简单的位置逐点全连接前馈网络。我们在每个子层周围采用残差连接,随后进行层归一化。也就是说,每个子层的输出是LayerNorm(x + Sublayer(x))&

  • 22
    点赞
  • 14
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值