论文《Attention Is All You Need》是 Ashish Vaswani 等人于 2017 年发表在 NeurIPS 会议上的论文,提出了 Transformer 架构,彻底改变了自然语言处理(NLP)和人工智能领域的研究范式。
原文地址:https://arxiv.org/pdf/1706.03762
原文 1
Abstract
The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles, by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
翻译
摘要
主流的序列转换模型基于复杂的循环神经网络或卷积神经网络,这些网络通常包含编码器和解码器结构。性能最佳的模型还会通过注意力机制连接编码器和解码器。我们提出了一种全新的简单网络架构——Transformer,该架构完全基于注意力机制,彻底摒弃了循环和卷积操作。在两个机器翻译任务上的实验表明,该模型不仅质量更优,而且具有更高的并行性,训练时间显著缩短。我们的模型在WMT 2014英德翻译任务中取得了28.4的BLEU评分,比现有最佳结果(包括集成模型)提高了超过2个BLEU值。在WMT 2014英法翻译任务中,该模型在8块GPU上训练3.5天后,创下了41.8的BLEU评分新纪录——这是单模型的最先进水平,而训练成本仅为文献中最佳模型的一小部分。通过成功应用于英语成分句法分析任务(无论训练数据量大小),我们证明了Transformer架构能够很好地推广至其他任务。
重点句子解析
- The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder.
【解析】
这是含有定语从句的一个主从复合句。主句是:The dominant sequence transduction models are based on complex recurrent or convolutional neural networks。我们可以把主语 The dominant sequence transduction models 看作 A, 把介词 on 后边的宾语 complex recurrent or convolutional neural networks 看作 B,从而把主句简化为:A is based on B,即:A 基于 B。that include an encoder and a decoder是一个定语从句,其中引导词that在定语从句中充当主语,并且指代前边的networks。
【参考翻译】
主流的序列转换模型基于复杂的循环神经网络或卷积神经网络,这些网络通常包含编码器和解码器结构。
- The best performing models also connect the encoder and decoder through an attention mechanism.
【解析】
这是一个主谓宾结构的简单句。谓语是动词 connect,结尾的介词短语 through an attention mechanism 做状语,修饰 connect,表示:通过…。我们可以把 The best performing models 看作 A,把 the encoder and decoder 看作 B,把 an attention mechanism 看作 C,从而把句子简化为:A (also) connect B through C,即:A 还通过 C 连接 B。
【参考翻译】
性能最佳的模型还会通过注意力机制连接编码器和解码器。
- We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely.
【解析】
这是一个简单句,只不过在简单句后边出现了插入语、过去分词短语和现在分词短语等语法现象。其中,We propose a new simple network architecture是句子的主干,其结构为主谓宾,动词propose是谓语,前后分别是主语和宾语。原句中,两个逗号之间的the Transformer是插入语,同时也是a new simple network architecture的同位语,说明其名称。插入语在分析句子成分的时候可以忽略不计。过去分词短语based solely on attention mechanisms做后置定语,修饰宾语a new simple network architecture,相当于定语从句which is based solely on attention mechanisms.其中which指代a new simple network architecture。现在分词短语dispensing with recurrence and convolutions entirely同样做后置定语,修饰a new simple network architecture,相当于定语从句:which dispenses with recurrence and convolutions entirely. 其中which同样指代a new simple network architecture。
【参考翻译】
我们提出了一种全新的简单网络架构——Transformer,该架构完全基于注意力机制,彻底摒弃了循环和卷积操作。
- Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train.
【解析】
这个句子的主干是:Experiments…show these models…,其中动词show是谓语,前后分别是主语和宾语。原句中的介词短语on two machine translation做后置定语,修饰Experiments;
to be … while being … and requiring …都是后置定语,修饰these models,表示其特性。其中,to be superior in quality是不定式做后置定语;while表示“与此同时”;being more parallelizable and requiring significantly less time to train是and连接的两个现在分词短语做后置定语,修饰these models;其中的to train是不定式做后置定语,修饰time。
【参考翻译】
在两个机器翻译任务上的实验表明,该模型不仅质量更优,而且具有更高的并行性,训练时间显著缩短。
- Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles, by over 2 BLEU.
【解析】
这个句子的主干是:Our model achieves 28.4 BLEU。后边的介词短语on the WMT 2014 English-to-German translation task做状语,表示在什么情况下得分;现在分词短语improving…做伴随状语,修饰achieves,表示与谓语动词achieves同时发生的情况。介词over表示“胜过,优于”;两个逗号之间的 including ensembles是插入语,对前边的the existing best results进行补充说明;by…做状语,引出具体的差额或幅度;over表示:超过,多于。
【参考翻译】
我们的模型在WMT 2014英德翻译任务中取得了28.4的BLEU评分,比现有最佳结果(包括集成模型)提高了超过2个BLEU值。
- On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature.
【解析】
句子的主干是our model establishes a BLEU score of 41.8。句首的介词短语On the WMT 2014 English-to-French translation task交代事件发生的背景;new single-model state-of-the-art是三个并列的形容词做定语,修饰score of 41.8;after training for 3.5 days on eight GPUs是时间状语,逗号后边的a small fraction of …是插入语,对前边的training for 3.5 days on eight GPUs进行补充说明。其中a small fraction of 可以看作A,the training costs可以看作B,the best models from the literature可以看作C,因此逗号后边的内容在结构上可以简化为:A of B of C, 即:C的B的A。
【参考翻译】
在WMT 2014英法翻译任务中,该模型在8块GPU上训练3.5天后,创下了41.8的BLEU评分新纪录——这是单模型的最先进水平,而训练成本仅为文献中最佳模型的一小部分。
- We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
【解析】
这是含有宾语从句的主从复合句。主句是:We show…,后接that引导的宾语从句。that引导宾语从句时,不充当任何成分,因此也可以省略。这个宾语从句的主干是:the Transformer generalizes well to other tasks。介词短语by applying it successfully to…做方式状语,修饰主句的谓语动词show。介词短语with large and limited training data做状语,修饰applying,说明应用Transformer时的数据条件。
【参考翻译】
通过成功将其应用于英语成分句法分析任务(无论训练数据量大小),我们证明了Transformer架构能够很好地推广至其他任务。
原文 2
1 Introduction
Recurrent neural networks, long short-term memory and gated recurrent neural networks in particular, have been firmly established as state of the art approaches in sequence modeling and transduction problems such as language modeling and machine translation. Numerous efforts have since continued to push the boundaries of recurrent language models and encoder-decoder architectures.
翻译
1 引言
循环神经网络(RNN),尤其是长短期记忆(LSTM)神经网络和门控循环神经网络,已被公认为序列建模和转换任务(如语言建模和机器翻译中最先进的解决方案。此后,大量研究持续推动着循环语言模型与编码器-解码器架构的发展。
重点句子解析
- Recurrent neural networks, long short-term memory and gated recurrent neural networks in particular, have been firmly established as state of the art approaches in sequence modeling and transduction problems such as language modeling and machine translation.
【解析】
句中两个逗号之间的成分是插入语,在拆分句子时可以暂时忽略。句子的主体是:Recurrent neural networks…have been firmly established as state of the art approaches. 为了使结构更加清晰,我们可以把Recurrent neural networks看作A, 把state of the art approaches看作B,从而把把主体部分进一步简化为:A have been firmly established as B。其中,Recurrent neural和state of the art做定语,分别修饰networks和approaches;firmly修饰过去分词established,做状语。再来看插入语部分,即(long short-term) memory and (gated recurrent neural) networks (in particular), 这里的修饰成分同样括了起来。其中,long short-term和gated recurrent neural都是定语,分别修饰memory和networks;in particular做状语,表示强调意味。整个插入语放在主语后边,对主语进行补充说明;介词短语in sequence modeling and transduction problems做状语,说明这些方法应用的领域;such as language modeling and machine translation用于对problems进行举例说明。
【参考翻译】
循环神经网络(RNN),尤其是长短期记忆(LSTM)神经网络和门控循环神经网络,已被公认为序列建模和转换任务(如语言建模和机器翻译)中最先进的解决方案。
- Numerous efforts have since continued to push the boundaries of recurrent language models and encoder-decoder architectures.
【解析】
句子的主干是:Numerous efforts have continued to push the boundaries。我们可以把Numerous efforts看作A,把the boundaries看作B,从而把整句话简化为:A have been continued to push B. 此外,the boundaries后边的介词短语of recurrent language models and encoder-decoder architectures是后置定语,修饰boundaries。如果我们再把recurrent language models和encoder-decoder architectures分别看作C和D,整句话也可以看作A have been continued to push B (of C and D)。其中C和D是并列关系,of C and D是后置定语,修饰B,意思是:C和D的B。此外,还需要注意的是句中的since用作副词,表示:此后,后来。
【参考翻译】
此后,大量研究持续推动着循环语言模型与编码器-解码器架构的发展。