研读论文《Attention Is All You Need》目录:
原文 3
Recurrent models typically factor computation along the symbol positions of the input and output sequences. Aligning the positions to steps in computation time, they generate a sequence of hidden states h t h_t ht , as a function of the previous hidden state h t − 1 h_{t−1} ht−1 and the input for position t t t. This inherently sequential nature precludes parallelization within training examples, which becomes critical at longer sequence lengths, as memory constraints limit batching across examples. Recent work has achieved significant improvements in computational efficiency through factorization tricks and conditional computation, while also improving model performance in case of the latter. The fundamental constraint of sequential computation, however, remains.
翻译
循环神经网络模型通常沿着输入和输出序列的符号位置(symbol positions)进行分步计算。这些模型将序列位置与计算时间步骤对齐,生成隐藏状态序列 h t h_t ht,作为前一个隐藏状态 h t − 1 h_{t-1} ht−1 和当前位置 t t t 的输入函数。这种固有的顺序计算特性阻碍了训练样本内的并行化处理,而随着序列长度的增加,这一问题变得尤为关键——因为内存限制会制约跨样本的批处理能力。近期研究通过因子分解技巧(factorization tricks)和条件计算(conditional computation)显著提升了计算效率,后者还同时提升了模型性能。然而,顺序计算这一根本性约束依然存在。
重点句子解析
- Recurrent models typically factor computation along the symbol positions of the input and output sequences.
【解析】
句子的主干是:Recurrent models factor computation. 这是一个主谓宾结构的简单句,其中factor用作动词,是句子的谓语;前边的Recurrent models和后边的computation分别是主语和宾语。原句中的typically是副词做状语,修饰动词factor;介词短语along the symbol positions…做地点状语,修饰factor;介词短语of the input and output sequences做后置定语,修饰the symbol positions。
【参考翻译】
循环神经网络模型通常沿着输入和输出序列的符号位置进行分步计算。
- Aligning the positions to steps in computation time, they generate a sequence of hidden states h t h_t ht, as a function of the previous hidden state h t − 1 h_{t−1} ht−1 and the input for position t t t.
【解析】
本句的结构是:现在分词短语+主体+介词短语。主体是:they generate a sequence of hidden states h t ht ht . 这是一个主谓宾结构的简单句,其中generate是谓语,前后两边分别是主语和宾语。句首的现在分词短语Aligning the positions to steps in computation time做方式状语或伴随状语,表示align和谓语动词generate同时发生或伴随发生。这个分词短语可以简化为:Aligning A to B, 其中A代表the positions,B代表steps in computation time。介词短语as a function of the previous hidden state ht−1 and the input for position t. 做后置定语,修饰a sequence of hidden states h t h_t ht。 这部分也可以简化为:as a function of C and D,表示“作为C和D的函数”。其中,C和D是并列关系,C代表the previous hidden state h t − 1 h_{t−1} ht−1,D代表the input for position t t t 。
【参考翻译】
这些模型将序列位置与计算时间步骤对齐,生成隐藏状态序列 h t h_t ht,作为前一个隐藏状态 h t − 1 h_{t-1} ht−1 和当前位置 t t t 的输入函数。
- This inherently sequential nature precludes parallelization within training examples, which becomes critical at longer sequence lengths, as memory constraints limit batching across examples.
【解析】
这个句子的结构是:主句+非限制性定语从句+原因状语从句。主句This (inherently sequential) nature precludes parallelization (within training examples)是主谓宾结构,其中precludes是谓语,前后分别是主语和宾语,第一个括号中是前置定语,修饰nature,第二个括号中是后置定语,修饰parallelization。which引导非限制性定语从句,修饰整个主句,其中which指代前边的整个主句所提到的问题或情况;as引导原因状语从句,对前边定语从句中的观点给出原因和依据。这个原因状语从句属于主谓宾结构,其谓语是动词limit,limit前后分别是主语和表语。across examples是后置定语,修饰batching。
【参考翻译】
这种固有的顺序计算特性阻碍了训练样本内的并行化处理,而随着序列长度的增加,这一问题变得尤为关键——因为内存限制会制约跨样本的批处理能力。
- Recent work has achieved significant improvements in computational efficiency through factorization tricks and conditional computation, while also improving model performance in case of the latter.
【解析】
这句话的主干是:Recent work has achieved significant improvements. 介词短语in computational efficiency做后置定语,修饰improvements;through factorization tricks and conditional computation做方式状语,修饰achieve;while also improving model performance…是“连词(while)+现在分词短语(improving …)”做状语,表示补充或递进,这里的while相当于and at the same time。此外,这一部分也可以改写为主动语态的状语从句:while recent work also improves model performance…。句尾的介词短语 in case of the latter指的是in case of conditional computation,字面意思是:在条件计算的情况下。
【参考翻译】
近期研究通过因子分解技巧和条件计算显著提升了计算效率,后者还同时提升了模型性能。
- The fundamental constraint of sequential computation, however, remains.
【解析】
这是带有插入语的一个简单句。其中两个逗号之间的however是插入语,也可以把它挪到句首,即:However, the fundamental constraint of sequential computation remains. 其中,however是副词,做状语;后边是主谓结构。主语是the fundamental constraint,谓语是remains。介词短语of sequential computation做后置定语,修饰主语名词。
【参考翻译】
然而,顺序计算这一根本性约束依然存在。
原文 4
Attention mechanisms have become an integral part of compelling sequence modeling and transduction models in various tasks, allowing modeling of dependencies without regard to their distance in the input or output sequences [2, 19]. In all but a few cases [27], however, such attention mechanisms are used in conjunction with a recurrent network.
翻译
注意力机制已成为各类任务中强大序列建模与转换模型不可或缺的组成部分,它能够直接对输入或输出序列间的依赖关系建模,而无需考虑其间距。然而除少数特例外,现有注意力机制通常需与循环网络结合使用。
重点句子解析
- Attention mechanisms have become an integral part of compelling sequence modeling and transduction models in various tasks, allowing modeling of dependencies without regard to their distance in the input or output sequences
【解析】
句子的主干是:Attention mechanisms have become an integral part.其语法结构为主谓宾。其中,have become是谓语,Attention mechanisms和an integral part分别是主语和宾语。原句中的介词短语of compelling sequence modeling and transduction models是后置定语,修饰an integral part;另一个介词短语in various tasks也是后置定语,修饰sequence modeling and transduction models。现在分词短语allowing modeling of dependencies…做伴随状语,表示和主要谓语动词become同时发生或伴随发生的动作,其逻辑主语是Attention mechanisms;介词短语 without regard to their distance是条件状语,修饰动名词modeling,表示:无需考虑其间距; in the input or output sequences是后置定语,修饰dependencies。
【参考翻译】
注意力机制已成为各类任务中强大序列建模与转换模型不可或缺的组成部分,它能够直接对输入或输出序列间的依赖关系建模,而无需考虑其间距。
- In all but a few cases, however, such attention mechanisms are used in conjunction with a recurrent network.
【解析】
这是一个相对简单的句子,其结构是:介词短语+插入语+主体。主干是:such attention mechanisms are used. 这是一个主谓结构的被动句,主语是such attention mechanisms,谓语是are used。后边的介词短语in conjunction with a recurrent network做方式状语,修饰谓语动词。句首的介词短语做状语,表示适用范围;这里需要注意的是but用作介词,相当于except, 表示:除…之外。两个逗号之间的however是插入语,翻译的时候往往提到句首。
【参考翻译】
然而,除少数特例外,现有注意力机制通常需与循环网络结合使用。
原文 5
In this work we propose the Transformer, a model architecture eschewing recurrence and instead relying entirely on an attention mechanism to draw global dependencies between input and output. The Transformer allows for significantly more parallelization and can reach a new state of the art in translation quality after being trained for as little as twelve hours on eight P100 GPUs.
翻译
本研究提出Transformer——该模型架构完全摒弃循环结构,仅依赖注意力机制即可捕捉输入与输出间的全局依赖关系。Transformer实现了更高程度的并行化计算,仅需在8块P100 GPU上训练12小时,便能达到翻译质量的新标杆。
重点句子解析
- In this work we propose the Transformer, a model architecture eschewing recurrence and instead relying entirely on an attention mechanism to draw global dependencies between input and output.
【解析】
句子的主干是:we propose the Transformer。句首的介词短语做地点状语;逗号后边的a model architecture是同位语,对the Transformer进行解释说明。eschewing recurrence and instead relying entirely on an attention mechanism…是由and连接的两个并列的现在分词短语,共同做后置定语,修饰a model architecture。不定式短语to draw global dependencies between input and output做目的状语。
【参考翻译】
本研究提出Transformer——该模型架构完全摒弃循环结构,仅依赖注意力机制即可捕捉输入与输出间的全局依赖关系。
- The Transformer allows for significantly more parallelization and can reach a new state of the art in translation quality after being trained for as little as twelve hours on eight P100 GPUs.
【解析】
这句话的主干是:The Transformer allows for …parallelization and can reach a … state of the art … 这句话包含了and连接的两个并列的谓语结构,主语是The Transformer,谓语分别是allows for 和can reach,宾语中心词是parallelization和state of the art,其余都是修饰成分。significantly more和new都是前置定语,分别修饰parallelization和state of the art;介词短语in translation quality是后置定语,修饰state of the art。after being trained…是时间状语,相当于被动语态的时间状语从句,即:after the Transformer are trained…;介词短语for as little as twelve hours on eight P100 GPUs做时间状语,修饰trained。需要注意的是,state of the art是固定短语,本意为:最先进的技术水平,此处可以意译为:标杆。
【参考翻译】
Transformer实现了更高程度的并行化计算,仅需在8块P100 GPU上训练12小时,便能达到翻译质量的新标杆。