“We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely.” 我们提出了一种新的简单的网络架构Transformer,它完全基于注意力机制,完全不需要递归和卷积。
“1 Introduction”
“Attention mechanisms have become an integral part of compelling sequence modeling and transduction models in various tasks, allowing modeling of dependencies without regard to their distance in the input or output sequences .” 注意力机制已经成为各种任务中引人注目的序列建模和转导模型的一个组成部分,允许不考虑它们在输入或输出序列中的距离的依赖关系建模。
“The Transformer allows for significantly more parallelization” Transformer允许更多的并行化
“2 Background”
“In the Transformer this is reduced to a constant number of operations, albeit at the cost of reduced effective resolution due to averaging attention-weighted positions, an effect we counteract with Multi-Head Attention as described in section 3.2” 在Transformer中,关联来自两个任意输入或输出位置的信号所需的运算次数被减少到一个恒定的操作数量,尽管由于平均注意力加权位置而降低了有效分辨率,正如3.2节中描述的那样,我们用多头注意力来抵消这种影响
-
“Self-attention”
“an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence.” 一种注意力机制,将单个序列的不同位置联系起来,以便计算序列的表示。
-
“End-to-end memory networks”
“based on a recurrent attention mechanism instead of sequencealigned recurrence and have been shown to perform well on simple-language question answering and language modeling tasks” 基于循环注意力机制而不是序列对齐的循环,已经被证明在简单语言问答和语言建模任务上表现良好
“3 Model Architecture”
“the encoder maps an input sequence of symbol representations (x1, ..., xn) to a sequence of continuous representations z = (z1, ..., zn). Given z, the decoder then generates an output sequence (y1, ..., ym) of symbols one element at a time.” 编码器将符号表示的输入序列( x1 , ... , xn)映射到连续表示的序列z = ( z1 , ... , zn)。给定z,解码器然后一次一个元素地生成一个输出序列( y1 , ... , ym)
“3.1 Encoder and Decoder Stacks”
“Encoder”
“The encoder is composed of a stack of N = 6 identical layers. Each layer has two sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, positionwise fully connected feed-forward network.” 编码器由N = 6个相同层的堆栈组成。每层有两个子层。第一种是多头自注意力机制,第二种是简单的位置全连接前馈网络。
“Decoder”
“the decoder inserts a third sub-layer, which performs multi-head attention over the output of the encoder stack” 解码器插入第三个子层,它对编码器堆栈的输出执行多头注意力
“3.2 Attention”
“An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors.” 注意力函数可以描述为将一个查询和一组键值对映射到一个输出,其中查询、键、值和输出都是向量。
“3.2.1 Scaled Dot-Product Attention”
“The input consists of queries and keys of dimension dk, and values of dimension dv. We compute the dot products of the query with all keys, divide each by √dk, and apply a softmax function to obtain the weights on the values.” 输入由维度dk的查询和键以及维度dv的值组成。我们计算查询的所有键的点积,除以√dk,并应用softmax函数获得值上的权重。
“3.2.2 Multi-Head Attention”
“we found it beneficial to linearly project the queries, keys and values h times with different, learned linear projections to dk, dk and dv dimensions, respectively.” 我们发现将查询、键和值h次分别用不同的、学习过的线性投影到dk、dk和dv维度是有益的。
“3.2.3 Applications of Attention in our Model”
-
“In "encoder-decoder attention" layers,”
-
“The encoder contains self-attention layers”
-
“self-attention layers in the decoder”
“3.3 Position-wise Feed-Forward Networks”
“In addition to attention sub-layers, each of the layers in our encoder and decoder contains a fully connected feed-forward network, which is applied to each position separately and identically.” 除了注意力子层之外,我们的编码器和解码器中的每一层都包含一个全连接的前馈网络,该网络分别且相同地应用于每个位置。
“3.4 Embeddings and Softmax”
“we use learned embeddings to convert the input tokens and output tokens to vectors of dimension dmodel. We also use the usual learned linear transformation and softmax function to convert the decoder output to predicted next-token probabilities.” 我们使用学习到的嵌入将输入令牌和输出令牌转换为维度dmodel的向量。我们还使用通常学习的线性变换和softmax函数将解码器输出转换为预测的下一个令牌概率。
“3.5 Positional Encoding”
“Since our model contains no recurrence and no convolution, in order for the model to make use of the order of the sequence, we must inject some information about the relative or absolute position of the tokens in the sequence. To this end, we add "positional encodings" to the input embeddings at the bottoms of the encoder and decoder stacks.” 由于我们的模型不包含递归和不包含卷积,为了使模型能够利用序列的顺序,我们必须在序列中注入一些关于令牌相对或绝对位置的信息。为此,我们在编码器和解码器堆栈底部的输入嵌入中加入"位置编码"。
where pos is the position and i is the dimension.
“4 Why Self-Attention”
“One is the total computational complexity per layer. Another is the amount of computation that can be parallelized, as measured by the minimum number of sequential operations required. The third is the path length between long-range dependencies in the network.” 一是每层的总计算复杂度。另一个是可以并行化的计算量,用所需的最小顺序操作数来衡量。第三是网络中长程依赖之间的路径长度。