Attention Is All You Need阅读笔记

最新推荐文章于 2021-12-21 09:33:20 发布

yyyybupt

最新推荐文章于 2021-12-21 09:33:20 发布

阅读量206

点赞数

分类专栏： nlp

本文链接：https://blog.csdn.net/qq_41747565/article/details/100806703

版权

nlp 专栏收录该内容

15 篇文章 0 订阅

订阅专栏

论文链接

文章亮点

文章提出了一种新的简单网络架构--Transformer

完全基于Attention机制
完全免除迭代和卷积

迭代模型同时考虑输入的符号位置和输出序列，这种固有的顺序特性限制了训练样例的并行化。对于较长的序列，内存更是约束了样例的批处理
Attention机制允许对依赖关系进行建模，不需要考虑他们在输入或输出序列的距离，但大多数都与递归网络结合使用

Transformer允许更多的并行化，在8个P100 GPU上经过长达12小时的培训后可以达到很好的翻译质量

背景介绍

Extended Neural GPU，ByteNet和ConvS2S 都使用卷积神经网络作为基本构建块，并行为所有输入和输出位置计算隐藏表示。关联信号所需的操作次数随输入输出信号的距离增加。Transformer将操作次数减少到恒定，尽管由于平均注意力加权位置(Multi-Head)而降低了有效分辨率
self-attention关联单个序列的不同位置以计算序列的表示
end-to-end 记忆网络是基于迭代attention机制，而不是序列对齐迭代

模型介绍

编码器将符号表示的输入序列 $(x_1,\dots,x_n)$ 映射到连续表示序列 $z=(z_1,\dots,z_n)$
给定，解码器一次一个元素地生成符号的输出序列 $(y_1,\dots,y_n)$
在每个步骤中，模型都是自动回归的——在生成下一个输出时会将先前生成的符号作为附加输入

直接上图：

1. Encoder和Decoder的堆叠

Encoder

6个相同的层，每层有两个子层：multi-head self-attention机制和 position-wise 全连接前向网络
在每个子层使用 residual connection，然后进行 layer normalization： $\mathrm{LayerNorm}(x+\mathrm{Sublayer}(x))$
为了利用这些 residual connection，模型中所有输出(包括映射层)维数取 $d_{model}=512$

Decoder

6个相同的层，每层有三个子层：masked multi-head self-attention机制， multi-head attention over the output of the encoder stack 和 position-wise 全连接前向网络
在每个子层使用 residual connection，然后进行 layer normalization： $\mathrm{LayerNorm}(x+\mathrm{Sublayer}(x))$
为了防止输入映射偏移一个位置，采用 masking——确保位置i的预测仅取决于小于i的已知输出

2. Attention

(1) 缩放的点积Attention

输入： d_k 维的 queries 和 keys， d_v 维的values，其中 $\frac1{\sqrt{d_k}}$ 为缩放因子

$\mathrm{Attention}(Q,K,V)=\mathrm{softmax}(\frac{QK^T}{\sqrt{d_k}})V$

(2) Multi-Head Attention

$\begin{array}{l}\mathrm{MultiHead}(Q,K,V)=\mathrm{Concat}(head_1,\dots,head_h)W^O\\\mathrm{where}\;head_i=\mathrm{Attention}(QW_i^Q,KW_i^K,VW_i^V)\end{array}$

我们通过参数矩阵对queries,keys,values投影： $W_i^Q\in R^{d_{model}\times d_k},W_i^K\in R^{d_{model}\times d_k},W_i^V\in R^{d_{model}\times d_v},W^O\in R^{hd_v\times d_{model}}$

我们采用 h=8 个并行的attention层： $d_k=d_v=\frac{d_{model}}h=64$

——由于每个头维数的减少， multi-head attention 计算量与全连接的 single-head attention 差不多

(3) Attention 在模型中的应用

encoder-decoder attention: the quiries <-- the previous decoder, the memory keys and values <-- the output of the encoder
encoder self-attention: all of the keys, values and queries <-- the output of the previous layer in the encoder
decoder self-attention:

all of the keys, values and queries <-- the output of the previous layer in the decoder
preserve auto-regressive and prevent leftward information flow <-- masking out (setting to $-\infty$ ) all values in the input of the softmax

(4) Self-Attention

每层的总计算复杂度
可以并行化的计算量-->所需的最小顺序操作数
网络中远程依赖关系之间的路径长度-->由不同层组成的网络中任意两输入输出位置之间的最大路径长度

其中n是序列长度，d是表示的维数，k是卷积核的大小，r是受限制self-attention的邻域大小

3. Position-wise Feed-Forward Networks

$\mathrm{FFN}(x)=\max(0,xW_1+b_1)W_2+b_2$

输入和输出的维数为 $d_{model}=512$ ，inner-layer的维数为 $d_{ff}=2048$

4. Embeddings and Softmax

映射(embeddings)：将输入和输出的tokens映射为 $d_{model}$ 维的向量

线性变换(linear transformation)和softmax函数：将解码器的输出变换成预测的next-token的概率

5. Positional Encoding

由于我们的模型没有递归和卷积，为了使模型能够利用序列的顺序，我们会输入tokens的一些相对或绝对位置的信息( $d_{model}$ 维)

$\begin{array}{l}PE_{(pos,2i)}=\sin(pos/10000^{2i/d_{model}})\\PE_{(pos,2i+1)}=\cos(pos/10000^{2i/d_{model}})\end{array}$

其中 pos 是位置，是维度 --> positional wncoding 的每一维都对应一个正弦函数

Model Training

训练数据：the WMT 2014 English-German dataset (450万句子对) the WMT 2014 English-French dataset (36M句)
按大致的序列长度对 sentence pairs 进行批处理，每批数据包含大约25000个源 tokens 和25000个目标 tokens
硬件：one machine with 8 NVIDIA P100 GPUs
参数训练：参考实验结果第2部分
优化器：Adam optimizer $\beta_1=0.9,\;\beta_2=0.98\;\mathrm{and}\;\varepsilon=10^{-9}$ $lrate=d_{model}^{-0.5}\cdot min(step\_num^{-0.5},step\_num\cdot warmup\_steps^{-1.5})$
正则化：Residual Dropout && Label Smoothing