Attention——《Attention is all you need》

最新推荐文章于 2023-07-07 10:59:05 发布

卉卉卉大爷

最新推荐文章于 2023-07-07 10:59:05 发布

阅读量267

点赞数

分类专栏：深度学习文章标签：深度学习

原文链接：http://nlp.seas.harvard.edu/2018/04/03/attention.html

版权

深度学习专栏收录该内容

6 篇文章 4 订阅

订阅专栏

翻译自 The Annotated Transformer

Attention

一个Attention function的作用可以被描述为，将一个query 和一组key-value对映射到一个output，其中query、keys、values、output都是vector。这个output是所有value的加权和，每个value的权值通过由 a compatibility function 计算的这个query和对应的key的注意力得到。

这篇文章中的Attention为 “Scaled Dot-Product Attention”，输入包括维度为 $d_k$ 的queries和keys、维度为 $d_v$ 的values。我们计算一个query和所有keys的点积，之后除以 $\sqrt{d_k}$ ，并且经过一层softmax得到values的权重。
在这里插入图片描述
在训练阶段，我们同时计算a set of queirs（并行计算、使用GPU加速），用矩阵Q表示。keys、values也分别用矩阵K、V表示。outputs的矩阵形式计算即为：
$Attention(Q,K,V)=softmax(QK^{T}/\sqrt{d_k})V$

对应的程序实现如下：

def attention(query, key, value, mask=None, dropout=None):
    "Compute 'Scaled Dot Product Attention'"
    d_k = query.size(-1)
    scores = torch.matmul(query, key.transpose(-2, -1)) \
             / math.sqrt(d_k)
    if mask is not None:
        scores = scores.masked_fill(mask == 0, -1e9)
    p_attn = F.softmax(scores, dim = -1)
    if dropout is not None:
        p_attn = dropout(p_attn)
    return torch.matmul(p_attn, value), p_attn

两个最常用的attention function是additive attention和dot-product (multiplicative) attention。Dot-product attention即为本文中的计算方式，此外我们加上了 $1/\sqrt{d_k}$ 。 Additive attention使用一个带一个隐藏层的前馈神经网络计算相关系数。虽然两种计算方式在理论上计算复杂度相同，但是在实践中 dot-product attention更快且更节省空间，因为可以用矩阵乘法来实现。

当 $d^{k}$ 比较小的时候，两种计算方式表现相当；当 $d_k$ 比较大的时候，additive attention表现优于 dot product attention（没有进行数据的缩放）。 We suspect that for large values of $d_k$ , the dot products grow large in magnitude, pushing the softmax function into regions where it has extremely small gradients (To illustrate why the dot products get large, assume that the components of $q$ and $k$ are independent random variables with mean 0 and variance 1. Then their dot product, $q⋅k=\sum\limits_{i=1}^{d_k}q_ik_i$ , has mean 0 and variance $d_k$ .).所以为了消除这种影响，本文用 $1/\sqrt{d_k}$ 做一个缩放处理。

Multi-head attention

Multi-head attention可使得模型将不同子空间中的表示结合到一起。当只有一个attention head时，平均使得这些信息丢失。
$Concat(head_1,...,head_h)W^{O} \\ where \quad head_i=Attention(QW_i^{Q},KW_i^{K},VW_i^{V})$
其中投影是参数矩阵 $W_i^{Q}\in R^{d_{model}\times d_q}$ ， $W_i^{K}\in R^{d_{model}\times d_k}$ ， $W_i^{V}\in R^{d_{model}\times d_v}$ ， $W_i^{O}\in R^{h d_v\times d_{model}}$ 。在本文中我们采用 $h = 8$ 个平行的Attention layer. $d_k=d_v=d_{model}/h=64$ 。由于每个head的维度变小，所以计算量与保留整个维度的single head attention差不多。

class MultiHeadedAttention(nn.Module):
    def __init__(self, h, d_model, dropout=0.1):
        "Take in model size and number of heads."
        super(MultiHeadedAttention, self).__init__()
        assert d_model % h == 0
        # We assume d_v always equals d_k
        self.d_k = d_model // h
        self.h = h
        self.linears = clones(nn.Linear(d_model, d_model), 4)
        self.attn = None
        self.dropout = nn.Dropout(p=dropout)
        
    def forward(self, query, key, value, mask=None):
        "Implements Figure 2"
        if mask is not None:
            # Same mask applied to all h heads.
            mask = mask.unsqueeze(1)
        nbatches = query.size(0)
        
        # 1) Do all the linear projections in batch from d_model => h x d_k 
        query, key, value = \
            [l(x).view(nbatches, -1, self.h, self.d_k).transpose(1, 2)
             for l, x in zip(self.linears, (query, key, value))]
        
        # 2) Apply attention on all the projected vectors in batch. 
        x, self.attn = attention(query, key, value, mask=mask, 
                                 dropout=self.dropout)
        
        # 3) "Concat" using a view and apply a final linear. 
        x = x.transpose(1, 2).contiguous() \
             .view(nbatches, -1, self.h * self.d_k)
        return self.linears[-1](x)

Applications of Attention in our Model

Transformer用三种不同的方式应用了multi-head attention:

在“Encoder-Decoder attention” layer，queries来自之前的decoder层的输出，并且memory keys 和 values是Encoder的输出。This allows every position in the decoder to attend over all positions in the input sequence.
encoder中有self-attention layers。在self-attention layers中，所有的keys、values、queries来源相同，在本文中，他们都来自encoder中前一层的输出。encoder中每一个位置都参加了encoder前一层中所有位置attention计算。
类似的，decoder中的self-attention layers每一个位置也都参与了其他位置的运算，到当前位置（包括当前位置）。

Position-wise Feed-Forward Networks

除了attention layer，encoder和decoder中每一个层包含了一个全连接前馈网络，它被应用于每一个位置。它包含两个线性变换层，中间是ReLU激活函数。
$FFN(x) = max(0,xW_1+b_1)W_2 + b_2$
虽然线性变换在不同的位置是相同的，但是参数是不同的。即用卷积核大小为1的两个卷积。输入和输出的维度为 $d_{model}=512$ ，并且inner-layer 的维度为 $d_{ff}=2048$ 。

class PositionwiseFeedForward(nn.Module):
    "Implements FFN equation."
    def __init__(self, d_model, d_ff, dropout=0.1):
        super(PositionwiseFeedForward, self).__init__()
        self.w_1 = nn.Linear(d_model, d_ff)
        self.w_2 = nn.Linear(d_ff, d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        return self.w_2(self.dropout(F.relu(self.w_1(x))))

Embeddings and Softmax

与其他的序列传递模型相似，我们使用学好的embedding将输入tokens和输出tokens转换为维度为 $d_{model}$ 的向量。同时我们也用线性层和softmax将decoder的输出转换为预测的下一个token的概率。在本文的模型中，在两个embedding layers以及pre-softmax linear转换层中采用共享的权重矩阵。在embedding layer中，我们将这些权重乘 $\sqrt{d_{model}}$ 。

class Embeddings(nn.Module):
    def __init__(self, d_model, vocab):
        super(Embeddings, self).__init__()
        self.lut = nn.Embedding(vocab, d_model)
        self.d_model = d_model

    def forward(self, x):
        return self.lut(x) * math.sqrt(self.d_model)

Positional Encoding

由于模型中没有循环和卷积，为了让模型利用到序列中的顺序信息，我们需要加入一些序列中的位置信息。所以我们在input embedding中加入“positional encodings”。位置编码的维度也为 $d_{model}$ ，所以可以直接相加。位置编码有很多种选择，可学习的或是固定的。
在本文中，使用不同频率的sine或是cosine函数：
$PE_{(pos,2i)}=sin(pos/10000^{2i/d_{model}}) \\ PE_{(pos,2i+1)}=cos(pos/10000^{2i/d_{model}})$
其中 $p o s$ 是位置， $i$ 是维度。也就是说位置编码的每一个维度都对应于一个sin信号。波长相乘一个从 $2\pi$ 到 $1000\cdot2\pi$ 。应用这个函数的原因是这个函数可以学到相对位置，对任意固定的 $k$ ， $PE_{pos+k}$ 可以被表示成 $PE_{pos}$ 的线性函数。

此外，设置dropout ， $P_{drop}=0.1$

class PositionalEncoding(nn.Module):
    "Implement the PE function."
    def __init__(self, d_model, dropout, max_len=5000):
        super(PositionalEncoding, self).__init__()
        self.dropout = nn.Dropout(p=dropout)
        
        # Compute the positional encodings once in log space.
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2) *
                             -(math.log(10000.0) / d_model))
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0)
        self.register_buffer('pe', pe)
        
    def forward(self, x):
        x = x + Variable(self.pe[:, :x.size(1)], 
                         requires_grad=False)
        return self.dropout(x)

Full Model

这里我们定义了一个函数，用于生成整个模型。

def make_model(src_vocab, tgt_vocab, N=6, 
               d_model=512, d_ff=2048, h=8, dropout=0.1):
    "Helper: Construct a model from hyperparameters."
    c = copy.deepcopy
    attn = MultiHeadedAttention(h, d_model)
    ff = PositionwiseFeedForward(d_model, d_ff, dropout)
    position = PositionalEncoding(d_model, dropout)
    model = EncoderDecoder(
        Encoder(EncoderLayer(d_model, c(attn), c(ff), dropout), N),
        Decoder(DecoderLayer(d_model, c(attn), c(attn), 
                             c(ff), dropout), N),
        nn.Sequential(Embeddings(d_model, src_vocab), c(position)),
        nn.Sequential(Embeddings(d_model, tgt_vocab), c(position)),
        Generator(d_model, tgt_vocab))
    
    # This was important from their code. 
    # Initialize parameters with Glorot / fan_avg.
    for p in model.parameters():
        if p.dim() > 1:
            nn.init.xavier_uniform(p)
    return model