在NLP中广泛应用的transformer（Self-Attention）剖析笔记

最新推荐文章于 2025-04-17 09:39:32 发布

zhyuxie

最新推荐文章于 2025-04-17 09:39:32 发布

阅读量9.1k

点赞数 8

分类专栏：深度学习 NLP

本文链接：https://blog.csdn.net/dakenz/article/details/85150676

版权

深度学习同时被 2 个专栏收录

1 篇文章

订阅专栏

NLP

1 篇文章

订阅专栏

在NLP中广泛应用的transformer（Self-Attention）的剖析笔记

模块剖析
动画Demo
参考文献

自从谷歌在2017NIPS上发表paper Attention is All You Need 以来，最近很多NLP场景已经应用了transformer，有的是使用transformer的整个seq2seq架构，也有很多任务只使用其encoder部分，包括最近很火的GPT／BERT，本文将以这篇paper为主并结合github上高star的实现源代码 transformer 一起梳理一下transformer
在transformer之前的seq2seq任务主要方案是由RNN／LSTM这类循环网络或者是CNN构成的 encoder+decoder的框架，或者在此基础上应用Attention机制；也取得了非常大的成功，但依然存在很多痛点，比如RNN无法并行速度慢，所以Google的研究员提出了transformer，在详细的介绍transformer之前先把它的优点给大家罗列一番

transformer在翻译模型中取得了stat-of-the-art性能，先来感受一下Google blog给出的性能对比图

更低的单层计算复杂度

其中 n 表示句子长度，d表示embedding表示的维度，k是CNN的卷积核size，r是restricted self-attention的窗口大小；大多数情况下句子的长度n<d, 所以self-attention拥有更低的单层计算复杂度

可以并行计算
能更好的解决长距离依赖问题，这在自然语言这样的序列处理问题中是一个难点

模块剖析

既然transformer有这么多优点，那么接下来就对其进行剖析，大概的顺序就是根据数据流顺序从顶层往下剖析

整体框图

整体网络图

Encoder-Decoder框架

Encoder

Input Embedding

首先通过embedding 层将token映射成vectors；与平常使用的embedding层类似

def embedding(inputs, 
              vocab_size, 
              num_units, 
              zero_pad=True, 
              scale=True,
              scope="embedding", 
              reuse=None):
    '''Embeds a given tensor.
    Args:
      inputs: A `Tensor` with type `int32` or `int64` containing the ids
         to be looked up in `lookup table`.
      vocab_size: An int. Vocabulary size.
      num_units: An int. Number of embedding hidden units.
      zero_pad: A boolean. If True, all the values of the fist row (id 0)
        should be constant zeros.
      scale: A boolean. If True. the outputs is multiplied by sqrt num_units.
      scope: Optional scope for `variable_scope`.
      reuse: Boolean, whether to reuse the weights of a previous layer
        by the same name.
    Returns:
      A `Tensor` with one more rank than inputs's. The last dimensionality
        should be `num_units`.
    '''
    with tf.variable_scope(scope, reuse=reuse):
        lookup_table = tf.get_variable('lookup_table',
                                       dtype=tf.float32,
                                       shape=[vocab_size, num_units],
                                       initializer=tf.contrib.layers.xavier_initializer())
        if zero_pad:
            lookup_table = tf.concat((tf.zeros(shape=[1, num_units]),
                                      lookup_table[1:, :]), 0)
        outputs = tf.nn.embedding_lookup(lookup_table, inputs)
        
        if scale:
            outputs = outputs * (num_units ** 0.5) 
            
    return outputs

Positional Encoding

添加此模块的目的是为了弥补self attention无法记录序列中token的位置信息的短板。为了能和embedding层的信息做sum，positional encoding layer与embedding层保持相同的维度数。
paper中提供了两种positon embedding 的方法，一种是直接训练得到，另一种是sine／cosine 来表征token的位置信息，具体公式如下

在这里插入图片描述
其中pos表示token的位置，即当前token是第几个token；i 表示dimension
两种在实验中两种position embedding的方式效果差不多，不过paper中最终选择的是第二种，理由是认为第二种方法有两个优势：

存在表示相对位置的可能性，即PEpos+k 可以通过PEpos来线性表示
在遇到比训练集中更长的句子时泛化能力可能更强。

github上 Positional Encoding的源代码如下

def positional_encoding(inputs,
                        num_units,
                        zero_pad=True,
                        scale=True,
                        scope="positional_encoding",
                        reuse=None):
    '''Sinusoidal Positional_Encoding.
    Args:
      inputs: A 2d Tensor with shape of (N, T).
      num_units: Output dimensionality
      zero_pad: Boolean. If True, all the values of the first row (id = 0) should be constant zero
      scale: Boolean. If True, the output will be multiplied by sqrt num_units(check details from paper)
      scope: Optional scope for `variable_scope`.
      reuse: Boolean, whether to reuse the weights of a previous layer
        by the same name.
    Returns:
        A 'Tensor' with one more rank than inputs's, with the dimensionality should be 'num_units'
    '''

    N, T = inputs.get_shape().as_list()
    with tf.variable_scope(scope, reuse=reuse):
        position_ind = tf.tile(tf.expand_dims(tf.range(T), 0), [N, 1])

        # First part of the PE function: sin and cos argument
        position_enc = np.array([
            [pos / np.power(10000, 2.*i/num_units) for i in range(num_units)]
            for pos in range(T)])

        # Second part, apply the cosine to even columns and sin to odds.
        position_enc[:, 0::2] = np.sin(position_enc[:, 0::2])  # dim 2i
        position_enc[:, 1::2] = np.cos(position_enc[:, 1::2])  # dim 2i+1

        # Convert to a tensor
        lookup_table = tf.convert_to_tensor(position_enc)

        if zero_pad:
            lookup_table = tf.concat((tf.zeros(shape=[1, num_units]),
                                      lookup_table[1:, :]), 0)
        outputs = tf.nn.embedding_lookup(lookup_table, position_ind)

        if scale:
            outputs = outputs * num_units**0.5

        return outputs

Encoder stacks

Multi-Head Attention

在这里插入图片描述

Multi-head Attention的数学形式表达如下

在这里插入图片描述
可以将Multi-head Attention分为三个module

在这里插入图片描述

Linear Project
本模块（module 1）主要是将K／Q／V映射成h个head的输入；具体代码实现如下

# Linear projections
Q = tf.layers.dense(queries, num_units, activation=tf.nn.relu) # (N, T_q, C)
K = tf.layers.dense(keys, num_units, activation=tf.nn.relu) # (N, T_k, C)
V = tf.layers.dense(keys, num_units, activation=tf.nn.relu) # (N, T_k, C)
        
# Split and concat
Q_ = tf.concat(tf.split(Q, num_heads, axis=2), axis=0) # (h*N, T_q, C/h) 
K_ = tf.concat(tf.split(K, num_heads, axis=2), axis=0) # (h*N, T_k, C/h) 
V_ = tf.concat(tf.split(V, num_heads, axis=2), axis=0) # (h*N, T_k, C/h)

代码中单独看Linear projections和Split and concat不太好理解，其实是将h个从d维映射到 d／n 维的Linear project合并成一次处理了。

Scaled Dot-Product Attention

在这里插入图片描述
本模块（module 2）则是这篇paper的核心点self attention，具体数学表达式如下

目前主流的attention包括additive attention和dot-product attention两种，这两者在理论上的复杂度是一样的，之所以选择dot-product是因为在实践过程中dot-product attention计算速度和空间利用率上更占优势；而之所以除以 dk，论文认为当 key的维数 dk 特别大的时候，那么点积有可能变的很大，导致后面的softmax函数进入一个梯度很小的范围，不利于训练。

在encoder中的Q=K=V，均为Input encoding + positional encoding得到的结果

module 2的具体代码实现如下

        # Multiplication
        outputs = tf.matmul(Q_, tf.transpose(K_, [0, 2, 1])) # (h*N, T_q, T_k)
        
        # Scale
        outputs = outputs / (K_.get_shape().as_list()[-1] ** 0.5)
        
        # Key Masking
        key_masks = tf.sign(tf.abs(tf.reduce_sum(keys, axis=-1))) # (N, T_k)
        key_masks = tf.tile(key_masks, [num_heads, 1]) # (h*N, T_k)
        key_masks = tf.tile(tf.expand_dims(key_masks, 1), [1, tf.shape(queries)[1], 1]) # (h*N, T_q, T_k)
        
        paddings = tf.ones_like(outputs)*(-2**32+1)
        outputs = tf.where(tf.equal(key_masks, 0), paddings, outputs) # (h*N, T_q, T_k)
  
        # Causality = Future blinding
        if causality:
            diag_vals = tf.ones_like(outputs[0, :, :]) # (T_q, T_k)
            tril = tf.contrib.linalg.LinearOperatorTriL(diag_vals).to_dense() # (T_q, T_k)
            masks = tf.tile(tf.expand_dims(tril, 0), [tf.shape(outputs)[0], 1, 1]) # (h*N, T_q, T_k)
   
            paddings = tf.ones_like(masks)*(-2**32+1)
            outputs = tf.where(tf.equal(masks, 0), paddings, outputs) # (h*N, T_q, T_k)
  
        # Activation
        outputs = tf.nn.softmax(outputs) # (h*N, T_q, T_k)
         
        # Query Masking
        query_masks = tf.sign(tf.abs(tf.reduce_sum(queries, axis=-1))) # (N, T_q)
        query_masks = tf.tile(query_masks, [num_heads, 1]) # (h*N, T_q)
        query_masks = tf.tile(tf.expand_dims(query_masks, -1), [1, 1, tf.shape(keys)[1]]) # (h*N, T_q, T_k)
        outputs *= query_masks # broadcasting. (N, T_q, C)
          
        # Dropouts
        outputs = tf.layers.dropout(outputs, rate=dropout_rate, training=tf.convert_to_tensor(is_training))
               
        # Weighted sum
        outputs = tf.matmul(outputs, V_) # ( h*N, T_q, C/h)

Concat
这个很简单，就是将多个head的结果做一个拼接

        # Restore shape
        outputs = tf.concat(tf.split(outputs, num_heads, axis=0), axis=2 ) # (N, T_q, C)

Example for how Self-Attention works

在这里插入图片描述

step1. 经过Linear Project得到Qi／Ki／Vi
step2. 计算得到各个word在本word上对应weight （如图中两个word在Thinking这个词上的weight分别为0.88和0.12）
step3. 得到本word的self-attention表示（图中以第一个word为例，得到 Z1）

Why Multi-Head

如果使用单个Scaled Dot-Product Attention，则可以称为Single Attention，而之所以引入Multi-Head机制，主要是出于以下考虑：每个head相当于一个单独的通道，可以去独立的去抽取不同的有效特征，这样可以使得self-attention变成一个能力超强的特征抽取器，而计算成本与使用Single Attention几乎一致

Feed Forward

Position-wise Feed-Forward Networks 是由两个使用ReLU作为激活函数的线性变换构成的，具体数学表达式如下

也可以将其理解为卷积核大小为1的CNN，在具体代码实现的时候就是使用的conv1d() ，在FFN层之后有Residual connection和 Normalization。
Feed Forward的具体代码实现如下

def feedforward(inputs, 
                num_units=[2048, 512],
                scope="multihead_attention", 
                reuse=None):
    '''Point-wise feed forward net.
    
    Args:
      inputs: A 3d tensor with shape of [N, T, C].
      num_units: A list of two integers.
      scope: Optional scope for `variable_scope`.
      reuse: Boolean, whether to reuse the weights of a previous layer
        by the same name.
        
    Returns:
      A 3d tensor with the same shape and dtype as inputs
    '''
    with tf.variable_scope(scope, reuse=reuse):
        # Inner layer
        params = {"inputs": inputs, "filters": num_units[0], "kernel_size": 1,
                  "activation": tf.nn.relu, "use_bias": True}
        outputs = tf.layers.conv1d(**params)
        
        # Readout layer
        params = {"inputs": outputs, "filters": num_units[1], "kernel_size": 1,
                  "activation": None, "use_bias": True}
        outputs = tf.layers.conv1d(**params)
        
        # Residual connection
        outputs += inputs
        
        # Normalize
        outputs = normalize(outputs)
    
    return outputs

Decoder

Decoder demo

英文博客 The Illustrated Transformer 画了两张非常好的动画图，第一张图是从encoder到decoder的第一个输出，第二张图为剩下的steps，相信作者如此用心的动画对大家的理解会有不小的帮助。
decoder第一步

decoder接下来的steps

Output Embedding

同Encoder embedding部分

Positional Encoding

同Encoder positional encoding部分

Decoder stacks

Masked Multi-Head Attention

此处mask的原因是在decoder中接下来的词对目前来说是未知的，所以在做Self-Attention的时候需要进行mask，如下图所示，Encoder-Decoder Attention就是早期翻译模型中提出的Attention，左下角的图为Encoder中使用的Self-Attention，因为此时的sentence是完整的，所以可以全部进行Self-Attention计算，右下角则是Masked Self-Attention，确保预测第 i 个词时只使用了前 i-1 个词
在这里插入图片描述

Multi-Head Attention

这里的Multi-Head Attention与Encoder中的区别是此处的 Q=上一层decoder编码结果，K=V=Encoder编码结果
添加这个layer的作用（或者说目的）是引入encoder的信息，并根据上一层Masked Multi-Head Attention的输出，focus在与生成下一个word最相关的words上

Feed Forward

同Encoder部分

Linear Project

就是一个普通的linear层，github上 Linear project的源代码如下

# Final linear projection
self.logits = tf.layers.dense(self.dec, len(en2idx))

动画Demo

在这里插入图片描述
上图为Google blog上的 transfer翻译模型展示动画
从动画demo中可以看出Encoder是并行的Self-Attention，共有3层；而在在Decoder端也是3层，由<start>开始，产生出Je，Je的产生只依赖与encoder以及<start>, 而接下来产生的suis则在此基础上又增加了对Je的依赖，所以越往后计算量也是越来越大的。