nlper 成长之路(一)——Transformer理解与深入

最新推荐文章于 2024-06-07 09:54:57 发布

由比ヶ浜結衣

最新推荐文章于 2024-06-07 09:54:57 发布

阅读量557

点赞数

分类专栏： nlper成长之路文章标签：深度学习人工智能

本文链接：https://blog.csdn.net/qq_44159956/article/details/111148955

版权

nlper成长之路专栏收录该内容

1 篇文章 0 订阅

订阅专栏

最近需要用到Transformer，花了几天时间理解了一下源码，惊叹于其设计的思想之精妙，这会是一系列长篇文章的第一部分。

1. 预备知识

1.1 BLEU

biligual evaluation understudy 算法是一种衡量模型生成序列和参考序列之间的N元词组（N-Gram）重合度的算法，最早用来评价机器翻译模型的质量，目前也广泛应用在各种序列生成任务中

令 $x$ 为从模型分布 $p_\theta$ 中生成的一个候选（Candidate）序列， $s^{(1)},\cdots,s^{(K)}$ 为从真实数据分布中采集的一组参考序列， $W$ 为从生成的候选序列中提取所有 $N$ 元组合的集合，这些 $N$ 元组合的精度（Precision）
$P_N(x)=\frac{\sum\limits_{w \in W} min(c_w(x),max^ {K} _ {k=1}c_w(s^{(k)}))}{\sum\limits_{w\in W}c_w(x)}$
其中 $c_w(x)$ 是N元组合 $w$ 在生成序列 $x$ 中出现的次数， $c_w(s^{(k)})$ 是N元组合 $w$ 在参考序列 $s^{(k)}$ 中出现的次数，N元组合的精度 $P_N(x)$ 是计算生成序列中的 $N$ 元组合有多少比例在参考序列中出现。

由于精度只衡量生成序列中的 $N$ 元组合是否在参考序列中出现，生成序列越短，其精度会越高，因此可以引入长度惩罚因子（Brevity Penalty），如果生成序列的长度短于参考序列，就对其进行惩罚。
$\left \{ \begin{array}{rcl} 1 & & if & l_x > l_s \\ exp(1-l_s/l_x) & & if & l_x \leq l_s \end{array} \right.$
其中 $l_x$ 为生成序列 $x$ 的长度， $l_s$ 为参考序列的最短长度

BLEU算法是通过计算不同长度的N元组合($N=1,2, \cdots $)的精度，并进行几何加权平均而得到。
$\times exp(\sum\limits_{N=1}^{N'}a_NlogP_N)$
其中 $N^{'}$ 为最长N元组合的长度， $a_N$ 为不同N元组合的权重，一般设为 $\frac {1}{N'}$ ，BLEU算法的值域是[0,1]，越大表明生成的质量越好，但是BLEU算法只计算精度，而不关心召回率（即参考序列的N元组合是否在生成序列中出现）.

1.2. Attention is All you need

1.2.1 Transformer历史意义

提出self-attention，拉开了非序列化模型的序幕。
为预训练模型的到来打下了坚实的基础。

1.2.2 基于transformer结构的预训练模型

bert(采用的transformer的encoder部分)

gpt（采用的transformer的decoder部分）

albert等tingbert模型

1.2.3 论文结构

1.2.4 transformer结构

输入 --> input embedding --> positional encoding

for i in range(6):

self attention --> layer normalization --> feed forward --> layer normalization

for i in range(6):

self attention --> layer normalization -->encoder-decoder attention --> layer normalization --> feed forward --> layer normalization

使用了残差网络的思想，每个子层的输出都是 $L a y e r N o r m (x + S u b l a y e r (x))$ ，其中 $S u b l a y e r$ 是要实现的功能。

1.2.5 transformer细节

建议配合源码食用，可以在colab里打开跑一遍
https://github.com/tensorflow/docs/blob/master/site/en/tutorials/text/transformer.ipynb
中文版本：
https://github.com/tensorflow/docs-l10n/blob/master/site/zh-cn/tutorials/text/transformer.ipynb

1.2.5.1 Positional encoding

word embedding 不包含位置信息，因此需要向模型中添加一个句子中各个词的相对位置信息。

计算方法如下：
$\Large angle\_rate_d = (min\_rate)^{d / d_{max}}$

angle_rate_exponents = np.linspace(0,1,depth//2)
angle_rates = min_rate**(angle_rate_exponents)

angle rate 的范围从 1 [rads/step] 到 min_rate [rads/step] 共 $d_{max}$ 个点.

positions = np.arange(num_positions) 
angle_rads = (positions[:, np.newaxis])*angle_rates[np.newaxis, :]

计算angle_rads

做弧度——角度转换作图如下
在这里插入图片描述

但这样的原始角度做模型输入是不好的，不连续（不可导），且无界

因此做如下的转换：

$\Large{PE_{(pos, 2i)} = sin(pos / 10000^{2i / d_{model}})}$
$\Large{PE_{(pos, 2i+1)} = cos(pos / 10000^{2i / d_{model}})}$

positional encoding之所以这么计算的原因是求一个新的词的位置编码时，可以用之前的位置通过线性运算求解，因此可证明位置编码的合理性

$\Large { sin(a+b) = sin(a)*cos(b)+cos(a)*sin(b)\\ cos(a+b) = cos(a)*cos(b)-sin(a)*sin(b) }$

${PE_{(pos + step, 2i)} = sin(pos *angle\_rate_i)*cos(step *angle\_rate_i)+ cos(pos *angle\_rate_i)*sin(step *angle\_rate_i)}$
${PE_{(pos+step, 2i+1)} = cos(pos *angle\_rate_i)*cos(step *angle\_rate_i)- sin(pos *angle\_rate_i)*sin(step *angle\_rate_i)}$

-10

正弦波关于y轴奇对称，余弦波偶对称

plt.plot(np.dot(pos_encoding,update)[:,60])

每一列都是正弦波

1.2.5.2 Scaled Dot-Product Attention

scaled_dot_product_attention

分步解释
注意力机制需要三个输入，Q（query）、K（key）、V（value）
计算方法如下：
$\Large{Attention(Q, K, V) = softmax_k(\frac{QK^T}{\sqrt{d_k}}) V}$

Q、K、V是怎么来的？
假设输入的词向量为x,用来计算Q、K、V、的矩阵分别为 $W_Q,W_K,W_V$
$\times W_Q\\ K = x \times W_K\\ V = x \times W_V$
Q:查询向量
K:表示被查询信息与其他信息的相关性的向量
V:表示被查询信息的向量
query shape = (…, seq_len_q, depth)
key shape = (…, seq_len_k, depth)
value shape = (…, seq_len_v, depth_v)
Q，K，V除了最后两个维度要其他都相同，seq_len_k 需要和 seq_len_v 相同
scale限幅操作
考虑softmax的形状，较大的值会引起较小的梯度，因此如果depth( $d_k$ )较大，会使Q、K相乘的结果较大，算得的梯度较小，不利于梯度传播。
例如，假设Q和K的平均值为0，方差为1。它们的矩阵乘法的平均值为0，方差为dk。因此，使用dk的平方根用于缩放（而不是其他任何数字），能够使Q和K的乘积的均值为0，方差为1，可获得更柔和的softmax。
mask矩阵的计算
- look_ahead mask是一个右上三角全为1的矩阵，为1的部分代表不可见部分，用于遮挡一个序列中的后续标记，在解码器部分使用，即在预测一个词时，不应该看见这个词之后的词。
```
def create_look_ahead_mask(size):
    mask = 1 - tf.linalg.band_part(tf.ones((size, size)), -1, 0)
    return mask  # (seq_len, seq_len)
```
- padding mask
  对于那些用来填充词向量的为0的位置，padding mask 对应位置为1，它用来使输入的填充部分不会用于计算中。
```
def create_padding_mask(seq):
seq = tf.cast(tf.math.equal(seq, 0), tf.float32)

# add extra dimensions to add the padding
# to the attention logits.
return seq[:, tf.newaxis, tf.newaxis, :]  # (batch_size, 1, 1, seq_len)
```
mask的作用位置
值得注意的是，mask作用在 $\frac{QK^T}{\sqrt{d_k}}$ 上，这是一个最后两个维度为 (seq_len, seq_len)的张量。mask的一行代表一个句子，一列代表词在句子中的位置，本来就应当作用在这样一个张量上，运算过程中存在broadcast。

mask中为1的部分会乘-1*e9，考虑softmax的形状，直接会使输出值为0，同时梯度也为0
该部分源码如下：

def scaled_dot_product_attention(q, k, v, mask):
  """计算注意力权重。
  q, k, v 必须具有匹配的前置维度。
  k, v 必须有匹配的倒数第二个维度，例如：seq_len_k = seq_len_v。
  虽然 mask 根据其类型（填充或前瞻）有不同的形状，
  但是 mask 必须能进行广播转换以便求和。
  
  参数:
    q: 请求的形状 == (..., seq_len_q, depth)
    k: 主键的形状 == (..., seq_len_k, depth)
    v: 数值的形状 == (..., seq_len_v, depth_v)
    mask: Float 张量，其形状能转换成
          (..., seq_len_q, seq_len_k)。默认为None。
    
  返回值:
    输出，注意力权重
  """

  matmul_qk = tf.matmul(q, k, transpose_b=True)  # (..., seq_len_q, seq_len_k)
  
  # 缩放 matmul_qk
  dk = tf.cast(tf.shape(k)[-1], tf.float32)
  scaled_attention_logits = matmul_qk / tf.math.sqrt(dk)

  # 将 mask 加入到缩放的张量上。
  if mask is not None:
    scaled_attention_logits += (mask * -1e9)  

  # softmax 在最后一个轴（seq_len_k）上归一化，因此分数
  # 相加等于1。
  attention_weights = tf.nn.softmax(scaled_attention_logits, axis=-1)  # (..., seq_len_q, seq_len_k)

  output = tf.matmul(attention_weights, v)  # (..., seq_len_q, depth_v)

  return output, attention_weights

1.2.5.3 Multi-Head Attention

在这里插入图片描述

几个问题

split与多个Dense层
在实际写代码时并不需要先做split然后将每个头的输入分别先经过一个Dense，直接对Q、K、V分别经过一个Dense，然后再做split就可以，所谓的split其实就是reshape出一个新的维度num_head，而做Scaled Dot-Product Attention时也无需分开，上述Attention层在写的时候就只是对最后两个维度做转换，因此看上去无论输入还是运算过程中都是一个四维向量整体在运算，但实际上可以达到split的效果。
multi head到底有什么用?
multi head在Transformer中起到了类似CNN中多个feature map的作用，一个矩阵的参数是随机初始化的，初始化的参数不同可能影响模型最终收敛的位置不同，因此可以理解成一套初始化参数就代表一个观察角度。多一个头就多一个调参空间，多一个头就相当于多一个模型做ensemble，所以头的个数既不是越多越好，也不是越少越好。

源码如下：

class MultiHeadAttention(tf.keras.layers.Layer):
  def __init__(self, d_model, num_heads):
    super(MultiHeadAttention, self).__init__()
    self.num_heads = num_heads
    self.d_model = d_model
    
    assert d_model % self.num_heads == 0
    
    self.depth = d_model // self.num_heads
    
    self.wq = tf.keras.layers.Dense(d_model)
    self.wk = tf.keras.layers.Dense(d_model)
    self.wv = tf.keras.layers.Dense(d_model)
    
    self.dense = tf.keras.layers.Dense(d_model)
        
  def split_heads(self, x, batch_size):
    """分拆最后一个维度到 (num_heads, depth).
    转置结果使得形状为 (batch_size, num_heads, seq_len, depth)
    """
    x = tf.reshape(x, (batch_size, -1, self.num_heads, self.depth))
    return tf.transpose(x, perm=[0, 2, 1, 3])
    
  def call(self, v, k, q, mask):
    batch_size = tf.shape(q)[0]
    
    q = self.wq(q)  # (batch_size, seq_len, d_model)
    k = self.wk(k)  # (batch_size, seq_len, d_model)
    v = self.wv(v)  # (batch_size, seq_len, d_model)
    
    q = self.split_heads(q, batch_size)  # (batch_size, num_heads, seq_len_q, depth)
    k = self.split_heads(k, batch_size)  # (batch_size, num_heads, seq_len_k, depth)
    v = self.split_heads(v, batch_size)  # (batch_size, num_heads, seq_len_v, depth)
    
    # scaled_attention.shape == (batch_size, num_heads, seq_len_q, depth)
    # attention_weights.shape == (batch_size, num_heads, seq_len_q, seq_len_k)
    scaled_attention, attention_weights = scaled_dot_product_attention(
        q, k, v, mask)
    
    scaled_attention = tf.transpose(scaled_attention, perm=[0, 2, 1, 3])  # (batch_size, seq_len_q, num_heads, depth)

    concat_attention = tf.reshape(scaled_attention, 
                                  (batch_size, -1, self.d_model))  # (batch_size, seq_len_q, d_model)

    output = self.dense(concat_attention)  # (batch_size, seq_len_q, d_model)
        
    return output, attention_weights

1.2.5.4 Point wise feed forward network

比较简单，只是经过两个Dense，中间有个Relu。
在这里插入图片描述

1.2.6 Transformer的编解码器

在这里插入图片描述
Transformer 模型与标准的具有注意力机制的序列到序列模型（sequence to sequence with attention model），遵循相同的一般模式。

输入语句经过 N 个编码器层，为序列中的每个词/标记生成一个输出。
解码器关注编码器的输出以及它自身的输入（自注意力）来预测下一个词。

1.2.6.1 Encoder Layer

每个编码器层包括以下子层：

多头注意力（有填充遮挡）
点式前馈网络（Point wise feed forward networks）。

每个子层在其周围有一个残差连接，然后进行层归一化。残差连接有助于避免深度网络中的梯度消失问题。

每个子层的输出是 LayerNorm(x + Sublayer(x))。归一化是在 d_model（最后一个）维度完成的。Transformer 中有 N 个编码器层。

LayerNorm 与 BatchNorm
LN是和BN非常近似的一种归一化方法，不同的是BN取的是不同样本的同一个特征，而LN取的是同一个样本的不同特征。在BN和LN都能使用的场景中，BN的效果一般优于LN，原因是基于不同数据，同一特征得到的归一化特征更不容易损失信息。

但是有些场景是不能使用BN的，例如batchsize较小或者在RNN中，这时候可以选择使用LN，LN得到的模型更稳定且起到正则化的作用。RNN能应用到小批量和RNN中是因为LN的归一化统计量的计算是和batchsize没有关系的。

class EncoderLayer(tf.keras.layers.Layer):
  def __init__(self, d_model, num_heads, dff, rate=0.1):
    super(EncoderLayer, self).__init__()

    self.mha = MultiHeadAttention(d_model, num_heads)
    self.ffn = point_wise_feed_forward_network(d_model, dff)

    self.layernorm1 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
    self.layernorm2 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
    
    self.dropout1 = tf.keras.layers.Dropout(rate)
    self.dropout2 = tf.keras.layers.Dropout(rate)
    
  def call(self, x, training, mask):

    attn_output, _ = self.mha(x, x, x, mask)  # (batch_size, input_seq_len, d_model)
    attn_output = self.dropout1(attn_output, training=training)
    out1 = self.layernorm1(x + attn_output)  # (batch_size, input_seq_len, d_model)
    
    ffn_output = self.ffn(out1)  # (batch_size, input_seq_len, d_model)
    ffn_output = self.dropout2(ffn_output, training=training)
    out2 = self.layernorm2(out1 + ffn_output)  # (batch_size, input_seq_len, d_model)
    
    return out2

1.2.6.2 Decoder Layer

每个解码器层包括以下子层：

遮挡的多头注意力（前瞻遮挡和填充遮挡）
多头注意力（用填充遮挡）。V（数值）和 K（主键）接收编码器输出作为输入。Q（请求）接收遮挡的多头注意力子层的输出。
点式前馈网络

每个子层在其周围有一个残差连接，然后进行层归一化。每个子层的输出是 LayerNorm(x + Sublayer(x))。归一化是在 d_model（最后一个）维度完成的。

Transformer 中共有 N 个解码器层。

当 Q 接收到解码器的第一个注意力块的输出，并且 K 接收到编码器的输出时，注意力权重表示根据编码器的输出赋予解码器输入的重要性。换一种说法，解码器通过查看编码器输出和对其自身输出的自注意力，预测下一个词。参看按比缩放的点积注意力部分的演示。

Decoder Layer中有两层multi-head attention，其中一层是计算输入的自注意力，作为第二层的 Q , 另一层的 V 和 K 是encoder的输出

class DecoderLayer(tf.keras.layers.Layer):
  def __init__(self, d_model, num_heads, dff, rate=0.1):
    super(DecoderLayer, self).__init__()

    self.mha1 = MultiHeadAttention(d_model, num_heads)
    self.mha2 = MultiHeadAttention(d_model, num_heads)

    self.ffn = point_wise_feed_forward_network(d_model, dff)
 
    self.layernorm1 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
    self.layernorm2 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
    self.layernorm3 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
    
    self.dropout1 = tf.keras.layers.Dropout(rate)
    self.dropout2 = tf.keras.layers.Dropout(rate)
    self.dropout3 = tf.keras.layers.Dropout(rate)
    
    
  def call(self, x, enc_output, training, 
           look_ahead_mask, padding_mask):
    # enc_output.shape == (batch_size, input_seq_len, d_model)

    attn1, attn_weights_block1 = self.mha1(x, x, x, look_ahead_mask)  # (batch_size, target_seq_len, d_model)
    attn1 = self.dropout1(attn1, training=training)
    out1 = self.layernorm1(attn1 + x)
    
    attn2, attn_weights_block2 = self.mha2(
        enc_output, enc_output, out1, padding_mask)  # (batch_size, target_seq_len, d_model)
    attn2 = self.dropout2(attn2, training=training)
    out2 = self.layernorm2(attn2 + out1)  # (batch_size, target_seq_len, d_model)
    
    ffn_output = self.ffn(out2)  # (batch_size, target_seq_len, d_model)
    ffn_output = self.dropout3(ffn_output, training=training)
    out3 = self.layernorm3(ffn_output + out2)  # (batch_size, target_seq_len, d_model)
    
    return out3, attn_weights_block1, attn_weights_block2

1.2.6.3 Encoder

包括：

输入嵌入（Input Embedding）
位置编码（Positional Encoding）
N 个编码器层（encoder layers）

输入经过嵌入（embedding）后，该嵌入与位置编码相加。该加法结果的输出是编码器层的输入。编码器的输出是解码器的输入。

class Encoder(tf.keras.layers.Layer):
  def __init__(self, num_layers, d_model, num_heads, dff, input_vocab_size,
               maximum_position_encoding, rate=0.1):
    super(Encoder, self).__init__()

    self.d_model = d_model
    self.num_layers = num_layers
    
    self.embedding = tf.keras.layers.Embedding(input_vocab_size, d_model)
    self.pos_encoding = positional_encoding(maximum_position_encoding, 
                                            self.d_model)
    
    
    self.enc_layers = [EncoderLayer(d_model, num_heads, dff, rate) 
                       for _ in range(num_layers)]
  
    self.dropout = tf.keras.layers.Dropout(rate)
        
  def call(self, x, training, mask):

    seq_len = tf.shape(x)[1]
    
    # 将嵌入和位置编码相加。
    x = self.embedding(x)  # (batch_size, input_seq_len, d_model)
    x *= tf.math.sqrt(tf.cast(self.d_model, tf.float32))
    x += self.pos_encoding[:, :seq_len, :]

    x = self.dropout(x, training=training)
    
    for i in range(self.num_layers):
      x = self.enc_layers[i](x, training, mask)
    
    return x  # (batch_size, input_seq_len, d_model)

1.2.6.4 Decoder

包括：

输出嵌入（Output Embedding）
位置编码（Positional Encoding）
N 个解码器层（decoder layers）

目标（target）经过一个嵌入后，该嵌入和位置编码相加。该加法结果是解码器层的输入。解码器的输出是最后的线性层的输入。

class Decoder(tf.keras.layers.Layer):
  def __init__(self, num_layers, d_model, num_heads, dff, target_vocab_size,
               maximum_position_encoding, rate=0.1):
    super(Decoder, self).__init__()

    self.d_model = d_model
    self.num_layers = num_layers
    
    self.embedding = tf.keras.layers.Embedding(target_vocab_size, d_model)
    self.pos_encoding = positional_encoding(maximum_position_encoding, d_model)
    
    self.dec_layers = [DecoderLayer(d_model, num_heads, dff, rate) 
                       for _ in range(num_layers)]
    self.dropout = tf.keras.layers.Dropout(rate)
    
  def call(self, x, enc_output, training, 
           look_ahead_mask, padding_mask):

    seq_len = tf.shape(x)[1]
    attention_weights = {}
    
    x = self.embedding(x)  # (batch_size, target_seq_len, d_model)
    x *= tf.math.sqrt(tf.cast(self.d_model, tf.float32))
    x += self.pos_encoding[:, :seq_len, :]
    
    x = self.dropout(x, training=training)

    for i in range(self.num_layers):
      x, block1, block2 = self.dec_layers[i](x, enc_output, training,
                                             look_ahead_mask, padding_mask)
      
      attention_weights['decoder_layer{}_block1'.format(i+1)] = block1
      attention_weights['decoder_layer{}_block2'.format(i+1)] = block2
    
    # x.shape == (batch_size, target_seq_len, d_model)
    return x, attention_weights

1.2.7 完整的transformer

除了经过编码器解码器还要走一个带 softmax 的Dense

class Transformer(tf.keras.Model):
  def __init__(self, num_layers, d_model, num_heads, dff, input_vocab_size, 
               target_vocab_size, pe_input, pe_target, rate=0.1):
    super(Transformer, self).__init__()

    self.encoder = Encoder(num_layers, d_model, num_heads, dff, 
                           input_vocab_size, pe_input, rate)

    self.decoder = Decoder(num_layers, d_model, num_heads, dff, 
                           target_vocab_size, pe_target, rate)

    self.final_layer = tf.keras.layers.Dense(target_vocab_size)
    
  def call(self, inp, tar, training, enc_padding_mask, 
           look_ahead_mask, dec_padding_mask):

    enc_output = self.encoder(inp, training, enc_padding_mask)  # (batch_size, inp_seq_len, d_model)
    
    # dec_output.shape == (batch_size, tar_seq_len, d_model)
    dec_output, attention_weights = self.decoder(
        tar, enc_output, training, look_ahead_mask, dec_padding_mask)
    
    final_output = self.final_layer(dec_output)  # (batch_size, tar_seq_len, target_vocab_size)
    
    return final_output, attention_weights

1.2.8 源码中的其他部分

1.2.8.1 optimizer

采用Adam，学习率是下面这个函数，它使学习率在初始训练时线性增长，后面再非线性减小

class CustomSchedule(tf.keras.optimizers.schedules.LearningRateSchedule):
  def __init__(self, d_model, warmup_steps=4000):
    super(CustomSchedule, self).__init__()
    
    self.d_model = d_model
    self.d_model = tf.cast(self.d_model, tf.float32)

    self.warmup_steps = warmup_steps
    
  def __call__(self, step):
    arg1 = tf.math.rsqrt(step)
    arg2 = step * (self.warmup_steps ** -1.5)
    
    return tf.math.rsqrt(self.d_model) * tf.math.minimum(arg1, arg2)

在这里插入图片描述

1.2.8.2 loss metric

在计算loss时，由于预测出的结果是做了padding的，因此不能直接算，先把做padding的部分屏蔽掉再算。
padding是由于句长不一样，要使他们长度一样，需要填0

def loss_function(real, pred):
  mask = tf.math.logical_not(tf.math.equal(real, 0))
  loss_ = loss_object(real, pred)

  mask = tf.cast(mask, dtype=loss_.dtype)
  loss_ *= mask
  
  return tf.reduce_sum(loss_)/tf.reduce_sum(mask)


def accuracy_function(real, pred):
  accuracies = tf.equal(real, tf.argmax(pred, axis=2))
  
  mask = tf.math.logical_not(tf.math.equal(real, 0))
  accuracies = tf.math.logical_and(mask, accuracies)

  accuracies = tf.cast(accuracies, dtype=tf.float32)
  mask = tf.cast(mask, dtype=tf.float32)
  return tf.reduce_sum(accuracies)/tf.reduce_sum(mask)

1.2.8.3 训练

目标（target）被分成了 tar_inp 和 tar_real。tar_inp 作为输入传递到解码器。tar_real 是位移了 1 的同一个输入：在 tar_inp 中的每个位置，tar_real 包含了应该被预测到的下一个标记（token）。

例如，sentence = “SOS A lion in the jungle is sleeping EOS”

tar_inp = “SOS A lion in the jungle is sleeping”

tar_real = “A lion in the jungle is sleeping EOS”

Transformer 是一个自回归（auto-regressive）模型：它一次作一个部分的预测，然后使用到目前为止的自身的输出来决定下一步要做什么。

在训练过程中，本示例使用了 teacher-forcing 的方法（就像文本生成教程中一样）。无论模型在当前时间步骤下预测出什么，teacher-forcing 方法都会将真实的输出传递到下一个时间步骤上。

当 transformer 预测每个词时，自注意力（self-attention）功能使它能够查看输入序列中前面的单词，从而更好地预测下一个单词。

为了防止模型在期望的输出上达到峰值，模型使用了前瞻遮挡（look-ahead mask）。

训练时解码器的输入直接把整个完整的翻译后的句子做了输入，然后在 multi-head 部分做look-ahead mask遮挡住模型不该看到的部分，训练时的输出也是直接一个完整的句子。

# The @tf.function trace-compiles train_step into a TF graph for faster
# execution. The function specializes to the precise shape of the argument
# tensors. To avoid re-tracing due to the variable sequence lengths or variable
# batch sizes (the last batch is smaller), use input_signature to specify
# more generic shapes.

train_step_signature = [
    tf.TensorSpec(shape=(None, None), dtype=tf.int64),
    tf.TensorSpec(shape=(None, None), dtype=tf.int64),
]

@tf.function(input_signature=train_step_signature)
def train_step(inp, tar):
  tar_inp = tar[:, :-1]
  tar_real = tar[:, 1:]
  
  enc_padding_mask, combined_mask, dec_padding_mask = create_masks(inp, tar_inp)
  
  with tf.GradientTape() as tape:
    predictions, _ = transformer(inp, tar_inp, 
                                 True, 
                                 enc_padding_mask, 
                                 combined_mask, 
                                 dec_padding_mask)
    loss = loss_function(tar_real, predictions)

  gradients = tape.gradient(loss, transformer.trainable_variables)    
  optimizer.apply_gradients(zip(gradients, transformer.trainable_variables))
  
  train_loss(loss)
  train_accuracy(accuracy_function(tar_real, predictions))

1.2.8.4 评估

以下步骤用于评估：

用葡萄牙语分词器（tokenizer_pt）编码输入语句。此外，添加开始和结束标记，这样输入就与模型训练的内容相同。这是编码器输入。
解码器输入为 start token == tokenizer_en.vocab_size。
计算填充遮挡和前瞻遮挡。
解码器通过查看编码器输出和它自身的输出（自注意力）给出预测。
选择最后一个词并计算它的 argmax。
将预测的词连接到解码器输入，然后传递给解码器。
在这种方法中，解码器根据它预测的之前的词预测下一个。

def evaluate(inp_sentence):
  start_token = [tokenizer_pt.vocab_size]
  end_token = [tokenizer_pt.vocab_size + 1]
  
  # inp sentence is portuguese, hence adding the start and end token
  inp_sentence = start_token + tokenizer_pt.encode(inp_sentence) + end_token
  encoder_input = tf.expand_dims(inp_sentence, 0)
  
  # as the target is english, the first word to the transformer should be the
  # english start token.
  decoder_input = [tokenizer_en.vocab_size]
  output = tf.expand_dims(decoder_input, 0)
    
  for i in range(MAX_LENGTH):
    enc_padding_mask, combined_mask, dec_padding_mask = create_masks(
        encoder_input, output)
  
    # predictions.shape == (batch_size, seq_len, vocab_size)
    predictions, attention_weights = transformer(encoder_input, 
                                                 output,
                                                 False,
                                                 enc_padding_mask,
                                                 combined_mask,
                                                 dec_padding_mask)
    
    # select the last word from the seq_len dimension
    predictions = predictions[: ,-1:, :]  # (batch_size, 1, vocab_size)

    predicted_id = tf.cast(tf.argmax(predictions, axis=-1), tf.int32)
    
    # return the result if the predicted_id is equal to the end token
    if predicted_id == tokenizer_en.vocab_size+1:
      return tf.squeeze(output, axis=0), attention_weights
    
    # concatentate the predicted_id to the output which is given to the decoder
    # as its input.
    output = tf.concat([output, predicted_id], axis=-1)

  return tf.squeeze(output, axis=0), attention_weights

一个翻译的例子

def plot_attention_weights(attention, sentence, result, layer):
  fig = plt.figure(figsize=(16, 8))
  
  sentence = tokenizer_pt.encode(sentence)
  
  attention = tf.squeeze(attention[layer], axis=0)
  
  for head in range(attention.shape[0]):
    ax = fig.add_subplot(2, 4, head+1)
    
    # plot the attention weights
    ax.matshow(attention[head][:-1, :], cmap='viridis')

    fontdict = {'fontsize': 10}
    
    ax.set_xticks(range(len(sentence)+2))
    ax.set_yticks(range(len(result)))
    
    ax.set_ylim(len(result)-1.5, -0.5)
        
    ax.set_xticklabels(
        ['<start>']+[tokenizer_pt.decode([i]) for i in sentence]+['<end>'], 
        fontdict=fontdict, rotation=90)
    
    ax.set_yticklabels([tokenizer_en.decode([i]) for i in result 
                        if i < tokenizer_en.vocab_size], 
                       fontdict=fontdict)
    
    ax.set_xlabel('Head {}'.format(head+1))
  
  plt.tight_layout()
  plt.show()


def translate(sentence, plot=''):
  result, attention_weights = evaluate(sentence)
  
  predicted_sentence = tokenizer_en.decode([i for i in result 
                                            if i < tokenizer_en.vocab_size])  

  print('Input: {}'.format(sentence))
  print('Predicted translation: {}'.format(predicted_sentence))
  
  if plot:
    plot_attention_weights(attention_weights, sentence, result, plot)

在这里插入图片描述

1.2.9 参考与推荐

https://blog.csdn.net/qq_22795223/article/details/105676186
https://github.com/huggingface/transformers
https://huggingface.co/transformers/index.html
https://jalammar.github.io/illustrated-transformer/
https://www.nowcoder.com/discuss/258321
https://blog.csdn.net/longxinchen_ml/article/details/86533005

由比ヶ浜結衣

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
nlper 成长之路(一)——Transformer理解与深入

1. 预备知识1.1 BLEU biligual evaluation understudy 算法是一种衡量模型生成序列和参考序列之间的N元词组（N-Gram）重合度的算法，最早用来评价机器翻译模型的质量，目前也广泛应用在各种序列生成任务中令xxx为从模型分布pθp_\thetapθ中生成的一个候选（Candidate）序列，s(1),⋯ ,s(K)s^{(1)},\cdots,s^{(K)}s(1),⋯,s(K)为从真实数据分布中采集的一组参考序列，WWW为从生成的候选序列中提取..
复制链接

扫一扫