从理论到实践：Transformer模型的完整实现指南

本文链接：https://blog.csdn.net/python12345678_/article/details/144562835

Nothing in this world can take the place of persistence. Talent will not: nothing is more common than unsuccessful men with talent. Genius will not; unrewarded genius is almost a proverb. Education will not: the world is full of educated derelicts. Persistence and determination alone are omnipotent.

Attention models 是NLP从业者工具箱中的一个强大工具。与LSTM类似，它们能够学习哪些词语在短语、句子、段落中最为重要。而且，attention models在缓解梯度消失问题上比LSTM表现得更为出色。Attention机制 vs. 传统方法: NMT的新时代已经了解到如何将注意力机制与LSTM结合，构建适用于机器翻译等任务的编码器-解码器模型。

本文将attention融合到transformers，理论部分我已经在这篇文章解释了（Transformer），更详细的部分可以参考原论文"Attention is All You Need"。

除了嵌入、位置编码、密集层和 residual connection，attention 是transformers的关键组成部分。

下面显示一张简化的图片：

Dot product attention

在这里插入图片描述

通过基本的dot product attention，可以捕获查询中的每个单词（嵌入）和键中的每个单词之间的交互。如果查询和关键字属于同一个句子，这就构成了bi-directional self-attention。然而，在某些情况下，只考虑当前单词之前的单词更合适。这种情况，特别是当查询和关键字来自同一个句子时，属于causal attention。

Dot product attention 直接用Q和K的点积。

Scaled dot product attention 通过除以 Key 维度的平方根来归一化点积，使其在数值上更加稳定，尤其是对于高维向量。

Follow the formula below:

Q is the matrix of queries
K is the matrix of keys
V is the matrix of values
M is the optional mask you choose to apply
dk is the dimension of the keys, which is used to scale everything down so the softmax doesn’t explode

# Dot Product Attention
def dot_product_attention(q, k, v, mask, scale=True):
    """
    q, k, v must have matching leading dimensions.
    k, v must have matching penultimate dimension, i.e.: seq_len_k = seq_len_v.
    The mask has different shapes depending on its type(padding or look ahead) 
    but it must be broadcastable for addition.

    Returns:
        attention_output,attention_weights
    """
    
    # Multiply q and k transposed.
    matmul_qk = tf.matmul(q, k, transpose_b=True)  # (..., seq_len_q, seq_len_k)

    # scale matmul_qk with the square root of dk
    if scale:
        dk = tf.cast(tf.shape(k)[-1], tf.float32)
        matmul_qk = matmul_qk / tf.math.sqrt(dk)
    # add the mask to the scaled tensor.
    if mask is not None:
        matmul_qk += (1. - mask) * -1e9 

    # softmax is normalized on the last axis (seq_len_k) so that the scores add up to 1.
    attention_weights = tf.keras.activations.softmax(matmul_qk)

    # Multiply the attention weights by v
    attention_output = tf.matmul(attention_weights, v)  # (..., seq_len_q, depth_v)

    return attention_output, attention_weights

q = np.array([[1.0, 0.0, 0.0], [0.0, 1.0, 0.0]]).astype(np.float32)
k = np.array([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]]).astype(np.float32)
v = np.array([[0.0, 1.0, 0.0], [1.0, 0.0, 1.0]]).astype(np.float32)
mask = np.array([[1.0, 0.0], [1.0, 1.0]])

ou, atw = dot_product_attention(q, k, v, mask)
ou = np.around(ou, decimals=2)
atw = np.around(atw, decimals=2)


print(f"Output:\n {ou}")
print(f"\nAttention weigths:\n {atw}")

Output:
 [[0.   1.   0.  ]
 [0.85 0.15 0.85]]

Attention weigths:
 [[1.   0.  ]
 [0.15 0.85]]

For causal attention，可以在softmax函数的参数中添加一个掩码，如下所示：
在这里插入图片描述


def causal_dot_product_attention(q, k, v, scale=True):
    """ Masked dot product self attention.
    Args:
        q (numpy.ndarray): queries.
        k (numpy.ndarray): keys.
        v (numpy.ndarray): values.
    Returns:
        numpy.ndarray: masked dot product self attention tensor.
    """
    
    # Size of the penultimate dimension of the query
    mask_size = q.shape[-2]

    mask = tf.experimental.numpy.tril(tf.ones((mask_size, mask_size)))  
    output, atw = dot_product_attention(q, k, v, mask, scale=scale)
    return output

result = causal_dot_product_attention(q, k, v)
display_tensor(result, 'result')

Output
result shape: (2, 3)

[[0.         1.         0.        ]
 [0.8496746  0.15032543 0.8496746 ]]

Masking

在构建Transformer网络时，有两种类型的遮罩很有用：Padding mask 和 Look-ahead mask。两者都有助于 softmax 计算为输入中的单词赋予适当的权重。

Padding mask

Padding Mask 是在 Transformer 模型中用于忽略填充部分（通常是 0）的一种掩码机制。当处理变长序列时，短序列会被填充至统一长度。

E.g.
[["Do", "you", "know", "when", "Jane", "is", "going", "to", "visit", "Africa"], 
 ["Jane", "visits", "Africa", "in", "September" ],
 ["Exciting", "!"]
]

get vectorized as:
[[ 71, 121, 4, 56, 99, 2344, 345, 1284, 15],
 [ 56, 1285, 15, 181, 545],
 [ 87, 600]
]

set maximum length = 5
[[ 71, 121, 4, 56, 99],
 [ 2344, 345, 1284, 15, 0],
 [ 56, 1285, 15, 181, 545],
 [ 87, 600, 0, 0, 0],
]
大于最大长度五的序列将被截断，并将零添加到截断的序列中以实现长度均匀。同样，对于短于最大长度的序列，也将添加零进行填充。

当这些向量通过 attention layers 时，0 通常会消失。然而，仍然希望network 只关注该向量中的前几个数字（由句子长度给出），这就是填充掩码派上用场的时候。首先需要定义一个boolean mask，指定哪些元素参与(1)，哪些元素忽略(0)，可以通过查看序列中的所有零来实现这一点。然后使用 mask 将向量的值（对应于初始向量中的零）设置为接近负无穷大（-1e9）。

def create_padding_mask(decoder_token_ids):
    """
    Creates a matrix mask for the padding cells
    
    Arguments:
        decoder_token_ids (matrix like): matrix of size (n, m)
    
    Returns:
        mask (tf.Tensor): binary tensor of size (n, 1, m)
    """    
    seq = 1 - tf.cast(tf.math.equal(decoder_token_ids, 0), tf.float32)
    return seq[:, tf.newaxis, :]

x = tf.constant([[7., 6., 0., 0., 0.], [1., 2., 3., 0., 0.], [3., 0., 0., 0., 0.]])
print(create_padding_mask(x))

tf.Tensor(
[[[1. 1. 0. 0. 0.]]
 [[1. 1. 1. 0. 0.]]
 [[1. 0. 0. 0. 0.]]], shape=(3, 1, 5), dtype=float32)

如果将（1-mask）乘以-1e9并将其添加到样本输入序列中，则零基本上被设置为负无穷大。

mask = create_padding_mask(x)
# Extend the dimension of x to match the dimension of the mask
x_extended = x[:, tf.newaxis, :]
print("Softmax of non-masked vectors:\n")
print(tf.keras.activations.softmax(x_extended))
print("\nSoftmax of masked vectors:\n")
print(tf.keras.activations.softmax(x_extended + (1 - mask) * -1.0e9))

在这里插入图片描述

look-ahead mask

The look-ahead mask 帮助模型假装正确地预测了一部分输出，并查看在 without looking ahead 的情况下，模型是否可以正确地预测下一个输出。

例如，如果预期的正确输出是【1,2,3】，并且想看看给定模型正确预测了第一个值，它是否可以预测第二个值，可以屏蔽第二个和第三个值。所以你可以输入屏蔽序列【1，-1e9，-1e9】，看看它是否能生成【1，2，-1e9】。

def create_look_ahead_mask(sequence_length):
    """
    Returns a lower triangular matrix filled with ones
    
    Arguments:
        sequence_length (int): matrix size
    
    Returns:
        mask (tf.Tensor): binary tensor of size (sequence_length, sequence_length)
    """
    mask = tf.linalg.band_part(tf.ones((1, sequence_length, sequence_length)), -1, 0)
    return mask

x = tf.random.uniform((1, 3))
temp = create_look_ahead_mask(x.shape[1])
temp

<tf.Tensor: shape=(1, 3, 3), dtype=float32, numpy=
array([[[1., 0., 0.],
        [1., 1., 0.],
        [1., 1., 1.]]], dtype=float32)>

Positional Encoding

在 seq2seq 任务中，数据的相对顺序具有非常重要的意义。当你训练序列模型（如RNN）时，输入是按顺序逐步传递给网络的，数据的顺序信息会自然嵌入到模型中。然而，使用多头注意力机制训练Transformer时，数据是一次性输入到模型中的。虽然这显著加快了训练速度，但模型本身并没有捕捉到数据的顺序信息。此时，**位置编码（Position Encoding）**的作用就显现出来了，它帮助Transformer保留输入数据的顺序信息。

在这里插入图片描述

d 是 word embedding 和 positional encoding 的维数.
pos 是单词的位置.
k 指位置编码中的每个不同维度，其中 i 等于 k // 2.

为了更好地理解位置编码，可以将它广义地看作是捕捉单词相对位置信息的一种特征。positional encoding 和 word embedding 相加后被一同输入到模型中。如果你仅通过在嵌入矩阵中硬编码位置，比如直接添加1或整数，这会扭曲嵌入的语义信息。相反，使用正弦和余弦函数生成的数值（范围在-1到1之间），可以避免嵌入的失真。positional encoding 不仅不会显著改变单词嵌入，还能为其加入位置信息。正弦和余弦函数的组合让 Transformer 能够关注输入数据的相对顺序，从而增强模型的性能。

def get_angles(position, k, d_model):
    """
    Computes a positional encoding for a word 
    
    Arguments:
        position (int): position of the word
        k (int): refers to each of the different dimensions in the positional encodings, with i equal to k//2
        d_model(int): the dimension of the word embedding and positional encoding
    
    Returns:
        _ (float): positional embedding value for the word
    """
    i = k // 2
    angle_rates = 1 / np.power(10000, (2 * i) / np.float32(d_model))
    return position * angle_rates
  
  
  def positional_encoding(positions, d):
    """
    Precomputes a matrix with all the positional encodings 
    
    Arguments:
        positions (int): Maximum number of positions to be encoded 
        d (int): Encoding size 
    
    Returns:
        pos_encoding (tf.Tensor): A matrix of shape (1, position, d_model) with the positional encodings
    """
    # initialize a matrix angle_rads of all the angles 
    angle_rads = get_angles(np.arange(positions)[:, np.newaxis],
                          np.arange(d)[np.newaxis, :],
                          d)
  
    # apply sin to even indices in the array; 2i
    angle_rads[:, 0::2] = np.sin(angle_rads[:, 0::2])
  
    # apply cos to odd indices in the array; 2i+1
    angle_rads[:, 1::2] = np.cos(angle_rads[:, 1::2])
    
    pos_encoding = angle_rads[np.newaxis, ...]
    
    return tf.cast(pos_encoding, dtype=tf.float32)

现在可以可视化 positional encodings.

pos_encoding = positional_encoding(128, 256)
plt.pcolormesh(pos_encoding[0], cmap='RdBu')
plt.xlabel('d')
plt.xlim((0, 256))
plt.ylabel('Position')
plt.colorbar()
plt.show()

在这里插入图片描述

每一行代表一个位置编码——注意没有一行是相同的！现在已经为每个单词创建了唯一的位置编码。

Encoder

Transformer 编码器层将自注意力和卷积神经网络处理风格配对以提高训练速度，并将 K 和 V matrices 传递给解码器。

在这里插入图片描述

class EncoderLayer(tf.keras.layers.Layer):
    """
    The encoder layer is composed by a multi-head self-attention mechanism,
    followed by a simple, positionwise fully connected feed-forward network. 
    This architecture includes a residual connection around each of the two 
    sub-layers, followed by layer normalization.
    """
    def __init__(self, embedding_dim, num_heads, fully_connected_dim,
                 dropout_rate=0.1, layernorm_eps=1e-6):
        
        super(EncoderLayer, self).__init__()

        self.mha = tf.keras.layers.MultiHeadAttention(
            num_heads=num_heads,
            key_dim=embedding_dim,
            dropout=dropout_rate
        )

        self.ffn = FullyConnected(
            embedding_dim=embedding_dim,
            fully_connected_dim=fully_connected_dim
        )

        self.layernorm1 = tf.keras.layers.LayerNormalization(epsilon=layernorm_eps)
        self.layernorm2 = tf.keras.layers.LayerNormalization(epsilon=layernorm_eps)

        self.dropout_ffn = tf.keras.layers.Dropout(dropout_rate)
    
    def call(self, x, training, mask):
        """
        Forward pass for the Encoder Layer
        Returns:
            encoder_layer_out (tf.Tensor): Tensor of shape (batch_size, input_seq_len, embedding_dim)
        """
        # Dropout is added by Keras automatically if the dropout parameter is non-zero during training
        self_mha_output = self.mha(x, x, x, mask)  # Self attention (batch_size, input_seq_len, fully_connected_dim)
        
        # skip connection
        # apply layer normalization on sum of the input and the attention output to get the  
        # output of the multi-head attention layer
        skip_x_attention = self.layernorm1(x + self_mha_output)  # (batch_size, input_seq_len, fully_connected_dim)

        # pass the output of the multi-head attention layer through a ffn
        ffn_output = self.ffn(skip_x_attention)  # (batch_size, input_seq_len, fully_connected_dim)
        
        # apply dropout layer to ffn output during training
        ffn_output = self.dropout_ffn(ffn_output, training=training)
        
        # apply layer normalization on sum of the output from multi-head attention (skip connection) and ffn output
        # to get the output of the encoder layer
        encoder_layer_out = self.layernorm2(skip_x_attention + ffn_output)  # (batch_size, input_seq_len, embedding_dim)
        
        return encoder_layer_out
      
      
class Encoder(tf.keras.layers.Layer):
    """
    The entire Encoder starts by passing the input to an embedding layer 
    and using positional encoding to then pass the output through a stack of
    encoder Layers
        
    """  
    def __init__(self, num_layers, embedding_dim, num_heads, fully_connected_dim, input_vocab_size,
               maximum_position_encoding, dropout_rate=0.1, layernorm_eps=1e-6):
        super(Encoder, self).__init__()

        self.embedding_dim = embedding_dim
        self.num_layers = num_layers

        self.embedding = tf.keras.layers.Embedding(input_vocab_size, self.embedding_dim)
        self.pos_encoding = positional_encoding(maximum_position_encoding, 
                                                self.embedding_dim)


        self.enc_layers = [EncoderLayer(embedding_dim=self.embedding_dim,
                                        num_heads=num_heads,
                                        fully_connected_dim=fully_connected_dim,
                                        dropout_rate=dropout_rate,
                                        layernorm_eps=layernorm_eps) 
                           for _ in range(self.num_layers)]

        self.dropout = tf.keras.layers.Dropout(dropout_rate)
        
    def call(self, x, training, mask):
        """
        Forward pass for the Encoder

        Returns:
            x (tf.Tensor): Tensor of shape (batch_size, seq_len, embedding dim)
        """
        seq_len = tf.shape(x)[1]
        
        # Pass input through the Embedding layer
        x = self.embedding(x)  # (batch_size, input_seq_len, embedding_dim)
        # Scale embedding by multiplying it by the square root of the embedding dimension
        x *= tf.math.sqrt(tf.cast(self.embedding_dim, tf.float32))
        # Add the position encoding to embedding
        x += self.pos_encoding[:, :seq_len, :]
        # Pass the encoded embedding through a dropout layer
        x = self.dropout(x, training=training)
        # Pass the output through the stack of encoding layers 
        for i in range(self.num_layers):
            x = self.enc_layers[i](x, training, mask)

        return x  # (batch_size, input_seq_len, embedding_dim)

Decoder

在这里插入图片描述

class DecoderLayer(tf.keras.layers.Layer):
    """
    The decoder layer is composed by two multi-head attention blocks, 
    one that takes the new input and uses self-attention, and the other 
    one that combines it with the output of the encoder, followed by a
    fully connected block. 
    """
    def __init__(self, embedding_dim, num_heads, fully_connected_dim, dropout_rate=0.1, layernorm_eps=1e-6):
        super(DecoderLayer, self).__init__()

        self.mha1 = tf.keras.layers.MultiHeadAttention(
            num_heads=num_heads,
            key_dim=embedding_dim,
            dropout=dropout_rate
        )

        self.mha2 = tf.keras.layers.MultiHeadAttention(
            num_heads=num_heads,
            key_dim=embedding_dim,
            dropout=dropout_rate
        )

        self.ffn = FullyConnected(
            embedding_dim=embedding_dim,
            fully_connected_dim=fully_connected_dim
        )

        self.layernorm1 = tf.keras.layers.LayerNormalization(epsilon=layernorm_eps)
        self.layernorm2 = tf.keras.layers.LayerNormalization(epsilon=layernorm_eps)
        self.layernorm3 = tf.keras.layers.LayerNormalization(epsilon=layernorm_eps)

        self.dropout_ffn = tf.keras.layers.Dropout(dropout_rate)
    
    def call(self, x, enc_output, training, look_ahead_mask, padding_mask):
        """
        Forward pass for the Decoder Layer
        """

        # BLOCK 1
        # calculate self-attention and return attention scores as attn_weights_block1.
        # Dropout will be applied during training.
        mult_attn_out1, attn_weights_block1 = self.mha1(x, x, x, look_ahead_mask, return_attention_scores=True)
        
        # apply layer normalization (layernorm1) to the sum of the attention output and the input
        Q1 = self.layernorm1(mult_attn_out1 + x)

        # BLOCK 2
        # calculate self-attention using the Q from the first block and K and V from the encoder output. 
        # Dropout will be applied during training
        # Return attention scores as attn_weights_block2  
        mult_attn_out2, attn_weights_block2 = self.mha2(Q1, enc_output, enc_output, padding_mask, return_attention_scores=True)
        
        # apply layer normalization (layernorm2) to the sum of the attention output and the Q from the first block
        mult_attn_out2 = self.layernorm2(mult_attn_out2 + Q1)
                
        # BLOCK 3
        # pass the output of the second block through a ffn
        ffn_output = self.ffn(mult_attn_out2)
        
        # apply a dropout layer to the ffn output
        ffn_output = self.dropout_ffn(ffn_output, training=training)
        
        # apply layer normalization (layernorm3) to the sum of the ffn output and the output of the second block
        out3 = self.layernorm3(ffn_output + mult_attn_out2)

        return out3, attn_weights_block1, attn_weights_block2
    
    
class Decoder(tf.keras.layers.Layer):
    """
    The entire Encoder starts by passing the target input to an embedding layer 
    and using positional encoding to then pass the output through a stack of
    decoder Layers
        
    """ 
    def __init__(self, num_layers, embedding_dim, num_heads, fully_connected_dim, target_vocab_size,
               maximum_position_encoding, dropout_rate=0.1, layernorm_eps=1e-6):
        super(Decoder, self).__init__()

        self.embedding_dim = embedding_dim
        self.num_layers = num_layers

        self.embedding = tf.keras.layers.Embedding(target_vocab_size, self.embedding_dim)
        self.pos_encoding = positional_encoding(maximum_position_encoding, self.embedding_dim)

        self.dec_layers = [DecoderLayer(embedding_dim=self.embedding_dim,
                                        num_heads=num_heads,
                                        fully_connected_dim=fully_connected_dim,
                                        dropout_rate=dropout_rate,
                                        layernorm_eps=layernorm_eps) 
                           for _ in range(self.num_layers)]
        self.dropout = tf.keras.layers.Dropout(dropout_rate)
    
    def call(self, x, enc_output, training, 
           look_ahead_mask, padding_mask):
        """
        Forward  pass for the Decoder
        """

        seq_len = tf.shape(x)[1]
        attention_weights = {}
        
        # create word embeddings 
        x = self.embedding(x)
        # scale embeddings by multiplying by the square root of their dimension
        x *= tf.math.sqrt(tf.cast(self.embedding_dim, tf.float32))
        # add positional encodings to word embedding
        x += self.pos_encoding[:, :seq_len, :]
        # apply a dropout layer to x
        x = self.dropout(x, training=training)
        # use a for loop to pass x through a stack of decoder layers and update attention_weights 
        for i in range(self.num_layers):
            # pass x and the encoder output through a stack of decoder layers and save the attention weights
            x, block1, block2 = self.dec_layers[i](x, enc_output, training, look_ahead_mask, padding_mask)
            # update attention_weights dictionary with the attention weights of block 1 and block 2
            attention_weights['decoder_layer{}_block1_self_att'.format(i+1)] = block1
            attention_weights['decoder_layer{}_block2_decenc_att'.format(i+1)] = block2

        return x, attention_weights

后续搭建 Transformer 的代码就不展示了，有兴趣可私聊获取代码。

如何学习大模型

现在社会上大模型越来越普及了，已经有很多人都想往这里面扎，但是却找不到适合的方法去学习。

作为一名资深码农，初入大模型时也吃了很多亏，踩了无数坑。现在我想把我的经验和知识分享给你们，帮助你们学习AI大模型，能够解决你们学习中的困难。

下面这些都是我当初辛苦整理和花钱购买的资料，现在我已将重要的AI大模型资料包括市面上AI大模型各大白皮书、AGI大模型系统学习路线、AI大模型视频教程、实战学习，等录播视频免费分享出来，需要的小伙伴可以扫取。

一、AGI大模型系统学习路线

很多人学习大模型的时候没有方向，东学一点西学一点，像只无头苍蝇乱撞，我下面分享的这个学习路线希望能够帮助到你们学习AI大模型。

在这里插入图片描述

二、AI大模型视频教程

在这里插入图片描述

三、AI大模型各大学习书籍

在这里插入图片描述

四、AI大模型各大场景实战案例

在这里插入图片描述

五、结束语

学习AI大模型是当前科技发展的趋势，它不仅能够为我们提供更多的机会和挑战，还能够让我们更好地理解和应用人工智能技术。通过学习AI大模型，我们可以深入了解深度学习、神经网络等核心概念，并将其应用于自然语言处理、计算机视觉、语音识别等领域。同时，掌握AI大模型还能够为我们的职业发展增添竞争力，成为未来技术领域的领导者。

再者，学习AI大模型也能为我们自己创造更多的价值，提供更多的岗位以及副业创收，让自己的生活更上一层楼。

因此，学习AI大模型是一项有前景且值得投入的时间和精力的重要选择。