Transformer&Bert self-attention multi-heads

qaz57301

已于 2023-06-22 20:32:30 修改

阅读量100

点赞数

文章标签： python 深度学习机器学习

于 2023-06-17 06:51:26 首次发布

本文链接：https://blog.csdn.net/qaz57301/article/details/131256373

版权

在这里插入图片描述 Self-Attention 层：使用Q,K,V(q_w, k_w, v_w)三个权重矩阵分别对输入X进行linear全连接层。
隐藏层hidden_size 为768， num_attention_heads为12，
attention_head_size = hidden_size/num_attention_heads= 64， max_length为输入X的长度，为4（假设输入 x = np.array([2450, 15486, 15167, 2110])）。

三个embedding层，word_embedding(下图中Token Embedding), position_embedding(下图Position Embedding), token_type_embedding(下图Segment Embeddings). 对输入X分别经过三个Embedding层之后，再合并成Embedding。
在这里插入图片描述
Embedding层之后，经过layer_norm, 归一化+线性层。

    #bert embedding，使用3层叠加，在经过一个embedding层, word_embedding,
    def embedding_forward(self, x):
        we = self.get_embedding(self.word_embeddings, x)  # shape: [max_len, hidden_size]
        # word_embedding: 21128*768
        # position embeding的输入 [0, 1, 2, 3]， postition_embedding: 512*768
        pe = self.get_embedding(self.position_embeddings, np.array(list(range(len(x)))))  # shpae: [max_len, hidden_size]
        # token type embedding,单输入的情况下为[0, 0, 0, 0]  token_type_embeddings
        #token type embedding: 2*768
        te = self.get_embedding(self.token_type_embeddings, np.array([0] * len(x)))  # shpae: [max_len, hidden_size]
        embedding = we + pe + te
        print(embedding.shape)
        # 加和后有一个归一化层 laynorm: [768]  self.layer_norm: 归一化
        embedding = self.layer_norm(embedding, self.embeddings_layer_norm_weight, self.embeddings_layer_norm_bias)  # shpae: [max_len, hidden_size]
        return embedding

    #Embedding方法： embedding层实际上相当于按index索引，或理解为onehot输入乘以embedding矩阵
    def get_embedding(self, embedding_matrix, x):
        return np.array([embedding_matrix[index] for index in x])

    #执行全部的transformer层计算， 12层Transformer叠加
    def all_transformer_layer_forward(self, x):
        # self.num_layers = 12, 调用了12次single_transformer_layer_forward
        for i in range(self.num_layers):
            x = self.single_transformer_layer_forward(x, i)
        return x
    
    #归一化层, 标准化之后，进行 线性运算
    def layer_norm(self, x, w, b):
        x = (x - np.mean(x, axis=1, keepdims=True)) / np.std(x, axis=1, keepdims=True)
        x = x * w + b
        return x

自注意力self attention 实现。

    ## self attentiom 自注意力机制
    # self attention的计算   num_attention_heads:  多头 一般12， 或者8也有
    def self_attention(self, x, q_w, q_b, k_w, k_b, v_w,
                       v_b, attention_output_weight, attention_output_bias,
                       num_attention_heads, hidden_size):
        # print(x.shape)
        q = np.dot(x, q_w.T) + q_b  # shape: [max_len, hidden_size] 4*768     W * X + B linear
        k = np.dot(x, k_w.T) + k_b  # shape: [max_len, hidden_size]
        v = np.dot(x, v_w.T) + v_b  # shape: [max_len, hidden_size]
        # print("q_shape", q.shape, k.shape, v.shape)
        # hidden_size = 768, num_attention_heads = 12 , ... attention_head_size = 64
        attention_head_size = int(hidden_size / num_attention_heads)       # hidden_size = 768， num_attention_heads= 12
        # q.shape = num_attention_heads, max_len, attention_head_size  # q: 4*768, attention_head_size = 64,   num_attention_heads = 12
        # self.transpose_for_scores 将 max_len*num_attention_heads* attention_head_size 转为 12*4*64 [num_attention_heads, max_len, attention_head_size]
        q = self.transpose_for_scores(q, attention_head_size, num_attention_heads)
        # k.shape = num_attention_heads, max_len, attention_head_size
        k = self.transpose_for_scores(k, attention_head_size, num_attention_heads)
        # v.shape = num_attention_heads, max_len, attention_head_size
        v = self.transpose_for_scores(v, attention_head_size, num_attention_heads)
        # qk.shape = q*k.T,  qk.shape =
        qk = np.matmul(q, k.swapaxes(1, 2))
        qk /= np.sqrt(attention_head_size)
        qk = softmax(qk)
        # print("qk_v", qk.shape, v.shape)
        # qkv.shape = num_attention_heads, max_len, attention_head_size  # np.matmul: 矩阵乘
        qkv = np.matmul(qk, v)
        # print("qkv_shape", qkv.shape)
        # qkv.shape = max_len, hidden_size
        qkv = qkv.swapaxes(0, 1).reshape(-1, hidden_size)
        # attention.shape = max_len, hidden_size  sccaled dot-product attention 之后的linear
        attention = np.dot(qkv, attention_output_weight.T) + attention_output_bias
        # print("attention", attention.shape)
        return attention

多头机制能让模型考虑到不同位置的Attention，另外"多头"Attention可以在不同的子空间表示不一样的关联关系，使用单个Head的Attention一般达不到这种效果，将768维的矩阵拆分成12个64维的矩阵，即num_attention_heads等于12，各个头的维度为 attention_head_size = hidden_size/num_attention_heads= 64，即下述公式中的 dk = 64.
在这里插入图片描述
transpose_for_scores 函数将线性层之后max_length* hidden_size 的矩阵先拆为 max_length* num_attention_heads* attention_head_size(4* 12* 64)，经过函数中的swapaxes(1, 0) 转化为num_attention_heads* max_length* attention_head_size. 这个维度转换的目的应该是要保证输出仍然为max_length* hidden_size。

    #多头机制, attention_head_size:
    def transpose_for_scores(self, x, attention_head_size, num_attention_heads):
        # hidden_size = 768  num_attent_heads = 12  attention_head_size = hidden_size/num_attent_heads = 64
        max_len, hidden_size = x.shape               # x.shape=(4, 768)
        # x = q,k,v    max_len = 4  num_attention_heads = 12  attention_head_size = 64
        # x.reshape = () x.shape: 4*768 = 4*12*64
        x = x.reshape(max_len, num_attention_heads, attention_head_size)
        # print("transpose_for_score_x", x.shape)
        # swapaxes 将 x 的第0维和第1维的维度值兑换
        x = x.swapaxes(1, 0)  # output shape = [num_attention_heads, max_len, attention_head_size]
        # print("swapaxes_x",x.shape)   # 12*4*64
        return x

#softmax归一化
def softmax(x):
    return np.exp(x)/np.sum(np.exp(x), axis=-1, keepdims=True)

#gelu激活函数
def gelu(x):
    return 0.5 * x * (1 + np.tanh(math.sqrt(2 / math.pi) * (x + 0.044715 * np.power(x, 3))))

位置编码
为了使模型能够有效利用序列的顺序特征，我们需要加入序列中各个Token间相对位置或Token在序列中绝对位置的信息。在这里，我们将位置编码添加到编码器和解码器栈底部的输入Embedding。由于位置编码与Embedding具有相同的维度d_model, 因此两者可以直接相加。

完整代码：
git_hub代码链接

qaz57301

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Transformer&Bert self-attention multi-heads

Self-Attention 层：使用Q,K,V(q_w, k_w, v_w)三个权重矩阵分别对输入X进行linear全连接层。隐藏层hidden_size 为768， num_attention_heads为12，所有attention_head_size 为64， max_length为输入X的长度，为4（假设输入 x = np.array([2450, 15486, 15167, 2110])）。
复制链接

扫一扫