Transform中的自注意力机制与多头注意力机制

重生之光头强下海当程序猿

于 2024-09-28 09:17:12 发布

阅读量453

点赞数 9

文章标签： numpy transformer 自注意力机制

本文链接：https://blog.csdn.net/m0_50758340/article/details/142610555

版权

学习前言

Self-Attention自注意力机制是Transformer模块的重要组成部分，是截至到现在（2024年1月6日）大大小小网络的标配，无论是LLM还是StableDiffusion，内部都有Self-Attention与Transformer，因此，一起来学学哈哈。

代码下载

Github源码下载地址为：
https://github.com/bubbliiiing/blip-pytorch
复制该路径到地址栏跳转。

Self-Attention自注意力机制详解

一、Self-attention结构解析

看懂Self-attention结构，其实看懂下面这一系列图就可以了，首先

存在一个序列的三个单位的输入，

每一个序列单位的输入都可以通过三个处理（比如全连接）获得Query、Key、Value，Query是查询向量、Key是键向量、Value值向量。

如果我们想要获得input-1的输出，那么我们进行如下几步：
1、利用input-1的查询向量，分别乘上input-1、input-2、input-3的键向量，此时我们获得了三个score。
2、然后对这三个score取softmax，获得了input-1、input-2、input-3各自的重要程度。
3、然后将这个重要程度乘上input-1、input-2、input-3的值向量，求和。
4、此时我们获得了input-1的输出。

如图所示，我们进行如下几步：
1、input-1的查询向量为[1, 0, 2]，分别乘上input-1、input-2、input-3的键向量，获得三个score为2，4，4。

2、然后对这三个score取softmax，获得了input-1、input-2、input-3各自的重要程度，获得三个重要程度为0.0，0.5，0.5。

4、此时我们获得了input-1的输出 [2.0, 7.0, 1.5]。

上述的例子中，序列长度仅为3，在实际使用时，序列长度远不仅仅为3，但计算过程是一样的。在实际运算时，我们采用矩阵进行运算。

二、Self-attention的矩阵运算
实际的矩阵运算过程如下图所示。我以实际矩阵为例子给大家解析：

输入的Query、Key、Value如下图所示：

首先利用查询向量query 叉乘转置后的键向量key，这一步可以通俗的理解为，利用查询向量去查询序列的特征，获得序列每个部分的重要程度score。

输出的每一行，都代表input-1、input-2、input-3，对当前input的贡献，我们对这个贡献值取一个softmax。

然后利用 score 叉乘 value，这一步可以通俗的理解为，将序列每个部分的重要程度重新施加到序列的值上去。

这个矩阵运算的代码如下所示，各位同学可以自己试试。

import numpy as np

def soft_max(z):
    t = np.exp(z)
    a = np.exp(z) / np.expand_dims(np.sum(t, axis=1), 1)
    return a

Query = np.array([
    [1,0,2],
    [2,2,2],
    [2,1,3]
])

Key = np.array([
    [0,1,1],
    [4,4,0],
    [2,3,1]
])

Value = np.array([
    [1,2,3],
    [2,8,0],
    [2,6,3]
])

scores = Query @ Key.T
print(scores)
scores = soft_max(scores)
print(scores)
out = scores @ Value
print(out)

三、Multi-Head多头注意力机制

多头注意力机制的示意图如图所示：

这幅图给人的感觉略显迷茫，我们跳脱出这个图，直接从矩阵的shape入手会清晰很多。

假设我们现在有一个特征序列的shape为[3, 768]，也就意味着序列长度为3，每一个单位序列的特征大小为768。
在施加多头的时候，我们直接对[3, 768]的最后一维度进行分割，比如我们想分割成12个头，那么矩阵的shepe就变成了[3, 12, 64]。

然后我们将[3, 12, 64]进行转置，将12放到前面去，获得的特征层为[12, 3, 64]。之后我们忽略这个12，把它和batch维度同等对待，只对3, 64进行处理，其实也就是上面的注意力机制的过程了。

import numpy as np

def soft_max(z):
    t = np.exp(z)
    a = np.exp(z) / np.expand_dims(np.sum(t, axis=-1), -1)
    return a

values_length = 3
num_attention_heads = 8
hidden_size = 768
attention_head_size = hidden_size // num_attention_heads

Query = np.random.rand(values_length, hidden_size)
Key = np.random.rand(values_length, hidden_size)
Value = np.random.rand(values_length, hidden_size)

Query = np.reshape(Query, [values_length, num_attention_heads, attention_head_size])
Key = np.reshape(Key, [values_length, num_attention_heads, attention_head_size])
Value = np.reshape(Value, [values_length, num_attention_heads, attention_head_size])

Query = np.transpose(Query, [1, 0, 2])
Key = np.transpose(Key, [1, 0, 2])
Value = np.transpose(Value, [1, 0, 2])

scores = Query @ np.transpose(Key, [0, 2, 1])
print(np.shape(scores))
scores = soft_max(scores)
print(np.shape(scores))
out = scores @ Value
print(np.shape(out))
out = np.transpose(out, [1, 0, 2])
print(np.shape(out))
out = np.reshape(out, [values_length , 768])
print(np.shape(out))