Self-Attention 层:使用Q,K,V(q_w, k_w, v_w)三个权重矩阵分别对输入X进行linear全连接层。
隐藏层hidden_size 为768, num_attention_heads为12,
attention_head_size = hidden_size/num_attention_heads= 64, max_length为输入X的长度,为4(假设输入 x = np.array([2450, 15486, 15167, 2110]))。
三个embedding层,word_embedding(下图中Token Embedding), position_embedding(下图Position Embedding), token_type_embedding(下图Segment Embeddings). 对输入X分别经过三个Embedding层之后,再合并成Embedding。
Embedding层之后,经过layer_norm, 归一化+线性层。
#bert embedding,使用3层叠加,在经过一个embedding层, word_embedding,
def embedding_forward(self, x):
we = self.get_embedding(self.word_embeddings, x) # shape: [max_len, hidden_size]
# word_embedding: 21128*768
# position embeding的输入 [0, 1, 2, 3], postition_embedding: 512*768
pe = self.get_embedding(self.position_embeddings, np.array(list(range(len(x))))) # shpae: [max_len, hidden_size]
# token type embedding,单输入的情况下为[0, 0, 0, 0] token_type_embeddings
#token type embedding: 2*768
te = self.get_embedding(self.token_type_embeddings, np.array([0] * len(x))) # shpae: [max_len, hidden_size]
embedding = we + pe + te
print(embedding.shape)
# 加和后有一个归一化层 laynorm: [768] self.layer_norm: 归一化
embedding = self.layer_norm(embedding, self.embeddings_layer_norm_weight, self.embeddings_layer_norm_bias) # shpae: [max_len, hidden_size]
return embedding
#Embedding方法: embedding层实际上相当于按index索引,或理解为onehot输入乘以embedding矩阵
def get_embedding(self, embedding_matrix, x):
return np.array([embedding_matrix[index] for index in x])
#执行全部的transformer层计算, 12层Transformer叠加
def all_transformer_layer_forward(self, x):
# self.num_layers = 12, 调用了12次single_transformer_layer_forward
for i in range(self.num_layers):
x = self.single_transformer_layer_forward(x, i)
return x
#归一化层, 标准化之后,进行 线性运算
def layer_norm(self, x, w, b):
x = (x - np.mean(x, axis=1, keepdims=True)) / np.std(x, axis=1, keepdims=True)
x = x * w + b
return x
自注意力self attention 实现。
## self attentiom 自注意力机制
# self attention的计算 num_attention_heads: 多头 一般12, 或者8也有
def self_attention(self, x, q_w, q_b, k_w, k_b, v_w,
v_b, attention_output_weight, attention_output_bias,
num_attention_heads, hidden_size):
# print(x.shape)
q = np.dot(x, q_w.T) + q_b # shape: [max_len, hidden_size] 4*768 W * X + B linear
k = np.dot(x, k_w.T) + k_b # shape: [max_len, hidden_size]
v = np.dot(x, v_w.T) + v_b # shape: [max_len, hidden_size]
# print("q_shape", q.shape, k.shape, v.shape)
# hidden_size = 768, num_attention_heads = 12 , ... attention_head_size = 64
attention_head_size = int(hidden_size / num_attention_heads) # hidden_size = 768, num_attention_heads= 12
# q.shape = num_attention_heads, max_len, attention_head_size # q: 4*768, attention_head_size = 64, num_attention_heads = 12
# self.transpose_for_scores 将 max_len*num_attention_heads* attention_head_size 转为 12*4*64 [num_attention_heads, max_len, attention_head_size]
q = self.transpose_for_scores(q, attention_head_size, num_attention_heads)
# k.shape = num_attention_heads, max_len, attention_head_size
k = self.transpose_for_scores(k, attention_head_size, num_attention_heads)
# v.shape = num_attention_heads, max_len, attention_head_size
v = self.transpose_for_scores(v, attention_head_size, num_attention_heads)
# qk.shape = q*k.T, qk.shape =
qk = np.matmul(q, k.swapaxes(1, 2))
qk /= np.sqrt(attention_head_size)
qk = softmax(qk)
# print("qk_v", qk.shape, v.shape)
# qkv.shape = num_attention_heads, max_len, attention_head_size # np.matmul: 矩阵乘
qkv = np.matmul(qk, v)
# print("qkv_shape", qkv.shape)
# qkv.shape = max_len, hidden_size
qkv = qkv.swapaxes(0, 1).reshape(-1, hidden_size)
# attention.shape = max_len, hidden_size sccaled dot-product attention 之后的linear
attention = np.dot(qkv, attention_output_weight.T) + attention_output_bias
# print("attention", attention.shape)
return attention
多头机制能让模型考虑到不同位置的Attention,另外"多头"Attention可以在不同的子空间表示不一样的关联关系,使用单个Head的Attention一般达不到这种效果,将768维的矩阵拆分成12个64维的矩阵, 即num_attention_heads等于12,各个头的维度为 attention_head_size = hidden_size/num_attention_heads= 64, 即下述公式中的 dk = 64.
transpose_for_scores 函数将线性层之后max_length* hidden_size 的矩阵先拆为 max_length* num_attention_heads* attention_head_size(4* 12* 64), 经过函数中的swapaxes(1, 0) 转化为num_attention_heads* max_length* attention_head_size. 这个维度转换的目的应该是要保证输出仍然为max_length* hidden_size。
#多头机制, attention_head_size:
def transpose_for_scores(self, x, attention_head_size, num_attention_heads):
# hidden_size = 768 num_attent_heads = 12 attention_head_size = hidden_size/num_attent_heads = 64
max_len, hidden_size = x.shape # x.shape=(4, 768)
# x = q,k,v max_len = 4 num_attention_heads = 12 attention_head_size = 64
# x.reshape = () x.shape: 4*768 = 4*12*64
x = x.reshape(max_len, num_attention_heads, attention_head_size)
# print("transpose_for_score_x", x.shape)
# swapaxes 将 x 的第0维和第1维的维度值兑换
x = x.swapaxes(1, 0) # output shape = [num_attention_heads, max_len, attention_head_size]
# print("swapaxes_x",x.shape) # 12*4*64
return x
#softmax归一化
def softmax(x):
return np.exp(x)/np.sum(np.exp(x), axis=-1, keepdims=True)
#gelu激活函数
def gelu(x):
return 0.5 * x * (1 + np.tanh(math.sqrt(2 / math.pi) * (x + 0.044715 * np.power(x, 3))))
位置编码
为了使模型能够有效利用序列的顺序特征,我们需要加入序列中各个Token间相对位置或Token在序列中绝对位置的信息。在这里,我们将位置编码添加到编码器和解码器栈底部的输入Embedding。 由于位置编码与Embedding具有相同的维度d_model, 因此两者可以直接相加。
完整代码:
git_hub代码链接