Bert源码(pytorch版本)—— BertSelfAttention

BertSelfAttention

1. init函数

1)得到3个变量

self.num_attention_heads = config.num_attention_heads
self.attention_head_size = int(config.hidden_size / config.num_attention_heads)
self.all_head_size = self.num_attention_heads * self.attention_head_size

all_head_size = 768
attention_head_size=768/12=64(每个头的大小)
num_attention_heads=12

2)K Q V的定义

self.query = nn.Linear(config.hidden_size, self.all_head_size)
self.key = nn.Linear(config.hidden_size, self.all_head_size)
self.value = nn.Linear(config.hidden_size, self.all_head_size)

config.hidden_size=768,self.all_head_size=768,nn.linear(a,b)指定了传入维度768,传出维度768
K Q V的维度为:(1,128,768)即(batch_size,seq_len,hid_size)

3)

2. transpose_for_scores 函数

作用:将self. K Q V 进行multihead拆分,拆成12个头
拆分后的K Q V 保存在key_layer, query_layer, value_layer 中,维度为(1,12,128,64)

在forward函数里有:
query_layer = self.transpose_for_scores(mixed_query_layer)
key_layer = self.transpose_key_for_scores(mixed_key_layer)
value_layer = self.transpose_for_scores(mixed_value_layer)


这一步调用了transpose_for_scores函数
def transpose_for_scores(self, x):
    new_x_shape = x.size()[:-1] + (self.num_attention_heads, self.attention_head_size)
    x = x.view(*new_x_shape)
    return x.permute(0, 2, 1, 3)
(0,1,2,3)-->(0,2,1,3):
(batch_size,seq_len,num_heads,attn_head_size)--->(batch_size,num_heads,seq_len,attn_head_size)
即(1,12,128,64
(1, 128, 768)
    ↓
self.transpose_for_scores()
    ↓
(1, 12, 128, 64)

3. forward函数

    def forward(self, hidden_states, attention_mask):
        mixed_query_layer = self.query(hidden_states)
        mixed_key_layer = self.key(hidden_states)
        mixed_value_layer = self.value(hidden_states)

        query_layer = self.transpose_for_scores(mixed_query_layer)
        key_layer = self.transpose_key_for_scores(mixed_key_layer)
        value_layer = self.transpose_for_scores(mixed_value_layer)
		
		以上都解释完了,下面进行attention_score的计算
		# Take the dot product between "query" and "key" to get the raw attention scores.
        attention_scores = torch.matmul(query_layer, key_layer)
        attention_scores = attention_scores / math.sqrt(self.attention_head_size)
        # Apply the attention mask is (precomputed for all layers in BertModel forward() function)
        attention_scores = attention_scores + attention_mask
        # Normalize the attention scores to probabilities.
        attention_probs = F.softmax(attention_scores, dim=-1)
        # This is actually dropping out entire tokens to attend to, which might
        # seem a bit unusual, but is taken from the original Transformer paper.
        attention_probs = self.dropout(attention_probs)




		下面是context layer的计算 : 计算出的attn_probs 要乘以 输入矩阵V(V和K一样)
        context_layer = torch.matmul(attention_probs, value_layer)
        context_layer = context_layer.permute(0, 2, 1, 3).contiguous()
        new_context_layer_shape = context_layer.size()[:-2] + (self.all_head_size,)
        context_layer = torch.reshape(context_layer, new_context_layer_shape)
        return context_layer
		
attention_probs   *   value_layer     =    context_layer
(1, 12, 128, 128)   (1, 12, 128, 64)      (1, 12, 128, 64)

之后对context_layer进行一些形状上的修剪:

context_layer.size()[:-2] + (self.all_head_size,)

维度位置移一下,(0, 2,1, 3)取后两个维度,并加上768,拼合后 context_layer 的维度
context_layer_shape (1, 128, 768)

最后,输出context_layer

  • 7
    点赞
  • 11
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值