【Transformer】multi-head self-attention

最新推荐文章于 2024-11-09 09:16:41 发布

QQ荔枝糖

最新推荐文章于 2024-11-09 09:16:41 发布

阅读量113

点赞数 1

分类专栏： Transformer 文章标签： transformer 深度学习机器学习

本文链接：https://blog.csdn.net/weixin_45780075/article/details/139321219

版权

Transformer 专栏收录该内容

8 篇文章 0 订阅

订阅专栏

# step 7:构建scaled self-attention
def scaled_dot_product_attention(Q, K, V, attn_mask):
    # shape of Q,K,V:(batch_size*num_head,seq_len,model_dim/num_head)
    score = torch.bmm(Q, K.transpose(-2, -1)) / torch.sqrt(model_dim)
    masked_score = score.masked_fill(attn_mask)
    prob = F.softmax(masked_score, -1)
    context = torch.bmm(prob, V)
    return context