Transformer - 注意⼒机制 Attention Scaled Dot-Product Attention不同的代码比较

二分掌柜的

已于 2024-04-10 11:22:50 修改

阅读量1.5k

点赞数 9

分类专栏：深度学习文章标签：深度学习 pytorch 人工智能

于 2024-02-28 19:13:10 首次发布

本文链接：https://blog.csdn.net/flyfish1986/article/details/136352707

版权

深度学习专栏收录该内容

149 篇文章

订阅专栏

本文详细比较了Transformer模型中注意力机制的ScaledDot-ProductAttention在PyTorch中的两种实现方式，包括原始模块和高效版本，讨论了点积计算过程以及如何处理mask。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Transformer - 注意⼒机制 Attention Scaled Dot-Product Attention不同的代码比较

flyfish

${\text{Attention}}(Q, K, V) = \text{softmax}\left(\frac{QK^{T}}{\sqrt{d_k}}\right)V$

import torch
import torch.nn as nn
import torch.nn.functional as F

class ScaledDotProductAttention(nn.Module):
    ''' Scaled Dot-Product Attention '''

    def __init__(self, temperature, attn_dropout=0.1):
        super().__init__()
        self.temperature = temperature
        self.dropout = nn.Dropout(attn_dropout)

    def forward(self, q, k, v, mask=None):

        attn = torch.matmul(q / self.temperature, k.transpose(2, 3))

        if mask is not None:
            attn = attn.masked_fill(mask == 0, -1e9)

        attn = self.dropout(F.softmax(attn, dim=-1))
        output = torch.matmul(attn, v)

        return output, attn

PyTorch官网的实现
torch.nn.functional.scaled_dot_product_attention

# Efficient implementation equivalent to the following:
def scaled_dot_product_attention(query, key, value, attn_mask=None, dropout_p=0.0, is_causal=False, scale=None) -> torch.Tensor:
    # Efficient implementation equivalent to the following:
    L, S = query.size(-2), key.size(-2)
    scale_factor = 1 / math.sqrt(query.size(-1)) if scale is None else scale
    attn_bias = torch.zeros(L, S, dtype=query.dtype)
    if is_causal:
        assert attn_mask is None
        temp_mask = torch.ones(L, S, dtype=torch.bool).tril(diagonal=0)
        attn_bias.masked_fill_(temp_mask.logical_not(), float("-inf"))
        attn_bias.to(query.dtype)

    if attn_mask is not None:
        if attn_mask.dtype == torch.bool:
            attn_bias.masked_fill_(attn_mask.logical_not(), float("-inf"))
        else:
            attn_bias += attn_mask
    attn_weight = query @ key.transpose(-2, -1) * scale_factor
    attn_weight += attn_bias
    attn_weight = torch.softmax(attn_weight, dim=-1)
    attn_weight = torch.dropout(attn_weight, dropout_p, train=True)
    return attn_weight @ value

在这里插入图片描述
点乘

计算过程

(1, 2, 3) • (7, 9, 11) = 1×7 + 2×9 + 3×11= 58
(1, 2, 3) • (8, 10, 12) = 1×8 + 2×10 + 3×12= 64
(4, 5, 6) • (7, 9, 11) = 4×7 + 5×9 + 6×11= 139
(4, 5, 6) • (8, 10, 12) = 4×8 + 5×10 + 6×12= 154