深度学习之全面理解Self-Attention_self attention size-CSDN博客

本文链接：https://blog.csdn.net/ueke1/article/details/137251088

Transformer 模型中大量使用了 self-attention机制 (Masked self-attention、Crosss attention)。
Bert 相当于 Transformer 的 encoder 部分，GPT 相当于 Transformer 的 decoder 部分。
encoder 编码信息， decoder 产生输出。
Multi-Head Attention 多头注意力机制在 Self-Attention 的基础上添加了 head 维度，用来处理不同类型的输入。
Scaled Dot-Product Attention 缩放点积注意力机制（单头）

Self-Attention

截图_20235106065111.png
每一个单词都会分割成QKV三部分（矩阵线性变换）
每个单词的 Q 会和所有单词的 K 做注意(Attention)，实际上就是相乘。每次都必须和自己做注意，所以称为自注意力机制。
做完注意力之后会得到相关性大小，再作用到 V 上得到最终的输出

import torch
from torch import nn


class SelfAttention(nn.Module):
    def __init__(self, hidden_size) -> None:
        super(SelfAttention, self).__init__()
        self.q_layer = nn.Sequential(
            nn.Linear(in_features=hidden_size, out_features=hidden_size)
        )
        self.k_layer = nn.Sequential(
            nn.Linear(in_features=hidden_size, out_features=hidden_size)
        )
        self.v_layer = nn.Sequential(
            nn.Linear(in_features=hidden_size, out_features=hidden_size)
        )

    def forward(self, x):
        """
        前向过程
        :param x: [n,t,e] n个文本, t个时刻, 每个时刻e维的向量
        :return: [n,t,e]
        """
        # 1. 获取q、k、v
        q = self.q_layer(x)
        k = self.k_layer(x)  # [n,t,e]
        v = self.v_layer(x)  # [n,t,e]

        # 2. 计算q和k之间的相关性 -> F函数
        scores = torch.matmul(q, torch.permute(k, dims=(0, 2, 1)))

        # 3. 转换为权重
        alpha = torch.softmax(scores, dim=2)  # [n,t,t]
        # 4. 值的合并
        v = torch.matmul(alpha, v)  # [n,t,e]
        return v
    

@torch.no_grad()
def t0():
    token_id = torch.tensor([
        [1, 3, 5],  # 一个样本, 三个时刻
        [1, 6, 3]
    ])
    # 静态特征向量提取 Word2Vec EmbeddingLayer
    emb_layer = nn.Embedding(num_embeddings=10, embedding_dim=4)
    x1 = emb_layer(token_id)  # [2,3,4]
    print(x1[0][0])  # 第一个样本的第一个token对应的向量
    print(x1[1][0])
    print('='*100)

    # 基于self-attention的提取
    att = SelfAttention(hidden_size=4)
    x3 = att(x1)
    print(x3[0][0])  # 第一个样本的第一个token对应的向量
    print(x3[1][0])  # 第二个样本的第一个token对应的向量
    print(x3[1]