文档分类Transformer 模型（pytorch实现）

K24B;

于 2024-05-17 15:21:42 发布

阅读量1k

点赞数 26

分类专栏：自然语言处理深度学习 pytorch 学习文章标签：分类 transformer pytorch

本文链接：https://blog.csdn.net/weixin_64017116/article/details/139005594

版权

pytorch 学习同时被 3 个专栏收录

13 篇文章 0 订阅

订阅专栏

自然语言处理

12 篇文章 0 订阅

订阅专栏

深度学习

9 篇文章 0 订阅

订阅专栏

Transformer文档分类

- - 模型结构：

在此之前，需要了解Transformer的相关知识，请移步Transformer详解

模型结构：

输入层 [ $batch\_size$ , $sentence\_len$ , $embed\_size$ ]
词嵌入层 [ $batch\_size$ , $sentence\_len$ , $embed\_size$ ]
position_embedding [ $batch\_size$ , $sentence\_len$ , $embed\_size$ ]
编码层Encoder
全连接层

position_embedding

class Positional_Encoding(nn.Module):
    def __init__(self, embed, pad_size, dropout, device):
        super(Positional_Encoding, self).__init__()
        self.device = device
        self.pe = torch.tensor(
            [[pos / (10000.0 ** (i // 2 * 2.0 / embed)) for i in range(embed)] for pos in range(pad_size)])
        self.pe[:, 0::2] = np.sin(self.pe[:, 0::2])
        self.pe[:, 1::2] = np.cos(self.pe[:, 1::2])
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        out = x + nn.Parameter(self.pe, requires_grad=False).to(self.device)  # torch.Size([128, 32, 300])
        out = self.dropout(out)
        return out

引入顺序信息：在Transformer模型中，位置编码（Positional Encoding）是用于表示输入序列中元素的位置信息的。因为Transformer模型完全基于注意力机制，没有内置的顺序信息，因此需要位置编码来补充位置信息。位置编码是将位置信息注入输入序列的一种方式。
保持序列不变性：位置编码的设计使得不同位置的编码具有唯一性，并且与输入的表示一起被模型处理。这样，即使在经过多层注意力机制后，位置信息仍然能够保留，保证了序列中的相对位置信息不丢失。
提供周期性：位置编码使用了正弦和余弦函数，这种设计使得位置编码具有周期性，能够表示任意长度的序列。

编码层Encoder

class Encoder(nn.Module):
    def __init__(self, dim_model, num_head, hidden, dropout):
        super(Encoder, self).__init__()
        self.attention = Multi_Head_Attention(dim_model, num_head, dropout)
        self.feed_forward = Position_wise_Feed_Forward(dim_model, hidden, dropout)

    def forward(self, x):  # torch.Size([128, 32, 300])
        out = self.attention(x)  # torch.Size([128, 32, 300])
        out = self.feed_forward(out)  # torch.Size([128, 32, 300])
        return out

Multi_Head_Attention

注意力机制中的Q，K，V
- Query（查询）: 用于提出问题，代表我们希望获取哪些信息。
- Key（键）: 用于标识信息，代表输入序列中的每一个位置。
- Value（值）: 用于存储信息，代表输入序列中每一个位置对应的实际信息。
假设输入张量 x 的形状为 [batch_size, seq_length, dim_model]，其中：
- batch_size：批次大小（例如，一次处理多少个样本）。
- seq_length：序列长度（例如，句子的长度）。
- dim_model：模型的维度（例如，嵌入的维度）。
通过线性变换 self.fc_Q，我们将输入 x 转换为查询张量 Q：
```
Q = self.fc_Q(x)
```
此时 Q 的形状仍然是 [batch_size, seq_length, dim_model]。
多头拆分

为了进行多头注意力计算，需要将 dim_model 维度拆分成多个头（num_head），每个头的维度为 dim_head。这里 dim_head = dim_model // num_head，所以 dim_model = num_head * dim_head。

我们希望得到的形状是 [batch_size, seq_length, num_head, dim_head]，但为了便于后续计算，需要进一步调整形状。
形状重塑

通过 Q.view(batch_size * num_head, -1, dim_head)，我们将 Q 的形状从 [batch_size, seq_length, dim_model] 转换为 [batch_size * num_head, seq_length, dim_head]。具体步骤如下：
1. 调整形状：
  - batch_size 和 num_head 合并为一维：batch_size * num_head。
  - dim_model 拆分为 num_head 和 dim_head。
2. 维度解释：
  - batch_size * num_head：每个注意力头作为一个独立的样本进行处理。
  - seq_length：序列长度保持不变。
  - dim_head：每个头的维度。
最终的形状 [batch_size * num_head, seq_length, dim_head] 适用于多头注意力机制的并行计算，使得每个头可以独立地计算注意力。

这样处理后，每个注意力头都可以独立计算注意力权重，从而提高计算效率和模型的表示能力。
缩放因子：
$scale=\frac{1}{\sqrt{dim_{\_}head}}$
缩放因子的作用：
- 防止数值过大
- 稳定梯度：
- 提高模型的性能：缩放因子使得注意力分数的分布更加均匀， $so f t ma x$ 之后的注意力权重也会更加平滑，这样模型在训练时能更稳定地学习到有效的注意力权重，从而提升模型性能。

class Multi_Head_Attention(nn.Module):
    def __init__(self, dim_model, num_head, dropout=0.0):
        super(Multi_Head_Attention, self).__init__()
        self.num_head = num_head
        assert dim_model % num_head == 0
        self.dim_head = dim_model // self.num_head
        self.fc_Q = nn.Linear(dim_model, num_head * self.dim_head)
        self.fc_K = nn.Linear(dim_model, num_head * self.dim_head)
        self.fc_V = nn.Linear(dim_model, num_head * self.dim_head)
        self.attention = Scaled_Dot_Product_Attention()
        self.fc = nn.Linear(num_head * self.dim_head, dim_model)
        self.dropout = nn.Dropout(dropout)
        self.layer_norm = nn.LayerNorm(dim_model)

    def forward(self, x):
        # print("Multi_Head_Attention--x", x.shape)  # [128, 32, 300]
        batch_size = x.size(0)
        Q = self.fc_Q(x)  # [128, 32, 300]
        K = self.fc_K(x)  # [128, 32, 300]
        V = self.fc_V(x)  # [128, 32, 300]
        Q = Q.view(batch_size * self.num_head, -1, self.dim_head)  # [640, 32, 60]
        K = K.view(batch_size * self.num_head, -1, self.dim_head)  # [640, 32, 60]
        V = V.view(batch_size * self.num_head, -1, self.dim_head)  # [640, 32, 60]
        scale = K.size(-1) ** -0.5  # 缩放因子
        context = self.attention(Q, K, V, scale)  # [640, 32, 60]
        context = context.view(batch_size, -1, self.dim_head * self.num_head)  # [128, 32, 300]
        out = self.fc(context)  # [128, 32, 300]
        out = self.dropout(out)  # [128, 32, 300]
        out = out + x  # 残差连接# [128, 32, 300]
        out = self.layer_norm(out)  # [128, 32, 300]
        return out

Scaled_Dot_Product_Attention

class Scaled_Dot_Product_Attention(nn.Module):
    '''Scaled Dot-Product Attention '''

    def __init__(self):
        super(Scaled_Dot_Product_Attention, self).__init__()

    def forward(self, Q, K, V, scale=None):
        # 这行代码计算查询张量Q和键张量K的点积
        attention = torch.matmul(Q, K.permute(0, 2, 1))# [640, 32, 32]
        if scale:
            attention = attention * scale
        # if mask:  # TODO change this
        #     attention = attention.masked_fill_(mask == 0, -1e9)
        # 对注意力分数应用softmax函数，得到归一化的注意力权重
        attention = F.softmax(attention, dim=-1) # [640, 32, 32]
        # 将注意力权重与值张量V进行点积，得到上下文向量。
        context = torch.matmul(attention, V)# [640, 32, 60]
        return context

点积注意力分数：
$attention\_scores=Q\cdot K^T$

其中

context = torch.matmul(attention, V)# [640, 32, 60]

将注意力权重与值张量 V 进行点积，得到上下文向量。注意力权重的形状为 [batch_size, len_Q, len_K]，值张量 V 的形状为 [batch_size, len_V, dim_V]，计算得到的 context 张量形状为 [batch_size, len_Q, dim_V]。

Position_wise_Feed_Forward

class Position_wise_Feed_Forward(nn.Module):
    def __init__(self, dim_model, hidden, dropout=0.0):
        super(Position_wise_Feed_Forward, self).__init__()
        self.fc1 = nn.Linear(dim_model, hidden)
        self.fc2 = nn.Linear(hidden, dim_model)
        self.dropout = nn.Dropout(dropout)
        self.layer_norm = nn.LayerNorm(dim_model)

    def forward(self, x):  # x.shape [128, 32, 300]
        out = self.fc1(x)  # [128, 32, 1024]
        out = F.relu(out)  # [128, 32, 1024]
        out = self.fc2(out)  # [128, 32, 300]
        out = self.dropout(out)  # [128, 32, 300]
        out = out + x  # 残差连接 # [128, 32, 300]  
        out = self.layer_norm(out)
        return out

作用：

增加模型的表达能力：通过非线性变换，模型可以学习到更复杂的特征。
保持梯度流动：残差连接帮助保持梯度的流动，防止梯度消失。
提高训练稳定性和速度：层归一化帮助在训练过程中保持每一层的输出分布稳定。
增强模型的泛化能力：Dropout防止模型过拟合，提高其在新数据上的表现。

整体代码：

import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import copy


class Config(object):
    """配置参数"""

    def __init__(self, dataset, embedding):
        self.model_name = 'Transformer'
        self.train_path = dataset + '/data/train.txt'  # 训练集
        self.dev_path = dataset + '/data/dev.txt'  # 验证集
        self.test_path = dataset + '/data/test.txt'  # 测试集
        self.class_list = [x.strip() for x in open(
            dataset + '/data/class.txt', encoding='utf-8').readlines()]  # 类别名单
        self.vocab_path = dataset + '/data/vocab.pkl'  # 词表
        self.save_path = dataset + '/saved_dict/' + self.model_name + '.ckpt'  # 模型训练结果
        self.log_path = dataset + '/log/' + self.model_name
        self.embedding_pretrained = torch.tensor(
            np.load(dataset + '/data/' + embedding)["embeddings"].astype('float32')) \
            if embedding != 'random' else None  # 预训练词向量
        self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')  # 设备

        self.dropout = 0.5  # 随机失活
        self.require_improvement = 2000  # 若超过1000batch效果还没提升，则提前结束训练
        self.num_classes = len(self.class_list)  # 类别数
        self.n_vocab = 0  # 词表大小，在运行时赋值
        self.num_epochs = 20  # epoch数
        self.batch_size = 128  # mini-batch大小
        self.pad_size = 32  # 每句话处理成的长度(短填长切)
        self.learning_rate = 5e-4  # 学习率
        self.embed = self.embedding_pretrained.size(1) \
            if self.embedding_pretrained is not None else 300  # 字向量维度
        self.dim_model = 300
        self.hidden = 1024
        self.last_hidden = 512
        self.num_head = 5
        self.num_encoder = 2


'''Attention Is All You Need'''


class Model(nn.Module):
    def __init__(self, config):
        super(Model, self).__init__()
        if config.embedding_pretrained is not None:
            self.embedding = nn.Embedding.from_pretrained(config.embedding_pretrained, freeze=False)
        else:
            self.embedding = nn.Embedding(config.n_vocab, config.embed, padding_idx=config.n_vocab - 1)
        self.postion_embedding = Positional_Encoding(config.embed, config.pad_size, config.dropout, config.device)
        self.encoder = Encoder(config.dim_model, config.num_head, config.hidden, config.dropout)
        self.encoders = nn.ModuleList([
            copy.deepcopy(self.encoder)
            # Encoder(config.dim_model, config.num_head, config.hidden, config.dropout)
            for _ in range(config.num_encoder)])

        self.fc1 = nn.Linear(config.pad_size * config.dim_model, config.num_classes)
        # self.fc2 = nn.Linear(config.last_hidden, config.num_classes)
        # self.fc1 = nn.Linear(config.dim_model, config.num_classes)
    def forward(self, x):
        out = self.embedding(x[0])  # torch.Size([128, 32, 300])
        out = self.postion_embedding(out)  # torch.Size([128, 32, 300])
        for encoder in self.encoders:
            out = encoder(out)
        out = out.view(out.size(0), -1)  # torch.Size([128, 9600])
        out = self.fc1(out)  # torch.Size([128, 10])
        return out


class Encoder(nn.Module):
    def __init__(self, dim_model, num_head, hidden, dropout):
        super(Encoder, self).__init__()
        self.attention = Multi_Head_Attention(dim_model, num_head, dropout)
        self.feed_forward = Position_wise_Feed_Forward(dim_model, hidden, dropout)

    def forward(self, x):  # torch.Size([128, 32, 300])
        out = self.attention(x)  # torch.Size([128, 32, 300])
        out = self.feed_forward(out)  # torch.Size([128, 32, 300])
        return out


class Positional_Encoding(nn.Module):
    def __init__(self, embed, pad_size, dropout, device):
        super(Positional_Encoding, self).__init__()
        self.device = device
        self.pe = torch.tensor(
            [[pos / (10000.0 ** (i // 2 * 2.0 / embed)) for i in range(embed)] for pos in range(pad_size)])
        self.pe[:, 0::2] = np.sin(self.pe[:, 0::2])
        self.pe[:, 1::2] = np.cos(self.pe[:, 1::2])
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        out = x + nn.Parameter(self.pe, requires_grad=False).to(self.device)  # torch.Size([128, 32, 300])
        out = self.dropout(out)
        return out


class Scaled_Dot_Product_Attention(nn.Module):
    '''Scaled Dot-Product Attention '''

    def __init__(self):
        super(Scaled_Dot_Product_Attention, self).__init__()

    def forward(self, Q, K, V, scale=None):
        # 这行代码计算查询张量Q和键张量K的点积
        attention = torch.matmul(Q, K.permute(0, 2, 1))# [640, 32, 32]
        if scale:
            attention = attention * scale
        # if mask:  # TODO change this
        #     attention = attention.masked_fill_(mask == 0, -1e9)
        # 对注意力分数应用softmax函数，得到归一化的注意力权重
        attention = F.softmax(attention, dim=-1) # [640, 32, 32]
        # 将注意力权重与值张量V进行点积，得到上下文向量。
        context = torch.matmul(attention, V)# [640, 32, 60]
        return context


class Multi_Head_Attention(nn.Module):
    def __init__(self, dim_model, num_head, dropout=0.0):
        super(Multi_Head_Attention, self).__init__()
        self.num_head = num_head
        assert dim_model % num_head == 0
        self.dim_head = dim_model // self.num_head
        self.fc_Q = nn.Linear(dim_model, num_head * self.dim_head)
        self.fc_K = nn.Linear(dim_model, num_head * self.dim_head)
        self.fc_V = nn.Linear(dim_model, num_head * self.dim_head)
        self.attention = Scaled_Dot_Product_Attention()
        self.fc = nn.Linear(num_head * self.dim_head, dim_model)
        self.dropout = nn.Dropout(dropout)
        self.layer_norm = nn.LayerNorm(dim_model)

    def forward(self, x):
        # print("Multi_Head_Attention--x", x.shape)  # [128, 32, 300]
        batch_size = x.size(0)
        Q = self.fc_Q(x)  # [128, 32, 300]
        K = self.fc_K(x)  # [128, 32, 300]
        V = self.fc_V(x)  # [128, 32, 300]
        Q = Q.view(batch_size * self.num_head, -1, self.dim_head)  # [640, 32, 60]
        K = K.view(batch_size * self.num_head, -1, self.dim_head)  # [640, 32, 60]
        V = V.view(batch_size * self.num_head, -1, self.dim_head)  # [640, 32, 60]
        # if mask:  # TODO
        #         #     mask = mask.repeat(self.num_head, 1, 1)  # TODO change this
        scale = K.size(-1) ** -0.5  # 缩放因子
        context = self.attention(Q, K, V, scale)  # [640, 32, 60]
        context = context.view(batch_size, -1, self.dim_head * self.num_head)  # [128, 32, 300]
        out = self.fc(context)  # [128, 32, 300]
        out = self.dropout(out)  # [128, 32, 300]
        out = out + x  # 残差连接# [128, 32, 300]
        out = self.layer_norm(out)  # [128, 32, 300]
        return out


class Position_wise_Feed_Forward(nn.Module):
    def __init__(self, dim_model, hidden, dropout=0.0):
        super(Position_wise_Feed_Forward, self).__init__()
        self.fc1 = nn.Linear(dim_model, hidden)
        self.fc2 = nn.Linear(hidden, dim_model)
        self.dropout = nn.Dropout(dropout)
        self.layer_norm = nn.LayerNorm(dim_model)

    def forward(self, x):  # x.shape [128, 32, 300]
        out = self.fc1(x)  # [128, 32, 1024]
        out = F.relu(out)  # [128, 32, 1024]
        out = self.fc2(out)  # [128, 32, 300]
        out = self.dropout(out)  # [128, 32, 300]
        out = out + x  # 残差连接 # [128, 32, 300]
        out = self.layer_norm(out)
        return out