文档分类Transformer 模型(pytorch实现)

在此之前,需要了解Transformer的相关知识,请移步Transformer详解

模型结构:

  • 输入层 [ b a t c h _ s i z e batch\_size batch_size, s e n t e n c e _ l e n sentence\_len sentence_len, e m b e d _ s i z e embed\_size embed_size]
  • 词嵌入层 [ b a t c h _ s i z e batch\_size batch_size, s e n t e n c e _ l e n sentence\_len sentence_len, e m b e d _ s i z e embed\_size embed_size]
  • position_embedding [ b a t c h _ s i z e batch\_size batch_size, s e n t e n c e _ l e n sentence\_len sentence_len, e m b e d _ s i z e embed\_size embed_size]
  • 编码层Encoder
  • 全连接层
position_embedding
class Positional_Encoding(nn.Module):
    def __init__(self, embed, pad_size, dropout, device):
        super(Positional_Encoding, self).__init__()
        self.device = device
        self.pe = torch.tensor(
            [[pos / (10000.0 ** (i // 2 * 2.0 / embed)) for i in range(embed)] for pos in range(pad_size)])
        self.pe[:, 0::2] = np.sin(self.pe[:, 0::2])
        self.pe[:, 1::2] = np.cos(self.pe[:, 1::2])
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        out = x + nn.Parameter(self.pe, requires_grad=False).to(self.device)  # torch.Size([128, 32, 300])
        out = self.dropout(out)
        return out
  • 引入顺序信息:在Transformer模型中,位置编码(Positional Encoding)是用于表示输入序列中元素的位置信息的。因为Transformer模型完全基于注意力机制,没有内置的顺序信息,因此需要位置编码来补充位置信息。位置编码是将位置信息注入输入序列的一种方式。
  • 保持序列不变性:位置编码的设计使得不同位置的编码具有唯一性,并且与输入的表示一起被模型处理。这样,即使在经过多层注意力机制后,位置信息仍然能够保留,保证了序列中的相对位置信息不丢失。
  • 提供周期性:位置编码使用了正弦和余弦函数,这种设计使得位置编码具有周期性,能够表示任意长度的序列。
编码层Encoder
class Encoder(nn.Module):
    def __init__(self, dim_model, num_head, hidden, dropout):
        super(Encoder, self).__init__()
        self.attention = Multi_Head_Attention(dim_model, num_head, dropout)
        self.feed_forward = Position_wise_Feed_Forward(dim_model, hidden, dropout)

    def forward(self, x):  # torch.Size([128, 32, 300])
        out = self.attention(x)  # torch.Size([128, 32, 300])
        out = self.feed_forward(out)  # torch.Size([128, 32, 300])
        return out
Multi_Head_Attention
  1. 注意力机制中的Q,K,V

    • Query(查询): 用于提出问题,代表我们希望获取哪些信息。
    • Key(键): 用于标识信息,代表输入序列中的每一个位置。
    • Value(值): 用于存储信息,代表输入序列中每一个位置对应的实际信息。
  2. 假设输入张量 x 的形状为 [batch_size, seq_length, dim_model],其中:

    • batch_size:批次大小(例如,一次处理多少个样本)。
    • seq_length:序列长度(例如,句子的长度)。
    • dim_model:模型的维度(例如,嵌入的维度)。
  3. 通过线性变换 self.fc_Q,我们将输入 x 转换为查询张量 Q

    Q = self.fc_Q(x)
    

    此时 Q 的形状仍然是 [batch_size, seq_length, dim_model]

  4. 多头拆分

    为了进行多头注意力计算,需要将 dim_model 维度拆分成多个头(num_head),每个头的维度为 dim_head。这里 dim_head = dim_model // num_head,所以 dim_model = num_head * dim_head

    我们希望得到的形状是 [batch_size, seq_length, num_head, dim_head],但为了便于后续计算,需要进一步调整形状。

  5. 形状重塑

    通过 Q.view(batch_size * num_head, -1, dim_head),我们将 Q 的形状从 [batch_size, seq_length, dim_model] 转换为 [batch_size * num_head, seq_length, dim_head]。具体步骤如下:

    1. 调整形状
      • batch_sizenum_head 合并为一维:batch_size * num_head
      • dim_model 拆分为 num_headdim_head
    2. 维度解释
      • batch_size * num_head:每个注意力头作为一个独立的样本进行处理。
      • seq_length:序列长度保持不变。
      • dim_head:每个头的维度。

    最终的形状 [batch_size * num_head, seq_length, dim_head] 适用于多头注意力机制的并行计算,使得每个头可以独立地计算注意力。

    这样处理后,每个注意力头都可以独立计算注意力权重,从而提高计算效率和模型的表示能力。

  6. 缩放因子:
    s c a l e = 1 d i m _ h e a d scale=\frac{1}{\sqrt{dim_{\_}head}} scale=dim_head 1
    缩放因子的作用:

    • 防止数值过大
    • 稳定梯度:
    • 提高模型的性能:缩放因子使得注意力分数的分布更加均匀, s o f t m a x softmax softmax之后的注意力权重也会更加平滑,这样模型在训练时能更稳定地学习到有效的注意力权重,从而提升模型性能。
class Multi_Head_Attention(nn.Module):
    def __init__(self, dim_model, num_head, dropout=0.0):
        super(Multi_Head_Attention, self).__init__()
        self.num_head = num_head
        assert dim_model % num_head == 0
        self.dim_head = dim_model // self.num_head
        self.fc_Q = nn.Linear(dim_model, num_head * self.dim_head)
        self.fc_K = nn.Linear(dim_model, num_head * self.dim_head)
        self.fc_V = nn.Linear(dim_model, num_head * self.dim_head)
        self.attention = Scaled_Dot_Product_Attention()
        self.fc = nn.Linear(num_head * self.dim_head, dim_model)
        self.dropout = nn.Dropout(dropout)
        self.layer_norm = nn.LayerNorm(dim_model)

    def forward(self, x):
        # print("Multi_Head_Attention--x", x.shape)  # [128, 32, 300]
        batch_size = x.size(0)
        Q = self.fc_Q(x)  # [128, 32, 300]
        K = self.fc_K(x)  # [128, 32, 300]
        V = self.fc_V(x)  # [128, 32, 300]
        Q = Q.view(batch_size * self.num_head, -1, self.dim_head)  # [640, 32, 60]
        K = K.view(batch_size * self.num_head, -1, self.dim_head)  # [640, 32, 60]
        V = V.view(batch_size * self.num_head, -1, self.dim_head)  # [640, 32, 60]
        scale = K.size(-1) ** -0.5  # 缩放因子
        context = self.attention(Q, K, V, scale)  # [640, 32, 60]
        context = context.view(batch_size, -1, self.dim_head * self.num_head)  # [128, 32, 300]
        out = self.fc(context)  # [128, 32, 300]
        out = self.dropout(out)  # [128, 32, 300]
        out = out + x  # 残差连接# [128, 32, 300]
        out = self.layer_norm(out)  # [128, 32, 300]
        return out
Scaled_Dot_Product_Attention
class Scaled_Dot_Product_Attention(nn.Module):
    '''Scaled Dot-Product Attention '''

    def __init__(self):
        super(Scaled_Dot_Product_Attention, self).__init__()

    def forward(self, Q, K, V, scale=None):
        # 这行代码计算查询张量Q和键张量K的点积
        attention = torch.matmul(Q, K.permute(0, 2, 1))# [640, 32, 32]
        if scale:
            attention = attention * scale
        # if mask:  # TODO change this
        #     attention = attention.masked_fill_(mask == 0, -1e9)
        # 对注意力分数应用softmax函数,得到归一化的注意力权重
        attention = F.softmax(attention, dim=-1) # [640, 32, 32]
        # 将注意力权重与值张量V进行点积,得到上下文向量。
        context = torch.matmul(attention, V)# [640, 32, 60]
        return context

点积注意力分数:
a t t e n t i o n _ s c o r e s = Q ⋅ K T attention\_scores=Q\cdot K^T attention_scores=QKT

其中

context = torch.matmul(attention, V)# [640, 32, 60]

将注意力权重与值张量 V 进行点积,得到上下文向量。注意力权重的形状为 [batch_size, len_Q, len_K],值张量 V 的形状为 [batch_size, len_V, dim_V],计算得到的 context 张量形状为 [batch_size, len_Q, dim_V]

Position_wise_Feed_Forward
class Position_wise_Feed_Forward(nn.Module):
    def __init__(self, dim_model, hidden, dropout=0.0):
        super(Position_wise_Feed_Forward, self).__init__()
        self.fc1 = nn.Linear(dim_model, hidden)
        self.fc2 = nn.Linear(hidden, dim_model)
        self.dropout = nn.Dropout(dropout)
        self.layer_norm = nn.LayerNorm(dim_model)

    def forward(self, x):  # x.shape [128, 32, 300]
        out = self.fc1(x)  # [128, 32, 1024]
        out = F.relu(out)  # [128, 32, 1024]
        out = self.fc2(out)  # [128, 32, 300]
        out = self.dropout(out)  # [128, 32, 300]
        out = out + x  # 残差连接 # [128, 32, 300]  
        out = self.layer_norm(out)
        return out

作用:

  • 增加模型的表达能力:通过非线性变换,模型可以学习到更复杂的特征。
  • 保持梯度流动:残差连接帮助保持梯度的流动,防止梯度消失。
  • 提高训练稳定性和速度:层归一化帮助在训练过程中保持每一层的输出分布稳定。
  • 增强模型的泛化能力:Dropout防止模型过拟合,提高其在新数据上的表现。

整体代码:

import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import copy


class Config(object):
    """配置参数"""

    def __init__(self, dataset, embedding):
        self.model_name = 'Transformer'
        self.train_path = dataset + '/data/train.txt'  # 训练集
        self.dev_path = dataset + '/data/dev.txt'  # 验证集
        self.test_path = dataset + '/data/test.txt'  # 测试集
        self.class_list = [x.strip() for x in open(
            dataset + '/data/class.txt', encoding='utf-8').readlines()]  # 类别名单
        self.vocab_path = dataset + '/data/vocab.pkl'  # 词表
        self.save_path = dataset + '/saved_dict/' + self.model_name + '.ckpt'  # 模型训练结果
        self.log_path = dataset + '/log/' + self.model_name
        self.embedding_pretrained = torch.tensor(
            np.load(dataset + '/data/' + embedding)["embeddings"].astype('float32')) \
            if embedding != 'random' else None  # 预训练词向量
        self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')  # 设备

        self.dropout = 0.5  # 随机失活
        self.require_improvement = 2000  # 若超过1000batch效果还没提升,则提前结束训练
        self.num_classes = len(self.class_list)  # 类别数
        self.n_vocab = 0  # 词表大小,在运行时赋值
        self.num_epochs = 20  # epoch数
        self.batch_size = 128  # mini-batch大小
        self.pad_size = 32  # 每句话处理成的长度(短填长切)
        self.learning_rate = 5e-4  # 学习率
        self.embed = self.embedding_pretrained.size(1) \
            if self.embedding_pretrained is not None else 300  # 字向量维度
        self.dim_model = 300
        self.hidden = 1024
        self.last_hidden = 512
        self.num_head = 5
        self.num_encoder = 2


'''Attention Is All You Need'''


class Model(nn.Module):
    def __init__(self, config):
        super(Model, self).__init__()
        if config.embedding_pretrained is not None:
            self.embedding = nn.Embedding.from_pretrained(config.embedding_pretrained, freeze=False)
        else:
            self.embedding = nn.Embedding(config.n_vocab, config.embed, padding_idx=config.n_vocab - 1)
        self.postion_embedding = Positional_Encoding(config.embed, config.pad_size, config.dropout, config.device)
        self.encoder = Encoder(config.dim_model, config.num_head, config.hidden, config.dropout)
        self.encoders = nn.ModuleList([
            copy.deepcopy(self.encoder)
            # Encoder(config.dim_model, config.num_head, config.hidden, config.dropout)
            for _ in range(config.num_encoder)])

        self.fc1 = nn.Linear(config.pad_size * config.dim_model, config.num_classes)
        # self.fc2 = nn.Linear(config.last_hidden, config.num_classes)
        # self.fc1 = nn.Linear(config.dim_model, config.num_classes)
    def forward(self, x):
        out = self.embedding(x[0])  # torch.Size([128, 32, 300])
        out = self.postion_embedding(out)  # torch.Size([128, 32, 300])
        for encoder in self.encoders:
            out = encoder(out)
        out = out.view(out.size(0), -1)  # torch.Size([128, 9600])
        out = self.fc1(out)  # torch.Size([128, 10])
        return out


class Encoder(nn.Module):
    def __init__(self, dim_model, num_head, hidden, dropout):
        super(Encoder, self).__init__()
        self.attention = Multi_Head_Attention(dim_model, num_head, dropout)
        self.feed_forward = Position_wise_Feed_Forward(dim_model, hidden, dropout)

    def forward(self, x):  # torch.Size([128, 32, 300])
        out = self.attention(x)  # torch.Size([128, 32, 300])
        out = self.feed_forward(out)  # torch.Size([128, 32, 300])
        return out


class Positional_Encoding(nn.Module):
    def __init__(self, embed, pad_size, dropout, device):
        super(Positional_Encoding, self).__init__()
        self.device = device
        self.pe = torch.tensor(
            [[pos / (10000.0 ** (i // 2 * 2.0 / embed)) for i in range(embed)] for pos in range(pad_size)])
        self.pe[:, 0::2] = np.sin(self.pe[:, 0::2])
        self.pe[:, 1::2] = np.cos(self.pe[:, 1::2])
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        out = x + nn.Parameter(self.pe, requires_grad=False).to(self.device)  # torch.Size([128, 32, 300])
        out = self.dropout(out)
        return out


class Scaled_Dot_Product_Attention(nn.Module):
    '''Scaled Dot-Product Attention '''

    def __init__(self):
        super(Scaled_Dot_Product_Attention, self).__init__()

    def forward(self, Q, K, V, scale=None):
        # 这行代码计算查询张量Q和键张量K的点积
        attention = torch.matmul(Q, K.permute(0, 2, 1))# [640, 32, 32]
        if scale:
            attention = attention * scale
        # if mask:  # TODO change this
        #     attention = attention.masked_fill_(mask == 0, -1e9)
        # 对注意力分数应用softmax函数,得到归一化的注意力权重
        attention = F.softmax(attention, dim=-1) # [640, 32, 32]
        # 将注意力权重与值张量V进行点积,得到上下文向量。
        context = torch.matmul(attention, V)# [640, 32, 60]
        return context


class Multi_Head_Attention(nn.Module):
    def __init__(self, dim_model, num_head, dropout=0.0):
        super(Multi_Head_Attention, self).__init__()
        self.num_head = num_head
        assert dim_model % num_head == 0
        self.dim_head = dim_model // self.num_head
        self.fc_Q = nn.Linear(dim_model, num_head * self.dim_head)
        self.fc_K = nn.Linear(dim_model, num_head * self.dim_head)
        self.fc_V = nn.Linear(dim_model, num_head * self.dim_head)
        self.attention = Scaled_Dot_Product_Attention()
        self.fc = nn.Linear(num_head * self.dim_head, dim_model)
        self.dropout = nn.Dropout(dropout)
        self.layer_norm = nn.LayerNorm(dim_model)

    def forward(self, x):
        # print("Multi_Head_Attention--x", x.shape)  # [128, 32, 300]
        batch_size = x.size(0)
        Q = self.fc_Q(x)  # [128, 32, 300]
        K = self.fc_K(x)  # [128, 32, 300]
        V = self.fc_V(x)  # [128, 32, 300]
        Q = Q.view(batch_size * self.num_head, -1, self.dim_head)  # [640, 32, 60]
        K = K.view(batch_size * self.num_head, -1, self.dim_head)  # [640, 32, 60]
        V = V.view(batch_size * self.num_head, -1, self.dim_head)  # [640, 32, 60]
        # if mask:  # TODO
        #         #     mask = mask.repeat(self.num_head, 1, 1)  # TODO change this
        scale = K.size(-1) ** -0.5  # 缩放因子
        context = self.attention(Q, K, V, scale)  # [640, 32, 60]
        context = context.view(batch_size, -1, self.dim_head * self.num_head)  # [128, 32, 300]
        out = self.fc(context)  # [128, 32, 300]
        out = self.dropout(out)  # [128, 32, 300]
        out = out + x  # 残差连接# [128, 32, 300]
        out = self.layer_norm(out)  # [128, 32, 300]
        return out


class Position_wise_Feed_Forward(nn.Module):
    def __init__(self, dim_model, hidden, dropout=0.0):
        super(Position_wise_Feed_Forward, self).__init__()
        self.fc1 = nn.Linear(dim_model, hidden)
        self.fc2 = nn.Linear(hidden, dim_model)
        self.dropout = nn.Dropout(dropout)
        self.layer_norm = nn.LayerNorm(dim_model)

    def forward(self, x):  # x.shape [128, 32, 300]
        out = self.fc1(x)  # [128, 32, 1024]
        out = F.relu(out)  # [128, 32, 1024]
        out = self.fc2(out)  # [128, 32, 300]
        out = self.dropout(out)  # [128, 32, 300]
        out = out + x  # 残差连接 # [128, 32, 300]
        out = self.layer_norm(out)
        return out

  • 26
    点赞
  • 21
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值