Transformer 代码详细解析

Transformer 代码详细解析

谷歌在 《Attention is all you need》中提出了一种新模型的架构叫做Transformer,看了很多相关文章,通过图解能够大致理解模型的思路,但是依然无法深入理解本质,立足于代码无疑是深入理解的一个好方法。由于源代码中有较多变量名采用了缩写,并且变量的数据类型繁多,对于初学者可能依然难以理解,笔者在参照了源代码后进行了代码重写,并对每个变量名进行了后缀添加,同时添加了大量的注释。一方面深入自己的理解,另一方面同样也希望对初学者能够有所帮助。能力有限,难免会有很多不对的地方,不妥之处欢迎大家评论指出。

相关文章

1.《Attention is all you need》论文地址:
https://arxiv.org/abs/1706.03762
2.Transformer模型详解:
https://blog.csdn.net/u012526436/article/details/86295971(中文)
https://jalammar.github.io/illustrated-transformer/(英文)
3.源代码:
https://github.com/zingp/NLP/blob/master/P006TheAnnotatedTransformer/model_transformer.ipynb

完整代码

建议学习路径:中文图解模型→英文图解模型→论文原文→代码详解(本文)→源代码

"""

Copyright (C) 2019-2020 Dynamo
***************************************************
Author               : Dynamo 
Versions             : 1.0 
Time                 : 2020.1
File_name            : Transformer 代码详细解析


***************************************************
# ↓ 导入pytorch相关模块
import torch
import torch.nn as nn  # ↓ torch的主模块
import torch.nn.functional as TorchFunction_class
from torch.autograd import Variable
# ↓ 导入科学计算、画图等其它模块
import numpy as np
import matplotlib.pyplot as plt
import math, copy, time
import seaborn

seaborn.set_context(context="talk")


# ↓ ————————————————————————————————————定义Encoder-Decoder的类结构————————————————————————————————————

# ↓ 全局EncoderDecoder模型
class EncoderDecoder(nn.Module):
    # ↓ 初始化网络,在实例化的时候(接受5个模块参数:编码模块,解码模块,词嵌入模块,目标嵌入模块,生成器模块)
    def __init__(self, Encoder_class, Decoder_class, SourceEmbedding_class, TargetEmbedding_class, Generator_class):
        super(EncoderDecoder, self).__init__()
        print('Initial-EncoderDecoder')
        # Encode层
        self.Encoder_class = Encoder_class
        # Decoder层
        self.Decoder_class = Decoder_class
        # 源语言和目标语言的embedding模块
        self.SourceEmbedding_class = SourceEmbedding_class
        self.TargetEmbedding_class = TargetEmbedding_class
        # generator就是根据Decoder的隐状态输出当前时刻的词
        # 基本的实现就是隐状态输入一个全连接层,全连接层的输出大小是词的个数
        # 然后接一个softmax变成概率。
        self.Generator_class = Generator_class

    # ↓ 定义编码器
    def Encoder_function(self, src, src_mask):
        print('Forward-Encoder_function → ', end='')
        print('InputData:', type(src), type(src_mask), end='')
        OutPut_tensor = self.Encoder_class(self.SourceEmbedding_class(src), src_mask)
        print('OutPutData:', type(OutPut_tensor))
        print(OutPut_tensor)
        return OutPut_tensor

    # ↓ 定义解码器,decoder会比encoder多一个context attention
    def Decoder_function(self, memory, src_mask, tgt, tgt_mask):
        print('Forward-Decoder_function → ', end='')
        print('InputData:', type(memory), type(src_mask), type(tgt), type(tgt_mask))
        OutPut_tensor = self.Decoder_class(self.TargetEmbedding_class(tgt), memory, src_mask, tgt_mask)
        return OutPut_tensor

    # ↓ 向前传播(首先调用encode方法对输入进行编码,然后调用decode方法解码)
    # OneBatch_tensor.InputSourceBatch_variable, OneBatch_tensor.InputTargetBatchIn_variable,
    # OneBatch_tensor.InputSourceBatchMask_tensor, OneBatch_tensor.InputTargetBatchMask_tensor
    def forward(self, OneBatchSourceBatch_variable, OneBatchTargetBatchIn_variable, OneBatchSourceBatchMask_variable,
                OneBatchTargetBatchMask_variable):
        print('Forward-EncoderDecoder → ', end='')
        print('InputData:', type(OneBatchSourceBatch_variable), type(OneBatchTargetBatchIn_variable),
              type(OneBatchSourceBatchMask_variable),
              type(OneBatchTargetBatchMask_variable))
        # (OneBatchSourceBatch, OneBatchSourceBatchMask) → Eecoder → MemoryEncoded_variable
        MemoryEncoded_variable = self.Encoder_function(OneBatchSourceBatch_variable, OneBatchSourceBatchMask_variable)
        # (MemoryEncoded_variable,SourceBatchMask,BatchTarget,TargetBatchMask) → Decoder → Decoded_variable
        Decoded_variable = self.Decoder_function(MemoryEncoded_variable, OneBatchSourceBatchMask_variable,
                                                 OneBatchTargetBatchIn_variable, OneBatchTargetBatchMask_variable)

        return Decoded_variable


# ↓ ————————————————————————————————————定义标准的线性+softmax生成步骤————————————————————————————————————
class Generator(nn.Module):
    """定义标准的线性+softmax生成步骤,这是在8. Embeddings和Softmax中"""

    def __init__(self, Hyper_ModelDimension_int, Hyper_AllVocabNumber_int):
        super(Generator, self).__init__()
        print('Initial-Generator')
        # ↓ 定义一层网络(词向量的维度数,输出的维度)
        self.GeneratorLayer = nn.Linear(Hyper_ModelDimension_int, Hyper_AllVocabNumber_int)

    def forward(self, Input_tensor):
        print('Forward-Generator')
        # ↓ 对这一层网络进行Softmax处理
        Output_tensor = TorchFunction_class.log_softmax(self.GeneratorLayer(Input_tensor), dim=-1)
        return Output_tensor

    # ↓ ————————————————————————————————————定义Encoder和Decoder堆————————————————————————————————————


# ↓ 模型复制脚本
def CloneLayer_function(module, N):
    # ↓ 生成N个相同的层。
    return nn.ModuleList([copy.deepcopy(module) for Layer in range(N)])


# ↓ ————————————————————————————————————定义Encoder模型
class Encoder(nn.Module):
    """核心编码器是一个堆栈的N层"""

    def __init__(self, EncodeLayer_class, N):
        super(Encoder, self).__init__()
        print('Initial-Encoder')
        # ↓ 构造Encoder编码器,由N层堆叠而成,接收层模块
        self.AllEncoderLayersList_class = CloneLayer_function(EncodeLayer_class, N)  # 复制N层EncodeLayer
        self.LayerNorm_class = LayerNorm(
            EncodeLayer_class.Hyper_ModelDimension_int)  # 层归一(接收参数为EncodeLayer层的ModelDimension)

    def forward(self, Input_tensor, mask):
        print('Forward-Encoder')
        # ↓ for循环通过N个EncoderLayer
        for SubEncoderLayer_class in self.AllEncoderLayersList_class:
            Input_tensor = SubEncoderLayer_class(Input_tensor, mask)
        # ↓ 层归一
        OutPut_tensor = self.LayerNorm_class(Input_tensor)
        return OutPut_tensor


# ↓ ————————————————————————————————————定义残差连接和
# ↓层归一
class LayerNorm(nn.Module):
    def __init__(self, Hyper_ModelDimension_int, Hyper_MinimumConstant=1e-6):
        super(LayerNorm, self).__init__()
        print('Initial-LayerNorm')
        # ↓返回一个全为1的张量,形状由Hyper_ModelDimension_int定义,为了弥补归一化的损失,这是一个训练参数
        self.Weight_tensor = nn.Parameter(torch.ones(Hyper_ModelDimension_int))
        # ↓返回一个全为0的张量,形状由Hyper_ModelDimension_int定义,为了弥补归一化的损失,这是一个训练参数
        self.Bias_tensor = nn.Parameter(torch.zeros(Hyper_ModelDimension_int))
        # ↓这是为了防止除以0而设置的一个极小的常量
        self.Hyper_MinimumConstant = Hyper_MinimumConstant

    def forward(self, Input_tensor):
        print('Forward-LayerNorm')
        # 1.按行求均值
        # 2.按行求标准差
        # 3.元素相乘(不是点积)
        Mean = Input_tensor.mean(-1, keepdim=True)
        StandardDeviation = Input_tensor.std(-1, keepdim=True)
        OutPut_tensor = self.Weight_tensor * (Input_tensor - Mean) / (
                StandardDeviation + self.Hyper_MinimumConstant) + self.Bias_tensor
        return OutPut_tensor


# ↓ 残差连接
class SublayerConnection(nn.Module):
    def __init__(self, Hyper_ModelDimension_int, Hyper_DropoutRate_float):
        super(SublayerConnection, self).__init__()
        print('Initial-SublayerConnection')
        # 一层LayerNorm(x+Sublayer(x))。
        self.LayerNorm_class = LayerNorm(Hyper_ModelDimension_int)
        # 实例化一个Dropout
        self.Dropout_class = nn.Dropout(Hyper_DropoutRate_float)

    def forward(self, Input_tensor, InputSubLayer):  # 将输入的tensor与这个tensor经过一个InputSubLayer层后的tensor相加
        print('Forward-SublayerConnection')
        # 1.首先进行层归一处理
        # 2.再经过Sublayer
        # 3.再进行Dropout处理
        # 4.最后与处理前的tensor相加
        OutPut_tensor = Input_tensor + self.Dropout_class(InputSubLayer(self.LayerNorm_class(Input_tensor)))
        return OutPut_tensor


# ↓ ————————————————————————————————————定义EncoderLayer模型
class EncoderLayer(nn.Module):
    """编码器由self_attentions和前馈自调音组成(定义如下)
    Encoder的每一层有两个block(教程中使用子层来称呼)"""

    #  EncoderLayer接收的参数:
    #  Hyper_ModelDimension_int, 模型维度
    #  copy.deepcopy(Copy_MultiHeadedAttention), 多头注意力机制模块
    #  copy.deepcopy(Copy_PositionwiseFeedForward), 位置信息模块
    #  Hyper_DropoutRate_float ,Dropout的比率

    def __init__(self, Hyper_ModelDimension_int, Multi_HeadedAttention_class, PositionwiseFeedForward_class,
                 Hyper_DropoutRate_float):
        super(EncoderLayer, self).__init__()
        print('Initial-EncoderLayer')
        # ↓ 多头注意机制
        self.Multi_HeadedAttention_class = Multi_HeadedAttention_class
        # ↓ FeedForward层
        self.PositionwiseFeedForward_class = PositionwiseFeedForward_class
        # 实例化并复制2个残差连接&归一化层,Sublayer的Forward要求传入两个参数(Input_tensor, Sublayer)
        self.AllSublayerList_class = CloneLayer_function(
            SublayerConnection(Hyper_ModelDimension_int, Hyper_DropoutRate_float), 2)
        # ↓ 模型的维度(超参数)
        self.Hyper_ModelDimension_int = Hyper_ModelDimension_int

    def forward(self, Input_tensor, mask):  # Input_tensor mask 是输入
        print('Forward-EncoderLayer')
        """Follow Figure 1 (left) for connections."""
        SubInput_tensor = self.AllSublayerList_class[0](Input_tensor,
                                                        lambda Input_tensor: self.Multi_HeadedAttention_class(
                                                            Input_tensor, Input_tensor, Input_tensor, mask))
        OutPut_tensor = self.AllSublayerList_class[1](SubInput_tensor, self.PositionwiseFeedForward_class)
        return OutPut_tensor


# ↓ ————————————————————————————————————构建Decoder堆
# ↓ Decoder模型
class Decoder(nn.Module):
    """Generic N layer decoder with masking."""

    def __init__(self, DecodeLayer_class, N):
        super(Decoder, self).__init__()
        print('Initial-Decoder')
        # ↓ 复制N层DecodeLayer
        self.AllDecodeLayersList_class = CloneLayer_function(DecodeLayer_class, N)
        # ↓ 在Decoder堆中每一层都要进行残差连接和层归一Layer-Normalization
        self.LayerNorm_class = LayerNorm(DecodeLayer_class.Hyper_ModelDimension_int)

        # ↓ 和encoder一样,复制N层,只是在每一层中比encoder多了memory。

    def forward(self, Input_tensor, memory, src_mask, tgt_mask):
        print('Forward-Decoder')
        # ↓ for循环,tensor经过多层Decoder
        for Decode_Layer in self.AllDecodeLayersList_class:
            Input_tensor = Decode_Layer(Input_tensor, memory, src_mask, tgt_mask)
        # ↓ 归一化
        OutPut_tensor = self.LayerNorm_class(Input_tensor)
        return OutPut_tensor


# ↓ Decoder层
class DecoderLayer(nn.Module):
    """解码器由self-attn、src-attn和前馈组成(定义如下)"""

    # DecoderLayer接收的参数:
    # Hyper_ModelDimension_int, 模型维度
    # copy.deepcopy(Copy_MultiHeadedAttention), 多头注意力机制返回的tensor
    # copy.deepcopy(Copy_MultiHeadedAttention), src_attn与多头注意力机制返回的tensor一样
    # copy.deepcopy(Copy_PositionwiseFeedForward), 位置信息
    # Hyper_DropoutRate_float, Dropout的比率

    def __init__(self, Hyper_ModelDimension_int, MultiHeadedAttention_class, src_attn, PositionwiseFeedForward_class,
                 Hyper_DropoutRate_float):
        super(DecoderLayer, self).__init__()
        print('Initial-DecoderLayer')
        # ↓ 模型的维度(超参数)
        self.Hyper_ModelDimension_int = Hyper_ModelDimension_int
        # ↓ 多头注意机制
        self.MultiHeadedAttention_class = MultiHeadedAttention_class
        # ↓ 多头注意机制
        self.src_attn = src_attn
        # ↓ FeedForward层
        self.PositionwiseFeedForward_class = PositionwiseFeedForward_class
        # ↓ 残差连接与归一化(3层)
        self.AllSublayerList_class = CloneLayer_function(
            SublayerConnection(Hyper_ModelDimension_int, Hyper_DropoutRate_float), 3)

    def forward(self, Input_tensor, MemoryFromEncode_tensor, src_mask, tgt_mask):
        print('Forward-DecoderLayer')
        """Follow Figure 1 (right) for connections."""
        # 多头注意机制→残差连接与归一化→
        # 多头注意机制→残差连接与归一化→
        # FeedForward层→残差连接与归一化
        MemoryFromEncode_tensor = MemoryFromEncode_tensor
        Input_tensor = self.AllSublayerList_class[0](Input_tensor,
                                                     lambda Input_tensor: self.MultiHeadedAttention_class(Input_tensor,
                                                                                                          Input_tensor,
                                                                                                          Input_tensor,
                                                                                                          tgt_mask))
        Input_tensor = self.AllSublayerList_class[1](Input_tensor, lambda Input_tensor: self.src_attn(Input_tensor,
                                                                                                      MemoryFromEncode_tensor,
                                                                                                      MemoryFromEncode_tensor,
                                                                                                      src_mask))
        return self.AllSublayerList_class[2](Input_tensor, self.PositionwiseFeedForward_class)


#  Decoder在解码第t个时刻的时候只能使用1-t时刻的输入,而不能使用t+1时刻及其之后的输入。因此我们需要一个函数来产生一个Mask矩阵
def CreateSubSequentMask_function(AttentionShape_matrix):
    """Mask out subsequent positions."""
    AttentionShape = (1, AttentionShape_matrix, AttentionShape_matrix)
    # np.triu产生一个上三角阵,k表示对角线起始位置,对角线下面的为0,上面的为1
    SubSequentMask_np = np.triu(np.ones(AttentionShape), k=1).astype('uint8')
    '''
    例如:5*5的三角矩阵
    [0, 1, 1, 1, 1],
    [0, 0, 1, 1, 1],
    [0, 0, 0, 1, 1],
    [0, 0, 0, 0, 1],
    [0, 0, 0, 0, 0]
    '''
    # Numpy 的广播功能,判断matrix是否 = 0,返回一个0=True,else=False的矩阵
    SubSequentMask_torch = torch.from_numpy(SubSequentMask_np) == 0
    '''
    例如:上述矩阵的变换
    [ True, False, False, False, False],
    [ True,  True, False, False, False],
    [ True,  True,  True, False, False],
    [ True,  True,  True,  True, False],
    [ True,  True,  True,  True,  True]
    '''

    return SubSequentMask_torch


'''
打印三角矩阵
plt.figure(figsize=(5, 5))
plt.imshow(CreateSubSequentMask_function(20)[0])
None
'''


# ↓ ————————————————————————————————————Attention模块有四个维度(批数,头数,序列长度(句子长度),特征数(词向量))
# ↓ 注意力机制的计算脚本(query, key, value)它的输入是Query,Key,Value和Mask,输出是一个Tensor
def Attention_function(query, key, value, mask=None, dropout=None):
    # input是维度为(批数,序列长度(句子长度),头数,特征数(词向量))
    #  注意力机制的计算步骤:
    #  (1)查询向量*键向量(需要转置)
    #  (2)除以键向量维度的平方根
    #  (3)使用softmax函数标准化结果
    #  (4)将Score乘以Value,输出OutPut_tensor

    # ↓ Key的维度
    KeyDimension_int = query.size(-1)
    # ↓ 查询向量*键向量(转置)query(B,T,H,F)*key(B,T,F,H)
    Scores = torch.matmul(query, key.transpose(-2, -1))
    # ↓ 除以键向量维度的平方根
    Scores = Scores / math.sqrt(KeyDimension_int)
    # ↓ 如果要启用Mask则把Mask是0的Scores位置变成一个很小的数(-1e9)
    if mask is not None:
        Scores = Scores.masked_fill(mask == 0, -1e9)
    # ↓ 使用softmax函数处理结果获取Scores的权重
    Product_Attention = TorchFunction_class.softmax(Scores, dim=-1)
    # ↓ dropout处理
    if dropout is not None:
        Product_Attention = dropout(Product_Attention)
    # ↓ 将Score乘以Value,最终输出OutPut_tensor
    OutPut_tensor = torch.matmul(Product_Attention, value)
    return OutPut_tensor, Product_Attention


# ↓ ————————————————————————————————————“多头”机制
class MultiHeadedAttention(nn.Module):
    def __init__(self, Hyper_HeadNumber_int, Hyper_ModelDimension_int, Hyper_DropoutRate=0.1):
        """考虑模型大小和头的数量。"""
        super(MultiHeadedAttention, self).__init__()
        print('Initial-MultiHeadedAttention')
        # 初始化时指定头数h(超参数)和模型维度Hyper_ModelDimension_int
        assert Hyper_ModelDimension_int % Hyper_HeadNumber_int == 0  # 断言:二者是一定整除的,否则报错
        # ↓ 一个MultiHeaded的大小(512/8=64)
        self.HeadSize_int = Hyper_ModelDimension_int // Hyper_HeadNumber_int
        # ↓ 一个MultiHeaded的个数
        self.Hyper_HeadNumber_int = Hyper_HeadNumber_int
        # ↓ 复制4层ModelDimension(特征数/词向量)维度的层,用于QKV做线性变换
        self.AlllinearsList_class = CloneLayer_function(nn.Linear(Hyper_ModelDimension_int, Hyper_ModelDimension_int),
                                                        4)
        self.Product_Attention = None
        # ↓ 实例化一个Dropout方法
        self.Hyper_DropoutRate = nn.Dropout(p=Hyper_DropoutRate)

    def forward(self, InputQuery_tensor, InputKey_tensor, InputValue_tensor, mask=None):
        print('Forward-MultiHeadedAttention')
        # InputQuery_tensor有三个维度(批数(句子个数),序列长度(句子长度),特征数(词向量))

        # ↓ "实现多头注意力模型",计算一下Mask的维度是(batch, 1, time),因为每个head的mask都是一样的,所以先用unsqueeze(1)变成(batch, 1, 1, time)
        if mask is not None:
            mask = mask.unsqueeze(1)  # 在第1维后添加一个维度
        # ↓ InputQuery_tensor有三个维度,这里提取第一个维度:批数
        OneBatcheNumber_int = InputQuery_tensor.size(0)
        # ↓ 1) 将这一批次的数据进行变形 d_model => h x d_k(列表生成器,将)
        # 这里做了多步计算:
        # zip(self.linears, (query, key, value))是把(self.linears[0],self.linears[1],self.linears[2])(query, key, value)放到一起然后遍历。
        # QKV + self.linears [层数] →
        # QKV维度(OneBatcheNumber, OneSentenceNumber, OneWordNumber) + view →
        # QKV维度:(OneBatcheNumber, OneSentenceNumber, 8, 64) + transpose(1, 2) →
        # QKV维度(OneBatcheNumber, 8, OneSentenceNumber, 64)
        InputQuery_tensor, InputKey_tensor, InputValue_tensor = \
            [linear(InputQKV_tensor).view
             (OneBatcheNumber_int, -1, self.Hyper_HeadNumber_int, self.HeadSize_int).transpose(1, 2)
             # (OneBatcheNumber, time←→8, 64)  (OneBatcheNumber, 8, time, 64)
             #  ↑ 遍历4层ModelDimension*ModelDimension维度的层,做QKV经过4层线性变换
             for linear, InputQKV_tensor in
             zip(self.AlllinearsList_class, (InputQuery_tensor, InputKey_tensor, InputValue_tensor))]
        # ↓ 2) 针对所有变量计算 Attention_function → Score(batch, 8, time, 64),Product_Attention(batch, 8, time, time)
        Score, self.Product_Attention = Attention_function(InputQuery_tensor, InputKey_tensor, InputValue_tensor,
                                                           mask=mask,
                                                           dropout=self.Hyper_DropoutRate)
        # ↓ 3) 最后,将Attention_function计算结果串联在一起
        Score = Score.transpose(1, 2).contiguous().view(OneBatcheNumber_int, -1,
                                                        self.Hyper_HeadNumber_int * self.HeadSize_int)
        return self.AlllinearsList_class[-1](Score)


# ↓ ————————————————————————————————————Position-wise前馈网络
# ↓ 在每一个block中,都包含有一个全连接的前馈神经网络。
# 包含两个线性变换,然后使用Relu作为激活函数。
class PositionwiseFeedForward(nn.Module):
    """Implements FFN equation."""

    def __init__(self, Hyper_ModelDimension_int, Hyper_HiddenNodeNumber_int, Hyper_DropoutRate_float=0.1):
        super(PositionwiseFeedForward, self).__init__()
        print('Initial-PositionwiseFeedForward')
        # ↓  全连接层有两个线性变换
        self.FeedForwardLayerOne = nn.Linear(Hyper_ModelDimension_int, Hyper_HiddenNodeNumber_int)  # (512*512)
        self.FeedForwardLayerTwo = nn.Linear(Hyper_HiddenNodeNumber_int, Hyper_ModelDimension_int)
        self.dropout = nn.Dropout(Hyper_DropoutRate_float)

    def forward(self, Input_tensor):
        print('Forward-PositionwiseFeedForward')
        # ↓ 线性变换
        LayerOnePassed_tensor = self.FeedForwardLayerOne(Input_tensor)
        # ↓ ReLU激活
        ReLUed_tensor = TorchFunction_class.relu(LayerOnePassed_tensor)
        # ↓ DropOut
        DropOuted_tensor = self.dropout(ReLUed_tensor)
        # ↓ 线性变换
        Output_tensor = self.FeedForwardLayerTwo(DropOuted_tensor)
        return Output_tensor


# ↓ Embeddings and Softmax
class Embeddings(nn.Module):
    def __init__(self, Hyper_ModelDimension_int, Hyper_AllVocabNumber_int):
        super(Embeddings, self).__init__()
        print('Initial-Embeddings → ', end='')
        print('EmbeddingsSize:' + str(Hyper_ModelDimension_int) + '*' + str(Hyper_AllVocabNumber_int))
        # ↓词向量的维度
        self.Hyper_ModelDimension_int = Hyper_ModelDimension_int
        # ↓创建Hyper_VocabNumber_int个Hyper_ModelDimension_int维度的词向量查询矩阵
        self.EmbeddingRequireMatrix_tensor = nn.Embedding(Hyper_AllVocabNumber_int, Hyper_ModelDimension_int)

    def forward(self, InputRequireNumberInt_variable):
        print('Forward-Embeddings')
        # ↓ ModelDimension开方
        ModelDimensionSquareRoot = math.sqrt(self.Hyper_ModelDimension_int)
        # ↓ ModelDimension开方与Embedding_tensor相乘
        print(InputRequireNumberInt_variable.shape)
        Output_tensor = self.EmbeddingRequireMatrix_tensor(InputRequireNumberInt_variable) * ModelDimensionSquareRoot
        return Output_tensor

    # ↓ ————————————————————————————————————Positional encoding位置编码


# ↓ Positional Encoding
class PositionalEncoding(nn.Module):
    "Implement the PE function."

    def __init__(self, Hyper_ModelDimension_int, Hyper_DropoutRate, Hyper_MaxSentenceLength_int=5000):
        super(PositionalEncoding, self).__init__()
        print('Initial-PositionalEncoding')
        # ↓ 实例化一个Dropout方法
        self.Dropout = nn.Dropout(p=Hyper_DropoutRate)

        # ↓ 创建PositionalEncoding矩阵(句子个数*词向量)
        PositionalEncodingDimension_torch = torch.zeros(Hyper_MaxSentenceLength_int, Hyper_ModelDimension_int)
        # ↓ 句子序列
        ModelDimensionIndex_int = torch.arange(0.0, Hyper_MaxSentenceLength_int).unsqueeze(1)
        # ↓ PositionalEncodingDimension的计算公式
        PositionalEncodingDimension_torch[:, 0::2] = torch.sin(ModelDimensionIndex_int * torch.exp(
            torch.arange(0.0, Hyper_ModelDimension_int, 2) * -(math.log(10000.0) / Hyper_ModelDimension_int)))  # ↓ 偶数列
        PositionalEncodingDimension_torch[:, 1::2] = torch.cos(ModelDimensionIndex_int * torch.exp(
            torch.arange(0.0, Hyper_ModelDimension_int, 2) * -(math.log(10000.0) / Hyper_ModelDimension_int)))  # ↓ 奇数列
        # ↓ 添加一个维度
        PositionalEncodingDimension_torch = PositionalEncodingDimension_torch.unsqueeze(0)
        # ↓ register_buffer用于保存一些模型参数之外的值
        self.register_buffer('PositionalEncoding', PositionalEncodingDimension_torch)

    def forward(self, Input_tensor):
        print('Forward-PositionalEncoding')
        Input_tensor = Input_tensor + Variable(self.PositionalEncoding[:, :Input_tensor.size(1)],
                                               requires_grad=False)
        return self.Dropout(Input_tensor)


# ↓ 在位置编码下方,将基于位置添加正弦波。对于每个维度,波的频率和偏移都不同。
'''
plt.figure(figsize=(15, 5))
pe = PositionalEncoding(20, 0)
y = pe.forward(Variable(torch.zeros(1, 100, 20)))
plt.plot(np.arange(100), y[0, :, 4:8].data.numpy())
plt.legend(["dim %d" % p for p in [4, 5, 6, 7]])
None
'''


# ↓ ————————————————————————————————————构造一个简单的模型————————————————————————————————————
# ↓ 模型构造脚本
def CreateTransformerModel_function(Hyper_AllSourceVocabNumber_int, Hyper_AllTargetVocabNumber_int,
                                    Hyper_NetLayer_int=6,
                                    Hyper_ModelDimension_int=512,
                                    Hyper_HiddenNodeNumber_int=2048,
                                    Hyper_HeadNumber_int=8, Hyper_DropoutRate_float=0.1):
    """Helper: Construct a model from hyperparameters.从超参数构造模型。"""

    Copy_MultiHeadedAttention = MultiHeadedAttention(Hyper_HeadNumber_int, Hyper_ModelDimension_int)
    Copy_PositionwiseFeedForward = PositionwiseFeedForward(Hyper_ModelDimension_int, Hyper_HiddenNodeNumber_int,
                                                           Hyper_DropoutRate_float)
    Copy_PositionalEncoding = PositionalEncoding(Hyper_ModelDimension_int, Hyper_DropoutRate_float)
    model = EncoderDecoder(

        Encoder(EncoderLayer(Hyper_ModelDimension_int, copy.deepcopy(Copy_MultiHeadedAttention),
                             copy.deepcopy(Copy_PositionwiseFeedForward), Hyper_DropoutRate_float), Hyper_NetLayer_int),
        Decoder(DecoderLayer(Hyper_ModelDimension_int, copy.deepcopy(Copy_MultiHeadedAttention),
                             copy.deepcopy(Copy_MultiHeadedAttention), copy.deepcopy(Copy_PositionwiseFeedForward),
                             Hyper_DropoutRate_float), Hyper_NetLayer_int),
        nn.Sequential(Embeddings(Hyper_ModelDimension_int, Hyper_AllSourceVocabNumber_int),
                      copy.deepcopy(Copy_PositionalEncoding)),
        nn.Sequential(Embeddings(Hyper_ModelDimension_int, Hyper_AllTargetVocabNumber_int),
                      copy.deepcopy(Copy_PositionalEncoding)),
        #  ↓ 把模型的隐单元变成输出词的概率
        Generator(Hyper_ModelDimension_int, Hyper_AllTargetVocabNumber_int)
    )

    # ↓ # 随机初始化参数,这非常重要
    for Parameters in model.parameters():
        if Parameters.dim() > 1:
            nn.init.xavier_uniform_(Parameters)
    return model


# ↓ 模型构造
# ↓ Small example model.
tmp_model = CreateTransformerModel_function(10, 10, 2)  # src_vocab, tgt_vocab, Hyper_NetLayer_int=6


# ↓ ————————————————————————————————————训练
# ↓ Batches and Masking
class Batch:
    """Object for holding a batch of data with mask during training."""

    def __init__(self, InputSourceBatch_variable, InputTargetBatch_variable=None, pad=0):
        print('Initial-Batch →', end='')
        print('Input:' + str(InputSourceBatch_variable.shape) + str(
            InputTargetBatch_variable.shape))  # 30,10  30是大小,而10是最长的句子长度
        # 输入为一个batch,包括多个句子
        self.InputSourceBatch_variable = InputSourceBatch_variable
        self.InputSourceBatchMask_variable = (InputSourceBatch_variable != pad).unsqueeze(-2)
        if InputTargetBatch_variable is not None:
            self.InputTargetBatchIn_variable = InputTargetBatch_variable[:, :-1]
            self.InputTargetBatchOut_variable = InputTargetBatch_variable[:, 1:]
            self.InputTargetBatchMask_variable = self.make_std_mask(self.InputTargetBatchIn_variable, pad)
            self.ntokens = (self.InputTargetBatchOut_variable != pad).data.sum()

    @staticmethod  # 静态方法,可以不用实例化而调用这个函数,例如Batch.make_std_mask()
    def make_std_mask(InputTargetBatch_tensor, pad):
        print('Running-make_std_mask_function')
        """创建一个遮罩来隐藏填充和将来的单词。"""
        InputTargetBatchMask_variable = (InputTargetBatch_tensor != pad).unsqueeze(-2)
        InputTargetBatchMask_variable = InputTargetBatchMask_variable & Variable(
            CreateSubSequentMask_function(InputTargetBatch_tensor.size(-1)).type_as(InputTargetBatchMask_variable.data))
        return InputTargetBatchMask_variable


# ↓ Training Loop
def RunEpoch_function(AllData_Iterator, GlobalModel, LossComput_class):
    """Standard Training and Logging Function"""
    StartTime_float = time.time()
    total_tokens = 0
    total_loss = 0
    tokens = 0
    for BatchIndex, OneBatch_tensor in enumerate(AllData_Iterator):
        # 将数据传入全局模型得到输出(self, src, tgt, src_mask, tgt_mask):
        GlobalModelOutput_tensor = GlobalModel.forward(OneBatch_tensor.InputSourceBatch_variable,
                                                       OneBatch_tensor.InputTargetBatchIn_variable,
                                                       OneBatch_tensor.InputSourceBatchMask_variable,
                                                       OneBatch_tensor.InputTargetBatchMask_variable)
        loss = LossComput_class(GlobalModelOutput_tensor, OneBatch_tensor.InputTargetBatchOut_variable,
                                OneBatch_tensor.ntokens)
        total_loss += loss
        total_tokens += OneBatch_tensor.ntokens
        tokens += OneBatch_tensor.ntokens
        if BatchIndex % 50 == 1:
            elapsed = time.time() - StartTime_float
            print("Epoch Step: %d Loss: %f Tokens per Sec: %f" %
                  (BatchIndex, loss / OneBatch_tensor.ntokens, tokens / elapsed))
            StartTime_float = time.time()
            tokens = 0
    return total_loss / total_tokens


# ↓ Training Data and Batching
global max_src_in_batch, max_tgt_in_batch


def batch_size_fn(new, count, sofar):
    "Keep augmenting batch and calculate total number of tokens + padding."
    global max_src_in_batch, max_tgt_in_batch
    if count == 1:
        max_src_in_batch = 0
        max_tgt_in_batch = 0
    max_src_in_batch = max(max_src_in_batch, len(new.src))
    max_tgt_in_batch = max(max_tgt_in_batch, len(new.trg) + 2)
    src_elements = count * max_src_in_batch
    tgt_elements = count * max_tgt_in_batch
    return max(src_elements, tgt_elements)


# ↓ Optimizer
class Optimizer:
    "Optim wrapper that implements rate."

    def __init__(self, model_size, factor, warmup, optimizer):
        self.optimizer = optimizer
        self._step = 0
        self.warmup = warmup
        self.factor = factor
        self.model_size = model_size
        self._rate = 0

    def step(self):
        "Update parameters and rate"
        self._step += 1
        rate = self.rate()
        for p in self.optimizer.param_groups:
            p['lr'] = rate
        self._rate = rate
        self.optimizer.step()

    def rate(self, step=None):
        "Implement `lrate` above"
        if step is None:
            step = self._step
        return self.factor * \
               (self.model_size ** (-0.5) *
                min(step ** (-0.5), step * self.warmup ** (-1.5)))


def get_std_opt(model):
    return Optimizer(model.src_embed[0].d_model, 2, 4000,
                     torch.optim.Adam(model.parameters(), lr=0, betas=(0.9, 0.98), eps=1e-9))


'''
# ↓ Example of the curves of this model for different model sizes and for optimization hyperparameters.
# ↓ Three settings of the lrate hyperparameters.
opts = [Optimizer(512, 1, 4000, None),
        Optimizer(512, 1, 8000, None),
        Optimizer(256, 1, 4000, None)]
plt.plot(np.arange(1, 20000), [[opt.rate(i) for opt in opts] for i in range(1, 20000)])
plt.legend(["512:4000", "512:8000", "256:4000"])
None
'''


# ↓ Regularization
# ↓ Label Smoothing
class LabelSmoothing(nn.Module):
    "Implement label smoothing."

    def __init__(self, size, padding_idx, smoothing=0.0):
        super(LabelSmoothing, self).__init__()
        print('Initial-LabelSmoothing')
        self.criterion = nn.KLDivLoss(reduction='sum')
        self.padding_idx = padding_idx
        self.confidence = 1.0 - smoothing
        self.smoothing = smoothing
        self.size = size
        self.true_dist = None

    def forward(self, x, target):
        print('Forward-LabelSmoothing')
        assert x.size(1) == self.size
        true_dist = x.data.clone()
        true_dist.fill_(self.smoothing / (self.size - 2))
        true_dist.scatter_(1, target.data.unsqueeze(1), self.confidence)
        true_dist[:, self.padding_idx] = 0
        mask = torch.nonzero(target.data == self.padding_idx)
        if mask.dim() > 0:
            true_dist.index_fill_(0, mask.squeeze(), 0.0)
        self.true_dist = true_dist
        return self.criterion(x, Variable(true_dist, requires_grad=False))


############################################################第一个例子
# 给定一个小词汇表中的随机输入符号集,目标是生成这些相同的符号。
# 创造一些数据(这是一个迭代器)batch=30,nbatches=20
def CreateFakeData_function(Hyper_AllSourceVocabNumber_int, OneBatchNumber_int, AllBatchNumber_int):
    print('Running:CreateFakeData_function')
    # 一批的句子个数,一次构建一个句子的编码序列
    for BatchNumber_int in range(AllBatchNumber_int):
        # 返回一个范围1-Hyper_AllSourceVocabNumber_int(词序号)的随机整型数,size=一批的数量*10句的维度
        SentenceData_torch = torch.from_numpy(
            np.random.randint(1, Hyper_AllSourceVocabNumber_int, size=(OneBatchNumber_int, 10))).long()
        SentenceData_torch[:, 0] = 1
        SourceData_variable = Variable(SentenceData_torch, requires_grad=False).cuda()
        TargetData_variable = Variable(SentenceData_torch, requires_grad=False).cuda()
        # 实例化一个Batch类
        CurrentBatchData_class = Batch(SourceData_variable, TargetData_variable, 0)
        print('Output:CurrentBatchData_class:')
        yield CurrentBatchData_class


# Loss Computation
class SimpleLossCompute:
    "A simple loss compute and train function."

    def __init__(self, generator, criterion, opt=None):
        print('Initial-SimpleLossCompute')
        self.generator = generator
        self.criterion = criterion
        self.opt = opt

    def __call__(self, x, y, norm):
        print('Call-SimpleLossCompute')
        x = self.generator(x)
        loss = self.criterion(x.contiguous().view(-1, x.size(-1)),
                              y.contiguous().view(-1)) / norm
        loss.backward()
        if self.opt is not None:
            self.opt.step()
            self.opt.optimizer.zero_grad()

        OutputLoss = loss.item() * norm

        return OutputLoss


# Greedy Decoding
# Train the simple copy task.
# 设置超参数一共11个字的字库
Hyper_AllSourceVocabNumber_int, Hyper_AllTargetVocabNumber_int = 10, 10
Hyper_NetLayer_int = 6
criterion = LabelSmoothing(size=Hyper_AllSourceVocabNumber_int, padding_idx=0, smoothing=0.0)
# 模型实例化11个源/标准字库,2层网络叠加
CopyTestModel_class = CreateTransformerModel_function(Hyper_AllSourceVocabNumber_int, Hyper_AllTargetVocabNumber_int,
                                                      Hyper_NetLayer_int).cuda()
# 定义优化器
CopyTestModelOptimizer = Optimizer(CopyTestModel_class.TargetEmbedding_class[0].Hyper_ModelDimension_int, 1, 400,
                                   torch.optim.Adam(CopyTestModel_class.parameters(), lr=0, betas=(0.9, 0.98),
                                                    eps=1e-9))
# 训练多个Epoch
for Epoch in range(1):
    # 指定训练和测试时采用不同方式,在训练模型时会在前面加上:   model.train(启用Normalization 和 Dropout。)
    CopyTestModel_class.train()
    # 一共11个字的字库,每批30个句子,一共20批的生成器,一次生成一批
    AllBatchData_Iterator = CreateFakeData_function(Hyper_AllSourceVocabNumber_int, 30, 20)
    # 计算误差
    LossCompute = SimpleLossCompute(CopyTestModel_class.Generator_class, criterion, CopyTestModelOptimizer)
    # RunEpoch_function将训练一个Epoch的数据接受参数(AllData_Iterator, GlobalModel, loss_compute)
    RunEpoch_function(AllBatchData_Iterator, CopyTestModel_class, LossCompute)

    # model.train(不启用Normalization 和 Dropout。)
    CopyTestModel_class.eval()
    print(RunEpoch_function(CreateFakeData_function(Hyper_AllSourceVocabNumber_int, 30, 5), CopyTestModel_class,
                            SimpleLossCompute(CopyTestModel_class.Generator_class, criterion, None)))


结语

代码部分到此结束,第一次写博客,难免有各种纰漏,主要是记下自己的学习心得,通过重写代码笔者收益颇丰,同样希望读者通过阅读本文能够有所受益,最后,欢迎转载,但请注明出处。

  • 11
    点赞
  • 43
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
Transformer发轫于NLP(自然语言处理),并跨界应用到CV(计算机视觉)领域。目前已成为深度学习的新范式,影响力和应用前景巨大。  本课程对Transformer的原理和PyTorch代码进行精讲,来帮助大家掌握其详细原理和具体实现。  原理精讲部分包括:注意力机制和自注意力机制、Transformer的架构概述、Encoder的多头注意力(Multi-Head Attention)、Encoder的位置编码(Positional Encoding)、残差链接、层规范化(Layer Normalization)、FFN(Feed Forward Network)、Transformer的训练及性能、Transformer的机器翻译工作流程。   代码精讲部分使用Jupyter Notebook对TransformerPyTorch代码进行逐行解读,包括:安装PyTorchTransformer的Encoder代码解读、Transformer的Decoder代码解读、Transformer的超参设置代码解读、Transformer的训练示例(人为随机数据)代码解读、Transformer的训练示例(德语-英语机器翻译)代码解读。相关课程: 《Transformer原理与代码精讲(PyTorch)》https://edu.csdn.net/course/detail/36697《Transformer原理与代码精讲(TensorFlow)》https://edu.csdn.net/course/detail/36699《ViT(Vision Transformer)原理与代码精讲》https://edu.csdn.net/course/detail/36719《DETR原理与代码精讲》https://edu.csdn.net/course/detail/36768《Swin Transformer实战目标检测:训练自己的数据集》https://edu.csdn.net/course/detail/36585《Swin Transformer实战实例分割:训练自己的数据集》https://edu.csdn.net/course/detail/36586《Swin Transformer原理与代码精讲》 https://download.csdn.net/course/detail/37045
这段代码是用于实现Vision Transformer框架的一部分功能,具体逐行解析如下: 1. `conv_output = F.conv2d(image, kernel, stride=stride)`: 这一行代码使用PyTorch中的卷积函数`F.conv2d`来对输入图像进行卷积操作。 2. `bs, oc, oh, ow = conv_output.shape`: 这一行代码通过`conv_output.shape`获取卷积输出张量的形状信息,其中`bs`表示批次大小,`oc`表示输出通道数,`oh`和`ow`分别表示输出张量的高度和宽度。 3. `patch_embedding = conv_output.reshape((bs, oc, oh*ow))`: 这一行代码通过`reshape`函数将卷积输出张量进行形状变换,将其转换为形状为`(bs, oc, oh*ow)`的张量。 4. `patch_embedding = patch_embedding.transpose(-1, -2)`: 这一行代码使用`transpose`函数交换张量的最后两个维度,将形状为`(bs, oh*ow, oc)`的张量转换为`(bs, oc, oh*ow)`的张量。 5. `weight = weight.transpose(0, 1)`: 这一行代码将权重张量进行转置操作,交换第0维和第1维的位置。 6. `kernel = weight.reshape((-1, ic, patch_size, patch_size))`: 这一行代码通过`reshape`函数将权重张量进行形状变换,将其转换为形状为`(outchannel*inchannel, ic, patch_size, patch_size)`的张量。 7. `patch_embedding_conv = image2emb_conv(image, kernel, patch_size)`: 这一行代码调用了`image2emb_conv`函数,并传入了图像、权重张量和补丁大小作为参数。 8. `print(patch_embedding_conv.shape)`: 这一行代码打印了`patch_embedding_conv`的形状信息。 以上是对Vision Transformer代码的逐行解析
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值