Transformer 代码详细解析

最新推荐文章于 2024-07-03 16:22:08 发布
邓阳（Dynamo-Deng）
最新推荐文章于 2024-07-03 16:22:08 发布
阅读量3.1k
点赞数 11
文章标签： pytorch 自然语言处理神经网络机器学习深度学习
本文链接：https://blog.csdn.net/weixin_40568091/article/details/103910316
版权
Transformer 代码详细解析

谷歌在《Attention is all you need》中提出了一种新模型的架构叫做Transformer，看了很多相关文章，通过图解能够大致理解模型的思路，但是依然无法深入理解本质，立足于代码无疑是深入理解的一个好方法。由于源代码中有较多变量名采用了缩写，并且变量的数据类型繁多，对于初学者可能依然难以理解，笔者在参照了源代码后进行了代码重写，并对每个变量名进行了后缀添加，同时添加了大量的注释。一方面深入自己的理解，另一方面同样也希望对初学者能够有所帮助。能力有限，难免会有很多不对的地方，不妥之处欢迎大家评论指出。
1.《Attention is all you need》论文地址：
https://arxiv.org/abs/1706.03762
2.Transformer模型详解：
https://blog.csdn.net/u012526436/article/details/86295971（中文）
https://jalammar.github.io/illustrated-transformer/（英文）
3.源代码：
https://github.com/zingp/NLP/blob/master/P006TheAnnotatedTransformer/model_transformer.ipynb
完整代码

建议学习路径：中文图解模型→英文图解模型→论文原文→代码详解（本文）→源代码
"""

Copyright (C) 2019-2020 Dynamo
***************************************************
Author               : Dynamo 
Versions             : 1.0 
Time                 : 2020.1
File_name            : Transformer 代码详细解析


***************************************************
# ↓ 导入pytorch相关模块
import torch
import torch.nn as nn  # ↓ torch的主模块
import torch.nn.functional as TorchFunction_class
from torch.autograd import Variable
# ↓ 导入科学计算、画图等其它模块
import numpy as np
import matplotlib.pyplot as plt
import math, copy, time
import seaborn

seaborn.set_context(context="talk")


# ↓ ————————————————————————————————————定义Encoder-Decoder的类结构————————————————————————————————————

# ↓ 全局EncoderDecoder模型
class EncoderDecoder(nn.Module):
    # ↓ 初始化网络,在实例化的时候(接受5个模块参数：编码模块，解码模块，词嵌入模块，目标嵌入模块，生成器模块)
    def __init__(self, Encoder_class, Decoder_class, SourceEmbedding_class, TargetEmbedding_class, Generator_class):
        super(EncoderDecoder, self).__init__()
        print('Initial-EncoderDecoder')
        # Encode层
        self.Encoder_class = Encoder_class
        # Decoder层
        self.Decoder_class = Decoder_class
        # 源语言和目标语言的embedding模块
        self.SourceEmbedding_class = SourceEmbedding_class
        self.TargetEmbedding_class = TargetEmbedding_class
        # generator就是根据Decoder的隐状态输出当前时刻的词
        # 基本的实现就是隐状态输入一个全连接层，全连接层的输出大小是词的个数
        # 然后接一个softmax变成概率。
        self.Generator_class = Generator_class

    # ↓ 定义编码器
    def Encoder_function(self, src, src_mask):
        print('Forward-Encoder_function → ', end='')
        print('InputData:', type(src), type(src_mask), end='')
        OutPut_tensor = self.Encoder_class(self.SourceEmbedding_class(src), src_mask)
        print('OutPutData:', type(OutPut_tensor))
        print(OutPut_tensor)
        return OutPut_tensor

    # ↓ 定义解码器，decoder会比encoder多一个context attention
    def Decoder_function(self, memory, src_mask, tgt, tgt_mask):
        print('Forward-Decoder_function → ', end='')
        print('InputData:', type(memory), type(src_mask), type(tgt), type(tgt_mask))
        OutPut_tensor = self.Decoder_class(self.TargetEmbedding_class(tgt), memory, src_mask, tgt_mask)
        return OutPut_tensor

    # ↓ 向前传播（首先调用encode方法对输入进行编码，然后调用decode方法解码）
    # OneBatch_tensor.InputSourceBatch_variable, OneBatch_tensor.InputTargetBatchIn_variable,
    # OneBatch_tensor.InputSourceBatchMask_tensor, OneBatch_tensor.InputTargetBatchMask_tensor
    def forward(self, OneBatchSourceBatch_variable, OneBatchTargetBatchIn_variable, OneBatchSourceBatchMask_variable,
                OneBatchTargetBatchMask_variable):
        print('Forward-EncoderDecoder → ', end='')
        print('InputData:', type(OneBatchSourceBatch_variable), type(OneBatchTargetBatchIn_variable),
              type(OneBatchSourceBatchMask_variable),
              type(OneBatchTargetBatchMask_variable))
        # (OneBatchSourceBatch, OneBatchSourceBatchMask) → Eecoder → MemoryEncoded_variable
        MemoryEncoded_variable = self.Encoder_function(OneBatchSourceBatch_variable, OneBatchSourceBatchMask_variable)
        # (MemoryEncoded_variable，SourceBatchMask，BatchTarget，TargetBatchMask) → Decoder → Decoded_variable
        Decoded_variable = self.Decoder_function(MemoryEncoded_variable, OneBatchSourceBatchMask_variable,
                                                 OneBatchTargetBatchIn_variable, OneBatchTargetBatchMask_variable)

        return Decoded_variable


# ↓ ————————————————————————————————————定义标准的线性+softmax生成步骤————————————————————————————————————
class Generator(nn.Module):
    """定义标准的线性+softmax生成步骤，这是在8. Embeddings和Softmax中"""

    def __init__(self, Hyper_ModelDimension_int, Hyper_AllVocabNumber_int):
        super(Generator, self).__init__()
        print('Initial-Generator')
        # ↓ 定义一层网络（词向量的维度数，输出的维度）
        self.GeneratorLayer = nn.Linear(Hyper_ModelDimension_int, Hyper_AllVocabNumber_int)

    def forward(self, Input_tensor):
        print('Forward-Generator')
        # ↓ 对这一层网络进行Softmax处理
        Output_tensor = TorchFunction_class.log_softmax(self.GeneratorLayer(Input_tensor), dim=-1)
        return Output_tensor

    # ↓ ————————————————————————————————————定义Encoder和Decoder堆————————————————————————————————————


# ↓ 模型复制脚本
def CloneLayer_function(module, N):
    # ↓ 生成N个相同的层。
    return nn.ModuleList([copy.deepcopy(module) for Layer in range(N)])


# ↓ ————————————————————————————————————定义Encoder模型
class Encoder(nn.Module):
    """核心编码器是一个堆栈的N层"""

    def __init__(self, EncodeLayer_class, N):
        super(Encoder, self).__init__()
        print('Initial-Encoder')
        # ↓ 构造Encoder编码器，由N层堆叠而成，接收层模块
        self.AllEncoderLayersList_class = CloneLayer_function(EncodeLayer_class, N)  # 复制N层EncodeLayer
        self.LayerNorm_class = LayerNorm(
            EncodeLayer_class.Hyper_ModelDimension_int)  # 层归一（接收参数为EncodeLayer层的ModelDimension）

    def forward(self, Input_tensor, mask):
        print('Forward-Encoder')
        # ↓ for循环通过N个EncoderLayer
        for SubEncoderLayer_class in self.AllEncoderLayersList_class:
            Input_tensor = SubEncoderLayer_class(Input_tensor, mask)
        # ↓ 层归一
        OutPut_tensor = self.LayerNorm_class(Input_tensor)
        return OutPut_tensor


# ↓ ————————————————————————————————————定义残差连接和
# ↓层归一
class LayerNorm(nn.Module):
    def __init__(self, Hyper_ModelDimension_int, Hyper_MinimumConstant=1e-6):
        super(LayerNorm, self).__init__()
        print('Initial-LayerNorm')
        # ↓返回一个全为1的张量，形状由Hyper_ModelDimension_int定义,为了弥补归一化的损失，这是一个训练参数
        self.Weight_tensor = nn.Parameter(torch.ones(Hyper_ModelDimension_int))
        # ↓返回一个全为0的张量，形状由Hyper_ModelDimension_int定义,为了弥补归一化的损失，这是一个训练参数
        self.Bias_tensor = nn.Parameter(torch.zeros(Hyper_ModelDimension_int))
        # ↓这是为了防止除以0而设置的一个极小的常量
        self.Hyper_MinimumConstant = Hyper_MinimumConstant

    def forward(self, Input_tensor):
        print('Forward-LayerNorm')
        # 1.按行求均值
        # 2.按行求标准差
        # 3.元素相乘（不是点积）
        Mean = Input_tensor.mean(-1, keepdim=True)
        StandardDeviation = Input_tensor.std(-1, keepdim=True)
        OutPut_tensor = self.Weight_tensor * (Input_tensor - Mean) / (
                StandardDeviation + self.Hyper_MinimumConstant) + self.Bias_tensor
        return OutPut_tensor


# ↓ 残差连接
class SublayerConnection(nn.Module):
    def __init__(self, Hyper_ModelDimension_int, Hyper_DropoutRate_float):
        super(SublayerConnection, self).__init__()
        print('Initial-SublayerConnection')
        # 一层LayerNorm(x+Sublayer(x))。
        self.LayerNorm_class = LayerNorm(Hyper_ModelDimension_int)
        # 实例化一个Dropout
        self.Dropout_class = nn.Dropout(Hyper_DropoutRate_float)

    def forward(self, Input_tensor, InputSubLayer):  # 将输入的tensor与这个tensor经过一个InputSubLayer层后的tensor相加
        print('Forward-SublayerConnection')
        # 1.首先进行层归一处理
        # 2.再经过Sublayer
        # 3.再进行Dropout处理
        # 4.最后与处理前的tensor相加
        OutPut_tensor = Input_tensor + self.Dropout_class(InputSubLayer(self.LayerNorm_class(Input_tensor)))
        return OutPut_tensor


# ↓ ————————————————————————————————————定义EncoderLayer模型
class EncoderLayer(nn.Module):
    """编码器由self_attentions和前馈自调音组成(定义如下)
    Encoder的每一层有两个block（教程中使用子层来称呼）"""

    #  EncoderLayer接收的参数：
    #  Hyper_ModelDimension_int, 模型维度
    #  copy.deepcopy(Copy_MultiHeadedAttention), 多头注意力机制模块
    #  copy.deepcopy(Copy_PositionwiseFeedForward), 位置信息模块
    #  Hyper_DropoutRate_float ,Dropout的比率

    def __init__(self, Hyper_ModelDimension_int, Multi_HeadedAttention_class, PositionwiseFeedForward_class,
                 Hyper_DropoutRate_float):
        super(EncoderLayer, self).__init__()
        print('Initial-EncoderLayer')
        # ↓ 多头注意机制
        self.Multi_HeadedAttention_class = Multi_HeadedAttention_class
        # ↓ FeedForward层
        self.PositionwiseFeedForward_class = PositionwiseFeedForward_class
        # 实例化并复制2个残差连接&归一化层，Sublayer的Forward要求传入两个参数（Input_tensor, Sublayer)
        self.AllSublayerList_class = CloneLayer_function(
            SublayerConnection(Hyper_ModelDimension_int, Hyper_DropoutRate_float), 2)
        # ↓ 模型的维度（超参数）
        self.Hyper_ModelDimension_int = Hyper_ModelDimension_int

    def forward(self, Input_tensor, mask):  # Input_tensor mask 是输入
        print('Forward-EncoderLayer')
        """Follow Figure 1 (left) for connections."""
        SubInput_tensor = self.AllSublayerList_class[0](Input_tensor,
                                                        lambda Input_tensor: self.Multi_HeadedAttention_class(
                                                            Input_tensor, Input_tensor, Input_tensor, mask))
        OutPut_tensor = self.AllSublayerList_class[1](SubInput_tensor, self.PositionwiseFeedForward_class)
        return OutPut_tensor


# ↓ ————————————————————————————————————构建Decoder堆
# ↓ Decoder模型
class Decoder(nn.Module):
    """Generic N layer decoder with masking."""

    def __init__(self, DecodeLayer_class, N):
        super(Decoder, self).__init__()
        print('Initial-Decoder')
        # ↓ 复制N层DecodeLayer
        self.AllDecodeLayersList_class = CloneLayer_function(DecodeLayer_class, N)
        # ↓ 在Decoder堆中每一层都要进行残差连接和层归一Layer-Normalization
        self.LayerNorm_class = LayerNorm(DecodeLayer_class.Hyper_ModelDimension_int)

        # ↓ 和encoder一样，复制N层，只是在每一层中比encoder多了memory。

    def forward(self, Input_tensor, memory, src_mask, tgt_mask):
        print('Forward-Decoder')
        # ↓ for循环，tensor经过多层Decoder
        for Decode_Layer in self.AllDecodeLayersList_class:
            Input_tensor = Decode_Layer(Input_tensor, memory, src_mask, tgt_mask)
        # ↓ 归一化
        OutPut_tensor = self.LayerNorm_class(Input_tensor)
        return OutPut_tensor


# ↓ Decoder层
class DecoderLayer(nn.Module):
    """解码器由self-attn、src-attn和前馈组成(定义如下)"""

    # DecoderLayer接收的参数：
    # Hyper_ModelDimension_int, 模型维度
    # copy.deepcopy(Copy_MultiHeadedAttention), 多头注意力机制返回的tensor
    # copy.deepcopy(Copy_MultiHeadedAttention), src_attn与多头注意力机制返回的tensor一样
    # copy.deepcopy(Copy_PositionwiseFeedForward), 位置信息
    # Hyper_DropoutRate_float, Dropout的比率

    def __init__(self, Hyper_ModelDimension_int, MultiHeadedAttention_class, src_attn, PositionwiseFeedForward_class,
                 Hyper_DropoutRate_float):
        super(DecoderLayer, self).__init__()
        print('Initial-DecoderLayer')
        # ↓ 模型的维度（超参数）
        self.Hyper_ModelDimension_int = Hyper_ModelDimension_int
        # ↓ 多头注意机制
        self.MultiHeadedAttention_class = MultiHeadedAttention_class
        # ↓ 多头注意机制
        self.src_attn = src_attn
        # ↓ FeedForward层
        self.PositionwiseFeedForward_class = PositionwiseFeedForward_class
        # ↓ 残差连接与归一化（3层）
        self.AllSublayerList_class = CloneLayer_function(
            SublayerConnection(Hyper_ModelDimension_int, Hyper_DropoutRate_float), 3)

    def forward(self, Input_tensor, MemoryFromEncode_tensor, src_mask, tgt_mask):
        print('Forward-DecoderLayer')
        """Follow Figure 1 (right) for connections."""
        # 多头注意机制→残差连接与归一化→
        # 多头注意机制→残差连接与归一化→
        # FeedForward层→残差连接与归一化
        MemoryFromEncode_tensor = MemoryFromEncode_tensor
        Input_tensor = self.AllSublayerList_class[0](Input_tensor,
                                                     lambda Input_tensor: self.MultiHeadedAttention_class(Input_tensor,
                                                                                                          Input_tensor,
                                                                                                          Input_tensor,
                                                                                                          tgt_mask))
        Input_tensor = self.AllSublayerList_class[1](Input_tensor, lambda Input_tensor: self.src_attn(Input_tensor,
                                                                                                      MemoryFromEncode_tensor,
                                                                                                      MemoryFromEncode_tensor,
                                                                                                      src_mask))
        return self.AllSublayerList_class[2](Input_tensor, self.PositionwiseFeedForward_class)


#  Decoder在解码第t个时刻的时候只能使用1-t时刻的输入，而不能使用t+1时刻及其之后的输入。因此我们需要一个函数来产生一个Mask矩阵
def CreateSubSequentMask_function(AttentionShape_matrix):
    """Mask out subsequent positions."""
    AttentionShape = (1, AttentionShape_matrix, AttentionShape_matrix)
    # np.triu产生一个上三角阵,k表示对角线起始位置，对角线下面的为0，上面的为1
    SubSequentMask_np = np.triu(np.ones(AttentionShape), k=1).astype('uint8')
    '''
    例如：5*5的三角矩阵
    [0, 1, 1, 1, 1],
    [0, 0, 1, 1, 1],
    [0, 0, 0, 1, 1],
    [0, 0, 0, 0, 1],
    [0, 0, 0, 0, 0]
    '''
    # Numpy 的广播功能，判断matrix是否 = 0，返回一个0=True，else=False的矩阵
    SubSequentMask_torch = torch.from_numpy(SubSequentMask_np) == 0
    '''
    例如：上述矩阵的变换
    [ True, False, False, False, False],
    [ True,  True, False, False, False],
    [ True,  True,  True, False, False],
    [ True,  True,  True,  True, False],
    [ True,  True,  True,  True,  True]
    '''

    return SubSequentMask_torch


'''
打印三角矩阵
plt.figure(figsize=(5, 5))
plt.imshow(CreateSubSequentMask_function(20)[0])
None
'''


# ↓ ————————————————————————————————————Attention模块有四个维度（批数，头数，序列长度（句子长度），特征数（词向量））
# ↓ 注意力机制的计算脚本（query, key, value）它的输入是Query,Key,Value和Mask，输出是一个Tensor
def Attention_function(query, key, value, mask=None, dropout=None):
    # input是维度为（批数，序列长度（句子长度），头数，特征数（词向量））
    #  注意力机制的计算步骤：
    #  (1)查询向量*键向量（需要转置）
    #  (2)除以键向量维度的平方根
    #  (3)使用softmax函数标准化结果
    #  (4)将Score乘以Value,输出OutPut_tensor

    # ↓ Key的维度
    KeyDimension_int = query.size(-1)
    # ↓ 查询向量*键向量（转置）query(B,T,H,F)*key(B,T,F,H)
    Scores = torch.matmul(query, key.transpose(-2, -1))
    # ↓ 除以键向量维度的平方根
    Scores = Scores / math.sqrt(KeyDimension_int)
    # ↓ 如果要启用Mask则把Mask是0的Scores位置变成一个很小的数(-1e9)
    if mask is not None:
        Scores = Scores.masked_fill(mask == 0, -1e9)
    # ↓ 使用softmax函数处理结果获取Scores的权重
    Product_Attention = TorchFunction_class.softmax(Scores, dim=-1)
    # ↓ dropout处理
    if dropout is not None:
        Product_Attention = dropout(Product_Attention)
    # ↓ 将Score乘以Value,最终输出OutPut_tensor
    OutPut_tensor = torch.matmul(Product_Attention, value)
    return OutPut_tensor, Product_Attention


# ↓ ————————————————————————————————————“多头”机制
class MultiHeadedAttention(nn.Module):
    def __init__(self, Hyper_HeadNumber_int, Hyper_ModelDimension_int, Hyper_DropoutRate=0.1):
        """考虑模型大小和头的数量。"""
        super(MultiHeadedAttention, self).__init__()
        print('Initial-MultiHeadedAttention')
        # 初始化时指定头数h（超参数）和模型维度Hyper_ModelDimension_int
        assert Hyper_ModelDimension_int % Hyper_HeadNumber_int == 0  # 断言：二者是一定整除的，否则报错
        # ↓ 一个MultiHeaded的大小（512/8=64）
        self.HeadSize_int = Hyper_ModelDimension_int // Hyper_HeadNumber_int
        # ↓ 一个MultiHeaded的个数
        self.Hyper_HeadNumber_int = Hyper_HeadNumber_int
        # ↓ 复制4层ModelDimension(特征数/词向量）维度的层,用于QKV做线性变换
        self.AlllinearsList_class = CloneLayer_function(nn.Linear(Hyper_ModelDimension_int, Hyper_ModelDimension_int),
                                                        4)
        self.Product_Attention = None
        # ↓ 实例化一个Dropout方法
        self.Hyper_DropoutRate = nn.Dropout(p=Hyper_DropoutRate)

    def forward(self, InputQuery_tensor, InputKey_tensor, InputValue_tensor, mask=None):
        print('Forward-MultiHeadedAttention')
        # InputQuery_tensor有三个维度（批数(句子个数)，序列长度（句子长度），特征数（词向量））

        # ↓ "实现多头注意力模型"，计算一下Mask的维度是(batch, 1, time)，因为每个head的mask都是一样的，所以先用unsqueeze(1)变成(batch, 1, 1, time)
        if mask is not None:
            mask = mask.unsqueeze(1)  # 在第1维后添加一个维度
        # ↓ InputQuery_tensor有三个维度，这里提取第一个维度：批数
        OneBatcheNumber_int = InputQuery_tensor.size(0)
        # ↓ 1) 将这一批次的数据进行变形 d_model => h x d_k（列表生成器,将）
        # 这里做了多步计算：
        # zip(self.linears, (query, key, value))是把(self.linears[0],self.linears[1],self.linears[2])和(query, key, value)放到一起然后遍历。
        # QKV + self.linears [层数] →
        # QKV维度(OneBatcheNumber, OneSentenceNumber, OneWordNumber) + view →
        # QKV维度：(OneBatcheNumber, OneSentenceNumber, 8, 64) + transpose(1, 2) →
        # QKV维度(OneBatcheNumber, 8, OneSentenceNumber, 64)
        InputQuery_tensor, InputKey_tensor, InputValue_tensor = \
            [linear(InputQKV_tensor).view
             (OneBatcheNumber_int, -1, self.Hyper_HeadNumber_int, self.HeadSize_int).transpose(1, 2)
             # (OneBatcheNumber, time←→8, 64) → (OneBatcheNumber, 8, time, 64)
             #  ↑ 遍历4层ModelDimension*ModelDimension维度的层，做QKV经过4层线性变换
             for linear, InputQKV_tensor in
             zip(self.AlllinearsList_class, (InputQuery_tensor, InputKey_tensor, InputValue_tensor))]
        # ↓ 2) 针对所有变量计算 Attention_function → Score(batch, 8, time, 64)，Product_Attention(batch, 8, time, time)
        Score, self.Product_Attention = Attention_function(InputQuery_tensor, InputKey_tensor, InputValue_tensor,
                                                           mask=mask,
                                                           dropout=self.Hyper_DropoutRate)
        # ↓ 3) 最后，将Attention_function计算结果串联在一起
        Score = Score.transpose(1, 2).contiguous().view(OneBatcheNumber_int, -1,
                                                        self.Hyper_HeadNumber_int * self.HeadSize_int)
        return self.AlllinearsList_class[-1](Score)


# ↓ ————————————————————————————————————Position-wise前馈网络
# ↓ 在每一个block中，都包含有一个全连接的前馈神经网络。
# 包含两个线性变换，然后使用Relu作为激活函数。
class PositionwiseFeedForward(nn.Module):
    """Implements FFN equation."""

    def __init__(self, Hyper_ModelDimension_int, Hyper_HiddenNodeNumber_int, Hyper_DropoutRate_float=0.1):
        super(PositionwiseFeedForward, self).__init__()
        print('Initial-PositionwiseFeedForward')
        # ↓  全连接层有两个线性变换
        self.FeedForwardLayerOne = nn.Linear(Hyper_ModelDimension_int, Hyper_HiddenNodeNumber_int)  # (512*512)
        self.FeedForwardLayerTwo = nn.Linear(Hyper_HiddenNodeNumber_int, Hyper_ModelDimension_int)
        self.dropout = nn.Dropout(Hyper_DropoutRate_float)

    def forward(self, Input_tensor):
        print('Forward-PositionwiseFeedForward')
        # ↓ 线性变换
        LayerOnePassed_tensor = self.FeedForwardLayerOne(Input_tensor)
        # ↓ ReLU激活
        ReLUed_tensor = TorchFunction_class.relu(LayerOnePassed_tensor)
        # ↓ DropOut
        DropOuted_tensor = self.dropout(ReLUed_tensor)
        # ↓ 线性变换
        Output_tensor = self.FeedForwardLayerTwo(DropOuted_tensor)
        return Output_tensor


# ↓ Embeddings and Softmax
class Embeddings(nn.Module):
    def __init__(self, Hyper_ModelDimension_int, Hyper_AllVocabNumber_int):
        super(Embeddings, self).__init__()
        print('Initial-Embeddings → ', end='')
        print('EmbeddingsSize:' + str(Hyper_ModelDimension_int) + '*' + str(Hyper_AllVocabNumber_int))
        # ↓词向量的维度
        self.Hyper_ModelDimension_int = Hyper_ModelDimension_int
        # ↓创建Hyper_VocabNumber_int个Hyper_ModelDimension_int维度的词向量查询矩阵
        self.EmbeddingRequireMatrix_tensor = nn.Embedding(Hyper_AllVocabNumber_int, Hyper_ModelDimension_int)

    def forward(self, InputRequireNumberInt_variable):
        print('Forward-Embeddings')
        # ↓ ModelDimension开方
        ModelDimensionSquareRoot = math.sqrt(self.Hyper_ModelDimension_int)
        # ↓ ModelDimension开方与Embedding_tensor相乘
        print(InputRequireNumberInt_variable.shape)
        Output_tensor = self.EmbeddingRequireMatrix_tensor(InputRequireNumberInt_variable) * ModelDimensionSquareRoot
        return Output_tensor

    # ↓ ————————————————————————————————————Positional encoding位置编码


# ↓ Positional Encoding
class PositionalEncoding(nn.Module):
    "Implement the PE function."

    def __init__(self, Hyper_ModelDimension_int, Hyper_DropoutRate, Hyper_MaxSentenceLength_int=5000):
        super(PositionalEncoding, self).__init__()
        print('Initial-PositionalEncoding')
        # ↓ 实例化一个Dropout方法
        self.Dropout = nn.Dropout(p=Hyper_DropoutRate)

        # ↓ 创建PositionalEncoding矩阵（句子个数*词向量）
        PositionalEncodingDimension_torch = torch.zeros(Hyper_MaxSentenceLength_int, Hyper_ModelDimension_int)
        # ↓ 句子序列
        ModelDimensionIndex_int = torch.arange(0.0, Hyper_MaxSentenceLength_int).unsqueeze(1)
        # ↓ PositionalEncodingDimension的计算公式
        PositionalEncodingDimension_torch[:, 0::2] = torch.sin(ModelDimensionIndex_int * torch.exp(
            torch.arange(0.0, Hyper_ModelDimension_int, 2) * -(math.log(10000.0) / Hyper_ModelDimension_int)))  # ↓ 偶数列
        PositionalEncodingDimension_torch[:, 1::2] = torch.cos(ModelDimensionIndex_int * torch.exp(
            torch.arange(0.0, Hyper_ModelDimension_int, 2) * -(math.log(10000.0) / Hyper_ModelDimension_int)))  # ↓ 奇数列
        # ↓ 添加一个维度
        PositionalEncodingDimension_torch = PositionalEncodingDimension_torch.unsqueeze(0)
        # ↓ register_buffer用于保存一些模型参数之外的值
        self.register_buffer('PositionalEncoding', PositionalEncodingDimension_torch)

    def forward(self, Input_tensor):
        print('Forward-PositionalEncoding')
        Input_tensor = Input_tensor + Variable(self.PositionalEncoding[:, :Input_tensor.size(1)],
                                               requires_grad=False)
        return self.Dropout(Input_tensor)


# ↓ 在位置编码下方，将基于位置添加正弦波。对于每个维度，波的频率和偏移都不同。
'''
plt.figure(figsize=(15, 5))
pe = PositionalEncoding(20, 0)
y = pe.forward(Variable(torch.zeros(1, 100, 20)))
plt.plot(np.arange(100), y[0, :, 4:8].data.numpy())
plt.legend(["dim %d" % p for p in [4, 5, 6, 7]])
None
'''


# ↓ ————————————————————————————————————构造一个简单的模型————————————————————————————————————
# ↓ 模型构造脚本
def CreateTransformerModel_function(Hyper_AllSourceVocabNumber_int, Hyper_AllTargetVocabNumber_int,
                                    Hyper_NetLayer_int=6,
                                    Hyper_ModelDimension_int=512,
                                    Hyper_HiddenNodeNumber_int=2048,
                                    Hyper_HeadNumber_int=8, Hyper_DropoutRate_float=0.1):
    """Helper: Construct a model from hyperparameters.从超参数构造模型。"""

    Copy_MultiHeadedAttention = MultiHeadedAttention(Hyper_HeadNumber_int, Hyper_ModelDimension_int)
    Copy_PositionwiseFeedForward = PositionwiseFeedForward(Hyper_ModelDimension_int, Hyper_HiddenNodeNumber_int,
                                                           Hyper_DropoutRate_float)
    Copy_PositionalEncoding = PositionalEncoding(Hyper_ModelDimension_int, Hyper_DropoutRate_float)
    model = EncoderDecoder(

        Encoder(EncoderLayer(Hyper_ModelDimension_int, copy.deepcopy(Copy_MultiHeadedAttention),
                             copy.deepcopy(Copy_PositionwiseFeedForward), Hyper_DropoutRate_float), Hyper_NetLayer_int),
        Decoder(DecoderLayer(Hyper_ModelDimension_int, copy.deepcopy(Copy_MultiHeadedAttention),
                             copy.deepcopy(Copy_MultiHeadedAttention), copy.deepcopy(Copy_PositionwiseFeedForward),
                             Hyper_DropoutRate_float), Hyper_NetLayer_int),
        nn.Sequential(Embeddings(Hyper_ModelDimension_int, Hyper_AllSourceVocabNumber_int),
                      copy.deepcopy(Copy_PositionalEncoding)),
        nn.Sequential(Embeddings(Hyper_ModelDimension_int, Hyper_AllTargetVocabNumber_int),
                      copy.deepcopy(Copy_PositionalEncoding)),
        #  ↓ 把模型的隐单元变成输出词的概率
        Generator(Hyper_ModelDimension_int, Hyper_AllTargetVocabNumber_int)
    )

    # ↓ # 随机初始化参数，这非常重要
    for Parameters in model.parameters():
        if Parameters.dim() > 1:
            nn.init.xavier_uniform_(Parameters)
    return model


# ↓ 模型构造
# ↓ Small example model.
tmp_model = CreateTransformerModel_function(10, 10, 2)  # src_vocab, tgt_vocab, Hyper_NetLayer_int=6


# ↓ ————————————————————————————————————训练
# ↓ Batches and Masking
class Batch:
    """Object for holding a batch of data with mask during training."""

    def __init__(self, InputSourceBatch_variable, InputTargetBatch_variable=None, pad=0):
        print('Initial-Batch →', end='')
        print('Input:' + str(InputSourceBatch_variable.shape) + str(
            InputTargetBatch_variable.shape))  # 30,10  30是大小，而10是最长的句子长度
        # 输入为一个batch，包括多个句子
        self.InputSourceBatch_variable = InputSourceBatch_variable
        self.InputSourceBatchMask_variable = (InputSourceBatch_variable != pad).unsqueeze(-2)
        if InputTargetBatch_variable is not None:
            self.InputTargetBatchIn_variable = InputTargetBatch_variable[:, :-1]
            self.InputTargetBatchOut_variable = InputTargetBatch_variable[:, 1:]
            self.InputTargetBatchMask_variable = self.make_std_mask(self.InputTargetBatchIn_variable, pad)
            self.ntokens = (self.InputTargetBatchOut_variable != pad).data.sum()

    @staticmethod  # 静态方法,可以不用实例化而调用这个函数，例如Batch.make_std_mask()
    def make_std_mask(InputTargetBatch_tensor, pad):
        print('Running-make_std_mask_function')
        """创建一个遮罩来隐藏填充和将来的单词。"""
        InputTargetBatchMask_variable = (InputTargetBatch_tensor != pad).unsqueeze(-2)
        InputTargetBatchMask_variable = InputTargetBatchMask_variable & Variable(
            CreateSubSequentMask_function(InputTargetBatch_tensor.size(-1)).type_as(InputTargetBatchMask_variable.data))
        return InputTargetBatchMask_variable


# ↓ Training Loop
def RunEpoch_function(AllData_Iterator, GlobalModel, LossComput_class):
    """Standard Training and Logging Function"""
    StartTime_float = time.time()
    total_tokens = 0
    total_loss = 0
    tokens = 0
    for BatchIndex, OneBatch_tensor in enumerate(AllData_Iterator):
        # 将数据传入全局模型得到输出(self, src, tgt, src_mask, tgt_mask):
        GlobalModelOutput_tensor = GlobalModel.forward(OneBatch_tensor.InputSourceBatch_variable,
                                                       OneBatch_tensor.InputTargetBatchIn_variable,
                                                       OneBatch_tensor.InputSourceBatchMask_variable,
                                                       OneBatch_tensor.InputTargetBatchMask_variable)
        loss = LossComput_class(GlobalModelOutput_tensor, OneBatch_tensor.InputTargetBatchOut_variable,
                                OneBatch_tensor.ntokens)
        total_loss += loss
        total_tokens += OneBatch_tensor.ntokens
        tokens += OneBatch_tensor.ntokens
        if BatchIndex % 50 == 1:
            elapsed = time.time() - StartTime_float
            print("Epoch Step: %d Loss: %f Tokens per Sec: %f" %
                  (BatchIndex, loss / OneBatch_tensor.ntokens, tokens / elapsed))
            StartTime_float = time.time()
            tokens = 0
    return total_loss / total_tokens


# ↓ Training Data and Batching
global max_src_in_batch, max_tgt_in_batch


def batch_size_fn(new, count, sofar):
    "Keep augmenting batch and calculate total number of tokens + padding."
    global max_src_in_batch, max_tgt_in_batch
    if count == 1:
        max_src_in_batch = 0
        max_tgt_in_batch = 0
    max_src_in_batch = max(max_src_in_batch, len(new.src))
    max_tgt_in_batch = max(max_tgt_in_batch, len(new.trg) + 2)
    src_elements = count * max_src_in_batch
    tgt_elements = count * max_tgt_in_batch
    return max(src_elements, tgt_elements)


# ↓ Optimizer
class Optimizer:
    "Optim wrapper that implements rate."

    def __init__(self, model_size, factor, warmup, optimizer):
        self.optimizer = optimizer
        self._step = 0
        self.warmup = warmup
        self.factor = factor
        self.model_size = model_size
        self._rate = 0

    def step(self):
        "Update parameters and rate"
        self._step += 1
        rate = self.rate()
        for p in self.optimizer.param_groups:
            p['lr'] = rate
        self._rate = rate
        self.optimizer.step()

    def rate(self, step=None):
        "Implement `lrate` above"
        if step is None:
            step = self._step
        return self.factor * \
               (self.model_size ** (-0.5) *
                min(step ** (-0.5), step * self.warmup ** (-1.5)))


def get_std_opt(model):
    return Optimizer(model.src_embed[0].d_model, 2, 4000,
                     torch.optim.Adam(model.parameters(), lr=0, betas=(0.9, 0.98), eps=1e-9))


'''
# ↓ Example of the curves of this model for different model sizes and for optimization hyperparameters.
# ↓ Three settings of the lrate hyperparameters.
opts = [Optimizer(512, 1, 4000, None),
        Optimizer(512, 1, 8000, None),
        Optimizer(256, 1, 4000, None)]
plt.plot(np.arange(1, 20000), [[opt.rate(i) for opt in opts] for i in range(1, 20000)])
plt.legend(["512:4000", "512:8000", "256:4000"])
None
'''


# ↓ Regularization
# ↓ Label Smoothing
class LabelSmoothing(nn.Module):
    "Implement label smoothing."

    def __init__(self, size, padding_idx, smoothing=0.0):
        super(LabelSmoothing, self).__init__()
        print('Initial-LabelSmoothing')
        self.criterion = nn.KLDivLoss(reduction='sum')
        self.padding_idx = padding_idx
        self.confidence = 1.0 - smoothing
        self.smoothing = smoothing
        self.size = size
        self.true_dist = None

    def forward(self, x, target):
        print('Forward-LabelSmoothing')
        assert x.size(1) == self.size
        true_dist = x.data.clone()
        true_dist.fill_(self.smoothing / (self.size - 2))
        true_dist.scatter_(1, target.data.unsqueeze(1), self.confidence)
        true_dist[:, self.padding_idx] = 0
        mask = torch.nonzero(target.data == self.padding_idx)
        if mask.dim() > 0:
            true_dist.index_fill_(0, mask.squeeze(), 0.0)
        self.true_dist = true_dist
        return self.criterion(x, Variable(true_dist, requires_grad=False))


############################################################第一个例子
# 给定一个小词汇表中的随机输入符号集，目标是生成这些相同的符号。
# 创造一些数据(这是一个迭代器)batch=30，nbatches=20
def CreateFakeData_function(Hyper_AllSourceVocabNumber_int, OneBatchNumber_int, AllBatchNumber_int):
    print('Running:CreateFakeData_function')
    # 一批的句子个数，一次构建一个句子的编码序列
    for BatchNumber_int in range(AllBatchNumber_int):
        # 返回一个范围1-Hyper_AllSourceVocabNumber_int（词序号）的随机整型数，size=一批的数量*10句的维度
        SentenceData_torch = torch.from_numpy(
            np.random.randint(1, Hyper_AllSourceVocabNumber_int, size=(OneBatchNumber_int, 10))).long()
        SentenceData_torch[:, 0] = 1
        SourceData_variable = Variable(SentenceData_torch, requires_grad=False).cuda()
        TargetData_variable = Variable(SentenceData_torch, requires_grad=False).cuda()
        # 实例化一个Batch类
        CurrentBatchData_class = Batch(SourceData_variable, TargetData_variable, 0)
        print('Output:CurrentBatchData_class:')
        yield CurrentBatchData_class


# Loss Computation
class SimpleLossCompute:
    "A simple loss compute and train function."

    def __init__(self, generator, criterion, opt=None):
        print('Initial-SimpleLossCompute')
        self.generator = generator
        self.criterion = criterion
        self.opt = opt

    def __call__(self, x, y, norm):
        print('Call-SimpleLossCompute')
        x = self.generator(x)
        loss = self.criterion(x.contiguous().view(-1, x.size(-1)),
                              y.contiguous().view(-1)) / norm
        loss.backward()
        if self.opt is not None:
            self.opt.step()
            self.opt.optimizer.zero_grad()

        OutputLoss = loss.item() * norm

        return OutputLoss


# Greedy Decoding
# Train the simple copy task.
# 设置超参数一共11个字的字库
Hyper_AllSourceVocabNumber_int, Hyper_AllTargetVocabNumber_int = 10, 10
Hyper_NetLayer_int = 6
criterion = LabelSmoothing(size=Hyper_AllSourceVocabNumber_int, padding_idx=0, smoothing=0.0)
# 模型实例化11个源/标准字库，2层网络叠加
CopyTestModel_class = CreateTransformerModel_function(Hyper_AllSourceVocabNumber_int, Hyper_AllTargetVocabNumber_int,
                                                      Hyper_NetLayer_int).cuda()
# 定义优化器
CopyTestModelOptimizer = Optimizer(CopyTestModel_class.TargetEmbedding_class[0].Hyper_ModelDimension_int, 1, 400,
                                   torch.optim.Adam(CopyTestModel_class.parameters(), lr=0, betas=(0.9, 0.98),
                                                    eps=1e-9))
# 训练多个Epoch
for Epoch in range(1):
    # 指定训练和测试时采用不同方式,在训练模型时会在前面加上：   model.train(启用Normalization 和 Dropout。)
    CopyTestModel_class.train()
    # 一共11个字的字库，每批30个句子，一共20批的生成器，一次生成一批
    AllBatchData_Iterator = CreateFakeData_function(Hyper_AllSourceVocabNumber_int, 30, 20)
    # 计算误差
    LossCompute = SimpleLossCompute(CopyTestModel_class.Generator_class, criterion, CopyTestModelOptimizer)
    # RunEpoch_function将训练一个Epoch的数据接受参数（AllData_Iterator, GlobalModel, loss_compute）
    RunEpoch_function(AllBatchData_Iterator, CopyTestModel_class, LossCompute)

    # model.train(不启用Normalization 和 Dropout。)
    CopyTestModel_class.eval()
    print(RunEpoch_function(CreateFakeData_function(Hyper_AllSourceVocabNumber_int, 30, 5), CopyTestModel_class,
                            SimpleLossCompute(CopyTestModel_class.Generator_class, criterion, None)))
结语

代码部分到此结束，第一次写博客，难免有各种纰漏，主要是记下自己的学习心得，通过重写代码笔者收益颇丰，同样希望读者通过阅读本文能够有所受益，最后，欢迎转载，但请注明出处。
邓阳（Dynamo-Deng）
关注
11
点赞
踩
43

收藏

觉得还不错? 一键收藏
1
评论
Transformer 代码详细解析

Transformer 代码详细解析谷歌在《Attention is all you need》中提出了一种新模型的架构叫做Transformer，由于源代码中有较多变量名采用了缩写，并且变量的数据类型繁多，对于初学者可能难以理解，笔者在参照了源代码后进行了代码重写，并对每个变量名进行了后缀添加，同时添加了大量的注释，希望能够对初学者有所帮助，有不对的地方欢迎大家指出。相关文章1.《Att...
复制链接

扫一扫