大模型基础——从零实现一个Transformer(3)

大模型基础——从零实现一个Transformer(1)-CSDN博客

大模型基础——从零实现一个Transformer(2)-CSDN博客


一、前言

之前两篇文章已经讲了Transformer的Embedding,Tokenizer,Attention,Position Encoding,
本文我们继续了解Transformer中剩下的其他组件.

二、归一化

2.1 Layer Normalization

layerNorm是针对序列数据提出的一种归一化方法,主要在layer维度进行归一化,即对整个序列进行归一化。

layerNorm会计算一个layer的所有activation的均值和方差,利用均值和方差进行归一化。

\mu = \sum _{i=1}^{d}x_{i}

\sigma = \sqrt{\frac{1}{d}\sum _{i=1}^{d}(x_{i} - \mu ))}

归一化后的激活值如下:

y = \frac{x - \mu }{\sqrt{\sigma +\varepsilon }}\gamma +\beta

其中 𝛾 和 𝛽 是可训练的模型参数。 𝛾 是缩放参数,新分布的方差 𝛾2 ; 𝛽 是平移系数,新分布的均值为 𝛽 。 𝜖 为一个小数,添加到方差上,避免分母为0。

2.2 LayerNormalization 代码实现

import torch
import torch.nn as nn

class LayerNorm(nn.Module):
    def __init__(self,num_features,eps=1e-6):
        super().__init__()
        self.gamma = nn.Parameter(torch.ones(num_features))
        self.beta = nn.Parameter(torch.zeros(num_features))
        self.eps = eps

    def forward(self,x):
        """

            Args:
                x (Tensor): (batch_size, seq_length, d_model)

            Returns:
                Tensor: (batch_size, seq_length, d_model)
        """
        mean = x.mean(dim=-1,keepdim=True)
        std = x.std(dim=-1,keepdim=True,unbiased=False)
        normalized_x = (x - mean) / (std + self.eps)
        return self.gamma * normalized_x + self.beta

if __name__ == '__main__':
    batch_size = 2
    seqlen = 3
    hidden_dim = 4

    # 初始化一个随机tensor
    x = torch.randn(batch_size,seqlen,hidden_dim)
    print(x)

    # 初始化LayerNorm
    layer_norm  = LayerNorm(num_features=hidden_dim)
    output_tensor = layer_norm(x)
    print("output after layer norm:\n,",output_tensor)

    torch_layer_norm = torch.nn.LayerNorm(normalized_shape=hidden_dim)
    torch_output_tensor = torch_layer_norm(x)
    print("output after torch layer norm:\n",torch_output_tensor)

三、残差连接

残差连接(residual connection,skip residual,也称为残差块)其实很简单

x为网络层的输入,该网络层包含非线性激活函数,记为F(x),用公式描述的话就是:

代码简单实现

x = x + layer(x)

四、前馈神经网络

4.1 Position-wise Feed Forward

Position-wise Feed Forward(FFN),逐位置的前馈网络,其实就是一个全连接前馈网络。目的是为了增加非线性,增强模型的表示能力。

它一个简单的两层全连接神经网络,不是将整个嵌入序列处理成单个向量,而是独立地处理每个位置的嵌入。所以称为position-wise前馈网络层。也可以看为核大小为1的一维卷积。

目的是把输入投影到特定的空间,再投影回输入维度。

FFN具体的公式如下:

𝐹𝐹𝑁(𝑥)=𝑓(𝑥𝑊1+𝑏1)𝑊2+𝑏2

上述公式对应FFN中的向量变换操作,其中f为非线性激活函数。

4.2 FFN代码实现

from torch import nn,Tensor
from torch.nn import functional as F

class PositonWiseFeedForward(nn.Module):
    def __init__(self,d_model:int ,d_ff: int ,dropout: float=0.1) -> None:
        '''

        :param d_model:  dimension of embeddings
        :param d_ff: dimension of feed-forward network
        :param dropout: dropout ratio
        '''
        super().__init__()
        self.ff1 = nn.Linear(d_model,d_ff)
        self.ff2 = nn.Linear(d_ff,d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(self,x: Tensor) -> Tensor:
        '''

        :param x:  (batch_size, seq_length, d_model) output from attention
        :return: (batch_size, seq_length, d_model)
        '''
        return self.ff2(self.dropout(F.relu(self.ff1(x))))

五、Transformer Encoder Block

如图所示,编码器(Encoder)由N个编码器块(Encoder Block)堆叠而成,我们依次实现。

from torch import nn,Tensor
## 之前实现的函数引入
from llm_base.attention.MultiHeadAttention1 import MultiHeadAttention
from llm_base.layer_norm.normal_layernorm import LayerNorm
from llm_base.ffn.PositionWiseFeedForward import PositonWiseFeedForward

from typing import *


class EncoderBlock(nn.Module):
    def __init__(self,
                 d_model: int,
                 n_heads: int,
                 d_ff: int,
                 dropout: float,
                 norm_first: bool = False):
        '''

        :param d_model: dimension of embeddings
        :param n_heads: number of heads
        :param d_ff: dimension of inner feed-forward network
        :param dropout:dropout ratio
        :param norm_first : if True, layer norm is done prior to attention and feedforward operations(Pre-Norm).
                Otherwise it's done after(Post-Norm). Default to False.
        '''
        super().__init__()
        self.norm_first = norm_first

        self.attention = MultiHeadAttention(d_model,n_heads,dropout)
        self.norm1 = LayerNorm(d_model)

        self.ff = PositonWiseFeedForward(d_model,d_ff,dropout)
        self.norm2 = LayerNorm(d_model)

        self.dropout1 = nn.Dropout(dropout)
        self.dropout2 = nn.Dropout(dropout)

    # self attention sub layer
    def _self_attention_sub_layer(self,x: Tensor, attn_mask: Tensor, keep_attentions: bool) -> Tensor:
        x = self.attention(x,x,x,attn_mask,keep_attentions)
        return self.dropout1(x)

    # ffn sub layer
    def _ffn_sub_layer(self,x: Tensor) -> Tensor:
        x = self.ff(x)
        return self.dropout2(x)

    def forward(self,src: Tensor,src_mask: Tensor == None,keep_attentions: bool= False) -> Tuple[Tensor,Tensor]:
        '''

        :param src: (batch_size, seq_length, d_model)
        :param src_mask: (batch_size,  1, seq_length)
        :param keep_attentions:whether keep attention weigths or not. Defaults to False.
        :return:(batch_size, seq_length, d_model) output of encoder block
        '''
        # pass througth multi-head attention
        # src (batch_size, seq_length, d_model)
        # attn_score (batch_size, n_heads, seq_length, k_length)
        x = src
        
        # post LN or pre LN
        if self.norm_first:
            # pre LN
            x = x + self._self_attention_sub_layer(self.norm1(x),src_mask,keep_attentions)
            x = x + self._ffn_sub_layer(self.norm2(x))
        
        else:
            x = self.norm1(x + self._self_attention_sub_layer(x,src_mask,keep_attentions))
            x = self.norm2(x + self._ffn_sub_layer(x))
        
        return x


5.1 Post Norm Vs Pre Norm

公式区别

Pre Norm 和 Post Norm 的式子分别如下:

在大模型的区别

Post-LN :是在 Transformer 的原始版本中使用的归一化方案。在此方案中,每个子层(例如,自注意力机制或前馈网络)的输出先通过子层自身的操作,然后再通过层归一化(Layer Normalization)

Pre-LN:是先对输入进行层归一化,然后再传递到子层操作中。这样的顺序对于训练更深的网络可能更稳定,因为归一化的输入可以帮助缓解训练过程中的梯度消失和梯度爆炸问题。

5.2为什么Pre效果弱于Post

  • 19
    点赞
  • 13
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
### 回答1: 好的,下面是使用Python实现Transformer模型的代码。在这里,我们使用PyTorch框架来实现Transformer模型。 首先,我们导入必要的库: ```python import torch import torch.nn as nn import torch.nn.functional as F ``` 接下来,我们定义一些常量: ```python MAX_LEN = 512 # 最大序列长度 NUM_HEADS = 8 # 多头注意力的头数 NUM_LAYERS = 6 # Transformer的层数 D_MODEL = 512 # 模型的维度 D_FF = 2048 # Feedforward层的维度 D_K = D_V = 64 # 注意力机制中K和V的维度 DROP_RATE = 0.1 # Dropout的概率 ``` 然后,我们定义一些辅助函数: ```python def scaled_dot_product_attention(q, k, v, mask=None): scores = torch.matmul(q, k.transpose(-2, -1)) / torch.sqrt(torch.tensor(k.size(-1)).float()) if mask is not None: scores = scores.masked_fill(mask == 0, -1e9) attention = torch.softmax(scores, dim=-1) output = torch.matmul(attention, v) return output def positional_encoding(max_len, d_model): pos = torch.arange(0, max_len).unsqueeze(1) div = torch.exp(torch.arange(0, d_model, 2) * -(math.log(10000.0) / d_model)) enc = torch.zeros((max_len, d_model)) enc[:, 0::2] = torch.sin(pos * div) enc[:, 1::2] = torch.cos(pos * div) return enc def get_mask(seq): mask = (seq == 0).unsqueeze(1).unsqueeze(2) return mask ``` 接下来,我们定义Transformer模型: ```python class Transformer(nn.Module): def __init__(self, max_len, num_heads, num_layers, d_model, d_ff, d_k, d_v, drop_rate): super().__init__() self.max_len = max_len self.num_heads = num_heads self.num_layers = num_layers self.d_model = d_model self.d_ff = d_ff self.d_k = d_k self.d_v = d_v self.drop_rate = drop_rate self.embedding = nn.Embedding(self.max_len, self.d_model) self.pos_encoding = positional_encoding(self.max_len, self.d_model) self.encoder_layers = nn.ModuleList([EncoderLayer(self.num_heads, self.d_model, self.d_ff, self.d_k, self.d_v, self.drop_rate) for _ in range(self.num_layers)]) self.decoder_layers = nn.ModuleList([DecoderLayer(self.num_heads, self.d_model, self.d_ff, self.d_k, self.d_v, self.drop_rate) for _ in range(self.num_layers)]) self.fc = nn.Linear(self.d_model, self.max_len) def forward(self, src, tgt): src_mask = get_mask(src) tgt_mask = get_mask(tgt) src_emb = self.embedding(src) * torch.sqrt(torch.tensor(self.d_model).float()) tgt_emb = self.embedding(tgt) * torch.sqrt(torch.tensor(self.d_model).float()) src_emb += self.pos_encoding[:src.size(1), :].unsqueeze(0) tgt_emb += self.pos_encoding[:tgt.size(1), :].unsqueeze(0) src_output = src_emb tgt_output = tgt_emb for i in range(self.num_layers): src_output = self.encoder_layers[i](src_output, src_mask) tgt_output = self.decoder_layers[i](tgt_output, src_output, tgt_mask, src_mask) output = self.fc(tgt_output) return output ``` 接下来,我们定义Encoder层和Decoder层: ```python class EncoderLayer(nn.Module): def __init__(self, num_heads, d_model, d_ff, d_k, d_v, drop_rate): super().__init__() self.self_attention = nn.MultiheadAttention(d_model, num_heads, dropout=drop_rate) self.norm1 = nn.LayerNorm(d_model) self.feedforward = nn.Sequential( nn.Linear(d_model, d_ff), nn.ReLU(), nn.Dropout(drop_rate), nn.Linear(d_ff, d_model), nn.Dropout(drop_rate) ) self.norm2 = nn.LayerNorm(d_model) def forward(self, x, mask): self_att_output, _ = self.self_attention(x, x, x, attn_mask=mask) self_att_output = self.norm1(x + self_att_output) ff_output = self.feedforward(self_att_output) output = self.norm2(self_att_output + ff_output) return output class DecoderLayer(nn.Module): def __init__(self, num_heads, d_model, d_ff, d_k, d_v, drop_rate): super().__init__() self.self_attention = nn.MultiheadAttention(d_model, num_heads, dropout=drop_rate) self.norm1 = nn.LayerNorm(d_model) self.encoder_attention = nn.MultiheadAttention(d_model, num_heads, dropout=drop_rate) self.norm2 = nn.LayerNorm(d_model) self.feedforward = nn.Sequential( nn.Linear(d_model, d_ff), nn.ReLU(), nn.Dropout(drop_rate), nn.Linear(d_ff, d_model), nn.Dropout(drop_rate) ) self.norm3 = nn.LayerNorm(d_model) def forward(self, x, encoder_output, tgt_mask, src_mask): self_att_output, _ = self.self_attention(x, x, x, attn_mask=tgt_mask) self_att_output = self.norm1(x + self_att_output) encoder_att_output, _ = self.encoder_attention(self_att_output, encoder_output, encoder_output, attn_mask=src_mask) encoder_att_output = self.norm2(self_att_output + encoder_att_output) ff_output = self.feedforward(encoder_att_output) output = self.norm3(encoder_att_output + ff_output) return output ``` 最后,我们可以使用以下代码来实例化Transformer模型: ```python model = Transformer(MAX_LEN, NUM_HEADS, NUM_LAYERS, D_MODEL, D_FF, D_K, D_V, DROP_RATE) ``` 这就是使用Python实现Transformer模型的全部内容。 ### 回答2: transformer模型是一种用于序列到序列(sequence-to-sequence)任务的深度学习模型,最初应用于机器翻译任务。下面是用Python实现transformer模型的基本步骤: 步骤一:导入必要的库 - 导入tensorflow库 - 导入tensorflow的高级API——keras库 - 导入numpy库 步骤二:定义transformer模型结构 - 定义输入层,通过Input函数指定输入的shape - 定义位置编码器(Positional Encoding),通过Lambda函数将位置编码添加到输入层中 - 定义多层的Encoder层和Decoder层,其中包括Self-Attention和Feed-Forward神经网络 - 定义输出层,通过Dense函数指定输出的shape 步骤三:定义整体模型 - 将输入层和输出层连接起来,构建模型的开始部分 - 通过连接Encoder层和Decoder层构建transformer的主体 - 返回最终的模型 步骤四:定义损失函数和优化器 - 定义损失函数,可以使用交叉熵损失函数 - 定义优化器,如Adam优化器 步骤五:模型的训练和评估 - 加载数据集 - 编译模型,设置损失函数和优化器 - 使用fit函数进行模型的训练,并指定训练的参数,如epochs和batch_size - 使用evaluate函数对模型进行评估,并计算准确率 这样就完成了用Python实现transformer模型的基本步骤。当然,实际应用中还可以对模型进行改进和优化,如添加正则化、调整超参数等。这些步骤只是一个基本的模板,具体的实现还需要根据具体的应用场景和数据集进行调整和修改。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值