深入理解Transformer：架构、原理与代码实现

最新推荐文章于 2025-09-06 19:52:14 发布

北辰alk

最新推荐文章于 2025-09-06 19:52:14 发布

阅读量913

点赞数 39

CC 4.0 BY-SA版权

分类专栏： AI 文章标签： transformer 深度学习人工智能

本文链接：https://blog.csdn.net/qq_16242613/article/details/147400906

AI 专栏收录该内容

232 篇文章

订阅专栏

在这里插入图片描述

文章目录

1. 引言

Transformer模型由Google团队在2017年发表的论文《Attention Is All You Need》中首次提出，彻底改变了自然语言处理领域的格局。相较于传统的RNN和CNN模型，Transformer具有以下核心优势：

并行计算能力：克服了RNN的序列依赖问题
长距离依赖建模：通过自注意力机制捕捉全局关系
可扩展性：适合构建超大规模预训练模型

2. 整体架构

2.1 架构总览

Transformer采用经典的编码器-解码器结构，包含N个相同的编码器层和解码器层（原论文中N=6）

class Transformer(nn.Module):
    def __init__(self, src_vocab, tgt_vocab, d_model=512, N=6, 
                 heads=8, dropout=0.1):
        super().__init__()
        self.encoder = Encoder(src_vocab, d_model, N, heads, dropout)
        self.decoder = Decoder(tgt_vocab, d_model, N, heads, dropout)
        self.out = nn.Linear(d_model, tgt_vocab)

2.2 编码器-解码器结构对比

组件	编码器	解码器
注意力机制	自注意力	掩码自注意力 + 编码器-解码器注意力
前馈网络	Position-wise FFN	Position-wise FFN
位置编码	正弦位置编码	正弦位置编码
层数量	N（通常6层）	N（通常6层）

3. 核心组件详解

3.1 自注意力机制

3.1.1 计算过程

将输入转换为Q(Query), K(Key), V(Value)三个矩阵
计算注意力分数： $Attention(Q,K,V)=softmax(\frac{QK^T}{\sqrt{d_k}})V$

def attention(q, k, v, d_k, mask=None, dropout=None):
    scores = torch.matmul(q, k.transpose(-2, -1)) / math.sqrt(d_k)
    if mask is not None:
        scores = scores.masked_fill(mask == 0, -1e9)
    scores = F.softmax(scores, dim=-1)
    if dropout is not None:
        scores = dropout(scores)
    output = torch.matmul(scores, v)
    return output

3.1.2 多头注意力

class MultiHeadAttention(nn.Module):
    def __init__(self, heads, d_model, dropout=0.1):
        super().__init__()
        self.d_model = d_model
        self.d_k = d_model // heads
        self.h = heads
        
        self.q_linear = nn.Linear(d_model, d_model)
        self.k_linear = nn.Linear(d_model, d_model)
        self.v_linear = nn.Linear(d_model, d_model)
        self.dropout = nn.Dropout(dropout)
        self.out = nn.Linear(d_model, d_model)

3.2 位置编码

正弦余弦函数实现：
$PE_{(pos,2i)} = \sin(pos/10000^{2i/d_{model}})$
$PE_{(pos,2i+1)} = \cos(pos/10000^{2i/d_{model}})$

class PositionalEncoder(nn.Module):
    def __init__(self, d_model, max_seq_len=200):
        super().__init__()
        self.d_model = d_model
        
        pe = torch.zeros(max_seq_len, d_model)
        position = torch.arange(0, max_seq_len, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model)
        
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        self.register_buffer('pe', pe.unsqueeze(0))

3.3 前馈网络

位置式前馈网络：
$FFN(x) = max(0, xW_1 + b_1)W_2 + b_2$

class FeedForward(nn.Module):
    def __init__(self, d_model, d_ff=2048, dropout=0.1):
        super().__init__()
        self.linear_1 = nn.Linear(d_model, d_ff)
        self.dropout = nn.Dropout(dropout)
        self.linear_2 = nn.Linear(d_ff, d_model)
    
    def forward(self, x):
        x = self.dropout(F.relu(self.linear_1(x)))
        x = self.linear_2(x)
        return x

4. 数学原理详解

4.1 自注意力公式推导

给定输入矩阵 $\in \mathbb{R}^{n \times d}$ ，计算：

$Q = XW^Q, K = XW^K, V = XW^V$
$softmax(\frac{QK^T}{\sqrt{d_k}})V$

其中 $W^Q, W^K \in \mathbb{R}^{d \times d_k}$ , $W^V \in \mathbb{R}^{d \times d_v}$

4.2 梯度流动分析

通过残差连接保证梯度畅通：
$L a yer N or m (x + S u b l a yer (x))$

5. 完整实现代码

# 编码器层实现
class EncoderLayer(nn.Module):
    def __init__(self, d_model, heads, dropout=0.1):
        super().__init__()
        self.norm_1 = nn.LayerNorm(d_model)
        self.norm_2 = nn.LayerNorm(d_model)
        self.attn = MultiHeadAttention(heads, d_model)
        self.ff = FeedForward(d_model)
        self.dropout_1 = nn.Dropout(dropout)
        self.dropout_2 = nn.Dropout(dropout)
    
    def forward(self, x, mask):
        x2 = self.norm_1(x)
        x = x + self.dropout_1(self.attn(x2, x2, x2, mask))
        x2 = self.norm_2(x)
        x = x + self.dropout_2(self.ff(x2))
        return x