[LLM 学习笔记] Transformer 基础

最新推荐文章于 2025-12-10 09:11:08 发布

原创最新推荐文章于 2025-12-10 09:11:08 发布 · 1.2k 阅读

27 ·

CC 4.0 BY-SA版权

文章标签：

#学习 #笔记 #transformer

LLM 专栏收录该内容

1 篇文章

订阅专栏

部署运行你感兴趣的模型镜像

Transformer 基础

Transformer 模型架构

在这里插入图片描述

主要组成: Encoder, Decoder, Generator.

Encoder (编码器)

由 $N$ 层结构相同(参数不同)的 EncoderLayer 网络组成.
$\textbf{In}: [batch\_sz, seq\_len, d_{model}]$ , $\textbf{Out}: [batch\_sz, seq\_len, d_{model}]$

EncoderLayer: 由一层自注意力 Multi-Head Attention (多头注意力) 子网络, 一层 Position-wise Feed-Forward (基于位置的前馈) 子网络, 以及用于连接子网络的 Residual Connection (残差连接) 和 Layer Normalization (层标准化) 组成.
$\textbf{In}: [batch\_sz, seq\_len, d_{model}], \textbf{Out}: [batch\_sz, seq\_len, d_{model}]$

自注意力 Multi-Head Attention 网络: Q, K, V 均来自上一层(Input Embedding/EncoderLayer)网络.
$\textbf{In}: [batch\_sz, seq\_len, d_{model}], \textbf{Out}: [batch\_sz, seq\_len, d_{model}]$

Decoder (解码器)

由 $N$ 层结构相同(参数不同)的 DecoderLayer 网络组成.
$\textbf{In}: [batch\_sz, seq\_len, d_{model}], \textbf{Out}: [batch\_sz, seq\_len, d_{model}]$

DecoderLayer: 由一层自注意力 Masked Multi-Head Attention 子网络, 一层(Encoder-Decoder)注意力 Multi-Head Attention 子网络, 一层 Position-wise Feed-Forward (基于位置的前馈) 子网络, 以及用于连接子网络的 Residual Connection (残差连接) 和 Layer Normalization (层标准化) 组成.
$\textbf{In}: [batch\_sz, seq\_len, d_{model}], \textbf{Out}: [batch\_sz, seq\_len, d_{model}]$

自注意力 Masked Multi-Head Attention 网络: Q, K, V 均来自上一层(Output Embedding/DecoderLayer)网络. “Masked” 是通过掩码( $1,seq\_len,seq\_len]$ )将后续位置屏蔽, 仅关注需要预测的下一个位置.
$\textbf{In}: [batch\_sz, seq\_len, d_{model}], \textbf{Out}: [batch\_sz, seq\_len, d_{model}]$
(Encoder-Decoder)注意力 Multi-Head Attention 网络: Q 来自上一层(Masked Multi-Head Attention)网络; K,V 来自 Encoder 的输出 memory.
$\textbf{In}: [batch\_sz, seq\_len, d_{model}], \textbf{Out}: [batch\_sz, seq\_len, d_{model}]$

Generator (生成器)

由 $[\text{In}: d_{model}, \text{Out}:vocab\_sz]$ 的线性网络和 Softmax 操作组成.
$\mathrm{softmax}(\mathrm{Linear}(x))=\mathrm{softmax}(xA^T+b)$
生成器是按序列顺序一次只输出下一个位置的预测概率.
$\textbf{In}: [batch\_sz, d_{model}], \textbf{Out}: [batch\_sz, vocab\_sz]$

※ Multi-Head Attention

Scaled Dot-Product Attention (缩放点积注意力):
$\pmb{\mathrm{Attention}(Q,K,V) = \mathrm{softmax}(\frac{QK^{\top}}{\sqrt{d_k}})V}$
维度变化:

输入:
- $Q\ [batch\_sz,h,seq\_len,d_k]$
- $K\ [batch\_sz,h,seq\_len,d_k]$ , $K^{\top}\ [batch\_sz,h,d_k,seq\_len]$
- $V\ [batch\_sz,h,seq\_len,d_k]$
$QK^{\top}\ [batch\_sz,h,seq\_len,seq\_len]$
$\frac{QK^{\top}}{\sqrt{d_k}}$ 与 Mask 操作: 不改变形状 $batch\_sz,h,seq\_len,seq\_len]$
$\mathrm{softmax}(\frac{QK^{\top}}{\sqrt{d_k}})$ : 最后一维进行 Softmax 操作, 不改变形状 $batch\_sz,h,seq\_len,seq\_len]$
$\mathrm{softmax}(\frac{QK^{\top}}{\sqrt{d_k}})V$ : $batch\_sz,h,seq\_len,d_k]$

完整公式(参考 FlashAttention):
$\begin{aligned} & S=\tau QK^{\top}\in\mathbb{R}^{N\times N}\\ & S^{\text{masked}}=\text{MASK}(S)\in\mathbb{R}^{N\times N}\\ & P=\text{softmax}(S^{\text{masked}})\in\mathbb{R}^{N\times N}\\ & P^{\text{dropped}}=\text{dropout}(P, p_{drop})]\\ & \text{Attention}(Q,K,V)=O=P^{\text{dropped}}V\in\mathbb{R}^{N\times d} \end{aligned}$

Multi-Head Attention (多头注意力) 机制:
$\begin{aligned} MultiHeadAttn(Q,K,V) &= Concat(head_1, ..., head_h)W^O\\ \mathrm{where}\ head_i &= Attention(QW^Q_i, KW^K_i, VW^V_i) \end{aligned}$

其中, $W^Q_i\in\mathbb{R}^{d_{model\times d_k}}, W^K_i\in\mathbb{R}^{d_{model}\times d_k}, W^V_i\in\mathbb{R}^{d_{model}\times d_v}, W^O\in\mathbb{R}^{hd_v\times d_{model}}$
在实现中, $W^Q=(W^Q_1,...,W^Q_h)$ , $W^K=(W^K_1,...,W^K_h)$ , $W^V=(W^V_1,...,W^V_h)$ , $W^O$ , 由 4 个 $[\text{In}: d_{model}, \text{Out}:d_{model}]$ 的线性网络组成, $d_k=d_v=d_{model}/h$

在这里插入图片描述

维度变化:

输入: $X\ [batch\_sz, seq\_len, d_{model}]$
多头预处理: $X\ [batch\_sz, seq\_len, d_{model}]$ → $X\ [batch\_sz,h,seq\_len,d_k]$
注意力机制: $X\ [batch\_sz,h,seq\_len,d_k]$ → $Q,K,V\ [batch\_sz,h,seq\_len,d_k]$ → $\mathrm{Attention}(Q,K,V)\ [batch\_sz,h,seq\_len,d_k]$
拼接多头结果: $Concat(head_1, ..., head_h)\ [batch\_sz,h,seq\_len,d_k]$
输出: $MultiHeadAttn(Q,K,V)\ [batch\_sz, seq\_len, d_{model}]$

Position-wise Feed-Forward

$\mathrm{FFN}(x)=\mathrm{Linear}_2(\mathrm{ReLU}(\mathrm{Linear}_1(x)))=\max(0, xW_1 + b_1) W_2 + b_2$

$\mathrm{Linear}_1(x)$ : $[\text{In}:d_{model},\ \text{Out}:d_{ff}]$
$\mathrm{Linear}_2(x)$ : $[\text{In}:d_{ff},\ \text{Out}:d_{model}]$
$\textbf{In}: [batch\_sz, seq\_len, d_{model}], \textbf{Out}: [batch\_sz, seq\_len, d_{model}]$

Add&Norm

论文中: (post-Norm)
$\mathrm{SublayerConnection}(X)= \mathrm{LayerNorm}(X +\mathrm{Sublayer}(X))$

AnnotatedTransformer 实现中: (pre-Norm)
$\mathrm{SublayerConnection}(X)= X+\mathrm{Sublayer}(\mathrm{LayerNorm}(X))$

$\textbf{In}: [batch\_sz, seq\_len, d_{model}], \textbf{Out}: [batch\_sz, seq\_len, d_{model}]$

其中:

$\mathrm{Sublayer}\in\{\mathrm{MultiHeadAttn},\mathrm{FFN}\}$
层标准化 $\mathrm{LayerNorm}(X)$ : 对张量 $X$ 的最后一维( $d_{model}$ 维, 表示每个样本) $x=X[b,pos,:]\in\mathbb{R}^{d_{model}}$ 进行标准化.
$\mathrm{Norm}(x)=\frac{x-E(x)}{SD(x)+\epsilon}*\gamma+\beta$ . 其中, $E (x)$ 为平均值(期望), $S D (x)$ 为标准差, $\gamma,\beta\in\mathbb{R}^{d_{model}}$ 为可学习的参数, $\epsilon$ 是用于数值稳定性(避免除 0)在分母上加的一个极小值标量.
残差连接 (Residual Connection): $y=x+\mathcal{F}(x)$
注: pre-Norm 与 post-Norm 的区别, 参考: 【重新了解Transformer模型系列_1】PostNorm/PreNorm的差别 - 知乎

Token Embedding

大小为 $vocab\_sz$ 嵌入维度为 $d_{model}$ 的查询表(lookup table).
$\textbf{In}: [batch\_sz, seq\_len], \textbf{Out}: [batch\_sz, seq\_len, d_{model}]$
$\mathrm{Embedding(x)} = \mathrm{lut}(x)\cdot\sqrt{d_{model}}$

Positional Encoding

用于
$\begin{aligned} &PE_{(pos,2i)} = \sin(pos / 10000^{2i/d_{\text{model}}})\\ &PE_{(pos,2i+1)} = \cos(pos / 10000^{2i/d_{\text{model}}}) \end{aligned}$

$\mathrm{PE}(X)=X+ P,\ \text{where}\ (p_{(b,pos,i)})=P,\ p_{(b,pos,i)} = PE_{(pos,i)}$

其中, $X,P\in\mathbb{R}^{batch\_sz\times seq\_len\times d_{model}}$ , 即 $X$ 和 $P$ 为 $batch\_sz,seq\_len,d_{model}]$ 形状的张量; $p_{(b,pos,i)}$ 为 $P$ 对应位置的元素, $p os$ 为 token 在 $seq\_len$ 长度的序列中位置, $i$ 为 $d_{model}$ 中的维度.
$\textbf{In}: [batch\_sz, seq\_len, d_{model}], \textbf{Out}: [batch\_sz, seq\_len, d_{model}]$

Subsequent Mask

也称为 “Causal Attention Mask”, 因果注意力掩码("FlashAttention"中的说法). 用于 Decoder 的注意力网络中屏蔽预测位置之后的信息, 即仅根据预测位置及之前的信息进行预测.
掩码应用于矩阵 $QK^T/\sqrt{d_k}$ , 是一个包括对角线的下三角矩阵(对应保留 $Q$ 的 $seq\_len$ 索引 $i$ 大于等于 $K^T$ 的 $seq\_len$ 索引 $j$ 的计算结果), 将掩码为 0 部分(上三角部分为 0)对应的矩阵数据替换为极小值(如 -1e9).
$\text{shape}: [1,seq\_len, seq\_len]$
在这里插入图片描述