Transformer小结_transformer伪代码-CSDN博客

本文链接：https://blog.csdn.net/DarrenXf/article/details/97137759

本文详细介绍了Transformer模型的架构与工作原理，包括多头注意力机制、位置编码、前馈神经网络等核心组件，并提供了训练设置及优化策略。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Attention is all you need

Transformer

在这里插入图片描述

LayerNorm(x + Sublayer(x))

整理的Transformer 伪代码
输入 Inputs 输出 Outputs

X = Positional_Encoding(Input_Embedding(Inputs))
X = LayerNorm(X + Multi-Head_Attention(X))
X = LayerNorm(X + Feed_Forward(X))

Y = Positional_Encoding(Output_Embedding(Outputs))
Y = LayerNorm(Y + Masked_Multi-Head_Attention(Y))
Y = LayerNorm(Y + Multi-Head_Attention( $X_Q$ , $X_K$ , $Y_V$ ))
Y = LayerNorm(Y + Feed_Forward(Y))

Y = Linear(Y)
Output Probabilities = Softmax(Y)

在这里插入图片描述

Scaled Dot-Product Attention

Attention(Q, K, V) = softmax( $QKTdk\frac{QK^T}{\sqrt{d_k}}$ )V

Multi-Head Attention

MultiHead(Q,K,V) = Concat( $head_1$ ,…, $head_h$ ) $W^O$

where $head_i$ = Attention( $QW_i^Q$ , $KW_i^K$ , $VW_i^V$ )

$WiQ∈Rdmodel∗dkW_i^Q ∈ R^{d_{model} \quad * \quad d_k}$
$WiK∈Rdmodel∗dkW_i^K ∈ R^{d_{model} \quad * \quad d_k}$
$WiV∈Rdmodel∗dvW_i^V ∈ R^{d_{model} \quad * \quad d_v}$
$WiO∈Rhdv∗dmodelW_i^O ∈ R^{hd_v \quad * \quad d_{model}}$

In this work we employ h = 8 parallel attention layers, or heads.
For each of these we use $d_k$ = $d_v$ = $d_{model}$ /h = 64
Due to the reduced dimention of each head, the total computational cost
is similar to that of single-head attention with full dimensionality.

Position-wise Feead-Forward Networks

FFN(x) = max(0, $xW_1$ + $b_1$ ) $W_2$ + $b_2$

Positional Encoding

$PE_{(pos, 2i)} = sin(pos/10000^{2i/d_{model}}) $

$PE_{(pos, 2i+1)} = cos(pos/10000^{2i/d_{model}}) $

where pos is the position and i is the dimension.

重新写Transformer的伪代码

输入 Inputs 输出 Outputs

X = Positional_Encoding(Input_Embedding(Inputs))

$Q_X$ , $K_X$ , $V_X$ = X

X = LayerNorm(X + Multi-Head_Attention( $Q_X$ , $K_X$ , $V_X$ ))

X = LayerNorm(X + Feed_Forward(X))

$Q_X$ , $K_X$ , $V_X$ = X

Y = Positional_Encoding(Output_Embedding(Outputs))

$Q_Y$ , $K_Y$ , $V_Y$ = Y

Y = LayerNorm(Y + Masked_Multi-Head_Attention( $Q_Y$ , $K_Y$ , $V_Y$ ))

$Q_Y$ , $K_Y$ , $V_Y$ = Y

Y = LayerNorm(Y + Multi-Head_Attention( $Q_X$ , $K_X$ , $V_Y$ ))

Y = LayerNorm(Y + Feed_Forward(Y))

Y = Linear(Y)
Output Probabilities = Softmax(Y)

Hardware and Schedule

We trained our models on one machine with 8 NVIDIA P100 GPUs.
We trained the base models for a total of 100,000 steps or 12 hours.
The big models were trained for 300,000 steps(3.5days)