Transformer小结

本文详细介绍了Transformer模型的架构与工作原理,包括多头注意力机制、位置编码、前馈神经网络等核心组件,并提供了训练设置及优化策略。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

Attention is all you need

Transformer

在这里插入图片描述

LayerNorm(x + Sublayer(x))

整理的Transformer 伪代码
输入 Inputs 输出 Outputs

X = Positional_Encoding(Input_Embedding(Inputs))
X = LayerNorm(X + Multi-Head_Attention(X))
X = LayerNorm(X + Feed_Forward(X))

Y = Positional_Encoding(Output_Embedding(Outputs))
Y = LayerNorm(Y + Masked_Multi-Head_Attention(Y))
Y = LayerNorm(Y + Multi-Head_Attention(XQX_QXQ,XKX_KXK,YVY_VYV))
Y = LayerNorm(Y + Feed_Forward(Y))

Y = Linear(Y)
Output Probabilities = Softmax(Y)

在这里插入图片描述

Scaled Dot-Product Attention

Attention(Q, K, V) = softmax(QKTdk\frac{QK^T}{\sqrt{d_k}}dkQKT)V

Multi-Head Attention

MultiHead(Q,K,V) = Concat(head1head_1head1,…,headhhead_hheadh)WOW^OWO

where headihead_iheadi = Attention(QWiQQW_i^QQWiQ, KWiKKW_i^KKWiK, VWiVVW_i^VVWiV)

WiQ∈Rdmodel∗dkW_i^Q ∈ R^{d_{model} \quad * \quad d_k}WiQRdmodeldk
WiK∈Rdmodel∗dkW_i^K ∈ R^{d_{model} \quad * \quad d_k}WiKRdmodeldk
WiV∈Rdmodel∗dvW_i^V ∈ R^{d_{model} \quad * \quad d_v}WiVRdmodeldv
WiO∈Rhdv∗dmodelW_i^O ∈ R^{hd_v \quad * \quad d_{model}}WiORhdvdmodel

In this work we employ h = 8 parallel attention layers, or heads.
For each of these we use dkd_kdk = dvd_vdv = dmodeld_{model}dmodel/h = 64
Due to the reduced dimention of each head, the total computational cost
is similar to that of single-head attention with full dimensionality.

Position-wise Feead-Forward Networks

FFN(x) = max(0, xW1xW_1xW1 + b1b_1b1)W2W_2W2 + b2b_2b2

Positional Encoding

$PE_{(pos, 2i)} = sin(pos/10000^{2i/d_{model}}) $

$PE_{(pos, 2i+1)} = cos(pos/10000^{2i/d_{model}}) $

where pos is the position and i is the dimension.

重新写Transformer的伪代码

输入 Inputs 输出 Outputs

X = Positional_Encoding(Input_Embedding(Inputs))

QXQ_XQX,KXK_XKX,VXV_XVX = X

X = LayerNorm(X + Multi-Head_Attention(QXQ_XQX, KXK_XKX, VXV_XVX))

X = LayerNorm(X + Feed_Forward(X))

QXQ_XQX,KXK_XKX,VXV_XVX = X

Y = Positional_Encoding(Output_Embedding(Outputs))

QYQ_YQY,KYK_YKY,VYV_YVY = Y

Y = LayerNorm(Y + Masked_Multi-Head_Attention(QYQ_YQY,KYK_YKY, VYV_YVY))

QYQ_YQY,KYK_YKY,VYV_YVY = Y

Y = LayerNorm(Y + Multi-Head_Attention(QXQ_XQX,KXK_XKX,VYV_YVY))

Y = LayerNorm(Y + Feed_Forward(Y))

Y = Linear(Y)
Output Probabilities = Softmax(Y)

Hardware and Schedule

We trained our models on one machine with 8 NVIDIA P100 GPUs.
We trained the base models for a total of 100,000 steps or 12 hours.
The big models were trained for 300,000 steps(3.5days)

Optimizer

We used the Adam optimizer with with β1β_1β1 = 0.9, β2β_2β2 = 0.98 and ϵ\epsilonϵ= 10^{−9}$

Regularization

Residual Dropout
Label Smoothing

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值