Attention is all you need
Transformer
LayerNorm(x + Sublayer(x))
整理的Transformer 伪代码
输入 Inputs 输出 Outputs
X = Positional_Encoding(Input_Embedding(Inputs))
X = LayerNorm(X + Multi-Head_Attention(X))
X = LayerNorm(X + Feed_Forward(X))
Y = Positional_Encoding(Output_Embedding(Outputs))
Y = LayerNorm(Y + Masked_Multi-Head_Attention(Y))
Y = LayerNorm(Y + Multi-Head_Attention(XQX_QXQ,XKX_KXK,YVY_VYV))
Y = LayerNorm(Y + Feed_Forward(Y))
Y = Linear(Y)
Output Probabilities = Softmax(Y)
Scaled Dot-Product Attention
Attention(Q, K, V) = softmax(QKTdk\frac{QK^T}{\sqrt{d_k}}dkQKT)V
Multi-Head Attention
MultiHead(Q,K,V) = Concat(head1head_1head1,…,headhhead_hheadh)WOW^OWO
where headihead_iheadi = Attention(QWiQQW_i^QQWiQ, KWiKKW_i^KKWiK, VWiVVW_i^VVWiV)
WiQ∈Rdmodel∗dkW_i^Q ∈ R^{d_{model} \quad * \quad d_k}WiQ∈Rdmodel∗dk
WiK∈Rdmodel∗dkW_i^K ∈ R^{d_{model} \quad * \quad d_k}WiK∈Rdmodel∗dk
WiV∈Rdmodel∗dvW_i^V ∈ R^{d_{model} \quad * \quad d_v}WiV∈Rdmodel∗dv
WiO∈Rhdv∗dmodelW_i^O ∈ R^{hd_v \quad * \quad d_{model}}WiO∈Rhdv∗dmodel
In this work we employ h = 8 parallel attention layers, or heads.
For each of these we use dkd_kdk = dvd_vdv = dmodeld_{model}dmodel/h = 64
Due to the reduced dimention of each head, the total computational cost
is similar to that of single-head attention with full dimensionality.
Position-wise Feead-Forward Networks
FFN(x) = max(0, xW1xW_1xW1 + b1b_1b1)W2W_2W2 + b2b_2b2
Positional Encoding
$PE_{(pos, 2i)} = sin(pos/10000^{2i/d_{model}}) $
$PE_{(pos, 2i+1)} = cos(pos/10000^{2i/d_{model}}) $
where pos is the position and i is the dimension.
重新写Transformer的伪代码
输入 Inputs 输出 Outputs
X = Positional_Encoding(Input_Embedding(Inputs))
QXQ_XQX,KXK_XKX,VXV_XVX = X
X = LayerNorm(X + Multi-Head_Attention(QXQ_XQX, KXK_XKX, VXV_XVX))
X = LayerNorm(X + Feed_Forward(X))
QXQ_XQX,KXK_XKX,VXV_XVX = X
Y = Positional_Encoding(Output_Embedding(Outputs))
QYQ_YQY,KYK_YKY,VYV_YVY = Y
Y = LayerNorm(Y + Masked_Multi-Head_Attention(QYQ_YQY,KYK_YKY, VYV_YVY))
QYQ_YQY,KYK_YKY,VYV_YVY = Y
Y = LayerNorm(Y + Multi-Head_Attention(QXQ_XQX,KXK_XKX,VYV_YVY))
Y = LayerNorm(Y + Feed_Forward(Y))
Y = Linear(Y)
Output Probabilities = Softmax(Y)
Hardware and Schedule
We trained our models on one machine with 8 NVIDIA P100 GPUs.
We trained the base models for a total of 100,000 steps or 12 hours.
The big models were trained for 300,000 steps(3.5days)
Optimizer
We used the Adam optimizer with with β1β_1β1 = 0.9, β2β_2β2 = 0.98 and ϵ\epsilonϵ= 10^{−9}$
Regularization
Residual Dropout
Label Smoothing