一、Transformer模型架构
- 输入维度: 2
- 每个注意力头的维数: 4
- 隐藏层维度: 5
- 输出维度: 2
- 注意力头数: 3(多头注意力)
- 层数: 1(单层Transformer Encoder)
- 激活函数: ReLU
二、初始化参数
-
输入序列:
x = [ 1 2 3 4 ] \mathbf{x} = \begin{bmatrix} 1 & 2 \\ 3 & 4 \end{bmatrix} x=[1324]
(2个时间步,每个时间步2个特征) -
词嵌入权重矩阵(假设直接作为输入使用):
W e m b = [ 1 0 0 1 ] \mathbf{W_{emb}} = \begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix} Wemb=[1001] -
注意力权重矩阵(假设直接使用初始输入作为查询、键和值):
- 查询权重矩阵:
W Q = [ 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 ] \mathbf{W_Q} = \begin{bmatrix} 0.1 & 0.2 & 0.3 & 0.4 \\ 0.5 & 0.6 & 0.7 & 0.8 \end{bmatrix} WQ=[0.10.50.20.60.30.70.40.8] - 键权重矩阵:
W K = [ 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 ] \mathbf{W_K} = \begin{bmatrix} 0.2 & 0.3 & 0.4 & 0.5 \\ 0.6 & 0.7 & 0.8 & 0.9 \end{bmatrix} WK=[0.20.60.30.70.40.80.50.9] - 值权重矩阵:
W V = [ 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 ] \mathbf{W_V} = \begin{bmatrix} 0.3 & 0.4 & 0.5 & 0.6 \\ 0.7 & 0.8 & 0.9 & 1.0 \end{bmatrix} WV=[0.30.70.40.80.50.90.61.0]
- 查询权重矩阵:
-
输出层权重矩阵(用于最后的线性变换):
W O = [ 1 0 0 0 0 0 1 0 0 0 ] \mathbf{W_O} = \begin{bmatrix} 1 & 0 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 & 0 \end{bmatrix} WO=[1001000000] -
目标输出:
y = [ 1 0 0 1 ] \mathbf{y} = \begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix} y=[1001]
三、前向传播
-
词嵌入(在这个例子中,词嵌入权重矩阵为单位矩阵,等于输入):
e = x W e m b = [ 1 2 3 4 ] \mathbf{e} = \mathbf{x} \mathbf{W_{emb}} = \begin{bmatrix} 1 & 2 \\ 3 & 4 \end{bmatrix} e=xWemb=[1324] -
计算查询、键和值(对每个注意力头单独计算):
-
头1:
Q 1 = e W Q 1 = [ 1 2 3 4 ] [ 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 ] = [ 1.1 1.4 1.7 2.0 2.3 3.0 3.7 4.4 ] \mathbf{Q_1} = \mathbf{e} \mathbf{W_{Q1}} = \begin{bmatrix} 1 & 2 \\ 3 & 4 \end{bmatrix} \begin{bmatrix} 0.1 & 0.2 & 0.3 & 0.4 \\ 0.5 & 0.6 & 0.7 & 0.8 \end{bmatrix} = \begin{bmatrix} 1.1 & 1.4 & 1.7 & 2.0 \\ 2.3 & 3.0 & 3.7 & 4.4 \end{bmatrix} Q1=eWQ1=[1324][0.10.50.20.60.30.70.40.8]=[1.12.31.43.01.73.72.04.4]
K 1 = e W K 1 = [ 1 2 3 4 ] [ 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 ] = [ 1.4 1.7 2.0 2.3 3.2 4.1 5.0 5.9 ] \mathbf{K_1} = \mathbf{e} \mathbf{W_{K1}} = \begin{bmatrix} 1 & 2 \\ 3 & 4 \end{bmatrix} \begin{bmatrix} 0.2 & 0.3 & 0.4 & 0.5 \\ 0.6 & 0.7 & 0.8 & 0.9 \end{bmatrix} = \begin{bmatrix} 1.4 & 1.7 & 2.0 & 2.3 \\ 3.2 & 4.1 & 5.0 & 5.9 \end{bmatrix} K1=eWK1=[1324][0.20.60.30.70.40.80.50.9]=[1.43.21.74.12.05.02.35.9]
V 1 = e W V 1 = [ 1 2 3 4 ] [ 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 ] = [ 1.7 2.0 2.3 2.6 3.8 4.6 5.4 6.2 ] \mathbf{V_1} = \mathbf{e} \mathbf{W_{V1}} = \begin{bmatrix} 1 & 2 \\ 3 & 4 \end{bmatrix} \begin{bmatrix} 0.3 & 0.4 & 0.5 & 0.6 \\ 0.7 & 0.8 & 0.9 & 1.0 \end{bmatrix} = \begin{bmatrix} 1.7 & 2.0 & 2.3 & 2.6 \\ 3.8 & 4.6 & 5.4 & 6.2 \end{bmatrix} V1=eWV1=[1324][0.30.70.40.80.50.90.61.0]=[1.73.82.04.62.35.42.66.2] -
头2:
Q 2 = e W Q 2 = Q 1 \mathbf{Q_2} = \mathbf{e} \mathbf{W_{Q2}} = \mathbf{Q_1} Q2=eWQ2=Q1
K 2 = e W K 2 = K 1 \mathbf{K_2} = \mathbf{e} \mathbf{W_{K2}} = \mathbf{K_1} K2=eWK2=K1
V 2 = e W V 2 = V 1 \mathbf{V_2} = \mathbf{e} \mathbf{W_{V2}} = \mathbf{V_1} V2=eWV2=V1 -
头3:
Q 3 = e W Q 3 = Q 1 \mathbf{Q_3} = \mathbf{e} \mathbf{W_{Q3}} = \mathbf{Q_1} Q3=eWQ3=Q1
K 3 = e W K 3 = K 1 \mathbf{K_3} = \mathbf{e} \mathbf{W_{K3}} = \mathbf{K_1} K3=eWK3=K1
V 3 = e W V 3 = V 1 \mathbf{V_3} = \mathbf{e} \mathbf{W_{V3}} = \mathbf{V_1} V3=eWV3=V1
- 计算注意力得分(点积注意力):
-
头1:
Attention Scores 1 = Q 1 K 1 T = [ 1.1 1.4 1.7 2.0 2.3 3.0 3.7 4.4 ] [ 1.4 3.2 1.7 4.1 2.0 5.0 2.3 5.9 ] = [ 12.8 31.8 28.6 72.6 ] \text{Attention Scores}_1 = \mathbf{Q_1} \mathbf{K_1}^T = \begin{bmatrix} 1.1 & 1.4 & 1.7 & 2.0 \\ 2.3 & 3.0 & 3.7 & 4.4 \end{bmatrix} \begin{bmatrix} 1.4 & 3.2 \\ 1.7 & 4.1 \\ 2.0 & 5.0 \\ 2.3 & 5.9 \end{bmatrix} = \begin{bmatrix} 12.8 & 31.8 \\ 28.6 & 72.6 \end{bmatrix} Attention Scores1=Q1K1T=[1.12.31.43.01.73.72.04.4] 1.41.72.02.33.24.15.05.9 =[12.828.631.872.6] -
头2:
Attention Scores 2 = Q 2 K 2 T = Attention Scores 1 \text{Attention Scores}_2 = \mathbf{Q_2} \mathbf{K_2}^T = \text{Attention Scores}_1 Attention Scores2=Q2K2T=Attention Scores1 -
头3:
Attention Scores 3 = Q 3 K 3 T = Attention Scores 1 \text{Attention Scores}_3 = \mathbf{Q_3} \mathbf{K_3}^T = \text{Attention Scores}_1 Attention Scores3=Q3K3T=Attention Scores1
- 计算注意力权重(使用softmax函数):
为了简化手算,我们假设注意力权重为相等分布(即忽略softmax计算):
Attention Weights
1
≈
[
0.5
0.5
0.5
0.5
]
\text{Attention Weights}_1 \approx \begin{bmatrix} 0.5 & 0.5 \\ 0.5 & 0.5 \end{bmatrix}
Attention Weights1≈[0.50.50.50.5]
Attention Weights
2
≈
Attention Weights
1
\text{Attention Weights}_2 \approx \text{Attention Weights}_1
Attention Weights2≈Attention Weights1
Attention Weights
3
≈
Attention Weights
1
\text{Attention Weights}_3 \approx \text{Attention Weights}_1
Attention Weights3≈Attention Weights1
- 计算注意力输出:
-
头1:
Attention Output 1 = Attention Weights 1 V 1 = [ 0.5 0.5 0.5 0.5 ] [ 1.7 2.0 2.3 2.6 3.8 4.6 5.4 6.2 ] = [ 2.75 3.3 3.85 4.4 2.75 3.3 3.85 4.4 ] \text{Attention Output}_1 = \text{Attention Weights}_1 \mathbf{V_1} = \begin{bmatrix} 0.5 & 0.5 \\ 0.5 & 0.5 \end{bmatrix} \begin{bmatrix} 1.7 & 2.0 & 2.3 & 2.6 \\ 3.8 & 4.6 & 5.4 & 6.2 \end{bmatrix} = \begin{bmatrix} 2.75 & 3.3 & 3.85 & 4.4 \\ 2.75 & 3.3 & 3.85 & 4.4 \end{bmatrix} Attention Output1=Attention Weights1V1=[0.50.50.50.5][1.73.82.04.62.35.42.66.2]=[2.752.753.33.33.853.854.44.4] -
头2:
Attention Output 2 = Attention Output 1 \text{Attention Output}_2 = \text{Attention Output}_1 Attention Output2=Attention Output1 -
头3:
Attention Output 3 = Attention Output 1 \text{Attention Output}_3 = \text{Attention Output}_1 Attention Output3=Attention Output1
- 多头注意力输出的连接和线性变换:
将三个头的输出连接起来(concat),然后通过线性变换:
Multi-Head Attention Output
=
[
Attention Output
1
Attention Output
2
Attention Output
3
]
=
[
2.75
3.3
3.85
4.4
2.75
3.3
3.85
4.4
2.75
3.3
3.85
4.4
2.75
3.3
3.85
4.4
2.75
3.3
3.85
4.4
2.75
3.3
3.85
4.4
]
\text{Multi-Head Attention Output} = \begin{bmatrix} \text{Attention Output}_1 & \text{Attention Output}_2 & \text{Attention Output}_3 \end{bmatrix} = \begin{bmatrix} 2.75 & 3.3 & 3.85 & 4.4 & 2.75 & 3.3 & 3.85 & 4.4 & 2.75 & 3.3 & 3.85 & 4.4 \\ 2.75 & 3.3 & 3.85 & 4.4 & 2.75 & 3.3 & 3.85 & 4.4 & 2.75 & 3.3 & 3.85 & 4.4 \end{bmatrix}
Multi-Head Attention Output=[Attention Output1Attention Output2Attention Output3]=[2.752.753.33.33.853.854.44.42.752.753.33.33.853.854.44.42.752.753.33.33.853.854.44.4]
应用线性变换:
y
^
=
Multi-Head Attention Output
W
O
=
[
2.75
3.3
3.85
4.4
2.75
3.3
3.85
4.4
2.75
3.3
3.85
4.4
2.75
3.3
3.85
4.4
2.75
3.3
3.85
4.4
2.75
3.3
3.85
4.4
]
[
1
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
]
=
[
2.75
3.3
2.75
3.3
]
\mathbf{\hat{y}} = \text{Multi-Head Attention Output} \mathbf{W_O} = \begin{bmatrix} 2.75 & 3.3 & 3.85 & 4.4 & 2.75 & 3.3 & 3.85 & 4.4 & 2.75 & 3.3 & 3.85 & 4.4 \\ 2.75 & 3.3 & 3.85 & 4.4 & 2.75 & 3.3 & 3.85 & 4.4 & 2.75 & 3.3 & 3.85 & 4.4 \end{bmatrix} \begin{bmatrix} 1 & 0 \\ 0 & 1 \\ 0 & 0 \\ 0 & 0 \\ 0 & 0 \\ 0 & 0 \\ 0 & 0 \\ 0 & 0 \\ 0 & 0 \\ 0 & 0 \\ 0 & 0 \\ 0 & 0 \end{bmatrix} = \begin{bmatrix} 2.75 & 3.3 \\ 2.75 & 3.3 \end{bmatrix}
y^=Multi-Head Attention OutputWO=[2.752.753.33.33.853.854.44.42.752.753.33.33.853.854.44.42.752.753.33.33.853.854.44.4]
100000000000010000000000
=[2.752.753.33.3]
四、计算损失
使用均方误差(MSE)损失函数:
L
=
1
2
∑
(
y
^
−
y
)
2
=
1
2
[
(
2.75
−
1
)
2
+
(
3.3
−
0
)
2
+
(
2.75
−
0
)
2
+
(
3.3
−
1
)
2
]
L = \frac{1}{2} \sum (\hat{y} - y)^2 = \frac{1}{2} \left[ (2.75 - 1)^2 + (3.3 - 0)^2 + (2.75 - 0)^2 + (3.3 - 1)^2 \right]
L=21∑(y^−y)2=21[(2.75−1)2+(3.3−0)2+(2.75−0)2+(3.3−1)2]
L
=
1
2
[
3.0625
+
10.89
+
7.5625
+
5.29
]
=
1
2
×
26.805
=
13.4025
L = \frac{1}{2} \left[ 3.0625 + 10.89 + 7.5625 + 5.29 \right] = \frac{1}{2} \times 26.805 = 13.4025
L=21[3.0625+10.89+7.5625+5.29]=21×26.805=13.4025
五、反向传播
为了简化,假设只计算损失函数关于注意力输出的梯度,不考虑具体更新权重的细节。
-
损失函数关于注意力输出的梯度:
∂ L ∂ Attention Output = ( Attention Output − y ) = [ 2.75 − 1 3.3 − 0 2.75 − 0 3.3 − 1 ] = [ 1.75 3.3 2.75 2.3 ] \frac{\partial L}{\partial \text{Attention Output}} = (\text{Attention Output} - y) = \begin{bmatrix} 2.75 - 1 & 3.3 - 0 \\ 2.75 - 0 & 3.3 - 1 \end{bmatrix} = \begin{bmatrix} 1.75 & 3.3 \\ 2.75 & 2.3 \end{bmatrix} ∂Attention Output∂L=(Attention Output−y)=[2.75−12.75−03.3−03.3−1]=[1.752.753.32.3] -
更新注意力输出(梯度下降,学习率为0.1):
Attention Output New = Attention Output − 0.1 × ∂ L ∂ Attention Output = [ 2.75 3.3 2.75 3.3 ] − 0.1 × [ 1.75 3.3 2.75 2.3 ] \text{Attention Output New} = \text{Attention Output} - 0.1 \times \frac{\partial L}{\partial \text{Attention Output}} = \begin{bmatrix} 2.75 & 3.3 \\ 2.75 & 3.3 \end{bmatrix} - 0.1 \times \begin{bmatrix} 1.75 & 3.3 \\ 2.75 & 2.3 \end{bmatrix} Attention Output New=Attention Output−0.1×∂Attention Output∂L=[2.752.753.33.3]−0.1×[1.752.753.32.3]
Attention Output New = [ 2.575 2.97 2.475 3.07 ] \text{Attention Output New} = \begin{bmatrix} 2.575 & 2.97 \\ 2.475 & 3.07 \end{bmatrix} Attention Output New=[2.5752.4752.973.07]
六、总结
通过这个手算示例,我们展示了Transformer模型使用三头注意力机制、每个注意力头的维数为4,隐藏层维度为5的前向传播、计算损失和反向传播的基本过程。为了简化计算,我们对注意力计算和softmax进行了近似处理。实际应用中,Transformer模型的计算会复杂得多,但基本原理是相同的。这个例子有助于理解Transformer模型的基本工作机制和训练过程。