转载请注明出处
问题
常见的标量版反向传播公式从推导上易于理解,但对代码实现不友好
所以这里用向量化的形式写出反向传播。 具体要求:
- 向量化写法,避免出现 ∑ \sum ∑,只采用矩阵的四则运算和逐元素运算
- 带有L2正则项
- 直接写一个batch,避免一个一个样本的写
- 多分类交叉熵损失
说明
- 上述这些要求的优势在于可以直接转化为向量化的计算代码
- 在pytorch、numpy等框架中,向量化的计算比用循环累加要快非常多
- 本文参考了矩阵求导术-上,强推!本文采用的符号规范与之一致,如果没有看过直接看本文也没有问题
网络结构
- 网络结构必须定义明确,说明清楚
- 输入数据为 X , Y X,Y X,Y
- 网络一共三层,结构为:
f c 1 : X → F 1 r e l u : F 1 → G 1 f c 2 : G 1 → F 2 f_{c_1}:X \rightarrow F_1 \\ relu: F_1 \rightarrow G_1 \\ f_{c_2}: G_1 \rightarrow F_2 \\ fc1:X→F1relu:F1→G1fc2:G1→F2 - 输入层维度为
D
D
D,隐层维度为
H
H
H,输出层维度为
C
C
C. 所以每一层的参数形状为
W 1 ∈ R H × D W 2 ∈ R C × H b 1 ∈ R H × 1 b 2 ∈ R C × 1 W_1 \in \mathbb{R}^{H \times D} \\ W_2 \in \mathbb{R}^{C \times H}\\ b_1 \in \mathbb{R}^{H \times 1}\\ b_2 \in \mathbb{R}^{C \times 1} W1∈RH×DW2∈RC×Hb1∈RH×1b2∈RC×1 - 记一次batch样本数为
N
N
N,所以输入层、隐层、类别标注的形状为
X ∈ R D × N F 1 , G 1 ∈ R H × N F 2 , Y ∈ R C × N X \in \mathbb{R}^{D \times N} \\ F_1,G_1 \in \mathbb {R}^{H \times N} \\ F_2, Y \in \mathbb {R}^{C \times N} \\ X∈RD×NF1,G1∈RH×NF2,Y∈RC×N
这里 Y Y Y采用了one-hot编码 - 我用了列向量写法,如果你习惯于行向量,那么可以参考我贴在文末的手稿
前向传播
F
1
=
W
1
X
+
b
1
1
1
×
N
(1)
F_1 = W_1 X + b_1 \mathbf{1}_{1 \times N} \tag{1}
F1=W1X+b111×N(1)
G
1
=
m
a
x
(
0
,
F
1
)
(2)
G_1 = max(0, F_1) \tag{2}
G1=max(0,F1)(2)
F
2
=
W
2
G
1
+
b
2
1
1
×
N
(3)
F_2 = W_2 G_1 + b_2 \mathbf{1}_{1 \times N} \tag{3}
F2=W2G1+b211×N(3)
其中
1
a
×
b
\mathbf{1}_{a\times b}
1a×b表示形状为
a
×
b
a\times b
a×b的全1矩阵
记
L
1
\mathcal{L}_1
L1为经验损失项
L
1
=
[
ln
(
1
1
×
C
e
F
2
)
−
1
1
×
C
(
F
2
⊙
Y
)
]
1
N
×
1
(4)
\mathcal{L}_1 = [\ln (\mathbf{1}_{1\times C}e^{F_2} )- \mathbf{1}_{1\times C}(F_2 \odot Y)] \mathbf{1}_{N\times 1}\tag{4}
L1=[ln(11×CeF2)−11×C(F2⊙Y)]1N×1(4)
其中“
⊙
\odot
⊙”表示逐元素乘法,要求参与运算的两个矩阵维度相同
本文所有的指数函数
e
⋅
e^{\cdot}
e⋅和对数函数
ln
(
⋅
)
\ln(\cdot)
ln(⋅)都为逐元素函数
取均值的操作
1
N
L
1
\frac{1}{N}\mathcal{L}_1
N1L1放到了后文引入,为了方便求导书写
式(4)的正确性稍加解释:
对于样本
i
i
i,
L
1
i
=
−
ln
e
F
2
y
i
i
∑
y
=
1
C
e
F
2
y
i
=
ln
(
∑
y
=
1
C
e
F
2
y
i
)
−
F
2
y
i
i
=
ln
(
1
1
×
C
e
F
2
⋅
i
)
−
1
1
×
C
(
F
2
⋅
i
⊙
Y
⋅
i
)
\mathcal{L}_1^i=-\ln \frac{e^{{F_2}_{y_i i}}}{\sum_{y=1}^{C} e^{{F_2}_{y i}}}=\ln (\sum_{y=1}^{C} e^{{F_2}_{y i}}) - {F_2}_{y_i i}=\ln(\mathbf{1}_{1\times C}e^{{F_2}_{\cdot i}})-\mathbf{1}_{1\times C}({F_2}_{\cdot i} \odot Y_{\cdot i})
L1i=−ln∑y=1CeF2yieF2yii=ln(∑y=1CeF2yi)−F2yii=ln(11×CeF2⋅i)−11×C(F2⋅i⊙Y⋅i).
其中
y
i
∈
{
1
,
2
,
.
.
.
,
N
}
y_i \in \{1,2,...,N\}
yi∈{1,2,...,N},表示样本
i
i
i的类别。只要注意到
Y
⋅
i
Y_{\cdot i}
Y⋅i是one-hot编码,该推导不难成立
由
L
1
=
∑
i
=
1
N
L
1
i
\mathcal{L}_1=\sum_{i=1}^{N}\mathcal{L}_1^i
L1=∑i=1NL1i,得式(4),可以自行验证
[
- 一段冗笔,可以不看:
如果记 Y ^ \hat{Y} Y^为经过softmax后的网络输出, Y ^ ∈ R C × N \hat{Y} \in \mathbb{R}^{C \times N} Y^∈RC×N,那么
Y ^ = e F 2 ⊙ 1 1 C × C e F 2 \hat{Y} = e^{F_2} \odot \frac{1}{\mathbf{1}_{C \times C} e^{F_2}} Y^=eF2⊙1C×CeF21
其中 1 A \frac{1}{A} A1表示把 A A A逐元素取倒数, L 1 \mathcal{L}_1 L1又可写成
L 1 = − ln ( 1 1 × C ( Y ^ ⊙ Y ) ) 1 N × 1 \mathcal{L}_1 = - \ln(\mathbf{1}_{1\times C} (\hat{Y} \odot Y)) \mathbf{1}_{N \times 1} L1=−ln(11×C(Y^⊙Y))1N×1
这样 L 1 \mathcal{L}_1 L1就避免了直接使用 F 2 F_2 F2,但是该式到式(4)的正确性不易看出,该式也不利于后续求导
不过这两式易于构建计算图,方便代码实现
后续推导仍然使用式(4)
]
记
L
2
\mathcal{L}_2
L2表示正则损失项,
L
21
\mathcal{L}_{21}
L21为
W
1
W_1
W1的损失,
L
22
\mathcal{L}_{22}
L22为
W
2
W_2
W2的损失
L
2
=
L
21
+
L
22
=
1
1
×
H
(
W
1
⊙
W
1
)
1
D
×
1
+
1
1
×
C
(
W
2
⊙
W
2
)
1
H
×
1
(5)
\mathcal{L}_2=\mathcal{L}_{21} \!+ \!\mathcal{L}_{22}=\mathbf{1}_{1\times H}\!(W_1 \!\odot \! W_1) \!\mathbf{1}_{D\times 1} \! + \! \mathbf{1}_{1\times C} \! (W_2 \! \odot \! W_2) \!\mathbf{1}_{H\times 1} \tag{5}
L2=L21+L22=11×H(W1⊙W1)1D×1+11×C(W2⊙W2)1H×1(5)
L
=
1
N
L
1
+
λ
L
2
(6)
\mathcal{L} =\frac{1}{N} \mathcal{L}_1+\lambda \mathcal{L}_2 \tag{6}
L=N1L1+λL2(6)
式(1)~式(6)完整介绍了前向传播的向量化写法
下一节将介绍反向传播的向量化写法,高潮要来了呀,准备好
反向传播
首先简单介绍两个知识点
- Jacobian辨识:如果
d
f
=
t
r
(
A
d
X
)
df=tr(AdX)
df=tr(AdX),那么
∂
f
∂
X
=
A
T
\frac{\partial f}{\partial X} =A^T
∂X∂f=AT
即如果想求 ∂ f ∂ X \frac{\partial f}{\partial X} ∂X∂f,只要能把 d f df df写成 t r ( A d X ) tr(AdX) tr(AdX)的形式,那么就能把 ∂ f ∂ X \frac{\partial f}{\partial X} ∂X∂f辨识出来,为 A T A^T AT - 迹的一些性质,很重要,这些技巧后文会用
- 矩阵乘法/逐元素乘法交换: t r { A T ( B ⊙ C ) } = t r { ( A ⊙ B ) T C } tr\{A^T(B \odot C)\}=tr\{(A \odot B)^T C\} tr{AT(B⊙C)}=tr{(A⊙B)TC}
- 轮换对称性: t r { A B } = t r { B A } tr\{AB\}=tr\{BA\} tr{AB}=tr{BA}
- 转置迹不变: t r { A } = t r { A T } tr\{A\}=tr\{A^T\} tr{A}=tr{AT}
- 微分内提: d t r { A } = t r { d A } dtr\{A\}=tr\{dA\} dtr{A}=tr{dA}
另外,我们记 1 A \frac{1}{A} A1表示把 A A A逐元素取倒数
开始吧
先画一下计算图computing graph
经验损失项
d
L
=
1
N
d
L
1
+
λ
d
L
2
(7)
d\mathcal{L} =\frac{1}{N}d \mathcal{L}_1 + \lambda d\mathcal{L}_2 \tag{7}
dL=N1dL1+λdL2(7)
d
L
1
=
[
d
[
ln
(
1
1
×
C
e
F
2
)
]
−
1
1
×
C
(
d
F
2
⊙
Y
)
]
1
N
×
1
=
[
d
[
(
1
1
×
C
e
F
2
)
]
⊙
1
1
1
×
C
e
F
2
−
1
1
×
C
(
d
F
2
⊙
Y
)
]
1
N
×
1
=
[
(
1
1
×
C
(
e
F
2
⊙
d
F
2
)
)
⊙
1
1
1
×
C
e
F
2
−
1
1
×
C
(
d
F
2
⊙
Y
)
]
1
N
×
1
=
t
r
{
[
(
1
1
×
C
(
e
F
2
⊙
d
F
2
)
)
⊙
1
1
1
×
C
e
F
2
−
1
1
×
C
(
d
F
2
⊙
Y
)
]
1
N
×
1
}
=
t
r
{
[
(
1
1
×
C
(
e
F
2
⊙
d
F
2
)
)
⊙
1
1
1
×
C
e
F
2
]
1
N
×
1
}
−
t
r
{
1
1
×
C
(
d
F
2
⊙
Y
)
1
N
×
1
}
=
t
r
{
1
N
×
1
[
(
1
1
×
C
(
e
F
2
⊙
d
F
2
)
)
⊙
1
1
1
×
C
e
F
2
]
}
−
t
r
{
1
N
×
C
(
d
F
2
⊙
Y
)
}
=
t
r
{
[
1
1
×
N
⊙
(
1
1
×
C
(
e
F
2
⊙
d
F
2
)
)
]
T
1
1
1
×
C
e
F
2
}
−
t
r
{
(
1
C
×
N
⊙
d
F
2
)
T
Y
)
}
=
t
r
{
[
1
1
×
C
(
e
F
2
⊙
d
F
2
)
]
T
1
1
1
×
C
e
F
2
}
−
t
r
{
d
F
2
T
Y
}
=
t
r
{
(
d
F
2
T
⊙
e
F
2
T
)
(
1
C
×
1
1
1
1
×
C
e
F
2
)
}
−
t
r
{
d
F
2
T
Y
}
=
t
r
{
d
F
2
T
[
e
F
2
⊙
(
1
C
×
1
1
1
1
×
C
e
F
2
)
]
}
−
t
r
{
d
F
2
T
Y
}
=
t
r
{
[
e
F
2
⊙
(
1
C
×
1
1
1
1
×
C
e
F
2
)
−
Y
]
T
d
F
2
}
\begin{aligned} d\mathcal{L}_1 &= [d[\ln (\mathbf{1}_{1\times C}e^{F_2} )]- \mathbf{1}_{1\times C}(dF_2 \odot Y)] \mathbf{1}_{N\times 1} \\ &=[d[(\mathbf{1}_{1\times C}e^{F_2} )] \odot \frac{1}{\mathbf{1}_{1\times C}e^{F_2}}- \mathbf{1}_{1\times C}(dF_2 \odot Y)] \mathbf{1}_{N\times 1} \\ &=[(\mathbf{1}_{1\times C}(e^{F_2}\odot dF_2)) \odot \frac{1}{\mathbf{1}_{1\times C}e^{F_2}}- \mathbf{1}_{1\times C}(dF_2 \odot Y)] \mathbf{1}_{N\times 1} \\ &= tr\{[(\mathbf{1}_{1\times C}(e^{F_2}\odot dF_2)) \odot \frac{1}{\mathbf{1}_{1\times C}e^{F_2}}- \mathbf{1}_{1\times C}(dF_2 \odot Y)] \mathbf{1}_{N\times 1} \} \\ &= tr\{[(\mathbf{1}_{1\times C}(e^{F_2}\odot dF_2)) \odot \frac{1}{\mathbf{1}_{1\times C}e^{F_2}}] \mathbf{1}_{N\times 1}\}- tr\{\mathbf{1}_{1\times C}(dF_2 \odot Y) \mathbf{1}_{N\times 1} \} \\ &= tr\{\mathbf{1}_{N\times 1}[(\mathbf{1}_{1\times C}(e^{F_2}\odot dF_2)) \odot \frac{1}{\mathbf{1}_{1\times C}e^{F_2}}] \}- tr\{\mathbf{1}_{N\times C}(dF_2 \odot Y) \} \\ &= tr\{[\mathbf{1}_{1\times N}\odot(\mathbf{1}_{1\times C}(e^{F_2}\odot dF_2))]^T\frac{1}{\mathbf{1}_{1\times C}e^{F_2}} \}- tr\{(\mathbf{1}_{C\times N}\odot dF_2 )^T Y) \} \\ &= tr\{[\mathbf{1}_{1\times C}(e^{F_2}\odot dF_2)]^T\frac{1}{\mathbf{1}_{1\times C}e^{F_2}} \}- tr\{dF_2^T Y\} \\ &= tr\{(dF_2^T\odot {e^{F_2}}^T) (\mathbf{1}_{C\times 1}\frac{1}{\mathbf{1}_{1\times C}e^{F_2}} ) \}-tr\{dF_2^T Y\} \\ &= tr\{dF_2^T [ {e^{F_2}} \odot (\mathbf{1}_{C\times 1}\frac{1}{\mathbf{1}_{1\times C}e^{F_2}})]\}- tr\{dF_2^T Y\} \\ &= tr\{ [ {e^{F_2}} \odot (\mathbf{1}_{C\times 1}\frac{1}{\mathbf{1}_{1\times C}e^{F_2}}) - Y]^TdF_2\}\\ \end{aligned}
dL1=[d[ln(11×CeF2)]−11×C(dF2⊙Y)]1N×1=[d[(11×CeF2)]⊙11×CeF21−11×C(dF2⊙Y)]1N×1=[(11×C(eF2⊙dF2))⊙11×CeF21−11×C(dF2⊙Y)]1N×1=tr{[(11×C(eF2⊙dF2))⊙11×CeF21−11×C(dF2⊙Y)]1N×1}=tr{[(11×C(eF2⊙dF2))⊙11×CeF21]1N×1}−tr{11×C(dF2⊙Y)1N×1}=tr{1N×1[(11×C(eF2⊙dF2))⊙11×CeF21]}−tr{1N×C(dF2⊙Y)}=tr{[11×N⊙(11×C(eF2⊙dF2))]T11×CeF21}−tr{(1C×N⊙dF2)TY)}=tr{[11×C(eF2⊙dF2)]T11×CeF21}−tr{dF2TY}=tr{(dF2T⊙eF2T)(1C×111×CeF21)}−tr{dF2TY}=tr{dF2T[eF2⊙(1C×111×CeF21)]}−tr{dF2TY}=tr{[eF2⊙(1C×111×CeF21)−Y]TdF2}
一波推导,Jacobian辨识得
∂
L
1
∂
F
2
=
e
F
2
⊙
(
1
C
×
1
1
1
1
×
C
e
F
2
)
−
Y
(8)
\frac{\partial{\mathcal L_1}}{\partial F_2} = {e^{F_2}} \odot (\mathbf{1}_{C\times 1}\frac{1}{\mathbf{1}_{1\times C}e^{F_2}}) - Y \tag{8}
∂F2∂L1=eF2⊙(1C×111×CeF21)−Y(8)
接下来,开始反向传播,由式(3)
d
L
1
=
t
r
{
∂
L
1
∂
F
2
T
d
F
2
}
=
t
r
{
∂
L
1
∂
F
2
T
(
d
W
2
G
1
+
W
2
d
G
1
+
d
b
2
1
1
×
N
)
}
=
t
r
{
G
1
∂
L
1
∂
F
2
T
d
W
2
}
+
t
r
{
∂
L
1
∂
F
2
T
W
2
d
G
1
}
+
t
r
{
1
1
×
N
∂
L
1
∂
F
2
T
d
b
2
}
\begin{aligned} d\mathcal{L}_1 &= tr\{\frac{\partial{\mathcal L_1}}{\partial F_2}^TdF_2\} \\ &=tr\{\frac{\partial{\mathcal L_1}}{\partial F_2}^T(dW_2 G_1+W_2 dG_1+db_2 \mathbf{1}_{1\times N})\} \\ &=tr\{G_1\frac{\partial{\mathcal L_1}}{\partial F_2}^TdW_2 \}+tr\{\frac{\partial{\mathcal L_1}}{\partial F_2}^T W_2dG_1 \} + tr\{\mathbf{1}_{1\times N}\frac{\partial{\mathcal L_1}}{\partial F_2}^Tdb_2 \} \end{aligned}
dL1=tr{∂F2∂L1TdF2}=tr{∂F2∂L1T(dW2G1+W2dG1+db211×N)}=tr{G1∂F2∂L1TdW2}+tr{∂F2∂L1TW2dG1}+tr{11×N∂F2∂L1Tdb2}
Jacobian辨识得
∂
L
1
∂
W
2
=
∂
L
1
∂
F
2
G
1
T
(9)
\frac{\partial{\mathcal L_1}}{\partial W_2}=\frac{\partial{\mathcal L_1}}{\partial F_2}G_1^T \tag{9}
∂W2∂L1=∂F2∂L1G1T(9)
∂
L
1
∂
b
2
=
∂
L
1
∂
F
2
1
N
×
1
(10)
\frac{\partial{\mathcal L_1}}{\partial b_2}=\frac{\partial{\mathcal L_1}}{\partial F_2}\mathbf{1}_{N\times 1} \tag{10}
∂b2∂L1=∂F2∂L11N×1(10)
∂
L
1
∂
G
1
=
W
2
T
∂
L
1
∂
F
2
(11)
\frac{\partial{\mathcal L_1}}{\partial G_1}=W_2^T\frac{\partial{\mathcal L_1}}{\partial F_2} \tag{11}
∂G1∂L1=W2T∂F2∂L1(11)
式(9)和式(10)给出了经验损失项关于
W
2
W_2
W2和
b
2
b_2
b2的参数更新公式
我们继续反向传播,由式(2)
d
L
1
=
t
r
{
∂
L
1
∂
G
1
d
G
1
}
+
.
.
.
(
G
1
无关项)
=
t
r
{
∂
L
1
∂
G
1
[
I
(
F
1
>
0
)
⊙
d
F
1
]
}
+
.
.
.
=
t
r
{
[
∂
L
1
∂
G
1
T
⊙
I
(
F
1
>
0
)
]
T
d
F
1
}
+
.
.
.
\begin{aligned} d\mathcal{L}_1 &= tr\{\frac{\partial{\mathcal L_1}}{\partial G_1}dG_1 \} + ...\text{($G_1$无关项)} \\ &= tr\{\frac{\partial{\mathcal L_1}}{\partial G_1} [\mathbb{I}(F_1 > 0) \odot dF_1] \} + ... \\ &= tr\{[\frac{\partial{\mathcal L_1}}{\partial G_1}^T \odot \mathbb{I}(F_1 > 0)] ^TdF_1 \} + ... \end{aligned}
dL1=tr{∂G1∂L1dG1}+...(G1无关项)=tr{∂G1∂L1[I(F1>0)⊙dF1]}+...=tr{[∂G1∂L1T⊙I(F1>0)]TdF1}+...
Jacobian辨识得
∂
L
1
∂
F
1
=
∂
L
1
∂
G
1
⊙
I
(
F
1
>
0
)
(12)
\frac{\partial{\mathcal L_1}}{\partial F_1}=\frac{\partial{\mathcal L_1}}{\partial G_1} \odot \mathbb{I}(F_1 > 0) \tag{12}
∂F1∂L1=∂G1∂L1⊙I(F1>0)(12)
由式(1)
d
L
1
=
t
r
{
∂
L
1
∂
F
1
T
d
F
1
}
+
.
.
.
(
F
1
无关项)
=
t
r
{
∂
L
1
∂
F
1
T
(
d
W
1
X
+
d
b
1
1
1
×
N
)
}
+
.
.
.
=
t
r
{
X
∂
L
1
∂
F
1
T
d
W
1
}
+
t
r
{
1
1
×
N
∂
L
1
∂
F
1
T
d
b
1
}
+
.
.
.
\begin{aligned} d\mathcal{L}_1 &= tr\{\frac{\partial{\mathcal L_1}}{\partial F_1}^TdF_1\} + ...\text{($F_1$无关项)} \\ &= tr\{\frac{\partial{\mathcal L_1}}{\partial F_1}^T (dW_1 X + db_1 \mathbf{1}_{1\times N})\} + ... \\ &= tr\{X\frac{\partial{\mathcal L_1}}{\partial F_1}^T dW_1\} + tr\{ \mathbf{1}_{1\times N}\frac{\partial{\mathcal L_1}}{\partial F_1}^Tdb_1\}+ ... \end{aligned}
dL1=tr{∂F1∂L1TdF1}+...(F1无关项)=tr{∂F1∂L1T(dW1X+db111×N)}+...=tr{X∂F1∂L1TdW1}+tr{11×N∂F1∂L1Tdb1}+...
Jacobian辨识得
∂
L
1
∂
W
1
=
∂
L
1
∂
F
1
X
T
(13)
\frac{\partial{\mathcal L_1}}{\partial W_1} = \frac{\partial{\mathcal L_1}}{\partial F_1} X^T \tag{13}
∂W1∂L1=∂F1∂L1XT(13)
∂
L
1
∂
b
1
=
∂
L
1
∂
F
1
1
N
×
1
(14)
\frac{\partial{\mathcal L_1}}{\partial b_1} = \frac{\partial{\mathcal L_1}}{\partial F_1} \mathbf{1}_{N\times 1} \tag{14}
∂b1∂L1=∂F1∂L11N×1(14)
式(13)和式(14)给出了经验损失项关于
W
1
W_1
W1和
b
1
b_1
b1的参数更新公式。注意这两式和
W
2
,
b
2
W_2,b_2
W2,b2更新公式(9)和(10)之间的相似性。更多层反向传播公式是可以归纳出来的
正则损失项
对于
L
2
\mathcal{L}_{2}
L2
d
L
2
=
1
1
×
H
(
W
1
⊙
d
W
1
)
1
D
×
1
+
1
1
×
H
(
d
W
1
⊙
W
1
)
1
D
×
1
=
2
t
r
{
1
1
×
H
(
W
1
⊙
d
W
1
)
1
D
×
1
}
=
2
t
r
{
1
D
×
H
(
W
1
⊙
d
W
1
)
}
=
2
t
r
{
(
1
H
×
D
⊙
W
1
)
T
d
W
1
)
}
=
t
r
{
2
W
1
T
d
W
1
}
\begin{aligned} d\mathcal{L}_{2} &= \mathbf{1}_{1\times H}(W_1 \odot dW_1) \mathbf{1}_{D\times 1} + \mathbf{1}_{1\times H}(dW_1 \odot W_1) \mathbf{1}_{D\times 1} \\ &= 2tr\{\mathbf{1}_{1\times H}(W_1 \odot dW_1) \mathbf{1}_{D\times 1} \} \\ &= 2tr\{\mathbf{1}_{D\times H}(W_1 \odot dW_1)\} \\ &= 2tr\{(\mathbf{1}_{H\times D} \odot W_1) ^T dW_1)\} \\ &= tr\{2W_1^T dW_1\} \end{aligned}
dL2=11×H(W1⊙dW1)1D×1+11×H(dW1⊙W1)1D×1=2tr{11×H(W1⊙dW1)1D×1}=2tr{1D×H(W1⊙dW1)}=2tr{(1H×D⊙W1)TdW1)}=tr{2W1TdW1}
由Jacobian辨识得
∂
L
2
∂
W
1
=
2
W
1
(15)
\frac{\partial \mathcal{L}_{2}}{\partial W_1} = 2W_1 \tag{15}
∂W1∂L2=2W1(15)
同理
∂
L
2
∂
W
2
=
2
W
2
(16)
\frac{\partial \mathcal{L}_{2}}{\partial W_2} = 2W_2 \tag{16}
∂W2∂L2=2W2(16)
可以看出L2参数项对参数的影响与用标量写法推导是一致的
L1参数项的向量化推导类似
后记
- 写了一天……
- 矩阵求导是非常强大的工具,而且求出来的表达式可以直接转化成代码实现
- 类似的方法可以推导SVM合页损失、Logistic回归、线性回归等的参数更新公式,可以留作练习(想起来之前面试让我求最小二乘的梯度,我想推向量写法,没推出来就很气,无奈勉强写了标量写法)
- 上述方法始终保持了标量对矩阵求导,反向传播即是把矩阵微分不断打开的过程
- 在矩阵求导术(下)中,给出了一种矩阵对矩阵求导的技法,似乎是可以写出另一种链式法则,这里我存有一个疑问:
如果输出结果为标量(例如loss),是否用矩阵求导术(上)的方法更为方便?因为(下)中方法要先把矩阵拍开,而后再合上? - 我用手稿写了一份行向量的版本,仅供参考