神经网络的模型结构
a 1 ( 2 ) = σ ( W 10 ( 1 ) x 0 + W 11 ( 1 ) x 1 + W 12 ( 1 ) x 2 + W 13 ( 1 ) x 3 ) a 2 ( 2 ) = σ ( W 20 ( 1 ) x 0 + W 21 ( 1 ) x 1 + W 22 ( 1 ) x 2 + W 23 ( 1 ) x 3 ) a 3 ( 2 ) = σ ( W 30 ( 1 ) x 0 + W 31 ( 1 ) x 1 + W 32 ( 1 ) x 2 + W 33 ( 1 ) x 3 ) \begin{gathered} a_1^{(2)} = \sigma(\mathbf{W} _{10}^{(1)}{x_0} + \mathbf{W} _{11}^{(1)}{x_1} + \mathbf{W} _{12}^{(1)}{x_2} + \mathbf{W} _{13}^{(1)}{x_3}) \\ a_2^{(2)} = \sigma(\mathbf{W} _{20}^{(1)}{x_0} + \mathbf{W} _{21}^{(1)}{x_1} + \mathbf{W} _{22}^{(1)}{x_2} + \mathbf{W} _{23}^{(1)}{x_3}) \\ a_3^{(2)} = \sigma(\mathbf{W} _{30}^{(1)}{x_0} + \mathbf{W} _{31}^{(1)}{x_1} + \mathbf{W} _{32}^{(1)}{x_2} + \mathbf{W} _{33}^{(1)}{x_3}) \\ \end{gathered} a1(2)=σ(W10(1)x0+W11(1)x1+W12(1)x2+W13(1)x3)a2(2)=σ(W20(1)x0+W21(1)x1+W22(1)x2+W23(1)x3)a3(2)=σ(W30(1)x0+W31(1)x1+W32(1)x2+W33(1)x3)
a 0 ( 3 ) = σ ( W 10 ( 2 ) a 0 ( 2 ) + W 11 ( 2 ) a 1 ( 2 ) + W 12 ( 2 ) a 2 ( 2 ) + W 13 ( 2 ) a 3 ( 2 ) ) \begin{gathered} a_0^{(3)} = \sigma(\mathbf{W} _{10}^{(2)}a_0^{(2)} + \mathbf{W} _{11}^{(2)}a_1^{(2)} + \mathbf{W} _{12}^{(2)}a_2^{(2)} + \mathbf{W} _{13}^{(2)}a_3^{(2)}) \\ \end{gathered} a0(3)=σ(W10(2)a0(2)+W11(2)a1(2)+W12(2)a2(2)+W13(2)a3(2))
基本符号
W ( l ) \boldsymbol{W}^{(l)} W(l) 表示第 l l l 层的参数矩阵,
z ( l ) \boldsymbol{z}^{(l)} z(l) 表示第 l l l 层在激活函数之前的输入,
σ \sigma σ 表示激活函数,
a ( l ) \boldsymbol{a}^{(l)} a(l) 表示第 l l l 层的输出, a ( l ) = σ ( z ( l ) ) \boldsymbol{a}^{(l)} = \sigma(\boldsymbol{z}^{(l)}) a(l)=σ(z(l)) ,也可以看成第 l + 1 l+1 l+1 层的输入,
a L \boldsymbol{a}^{L} aL 表示最终神经网络的输出结果.
a L \boldsymbol{a}^{L} aL 一般在计算 Loss \text{Loss} Loss 时,还需要经过 softmax \text{softmax} softmax 操作才可以计算损失.
数据在神经网络中的流动路线
a ( 1 ) = x z ( 2 ) = W ( 1 ) a ( 1 ) a ( 2 ) = σ ( z ( 2 ) ) z ( 3 ) = W ( 2 ) a ( 2 ) a ( 3 ) = σ ( z ( 3 ) ) z ( 4 ) = W ( 3 ) a ( 3 ) ⋯ z ( L ) = W ( L ) a ( L − 1 ) a ( L ) = σ ( z ( L ) ) \begin{aligned} {\boldsymbol{a}^{(1)}} &= \boldsymbol{x} \\ {\boldsymbol{z}^{(2)}} &= {\mathbf{W} ^{(1)}}{\boldsymbol{a}^{(1)}} \\ {\boldsymbol{a}^{(2)}} &= \sigma({\boldsymbol{z}^{(2)}}) \\ {\boldsymbol{z}^{(3)}} &= {\mathbf{W} ^{(2)}}{\boldsymbol{a}^{(2)}} \\ {\boldsymbol{a}^{(3)}} &= \sigma({\boldsymbol{z}^{(3)}}) \\ {\boldsymbol{z}^{(4)}} &= {\mathbf{W} ^{(3)}}{\boldsymbol{a}^{(3)}} \\ & \cdots\\ {\boldsymbol{z}^{(L)}} &= {\mathbf{W} ^{(L)}}{\boldsymbol{a}^{(L-1)}} \\ {\boldsymbol{a}^{(L)}} &= \sigma({\boldsymbol{z}^{(L)}}) \\ \end{aligned} a(1)z(2)a(2)z(3)a(3)z(4)z(L)a(L)=x=W(1)a(1)=σ(z(2))=W(2)a(2)=σ(z(3))=W(3)a(3)⋯=W(L)a(L−1)=σ(z(L))
损失函数
Loss
=
−
y
T
ln
y
^
=
−
∑
i
=
1
K
y
i
ln
y
^
i
\begin{aligned} \text{Loss} &=- \boldsymbol{y}^{T} \ln \hat{\boldsymbol{y}} \\ &=− \sum_{i=1}^{\text{K}} y_i \ln \hat{y}_i \end{aligned}
Loss=−yTlny^=−i=1∑Kyilny^i
其中
y
∈
{
0
,
1
}
C
\boldsymbol{y} \in\{0,1\}^{C}
y∈{0,1}C 为标签,
y
^
\boldsymbol{\hat{y}}
y^ 为神经网络的预测值.
softmax \text{softmax} softmax 函数
y ^ k = softmax ( a ( L ) ) k = e a k ( L ) ∑ i = 1 K e a i ( L ) \hat{y}_k = \text{softmax}(\boldsymbol{a}^{(L)})_k = \frac{e^{a_k^{(L)}}} {\sum_{i = 1}^{K} e^{a_i^{(L)}}} y^k=softmax(a(L))k=∑i=1Keai(L)eak(L)
分子布局与分母布局
矩阵微积分的表示通常有两种符号约定: 分子布局 ( Numerator Layout) 和分母布局 ( Denominator Layout). 两者的区别是一个标量关于一个向量的导数是写成列向量还是行向量.标量关于向量的偏导数, 对于 M M M 维向量 x ∈ R M \boldsymbol{x} \in \mathbb{R}^{M} x∈RM 和函数 y = f ( x ) ∈ R y=f(\boldsymbol{x}) \in \mathbb{R} y=f(x)∈R, 则 y y y 关于 x \boldsymbol{x} x 的偏导数为
分母布局 ∂ y ∂ x = [ ∂ y ∂ x 1 , ⋯ , ∂ y ∂ x M ] T ∈ R M × 1 , 分子布局 ∂ y ∂ x = [ ∂ y ∂ x 1 , ⋯ , ∂ y ∂ x M ] ∈ R 1 × M . \begin{aligned} \text{分母布局 }\frac{\partial y}{\partial x} & =\left[\frac{\partial y}{\partial x_{1}}, \cdots, \frac{\partial y}{\partial x_{M}}\right]^{T} & \in \mathbb{R}^{M \times 1}, \\ \text{分子布局 }\frac{\partial y}{\partial x} & =\left[\frac{\partial y}{\partial x_{1}}, \cdots, \frac{\partial y}{\partial x_{M}}\right] & \in \mathbb{R}^{1 \times M} . \end{aligned} 分母布局 ∂x∂y分子布局 ∂x∂y=[∂x1∂y,⋯,∂xM∂y]T=[∂x1∂y,⋯,∂xM∂y]∈RM×1,∈R1×M.
在分母布局中, ∂ y ∂ x \frac{\partial y}{\partial x} ∂x∂y 为列向量;而在分子布局中, ∂ y ∂ x \frac{\partial y}{\partial x} ∂x∂y 为行向量. 向量关于标量的偏导数, 对于标量 x ∈ R x \in \mathbb{R} x∈R 和函数 y = f ( x ) ∈ R N \boldsymbol{y}=f(x) \in \mathbb{R}^{N} y=f(x)∈RN, 则 y \boldsymbol{y} y 关于 x x x 的 偏导数为
分母布局
∂
y
∂
x
=
[
∂
y
1
∂
x
,
⋯
,
∂
y
N
∂
x
]
∈
R
1
×
N
分子布局
∂
y
∂
x
=
[
∂
y
1
∂
x
,
⋯
,
∂
y
N
∂
x
]
T
∈
R
N
×
1
\text{分母布局 }\frac{\partial \boldsymbol{y}}{\partial x}=\left[\frac{\partial y_{1}}{\partial x}, \cdots, \frac{\partial y_{N}}{\partial x}\right] \quad \in \mathbb{R}^{1 \times N} \\ \text{分子布局 } \frac{\partial \boldsymbol{y}}{\partial x}=\left[\frac{\partial y_{1}}{\partial x}, \cdots, \frac{\partial y_{N}}{\partial x}\right]^{T} \quad \in \mathbb{R}^{N \times 1}
分母布局 ∂x∂y=[∂x∂y1,⋯,∂x∂yN]∈R1×N分子布局 ∂x∂y=[∂x∂y1,⋯,∂x∂yN]T∈RN×1
在分母布局中,
∂
y
∂
x
\frac{\partial y}{\partial x}
∂x∂y 为行向量;而在分子布局中,
∂
y
∂
\frac{\partial y}{\partial}
∂∂y 为列向量.
因为
∂
L
(
y
y
^
)
∂
W
(
l
)
\frac{\partial \mathcal{L}(\boldsymbol{y} \hat{\boldsymbol{y}})}{\partial \boldsymbol{W}^{(l)}}
∂W(l)∂L(yy^) 的计算涉及向量对矩阵的微分, 十分繁琐, 因此我们先计算
L
(
y
,
y
^
)
\mathcal{L}(\boldsymbol{y}, \hat{\boldsymbol{y}})
L(y,y^) 关于参数矩阵中每
个元素的偏导数.
反向传播算法
不失一般性, 对第𝑙 层中的参数
W
(
l
)
\boldsymbol{W}^{(l)}
W(l) 和
b
(
l
)
\boldsymbol{b}^{(l)}
b(l) 计算偏导数. 这里使用向量或矩阵来表示多变量函数的偏导数, 并使用分子布局表示, 根据链式法则,
∂
L
(
y
,
y
^
)
∂
w
i
j
(
l
)
=
∂
L
(
y
,
y
^
)
∂
z
(
l
)
∂
z
(
l
)
∂
w
i
j
(
l
)
,
∂
L
(
y
,
y
^
)
∂
b
(
l
)
=
∂
L
(
y
,
y
^
)
∂
z
(
l
)
∂
z
(
l
)
∂
b
(
l
)
,
\begin{array}{l} \frac{\partial \mathcal{L}(\boldsymbol{y}, \hat{\boldsymbol{y}})}{\partial w_{i j}^{(l)}}=\frac{\partial \mathcal{L}(\boldsymbol{y}, \hat{\boldsymbol{y}})}{\partial \mathbf{z}^{(l)}}\frac{\partial \boldsymbol{z}^{(l)}}{\partial w_{i j}^{(l)}}, \\ \frac{\partial \mathcal{L}(\boldsymbol{y}, \hat{\boldsymbol{y}})}{\partial \boldsymbol{b}^{(l)}} =\frac{\partial \mathcal{L}(\boldsymbol{y}, \hat{\boldsymbol{y}})}{\partial \boldsymbol{z}^{(l)}} \frac{\partial \boldsymbol{z}^{(l)}}{\partial \boldsymbol{b}^{(l)}}, \end{array}
∂wij(l)∂L(y,y^)=∂z(l)∂L(y,y^)∂wij(l)∂z(l),∂b(l)∂L(y,y^)=∂z(l)∂L(y,y^)∂b(l)∂z(l),
上述两个公式中的第二项都是目标函数关于第
l
l
l 层的神经元
z
(
l
)
\boldsymbol{z}^{(l)}
z(l)
的偏导数, 称为误差项, 可以一次计算得到. 这样我们只需要计算三个偏导数, 分别为
∂
z
(
l
)
∂
w
i
j
(
l
)
,
∂
z
(
l
)
∂
b
(
l
)
\frac{\partial \boldsymbol{z}^{(l)}}{\partial w_{i j}^{(l)}}, \frac{\partial \boldsymbol{z}^{(l)}}{\partial \boldsymbol{b}^{(l)}}
∂wij(l)∂z(l),∂b(l)∂z(l) 和
∂
L
(
y
,
y
^
)
∂
z
(
l
)
\frac{\partial \mathcal{L}(\boldsymbol{y}, \hat{\boldsymbol{y}})}{\partial \boldsymbol{z}^{(l)}}
∂z(l)∂L(y,y^)
( 1 ) (1) (1) 计算偏导数 ∂ z ( l ) ∂ w i j ( l ) \frac{\partial \boldsymbol{z}^{(l)}}{\partial w_{i j}^{(l)}} \quad ∂wij(l)∂z(l) 因 z ( l ) = W ( l ) a ( l − 1 ) + b ( l ) z^{(l)}=\boldsymbol{W}^{(l)} \boldsymbol{a}^{(l-1)}+\boldsymbol{b}^{(l)} z(l)=W(l)a(l−1)+b(l), 偏导数
∂ z ( l ) ∂ w i j ( l ) = [ ∂ z 1 ( l ) ∂ w i j ( l ) , ⋯ , ∂ z i ( l ) ∂ w i j ( l ) , ⋯ , ∂ z M l ( l ) ∂ w i j ( l ) ] = [ 0 , ⋯ , ( w i : ( l ) a ( l − 1 ) + b i ( l ) ) ∂ w i j ( l ) ⋯ , 0 ] = [ 0 , … , a j ( l − 1 ) , … , 0 ] T \begin{aligned} \frac{\partial \boldsymbol{z}^{(l)}}{\partial w_{i j}^{(l)}} &=\Big [\frac{\partial z_{1}^{(l)}}{\partial w_{i j}^{(l)}}, \cdots, {\frac{\partial z_{i}^{(l)}}{\partial w_{i j}^{(l)}}}, \cdots, \frac{\partial z_{M_{l}}^{(l)}}{\partial w_{i j}^{(l)}} \Big] \\ &=\Big[0, \cdots, \frac{(\boldsymbol{w}_{i:}^{(l)} \boldsymbol{a}^{(l-1)}+ b_i^{(l)} )} {\partial{w_{ij}^{(l)}}} \cdots, 0 \Big] \\ & = \Big[0,\ldots,a_{j}^{(l-1)},\ldots,0 \Big]^{T} \end{aligned} ∂wij(l)∂z(l)=[∂wij(l)∂z1(l),⋯,∂wij(l)∂zi(l),⋯,∂wij(l)∂zMl(l)]=[0,⋯,∂wij(l)(wi:(l)a(l−1)+bi(l))⋯,0]=[0,…,aj(l−1),…,0]T
其中 w i : ( l ) \boldsymbol{w}_{i:}^{(l)} wi:(l) 为权重矩阵 W ( l ) \boldsymbol{W}^{(l)} W(l) 的第 i i i 行.
(
2
)
(2)
(2) 计算偏导数
∂
z
(
l
)
∂
b
(
l
)
\frac{\partial \boldsymbol{z}^{(l)}}{\partial b^{(l)}} \quad
∂b(l)∂z(l) 因为
z
(
l
)
\boldsymbol{z}^{(l)}
z(l) 和
b
(
l
)
\boldsymbol{b}^{(l)}
b(l) 的函数关系为
z
(
l
)
=
W
(
l
)
a
(
l
−
1
)
+
\boldsymbol{z}^{(l)}=W^{(l)} \boldsymbol{a}^{(l-1)}+
z(l)=W(l)a(l−1)+
b
(
l
)
\boldsymbol{b}^{(l)}
b(l), 因此偏导数
∂ z ( l ) ∂ b ( l ) = I M l ∈ R M l × M l \frac{\partial \boldsymbol{z}^{(l)}}{\partial \boldsymbol{b}^{(l)}}=\boldsymbol{I}_{M_{l}} \in \mathbb{R}^{M_{l} \times M_{l}} ∂b(l)∂z(l)=IMl∈RMl×Ml
为 M l × M l M_{l} \times M_{l} Ml×Ml 的单位矩阵.
(
3
)
(3)
(3) 计算偏导数
∂
L
(
y
,
y
^
)
∂
z
(
l
)
\frac{\partial \mathcal{L}(\boldsymbol{y}, \hat{\boldsymbol{y}})}{\partial \boldsymbol{z}^{(l)}}
∂z(l)∂L(y,y^) 偏导数
∂
L
(
y
,
y
^
)
∂
z
(
l
)
\frac{\partial \mathcal{L}(\boldsymbol{y}, \hat{\boldsymbol{y}})}{\partial \boldsymbol{z}^{(l)}}
∂z(l)∂L(y,y^) 表示第
l
l
l 层神经元对最终损失
的影响,也反映了最终损失对第
l
l
l 层神经元的敏感程度, 因此一般称为第
l
l
l 层神经
元的误差项,用
δ
(
l
)
\delta^{(l)}
δ(l) 来表示.
δ ( l ) ≜ ∂ L ( y , y ^ ) ∂ z ( l ) ∈ R M l \delta^{(l)} \triangleq \frac{\partial \mathcal{L}(\boldsymbol{y}, \hat{\boldsymbol{y}})}{\partial \boldsymbol{z}^{(l)}} \in \mathbb{R}^{M_{l}} δ(l)≜∂z(l)∂L(y,y^)∈RMl
误差项
δ
(
l
)
\delta^{(l)}
δ(l) 也间接反映了不同神经元对网络能力的贡献程度, 从而比较好地解决
了贡献度分配问题 ( Credit Assignment Problem, CAP ).
根据
z
(
l
+
1
)
=
W
(
l
+
1
)
a
(
l
)
+
b
(
l
+
1
)
\boldsymbol{z}^{(l+1)}=\boldsymbol{W}^{(l+1)} \boldsymbol{a}^{(l)}+\boldsymbol{b}^{(l+1)}
z(l+1)=W(l+1)a(l)+b(l+1), 有(这里采用的是分母布局)
∂ z ( l + 1 ) ∂ a ( l ) = W ( l + 1 ) ∈ R M l × M l + 1 \frac{\partial \boldsymbol{z}^{(l+1)}}{\partial \boldsymbol{a}^{(l)}}= \boldsymbol{W}^{(l+1)} \in \mathbb{R}^{M_{l} \times M_{l+1}} ∂a(l)∂z(l+1)=W(l+1)∈RMl×Ml+1
根据 a ( l ) = σ l ( z ( l ) ) \boldsymbol{a}^{(l)}=\sigma_{l}\left(\boldsymbol{z}^{(l)}\right) a(l)=σl(z(l)),其中 σ l ( ⋅ ) \sigma_{l}(\cdot) σl(⋅) 为第 l l l 层神经网络中按位计算的激活函数, 因此有
∂ a ( l ) ∂ z ( l ) = ∂ σ l ( z ( l ) ) ∂ z ( l ) = [ ∂ σ l ( z ( l ) ) 1 ∂ z 1 ( l ) ∂ σ l ( z ( l ) ) 1 ∂ z 2 ( l ) … ∂ σ l ( z ( l ) ) 1 ∂ z m l ( l ) ∂ σ l ( z ( l ) ) 2 ∂ z 1 ( l ) ∂ σ l ( z ( l ) ) 2 ∂ z 2 ( l ) … ∂ σ l ( z ( l ) ) 2 ∂ z m l ( l ) ⋮ ⋮ ⋱ ⋮ ∂ σ l ( z ( l ) ) m l ∂ z 1 ( l ) ∂ σ l ( z ( l ) ) m l ∂ z 2 ( l ) … ∂ σ l ( z ( l ) ) m l ∂ z m l ( l ) ] = [ ∂ σ l ( z ( l ) ) 1 ∂ z 1 ( l ) 0 … 0 0 ∂ σ l ( z ( l ) ) 2 ∂ z 2 ( l ) … 0 ⋮ ⋮ ⋱ ⋮ 0 0 ⋯ ∂ σ l ( z ( l ) ) m l ∂ z m l ( l ) ] = diag ( σ l ′ ( z ( l ) ) ) ∈ R M l × M l \begin{aligned} \frac{\partial \boldsymbol{a}^{(l)}}{\partial \boldsymbol{z}^{(l)}} &=\frac{\partial \sigma_{l}\left(\boldsymbol{z}^{(l)}\right)}{\partial \boldsymbol{z}^{(l)}} \\ &= \left[ \begin{aligned} \frac{\partial \sigma_{l}\left({z}^{(l)}\right)_{1}}{\partial {z}_{1}^{(l)}} \frac{\partial \sigma_{l}\left({z}^{(l)}\right)_{1}}{\partial {z}_{2}^{(l)}} \ldots \frac{\partial \sigma_{l}\left({z}^{(l)}\right)_{1}}{\partial {z}_{m^l}^{(l)}} \\ \frac{\partial \sigma_{l}\left({z}^{(l)}\right)_{2}}{\partial {z}_{1}^{(l)}} \frac{\partial \sigma_{l}\left({z}^{(l)}\right)_{2}}{\partial {z}_{2}^{(l)}} \ldots \frac{\partial \sigma_{l}\left({z}^{(l)}\right)_{2}}{\partial {z}_{m^l}^{(l)}} \\ \quad \vdots \qquad \quad \vdots \qquad \quad \ddots \quad \quad \vdots \qquad \\ \frac{\partial \sigma_{l}\left({z}^{(l)}\right)_{m^l}}{\partial {z}_{1}^{(l)}} \frac{\partial \sigma_{l}\left({z}^{(l)}\right)_{m^l}}{\partial {z}_{2}^{(l)}} \ldots \frac{\partial \sigma_{l}\left({z}^{(l)}\right)_{m^l}}{\partial {z}_{m^l}^{(l)}} \\ \end{aligned}\right] \\ &=\left[ \begin{aligned} \frac{\partial \sigma_{l}\left({z}^{(l)}\right)_{1}}{\partial {z}_{1}^{(l)}} \qquad 0 \qquad \ldots \qquad 0 \qquad \\ \qquad 0 \qquad \frac{\partial \sigma_{l}\left({z}^{(l)}\right)_{2}}{\partial {z}_{2}^{(l)}} \ldots \qquad 0 \qquad \\ \quad \vdots \qquad \quad \vdots \qquad \quad \ddots \quad \quad \vdots \qquad \\ \quad 0 \quad \qquad 0 \quad \cdots \quad \frac{\partial \sigma_{l}\left({z}^{(l)}\right)_{m^l}}{\partial {z}_{m^l}^{(l)}} \\ \end{aligned}\right] \\ &=\text{diag}\left(\sigma_{l}^{\prime}\left(\boldsymbol{z}^{(l)}\right)\right) \quad \in \mathbb{R}^{M_{l} \times M_{l}} \end{aligned} ∂z(l)∂a(l)=∂z(l)∂σl(z(l))=⎣⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎡∂z1(l)∂σl(z(l))1∂z2(l)∂σl(z(l))1…∂zml(l)∂σl(z(l))1∂z1(l)∂σl(z(l))2∂z2(l)∂σl(z(l))2…∂zml(l)∂σl(z(l))2⋮⋮⋱⋮∂z1(l)∂σl(z(l))ml∂z2(l)∂σl(z(l))ml…∂zml(l)∂σl(z(l))ml⎦⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎤=⎣⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎡∂z1(l)∂σl(z(l))10…00∂z2(l)∂σl(z(l))2…0⋮⋮⋱⋮00⋯∂zml(l)∂σl(z(l))ml⎦⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎤=diag(σl′(z(l)))∈RMl×Ml
因此,根据链式法则 ,第 l l l 层的误差项为
1 × M l = ( 1 × M l + 1 ) ( M l + 1 × M l ) ( M l × M l ) δ ( l ) = ∂ L ( y , y ^ ) ∂ z ( l ) = ∂ L ( y , y ^ ) ∂ z ( l + 1 ) . ∂ z l + 1 ∂ a ( l ) . ∂ a l ∂ z ( l ) = δ ( l + 1 ) . W ( l + 1 ) . diag ( σ ′ ( z ( l ) ) ) \begin{aligned} 1\times M_{l} &= (1\times M_{l+1})(M_{l+1} \times M_{l})(M_{l} \times M_{l}) \\ \delta^{(l)} &= \frac{\partial \mathcal{L}(\boldsymbol{y}, \hat{\boldsymbol{y}})}{\partial \boldsymbol{z}^{(l)}} \\ &= \frac{\partial \mathcal{L}(\boldsymbol{y}, \hat{\boldsymbol{y}})}{\partial \boldsymbol{z}^{(l+1)}} . \frac{\partial \boldsymbol{z}^{l+1}}{\partial \boldsymbol{a}^{(l)}} . \frac{\partial \boldsymbol{a}^{l}}{\partial \boldsymbol{z}^{(l)}} \\ & = \delta^{(l+1)}. \boldsymbol{W}^{(l+1)} . \text{diag}(\sigma'(\boldsymbol{z}^{(l)})) \end{aligned} 1×Mlδ(l)=(1×Ml+1)(Ml+1×Ml)(Ml×Ml)=∂z(l)∂L(y,y^)=∂z(l+1)∂L(y,y^).∂a(l)∂zl+1.∂z(l)∂al=δ(l+1).W(l+1).diag(σ′(z(l)))
从公式 δ ( l ) = δ ( l + 1 ) . W ( l + 1 ) . diag ( σ ′ ( z ( l ) ) ) \delta^{(l)}= \delta^{(l+1)}. \boldsymbol{W}^{(l+1)} . \text{diag}(\sigma'(\boldsymbol{z}^{(l)})) δ(l)=δ(l+1).W(l+1).diag(σ′(z(l))) 可以看出,第 l l l 层的误差项可以通过第 l + 1 l+1 l+1 层的误差项计算得到, 这就是误差的反向传播(Back Propagation, BP). 反向传播算法的含义是:第 l l l 层的一个神经元的误差项(或敏感性)是所有与该神经元相连的第 l + 1 l+1 l+1层的神经元的误差项的权重和. 然后,再乘上该神经元激活函数的梯度. 由于得到 δ ( l ) \delta^{(l)} δ(l) 的值需要得到 δ ( l + 1 ) \delta^{(l+1)} δ(l+1) 的值,而想要得到 δ ( l + 1 ) \delta^{(l+1)} δ(l+1) 的值又需要得到 δ ( l + 2 ) \delta^{(l+2)} δ(l+2), 如此一直反向传播下去(与递归过程很类似),直到最后一个 δ ( L ) \delta^{(L)} δ(L), δ ( L ) \delta^{(L)} δ(L) 是损失函数对输出层(即最后一层)的求导结果.则有:
1 × M l = ( 1 × M l + 1 ) ( M l + 1 × M l ) ( M l × M l ) δ ( l ) = δ ( l + 1 ) . W ( l + 1 ) . diag ( σ ′ ( z ( l ) ) ) = δ ( l + 2 ) . W ( l + 2 ) . diag ( σ ′ ( z ( l + 1 ) ) ) . W ( l + 1 ) . diag ( σ ′ ( z ( l ) ) ) = δ ( L ) ∏ i = L l + 1 W ( i ) . diag ( σ ′ ( z ( i − 1 ) ) ) \begin{aligned} 1\times M_{l} &= (1\times M_{l+1})(M_{l+1} \times M_{l})(M_{l} \times M_{l}) \\ \delta^{(l)} & = \delta^{(l+1)}. \boldsymbol{W}^{(l+1)} . \text{diag}(\sigma'(\boldsymbol{z}^{(l)})) \\ & = \delta^{(l+2)}. \boldsymbol{W}^{(l+2)} . \text{diag}(\sigma'(\boldsymbol{z}^{(l+1)})). \boldsymbol{W}^{(l+1)} . \text{diag}(\sigma'(\boldsymbol{z}^{(l)})) \\ &=\delta^{(L)} \prod \limits_{i=L}^{l+1} \boldsymbol{W}^{(i)} . \text{diag}(\sigma'(\boldsymbol{z}^{(i-1)})) \end{aligned} 1×Mlδ(l)=(1×Ml+1)(Ml+1×Ml)(Ml×Ml)=δ(l+1).W(l+1).diag(σ′(z(l)))=δ(l+2).W(l+2).diag(σ′(z(l+1))).W(l+1).diag(σ′(z(l)))=δ(L)i=L∏l+1W(i).diag(σ′(z(i−1)))
对最后一层进行求导 ∂ L ( y , y ^ ) ∂ z ( L ) \frac{\partial \mathcal{L}(\boldsymbol{y}, \hat{\boldsymbol{y}})}{\partial \boldsymbol{z}^{(L)}} ∂z(L)∂L(y,y^)
δ ( L ) \delta^{(L)} δ(L) 的不同形式如下:
( 1 ) δ ( L ) = ∂ L ( y , y ^ ) ∂ z ( L ) = − ∑ i = 1 K y i ln y ^ i ∂ z ( L ) = − ∑ i = 1 K y i ln ( softmax ( a ( L ) ) ) i ∂ z ( L ) = − ∑ i = 1 K y i ln ( softmax ( σ ( z ( L ) ) ) ) i ∂ z ( L ) (1) \begin{aligned} \delta^{(L)} &= \frac{\partial \mathcal{L}(\boldsymbol{y}, \hat{\boldsymbol{y}})}{\partial \boldsymbol{z}^{(L)}} \\ & = \frac{-\sum_{i=1}^{K} y_i \ln \hat{y}_i}{\partial \boldsymbol{z}^{(L)}} \\ & = \frac{-\sum_{i=1}^{K} y_i \ln \left( \text{softmax}(\boldsymbol{a}^{(L)}) \right)_i } {\partial \boldsymbol{z}^{(L)}} \\ & = \frac{-\sum_{i=1}^{K} y_i \ln \left( \text{softmax}( \sigma(\boldsymbol{z}^{(L)})) \right)_i } {\partial \boldsymbol{z}^{(L)}} \\ \end{aligned} (1)δ(L)=∂z(L)∂L(y,y^)=∂z(L)−∑i=1Kyilny^i=∂z(L)−∑i=1Kyiln(softmax(a(L)))i=∂z(L)−∑i=1Kyiln(softmax(σ(z(L))))i
( 2 ) δ ( L ) = ∂ L ( y , y ^ ) ∂ z ( L ) = y ( ln y ^ ) T ∂ z ( L ) = y ln softmax ( a ( L ) ) T ∂ z ( L ) = y ln softmax ( σ ( z ( L ) ) ) T ∂ z ( L ) (2) \begin{aligned} \delta^{(L)} &= \frac{\partial \mathcal{L}(\boldsymbol{y}, \hat{\boldsymbol{y}})}{\partial \boldsymbol{z}^{(L)}} \\ & = \frac{\boldsymbol{y} (\ln \hat{\boldsymbol{y}})^{T}}{\partial \boldsymbol{z}^{(L)}} \\ & = \frac{\boldsymbol{y} \ln \text{softmax}(\boldsymbol{a}^{(L)})^{T}}{\partial \boldsymbol{z}^{(L)}} \\ & = \frac{\boldsymbol{y} \ln \text{softmax}( \sigma(\boldsymbol{z}^{(L)}))^{T} } {\partial \boldsymbol{z}^{(L)}} \\ \end{aligned} (2)δ(L)=∂z(L)∂L(y,y^)=∂z(L)y(lny^)T=∂z(L)ylnsoftmax(a(L))T=∂z(L)ylnsoftmax(σ(z(L)))T
( 3 ) δ ( L ) = ∂ L ( y , y ^ ) ∂ z ( L ) = ∂ L ∂ z ( L ) = ∂ L ∂ y ^ ∂ y ^ ∂ a ( L ) ∂ a ( L ) ∂ z ( L ) (3) \begin{aligned} \delta^{(L)} &= \frac{\partial \mathcal{L}(\boldsymbol{y}, \hat{\boldsymbol{y}})}{\partial \boldsymbol{z}^{(L)}} \\ &= \frac{\partial \mathcal{L}}{\partial \boldsymbol{z}^{(L)}} \\ &=\frac{\partial \mathcal{L}}{\partial \boldsymbol{\hat{y}}} \frac{\partial\boldsymbol{\hat{y}} }{\partial \boldsymbol{a}^{(L)}} \frac{\partial \boldsymbol{a}^{(L)} }{\partial \boldsymbol{z}^{(L)}} \end{aligned} (3)δ(L)=∂z(L)∂L(y,y^)=∂z(L)∂L=∂y^∂L∂a(L)∂y^∂z(L)∂a(L)
公式 | 大 小 | 分子分母 |
---|---|---|
∂ L ∂ y ^ \frac{\partial \mathcal{L}}{\partial \boldsymbol{\hat{y}}} ∂y^∂L | 1*K | 标量/矢量 |
∂ y ^ ∂ a ( L ) \frac{\partial\boldsymbol{\hat{y}} }{\partial \boldsymbol{a}^{(L)}} ∂a(L)∂y^ | K*K | 矢量/矢量 |
∂ a ( L ) ∂ z ( L ) \frac{\partial \boldsymbol{a}^{(L)} }{\partial \boldsymbol{z}^{(L)}} ∂z(L)∂a(L) | K*K | 矢量/矢量 |
计算 ∂ L ∂ y ^ \frac{\partial \mathcal{L}}{\partial \boldsymbol{\hat{y}}} ∂y^∂L
∂ L ∂ y ^ = ∂ y ln y ^ T ∂ y ^ = [ y 1 1 y 1 ^ , y 2 1 y 2 ^ , ⋯ , y K 1 y ^ K ] \begin{aligned} \frac{\partial \mathcal{L}}{\partial \boldsymbol{\hat{y}}} &= \frac{\partial \boldsymbol{y}\ln \boldsymbol{\hat{y}}^{T}}{\partial \boldsymbol{\hat{y}}} \\ & = \left[ y_1 \frac{1}{\hat{y_1}}, y_2 \frac{1}{\hat{y_2}}, \cdots,y_K \frac{1}{\hat{y}_K} \right] \end{aligned} ∂y^∂L=∂y^∂ylny^T=[y1y1^1,y2y2^1,⋯,yKy^K1]
计算 ∂ y ^ ∂ a ( L ) \frac{\partial\boldsymbol{\hat{y}} }{\partial \boldsymbol{a}^{(L)}} ∂a(L)∂y^
∂ y ^ ∂ a ( L ) = [ ∂ y ^ 1 ∂ a 1 ( L ) ∂ y ^ 1 ∂ a 2 ( L ) … ∂ y ^ 1 ∂ a K ( L ) ∂ y ^ 2 ∂ a 1 ( L ) ∂ y ^ 2 ∂ a 2 ( L ) … ∂ y ^ 2 ∂ a K ( L ) ⋮ ⋮ ⋱ ⋮ ∂ y ^ K ∂ a 1 ( L ) ∂ y ^ K ∂ a 2 ( L ) … ∂ y ^ K ∂ a K ( L ) ] \frac{\partial\boldsymbol{\hat{y}} }{\partial \boldsymbol{a}^{(L)}} = \left[ \begin{aligned} \frac{\partial \hat{y}_{1}}{\partial a_{1}^{(L)}} \frac{\partial \hat{y}_{1}}{\partial a_{2}^{(L)}} \ldots \frac{\partial \hat{y}_{1}}{\partial a_{K}^{(L)}} \\ \frac{\partial \hat{y}_{2}}{\partial a_{1}^{(L)}} \frac{\partial \hat{y}_{2}}{\partial a_{2}^{(L)}} \ldots \frac{\partial \hat{y}_{2}}{\partial a_{K}^{(L)}} \\ \vdots \quad \quad \vdots \quad \quad \ddots \quad \vdots \quad \\ \frac{\partial \hat{y}_{K}}{\partial a_{1}^{(L)}} \frac{\partial \hat{y}_{K}}{\partial a_{2}^{(L)}} \ldots \frac{\partial \hat{y}_{K}}{\partial a_{K}^{(L)}} \\ \end{aligned}\right] ∂a(L)∂y^=⎣⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎡∂a1(L)∂y^1∂a2(L)∂y^1…∂aK(L)∂y^1∂a1(L)∂y^2∂a2(L)∂y^2…∂aK(L)∂y^2⋮⋮⋱⋮∂a1(L)∂y^K∂a2(L)∂y^K…∂aK(L)∂y^K⎦⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎤
i
f
j
=
i
:
if \quad j=i:
ifj=i:
∂
y
^
j
∂
a
i
(
L
)
=
∂
(
e
a
j
(
L
)
∑
k
e
a
k
(
L
)
)
∂
a
i
(
L
)
=
∂
(
e
a
j
(
L
)
∑
k
e
a
k
(
L
)
)
∂
a
j
(
L
)
=
(
e
a
j
(
L
)
)
′
⋅
∑
k
e
a
k
(
L
)
−
e
a
j
(
L
)
⋅
e
a
j
(
L
)
(
∑
k
e
a
k
(
L
)
)
2
=
e
a
j
(
L
)
∑
k
e
a
k
(
L
)
−
e
a
j
(
L
)
∑
k
e
a
k
(
L
)
⋅
e
a
j
(
L
)
∑
k
e
a
k
(
L
)
=
y
^
j
(
1
−
y
^
j
)
\begin{aligned} \frac{\partial \hat{y}_{j}}{\partial a_{i}^{(L)}} &=\frac{\partial \left(\frac{e^{a_{j}^{(L)}}}{\sum_{k} e^{a_{k}^{(L)}}}\right) }{\partial a_{i}^{(L)}} \\ &=\frac{\partial \left(\frac{e^{a_{j}^{(L)}}}{\sum_{k} e^{a_{k}^{(L)}}}\right) }{\partial a_{j}^{(L)}} \\ &=\frac{\left(e^{a_{j}^{(L)}}\right)^{\prime} \cdot \sum_{k} e^{a_{k}^{(L)}}-e^{a_{j}^{(L)}} \cdot e^{a_{j}^{(L)}}}{\left(\sum_{k} e^{a_{k}^{(L)}}\right)^{2}} \\ &=\frac{e^{a_{j}^{(L)}}}{\sum_{k} e^{a_{k}^{(L)}}}-\frac{e^{a_{j}^{(L)}}}{\sum_{k} e^{a_{k}^{(L)}}} \cdot \frac{e^{a_{j}^{(L)}}}{\sum_{k} e^{a_{k}^{(L)}}} =\hat{y}_{j}\left(1-\hat{y}_{j}\right) \end{aligned}
∂ai(L)∂y^j=∂ai(L)∂(∑keak(L)eaj(L))=∂aj(L)∂(∑keak(L)eaj(L))=(∑keak(L))2(eaj(L))′⋅∑keak(L)−eaj(L)⋅eaj(L)=∑keak(L)eaj(L)−∑keak(L)eaj(L)⋅∑keak(L)eaj(L)=y^j(1−y^j)
i
f
j
≠
i
:
if \quad j \neq i:
ifj=i:
∂
y
^
j
∂
a
i
(
L
)
=
∂
(
e
a
j
(
L
)
∑
k
e
a
k
(
L
)
)
∂
a
i
(
L
)
=
0
⋅
∑
k
e
a
k
(
L
)
−
e
a
i
(
L
)
⋅
e
a
j
(
L
)
(
∑
k
e
a
k
(
L
)
)
2
=
−
e
a
i
(
L
)
∑
k
e
a
k
(
L
)
⋅
e
a
j
(
L
)
∑
k
e
a
k
(
L
)
=
−
y
^
i
y
^
j
\begin{aligned} \frac{\partial \hat{y}_{j}}{\partial a_{i}^{(L)}} &=\frac{\partial \left(\frac{e^{a_{j}^{(L)}}}{\sum_{k} e^{a_{k}^{(L)}}}\right) }{\partial a_{i}^{(L)}} \\ &=\frac{0 \cdot \sum_{k} e^{a_{k}^{(L)}}-e^{a_{i}^{(L)}} \cdot e^{a_{j}^{(L)}}} {\left(\sum_{k} e^{a_{k}^{(L)}}\right)^{2}} \\ &=-\frac{e^{a_{i}^{(L)}}}{\sum_{k} e^{a_{k}^{(L)}}} \cdot \frac{e^{a_{j}^{(L)}}}{\sum_{k} e^{a_{k}^{(L)}}} =-\hat{y}_{i} \hat{y}_{j} \end{aligned}
∂ai(L)∂y^j=∂ai(L)∂(∑keak(L)eaj(L))=(∑keak(L))20⋅∑keak(L)−eai(L)⋅eaj(L)=−∑keak(L)eai(L)⋅∑keak(L)eaj(L)=−y^iy^j
计算 ∂ a ( L ) ∂ z ( L ) \frac{\partial \boldsymbol{a}^{(L)} }{\partial \boldsymbol{z}^{(L)}} ∂z(L)∂a(L)
由于这里是属于激活函数的部分即
a
(
L
)
=
σ
(
z
(
L
)
)
\boldsymbol{a}^{(L)} = \sigma(\boldsymbol{z}^{(L)})
a(L)=σ(z(L))
并且激活函数是逐元素进行运算,则有
∂
a
(
L
)
∂
z
(
L
)
=
diag
(
σ
′
(
z
(
L
)
)
)
\frac{\partial \boldsymbol{a}^{(L)} }{\partial \boldsymbol{z}^{(L)}} = \text{diag}(\sigma'(\boldsymbol{z}^{(L)}))
∂z(L)∂a(L)=diag(σ′(z(L)))
最终的 δ ( L ) \delta^{(L)} δ(L) 结果
则总的表达式如下
(
1
×
M
L
)
=
(
1
×
M
L
)
(
M
L
×
M
L
)
(
M
l
×
M
L
)
=
1
×
M
L
δ
(
L
)
=
∂
L
∂
y
^
∂
y
^
∂
a
(
L
)
∂
a
(
L
)
∂
z
(
L
)
=
[
y
1
1
y
1
^
,
y
2
1
y
2
^
,
⋯
,
y
K
1
y
^
K
]
[
∂
y
^
1
∂
a
1
(
L
)
∂
y
^
1
∂
a
2
(
L
)
…
∂
y
^
1
∂
a
K
(
L
)
∂
y
^
2
∂
a
1
(
L
)
∂
y
^
2
∂
a
2
(
L
)
…
∂
y
^
2
∂
a
K
(
L
)
⋮
⋮
⋱
⋮
∂
y
^
K
∂
a
1
(
L
)
∂
y
^
K
∂
a
2
(
L
)
…
∂
y
^
K
∂
a
K
(
L
)
]
diag
(
σ
′
(
z
(
L
)
)
)
\begin{aligned} (1\times M_{L}) &= (1\times M_{L})(M_{L} \times M_{L})(M_{l} \times M_{L}) = 1\times M_{L} \\ \delta^{(L)} &=\frac{\partial \mathcal{L}}{\partial \boldsymbol{\hat{y}}} \frac{\partial\boldsymbol{\hat{y}} }{\partial \boldsymbol{a}^{(L)}} \frac{\partial \boldsymbol{a}^{(L)} }{\partial \boldsymbol{z}^{(L)}}\\ &=\left[ y_1 \frac{1}{\hat{y_1}}, y_2 \frac{1}{\hat{y_2}}, \cdots,y_K \frac{1}{\hat{y}_K} \right] \left[ \begin{aligned} \frac{\partial \hat{y}_{1}}{\partial a_{1}^{(L)}} \frac{\partial \hat{y}_{1}}{\partial a_{2}^{(L)}} \ldots \frac{\partial \hat{y}_{1}}{\partial a_{K}^{(L)}} \\ \frac{\partial \hat{y}_{2}}{\partial a_{1}^{(L)}} \frac{\partial \hat{y}_{2}}{\partial a_{2}^{(L)}} \ldots \frac{\partial \hat{y}_{2}}{\partial a_{K}^{(L)}} \\ \vdots \quad \quad \vdots \quad \quad \ddots \quad \vdots \quad \\ \frac{\partial \hat{y}_{K}}{\partial a_{1}^{(L)}} \frac{\partial \hat{y}_{K}}{\partial a_{2}^{(L)}} \ldots \frac{\partial \hat{y}_{K}}{\partial a_{K}^{(L)}} \\ \end{aligned}\right] \text{diag}(\sigma'(\boldsymbol{z}^{(L)})) \end{aligned}
(1×ML)δ(L)=(1×ML)(ML×ML)(Ml×ML)=1×ML=∂y^∂L∂a(L)∂y^∂z(L)∂a(L)=[y1y1^1,y2y2^1,⋯,yKy^K1]⎣⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎡∂a1(L)∂y^1∂a2(L)∂y^1…∂aK(L)∂y^1∂a1(L)∂y^2∂a2(L)∂y^2…∂aK(L)∂y^2⋮⋮⋱⋮∂a1(L)∂y^K∂a2(L)∂y^K…∂aK(L)∂y^K⎦⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎤diag(σ′(z(L)))
最终对 ∂ L ( y , y ^ ) ∂ w i j ( l ) \frac{\partial \mathcal{L}(\boldsymbol{y}, \hat{\boldsymbol{y}})}{\partial w_{i j}^{(l)}} ∂wij(l)∂L(y,y^) 的求导结果
因此,当我们将上述所有结果汇总即可得到最终对 ∂ L ( y , y ^ ) ∂ w i j ( l ) \frac{\partial \mathcal{L}(\boldsymbol{y}, \hat{\boldsymbol{y}})}{\partial w_{i j}^{(l)}} ∂wij(l)∂L(y,y^) 的求导结果
( 1 × 1 ) = ( 1 × M l ) ( M l × 1 ) = ( 1 × 1 ) = 1 ∂ L ( y , y ^ ) ∂ w i j ( l ) = ∂ L ( y , y ^ ) ∂ z ( l ) ∂ z ( l ) ∂ w i j ( l ) = ∂ L ( y , y ^ ) ∂ z ( l ) ∂ z ( l ) ∂ w i j ( l ) = δ ( L ) ( ∏ i = L l + 1 W ( i ) . diag ( σ ′ ( z ( i − 1 ) ) ) ) [ 0 , … , a j ( l − 1 ) , … , 0 ] T \begin{aligned} (1\times 1) &= (1\times M_{l})(M_{l} \times 1) = (1\times 1) = 1 \\ \frac{\partial \mathcal{L}(\boldsymbol{y}, \hat{\boldsymbol{y}})}{\partial w_{i j}^{(l)}} &=\frac{\partial \mathcal{L}(\boldsymbol{y}, \hat{\boldsymbol{y}})}{\partial \mathbf{z}^{(l)}}\frac{\partial \boldsymbol{z}^{(l)}}{\partial w_{i j}^{(l)}} \\ &= \frac{\partial \mathcal{L}(\boldsymbol{y}, \hat{\boldsymbol{y}})}{\partial \mathbf{z}^{(l)}}\frac{\partial \boldsymbol{z}^{(l)}}{\partial w_{i j}^{(l)}} \\ &=\delta^{(L)} \left( \prod \limits_{i=L}^{l+1} \boldsymbol{W}^{(i)} . \text{diag}(\sigma'(\boldsymbol{z}^{(i-1)})) \right) \Big[0,\ldots,a_{j}^{(l-1)},\ldots,0 \Big]^{T} \end{aligned} (1×1)∂wij(l)∂L(y,y^)=(1×Ml)(Ml×1)=(1×1)=1=∂z(l)∂L(y,y^)∂wij(l)∂z(l)=∂z(l)∂L(y,y^)∂wij(l)∂z(l)=δ(L)(i=L∏l+1W(i).diag(σ′(z(i−1))))[0,…,aj(l−1),…,0]T
由丄式可知,在求导 ∂ L ( y , y ^ ) ∂ w i j ( l ) \frac{\partial \mathcal{L}(\boldsymbol{y}, \hat{\boldsymbol{y}})}{\partial w_{i j}^{(l)}} ∂wij(l)∂L(y,y^) 时,距离最后一层越近( l l l 越接近 L L L ),则求导所需的计算量越少, 并且计算靠近输入层的参数需要靠近输出层的一些数据,这就是反向传播.
参考资料
- [1] 《神经网络与深度学习》
- [2] 《李宏毅人工智能》