深度神经网络中的反向传播算法以及对softmax函数求导

神经网络的模型结构

在这里插入图片描述

a 1 ( 2 ) = σ ( W 10 ( 1 ) x 0 + W 11 ( 1 ) x 1 + W 12 ( 1 ) x 2 + W 13 ( 1 ) x 3 ) a 2 ( 2 ) = σ ( W 20 ( 1 ) x 0 + W 21 ( 1 ) x 1 + W 22 ( 1 ) x 2 + W 23 ( 1 ) x 3 ) a 3 ( 2 ) = σ ( W 30 ( 1 ) x 0 + W 31 ( 1 ) x 1 + W 32 ( 1 ) x 2 + W 33 ( 1 ) x 3 ) \begin{gathered} a_1^{(2)} = \sigma(\mathbf{W} _{10}^{(1)}{x_0} + \mathbf{W} _{11}^{(1)}{x_1} + \mathbf{W} _{12}^{(1)}{x_2} + \mathbf{W} _{13}^{(1)}{x_3}) \\ a_2^{(2)} = \sigma(\mathbf{W} _{20}^{(1)}{x_0} + \mathbf{W} _{21}^{(1)}{x_1} + \mathbf{W} _{22}^{(1)}{x_2} + \mathbf{W} _{23}^{(1)}{x_3}) \\ a_3^{(2)} = \sigma(\mathbf{W} _{30}^{(1)}{x_0} + \mathbf{W} _{31}^{(1)}{x_1} + \mathbf{W} _{32}^{(1)}{x_2} + \mathbf{W} _{33}^{(1)}{x_3}) \\ \end{gathered} a1(2)=σ(W10(1)x0+W11(1)x1+W12(1)x2+W13(1)x3)a2(2)=σ(W20(1)x0+W21(1)x1+W22(1)x2+W23(1)x3)a3(2)=σ(W30(1)x0+W31(1)x1+W32(1)x2+W33(1)x3)

在这里插入图片描述

a 0 ( 3 ) = σ ( W 10 ( 2 ) a 0 ( 2 ) + W 11 ( 2 ) a 1 ( 2 ) + W 12 ( 2 ) a 2 ( 2 ) + W 13 ( 2 ) a 3 ( 2 ) ) \begin{gathered} a_0^{(3)} = \sigma(\mathbf{W} _{10}^{(2)}a_0^{(2)} + \mathbf{W} _{11}^{(2)}a_1^{(2)} + \mathbf{W} _{12}^{(2)}a_2^{(2)} + \mathbf{W} _{13}^{(2)}a_3^{(2)}) \\ \end{gathered} a0(3)=σ(W10(2)a0(2)+W11(2)a1(2)+W12(2)a2(2)+W13(2)a3(2))

基本符号

W ( l ) \boldsymbol{W}^{(l)} W(l) 表示第 l l l 层的参数矩阵,

z ( l ) \boldsymbol{z}^{(l)} z(l) 表示第 l l l 层在激活函数之前的输入,

σ \sigma σ 表示激活函数,

a ( l ) \boldsymbol{a}^{(l)} a(l) 表示第 l l l 层的输出, a ( l ) = σ ( z ( l ) ) \boldsymbol{a}^{(l)} = \sigma(\boldsymbol{z}^{(l)}) a(l)=σ(z(l)) ,也可以看成第 l + 1 l+1 l+1 层的输入,

a L \boldsymbol{a}^{L} aL 表示最终神经网络的输出结果.

a L \boldsymbol{a}^{L} aL 一般在计算 Loss \text{Loss} Loss 时,还需要经过 softmax \text{softmax} softmax 操作才可以计算损失.

数据在神经网络中的流动路线

a ( 1 ) = x z ( 2 ) = W ( 1 ) a ( 1 ) a ( 2 ) = σ ( z ( 2 ) ) z ( 3 ) = W ( 2 ) a ( 2 ) a ( 3 ) = σ ( z ( 3 ) ) z ( 4 ) = W ( 3 ) a ( 3 ) ⋯ z ( L ) = W ( L ) a ( L − 1 ) a ( L ) = σ ( z ( L ) ) \begin{aligned} {\boldsymbol{a}^{(1)}} &= \boldsymbol{x} \\ {\boldsymbol{z}^{(2)}} &= {\mathbf{W} ^{(1)}}{\boldsymbol{a}^{(1)}} \\ {\boldsymbol{a}^{(2)}} &= \sigma({\boldsymbol{z}^{(2)}}) \\ {\boldsymbol{z}^{(3)}} &= {\mathbf{W} ^{(2)}}{\boldsymbol{a}^{(2)}} \\ {\boldsymbol{a}^{(3)}} &= \sigma({\boldsymbol{z}^{(3)}}) \\ {\boldsymbol{z}^{(4)}} &= {\mathbf{W} ^{(3)}}{\boldsymbol{a}^{(3)}} \\ & \cdots\\ {\boldsymbol{z}^{(L)}} &= {\mathbf{W} ^{(L)}}{\boldsymbol{a}^{(L-1)}} \\ {\boldsymbol{a}^{(L)}} &= \sigma({\boldsymbol{z}^{(L)}}) \\ \end{aligned} a(1)z(2)a(2)z(3)a(3)z(4)z(L)a(L)=x=W(1)a(1)=σ(z(2))=W(2)a(2)=σ(z(3))=W(3)a(3)=W(L)a(L1)=σ(z(L))

损失函数

Loss = − y T ln ⁡ y ^ = − ∑ i = 1 K y i ln ⁡ y ^ i \begin{aligned} \text{Loss} &=- \boldsymbol{y}^{T} \ln \hat{\boldsymbol{y}} \\ &=− \sum_{i=1}^{\text{K}} y_i \ln \hat{y}_i \end{aligned} Loss=yTlny^=i=1Kyilny^i
其中 y ∈ { 0 , 1 } C \boldsymbol{y} \in\{0,1\}^{C} y{0,1}C 为标签, y ^ \boldsymbol{\hat{y}} y^ 为神经网络的预测值.

softmax \text{softmax} softmax 函数

y ^ k = softmax ( a ( L ) ) k = e a k ( L ) ∑ i = 1 K e a i ( L ) \hat{y}_k = \text{softmax}(\boldsymbol{a}^{(L)})_k = \frac{e^{a_k^{(L)}}} {\sum_{i = 1}^{K} e^{a_i^{(L)}}} y^k=softmax(a(L))k=i=1Keai(L)eak(L)

分子布局与分母布局

矩阵微积分的表示通常有两种符号约定: 分子布局 ( Numerator Layout) 和分母布局 ( Denominator Layout). 两者的区别是一个标量关于一个向量的导数是写成列向量还是行向量.标量关于向量的偏导数, 对于 M M M 维向量 x ∈ R M \boldsymbol{x} \in \mathbb{R}^{M} xRM 和函数 y = f ( x ) ∈ R y=f(\boldsymbol{x}) \in \mathbb{R} y=f(x)R, 则 y y y 关于 x \boldsymbol{x} x 的偏导数为

分母布局  ∂ y ∂ x = [ ∂ y ∂ x 1 , ⋯   , ∂ y ∂ x M ] T ∈ R M × 1 , 分子布局  ∂ y ∂ x = [ ∂ y ∂ x 1 , ⋯   , ∂ y ∂ x M ] ∈ R 1 × M . \begin{aligned} \text{分母布局 }\frac{\partial y}{\partial x} & =\left[\frac{\partial y}{\partial x_{1}}, \cdots, \frac{\partial y}{\partial x_{M}}\right]^{T} & \in \mathbb{R}^{M \times 1}, \\ \text{分子布局 }\frac{\partial y}{\partial x} & =\left[\frac{\partial y}{\partial x_{1}}, \cdots, \frac{\partial y}{\partial x_{M}}\right] & \in \mathbb{R}^{1 \times M} . \end{aligned} 分母布局 xy分子布局 xy=[x1y,,xMy]T=[x1y,,xMy]RM×1,R1×M.

在分母布局中, ∂ y ∂ x \frac{\partial y}{\partial x} xy 为列向量;而在分子布局中, ∂ y ∂ x \frac{\partial y}{\partial x} xy 为行向量. 向量关于标量的偏导数, 对于标量 x ∈ R x \in \mathbb{R} xR 和函数 y = f ( x ) ∈ R N \boldsymbol{y}=f(x) \in \mathbb{R}^{N} y=f(x)RN, 则 y \boldsymbol{y} y 关于 x x x 的 偏导数为

分母布局  ∂ y ∂ x = [ ∂ y 1 ∂ x , ⋯   , ∂ y N ∂ x ] ∈ R 1 × N 分子布局  ∂ y ∂ x = [ ∂ y 1 ∂ x , ⋯   , ∂ y N ∂ x ] T ∈ R N × 1 \text{分母布局 }\frac{\partial \boldsymbol{y}}{\partial x}=\left[\frac{\partial y_{1}}{\partial x}, \cdots, \frac{\partial y_{N}}{\partial x}\right] \quad \in \mathbb{R}^{1 \times N} \\ \text{分子布局 } \frac{\partial \boldsymbol{y}}{\partial x}=\left[\frac{\partial y_{1}}{\partial x}, \cdots, \frac{\partial y_{N}}{\partial x}\right]^{T} \quad \in \mathbb{R}^{N \times 1} 分母布局 xy=[xy1,,xyN]R1×N分子布局 xy=[xy1,,xyN]TRN×1
在分母布局中, ∂ y ∂ x \frac{\partial y}{\partial x} xy 为行向量;而在分子布局中, ∂ y ∂ \frac{\partial y}{\partial} y 为列向量.

因为 ∂ L ( y y ^ ) ∂ W ( l ) \frac{\partial \mathcal{L}(\boldsymbol{y} \hat{\boldsymbol{y}})}{\partial \boldsymbol{W}^{(l)}} W(l)L(yy^) 的计算涉及向量对矩阵的微分, 十分繁琐, 因此我们先计算 L ( y , y ^ ) \mathcal{L}(\boldsymbol{y}, \hat{\boldsymbol{y}}) L(y,y^) 关于参数矩阵中每
个元素的偏导数.

反向传播算法

不失一般性, 对第𝑙 层中的参数 W ( l ) \boldsymbol{W}^{(l)} W(l) b ( l ) \boldsymbol{b}^{(l)} b(l) 计算偏导数. 这里使用向量或矩阵来表示多变量函数的偏导数, 并使用分子布局表示, 根据链式法则,
∂ L ( y , y ^ ) ∂ w i j ( l ) = ∂ L ( y , y ^ ) ∂ z ( l ) ∂ z ( l ) ∂ w i j ( l ) , ∂ L ( y , y ^ ) ∂ b ( l ) = ∂ L ( y , y ^ ) ∂ z ( l ) ∂ z ( l ) ∂ b ( l ) , \begin{array}{l} \frac{\partial \mathcal{L}(\boldsymbol{y}, \hat{\boldsymbol{y}})}{\partial w_{i j}^{(l)}}=\frac{\partial \mathcal{L}(\boldsymbol{y}, \hat{\boldsymbol{y}})}{\partial \mathbf{z}^{(l)}}\frac{\partial \boldsymbol{z}^{(l)}}{\partial w_{i j}^{(l)}}, \\ \frac{\partial \mathcal{L}(\boldsymbol{y}, \hat{\boldsymbol{y}})}{\partial \boldsymbol{b}^{(l)}} =\frac{\partial \mathcal{L}(\boldsymbol{y}, \hat{\boldsymbol{y}})}{\partial \boldsymbol{z}^{(l)}} \frac{\partial \boldsymbol{z}^{(l)}}{\partial \boldsymbol{b}^{(l)}}, \end{array} wij(l)L(y,y^)=z(l)L(y,y^)wij(l)z(l),b(l)L(y,y^)=z(l)L(y,y^)b(l)z(l),

上述两个公式中的第二项都是目标函数关于第 l l l 层的神经元 z ( l ) \boldsymbol{z}^{(l)} z(l)
的偏导数, 称为误差项, 可以一次计算得到. 这样我们只需要计算三个偏导数, 分别为 ∂ z ( l ) ∂ w i j ( l ) , ∂ z ( l ) ∂ b ( l ) \frac{\partial \boldsymbol{z}^{(l)}}{\partial w_{i j}^{(l)}}, \frac{\partial \boldsymbol{z}^{(l)}}{\partial \boldsymbol{b}^{(l)}} wij(l)z(l),b(l)z(l) ∂ L ( y , y ^ ) ∂ z ( l ) \frac{\partial \mathcal{L}(\boldsymbol{y}, \hat{\boldsymbol{y}})}{\partial \boldsymbol{z}^{(l)}} z(l)L(y,y^)

( 1 ) (1) (1) 计算偏导数 ∂ z ( l ) ∂ w i j ( l ) \frac{\partial \boldsymbol{z}^{(l)}}{\partial w_{i j}^{(l)}} \quad wij(l)z(l) z ( l ) = W ( l ) a ( l − 1 ) + b ( l ) z^{(l)}=\boldsymbol{W}^{(l)} \boldsymbol{a}^{(l-1)}+\boldsymbol{b}^{(l)} z(l)=W(l)a(l1)+b(l), 偏导数

∂ z ( l ) ∂ w i j ( l ) = [ ∂ z 1 ( l ) ∂ w i j ( l ) , ⋯   , ∂ z i ( l ) ∂ w i j ( l ) , ⋯   , ∂ z M l ( l ) ∂ w i j ( l ) ] = [ 0 , ⋯   , ( w i : ( l ) a ( l − 1 ) + b i ( l ) ) ∂ w i j ( l ) ⋯   , 0 ] = [ 0 , … , a j ( l − 1 ) , … , 0 ] T \begin{aligned} \frac{\partial \boldsymbol{z}^{(l)}}{\partial w_{i j}^{(l)}} &=\Big [\frac{\partial z_{1}^{(l)}}{\partial w_{i j}^{(l)}}, \cdots, {\frac{\partial z_{i}^{(l)}}{\partial w_{i j}^{(l)}}}, \cdots, \frac{\partial z_{M_{l}}^{(l)}}{\partial w_{i j}^{(l)}} \Big] \\ &=\Big[0, \cdots, \frac{(\boldsymbol{w}_{i:}^{(l)} \boldsymbol{a}^{(l-1)}+ b_i^{(l)} )} {\partial{w_{ij}^{(l)}}} \cdots, 0 \Big] \\ & = \Big[0,\ldots,a_{j}^{(l-1)},\ldots,0 \Big]^{T} \end{aligned} wij(l)z(l)=[wij(l)z1(l),,wij(l)zi(l),,wij(l)zMl(l)]=[0,,wij(l)(wi:(l)a(l1)+bi(l)),0]=[0,,aj(l1),,0]T

其中 w i : ( l ) \boldsymbol{w}_{i:}^{(l)} wi:(l) 为权重矩阵 W ( l ) \boldsymbol{W}^{(l)} W(l) 的第 i i i 行.

( 2 ) (2) (2) 计算偏导数 ∂ z ( l ) ∂ b ( l ) \frac{\partial \boldsymbol{z}^{(l)}}{\partial b^{(l)}} \quad b(l)z(l) 因为 z ( l ) \boldsymbol{z}^{(l)} z(l) b ( l ) \boldsymbol{b}^{(l)} b(l) 的函数关系为 z ( l ) = W ( l ) a ( l − 1 ) + \boldsymbol{z}^{(l)}=W^{(l)} \boldsymbol{a}^{(l-1)}+ z(l)=W(l)a(l1)+
b ( l ) \boldsymbol{b}^{(l)} b(l), 因此偏导数

∂ z ( l ) ∂ b ( l ) = I M l ∈ R M l × M l \frac{\partial \boldsymbol{z}^{(l)}}{\partial \boldsymbol{b}^{(l)}}=\boldsymbol{I}_{M_{l}} \in \mathbb{R}^{M_{l} \times M_{l}} b(l)z(l)=IMlRMl×Ml

M l × M l M_{l} \times M_{l} Ml×Ml 的单位矩阵.

( 3 ) (3) (3) 计算偏导数 ∂ L ( y , y ^ ) ∂ z ( l ) \frac{\partial \mathcal{L}(\boldsymbol{y}, \hat{\boldsymbol{y}})}{\partial \boldsymbol{z}^{(l)}} z(l)L(y,y^) 偏导数 ∂ L ( y , y ^ ) ∂ z ( l ) \frac{\partial \mathcal{L}(\boldsymbol{y}, \hat{\boldsymbol{y}})}{\partial \boldsymbol{z}^{(l)}} z(l)L(y,y^) 表示第 l l l 层神经元对最终损失
的影响,也反映了最终损失对第 l l l 层神经元的敏感程度, 因此一般称为第 l l l 层神经
元的误差项,用 δ ( l ) \delta^{(l)} δ(l) 来表示.

δ ( l ) ≜ ∂ L ( y , y ^ ) ∂ z ( l ) ∈ R M l \delta^{(l)} \triangleq \frac{\partial \mathcal{L}(\boldsymbol{y}, \hat{\boldsymbol{y}})}{\partial \boldsymbol{z}^{(l)}} \in \mathbb{R}^{M_{l}} δ(l)z(l)L(y,y^)RMl

误差项 δ ( l ) \delta^{(l)} δ(l) 也间接反映了不同神经元对网络能力的贡献程度, 从而比较好地解决
了贡献度分配问题 ( Credit Assignment Problem, CAP ).
根据 z ( l + 1 ) = W ( l + 1 ) a ( l ) + b ( l + 1 ) \boldsymbol{z}^{(l+1)}=\boldsymbol{W}^{(l+1)} \boldsymbol{a}^{(l)}+\boldsymbol{b}^{(l+1)} z(l+1)=W(l+1)a(l)+b(l+1), 有(这里采用的是分母布局)

∂ z ( l + 1 ) ∂ a ( l ) = W ( l + 1 ) ∈ R M l × M l + 1 \frac{\partial \boldsymbol{z}^{(l+1)}}{\partial \boldsymbol{a}^{(l)}}= \boldsymbol{W}^{(l+1)} \in \mathbb{R}^{M_{l} \times M_{l+1}} a(l)z(l+1)=W(l+1)RMl×Ml+1

根据 a ( l ) = σ l ( z ( l ) ) \boldsymbol{a}^{(l)}=\sigma_{l}\left(\boldsymbol{z}^{(l)}\right) a(l)=σl(z(l)),其中 σ l ( ⋅ ) \sigma_{l}(\cdot) σl() 为第 l l l 层神经网络中按位计算的激活函数, 因此有

∂ a ( l ) ∂ z ( l ) = ∂ σ l ( z ( l ) ) ∂ z ( l ) = [ ∂ σ l ( z ( l ) ) 1 ∂ z 1 ( l ) ∂ σ l ( z ( l ) ) 1 ∂ z 2 ( l ) … ∂ σ l ( z ( l ) ) 1 ∂ z m l ( l ) ∂ σ l ( z ( l ) ) 2 ∂ z 1 ( l ) ∂ σ l ( z ( l ) ) 2 ∂ z 2 ( l ) … ∂ σ l ( z ( l ) ) 2 ∂ z m l ( l ) ⋮ ⋮ ⋱ ⋮ ∂ σ l ( z ( l ) ) m l ∂ z 1 ( l ) ∂ σ l ( z ( l ) ) m l ∂ z 2 ( l ) … ∂ σ l ( z ( l ) ) m l ∂ z m l ( l ) ] = [ ∂ σ l ( z ( l ) ) 1 ∂ z 1 ( l ) 0 … 0 0 ∂ σ l ( z ( l ) ) 2 ∂ z 2 ( l ) … 0 ⋮ ⋮ ⋱ ⋮ 0 0 ⋯ ∂ σ l ( z ( l ) ) m l ∂ z m l ( l ) ] = diag ( σ l ′ ( z ( l ) ) ) ∈ R M l × M l \begin{aligned} \frac{\partial \boldsymbol{a}^{(l)}}{\partial \boldsymbol{z}^{(l)}} &=\frac{\partial \sigma_{l}\left(\boldsymbol{z}^{(l)}\right)}{\partial \boldsymbol{z}^{(l)}} \\ &= \left[ \begin{aligned} \frac{\partial \sigma_{l}\left({z}^{(l)}\right)_{1}}{\partial {z}_{1}^{(l)}} \frac{\partial \sigma_{l}\left({z}^{(l)}\right)_{1}}{\partial {z}_{2}^{(l)}} \ldots \frac{\partial \sigma_{l}\left({z}^{(l)}\right)_{1}}{\partial {z}_{m^l}^{(l)}} \\ \frac{\partial \sigma_{l}\left({z}^{(l)}\right)_{2}}{\partial {z}_{1}^{(l)}} \frac{\partial \sigma_{l}\left({z}^{(l)}\right)_{2}}{\partial {z}_{2}^{(l)}} \ldots \frac{\partial \sigma_{l}\left({z}^{(l)}\right)_{2}}{\partial {z}_{m^l}^{(l)}} \\ \quad \vdots \qquad \quad \vdots \qquad \quad \ddots \quad \quad \vdots \qquad \\ \frac{\partial \sigma_{l}\left({z}^{(l)}\right)_{m^l}}{\partial {z}_{1}^{(l)}} \frac{\partial \sigma_{l}\left({z}^{(l)}\right)_{m^l}}{\partial {z}_{2}^{(l)}} \ldots \frac{\partial \sigma_{l}\left({z}^{(l)}\right)_{m^l}}{\partial {z}_{m^l}^{(l)}} \\ \end{aligned}\right] \\ &=\left[ \begin{aligned} \frac{\partial \sigma_{l}\left({z}^{(l)}\right)_{1}}{\partial {z}_{1}^{(l)}} \qquad 0 \qquad \ldots \qquad 0 \qquad \\ \qquad 0 \qquad \frac{\partial \sigma_{l}\left({z}^{(l)}\right)_{2}}{\partial {z}_{2}^{(l)}} \ldots \qquad 0 \qquad \\ \quad \vdots \qquad \quad \vdots \qquad \quad \ddots \quad \quad \vdots \qquad \\ \quad 0 \quad \qquad 0 \quad \cdots \quad \frac{\partial \sigma_{l}\left({z}^{(l)}\right)_{m^l}}{\partial {z}_{m^l}^{(l)}} \\ \end{aligned}\right] \\ &=\text{diag}\left(\sigma_{l}^{\prime}\left(\boldsymbol{z}^{(l)}\right)\right) \quad \in \mathbb{R}^{M_{l} \times M_{l}} \end{aligned} z(l)a(l)=z(l)σl(z(l))=z1(l)σl(z(l))1z2(l)σl(z(l))1zml(l)σl(z(l))1z1(l)σl(z(l))2z2(l)σl(z(l))2zml(l)σl(z(l))2z1(l)σl(z(l))mlz2(l)σl(z(l))mlzml(l)σl(z(l))ml=z1(l)σl(z(l))1000z2(l)σl(z(l))2000zml(l)σl(z(l))ml=diag(σl(z(l)))RMl×Ml

因此,根据链式法则 ,第 l l l 层的误差项为

1 × M l = ( 1 × M l + 1 ) ( M l + 1 × M l ) ( M l × M l ) δ ( l ) = ∂ L ( y , y ^ ) ∂ z ( l ) = ∂ L ( y , y ^ ) ∂ z ( l + 1 ) . ∂ z l + 1 ∂ a ( l ) . ∂ a l ∂ z ( l ) = δ ( l + 1 ) . W ( l + 1 ) . diag ( σ ′ ( z ( l ) ) ) \begin{aligned} 1\times M_{l} &= (1\times M_{l+1})(M_{l+1} \times M_{l})(M_{l} \times M_{l}) \\ \delta^{(l)} &= \frac{\partial \mathcal{L}(\boldsymbol{y}, \hat{\boldsymbol{y}})}{\partial \boldsymbol{z}^{(l)}} \\ &= \frac{\partial \mathcal{L}(\boldsymbol{y}, \hat{\boldsymbol{y}})}{\partial \boldsymbol{z}^{(l+1)}} . \frac{\partial \boldsymbol{z}^{l+1}}{\partial \boldsymbol{a}^{(l)}} . \frac{\partial \boldsymbol{a}^{l}}{\partial \boldsymbol{z}^{(l)}} \\ & = \delta^{(l+1)}. \boldsymbol{W}^{(l+1)} . \text{diag}(\sigma'(\boldsymbol{z}^{(l)})) \end{aligned} 1×Mlδ(l)=(1×Ml+1)(Ml+1×Ml)(Ml×Ml)=z(l)L(y,y^)=z(l+1)L(y,y^).a(l)zl+1.z(l)al=δ(l+1).W(l+1).diag(σ(z(l)))

从公式 δ ( l ) = δ ( l + 1 ) . W ( l + 1 ) . diag ( σ ′ ( z ( l ) ) ) \delta^{(l)}= \delta^{(l+1)}. \boldsymbol{W}^{(l+1)} . \text{diag}(\sigma'(\boldsymbol{z}^{(l)})) δ(l)=δ(l+1).W(l+1).diag(σ(z(l))) 可以看出, l l l 层的误差项可以通过第 l + 1 l+1 l+1 层的误差项计算得到, 这就是误差的反向传播(Back Propagation, BP). 反向传播算法的含义是:第 l l l 层的一个神经元的误差项(或敏感性)是所有与该神经元相连的第 l + 1 l+1 l+1层的神经元的误差项的权重和. 然后,再乘上该神经元激活函数的梯度. 由于得到 δ ( l ) \delta^{(l)} δ(l) 的值需要得到 δ ( l + 1 ) \delta^{(l+1)} δ(l+1) 的值,而想要得到 δ ( l + 1 ) \delta^{(l+1)} δ(l+1) 的值又需要得到 δ ( l + 2 ) \delta^{(l+2)} δ(l+2), 如此一直反向传播下去(与递归过程很类似),直到最后一个 δ ( L ) \delta^{(L)} δ(L), δ ( L ) \delta^{(L)} δ(L) 是损失函数对输出层(即最后一层)的求导结果.则有:

1 × M l = ( 1 × M l + 1 ) ( M l + 1 × M l ) ( M l × M l ) δ ( l ) = δ ( l + 1 ) . W ( l + 1 ) . diag ( σ ′ ( z ( l ) ) ) = δ ( l + 2 ) . W ( l + 2 ) . diag ( σ ′ ( z ( l + 1 ) ) ) . W ( l + 1 ) . diag ( σ ′ ( z ( l ) ) ) = δ ( L ) ∏ i = L l + 1 W ( i ) . diag ( σ ′ ( z ( i − 1 ) ) ) \begin{aligned} 1\times M_{l} &= (1\times M_{l+1})(M_{l+1} \times M_{l})(M_{l} \times M_{l}) \\ \delta^{(l)} & = \delta^{(l+1)}. \boldsymbol{W}^{(l+1)} . \text{diag}(\sigma'(\boldsymbol{z}^{(l)})) \\ & = \delta^{(l+2)}. \boldsymbol{W}^{(l+2)} . \text{diag}(\sigma'(\boldsymbol{z}^{(l+1)})). \boldsymbol{W}^{(l+1)} . \text{diag}(\sigma'(\boldsymbol{z}^{(l)})) \\ &=\delta^{(L)} \prod \limits_{i=L}^{l+1} \boldsymbol{W}^{(i)} . \text{diag}(\sigma'(\boldsymbol{z}^{(i-1)})) \end{aligned} 1×Mlδ(l)=(1×Ml+1)(Ml+1×Ml)(Ml×Ml)=δ(l+1).W(l+1).diag(σ(z(l)))=δ(l+2).W(l+2).diag(σ(z(l+1))).W(l+1).diag(σ(z(l)))=δ(L)i=Ll+1W(i).diag(σ(z(i1)))

对最后一层进行求导 ∂ L ( y , y ^ ) ∂ z ( L ) \frac{\partial \mathcal{L}(\boldsymbol{y}, \hat{\boldsymbol{y}})}{\partial \boldsymbol{z}^{(L)}} z(L)L(y,y^)

神经网络结构图

δ ( L ) \delta^{(L)} δ(L) 的不同形式如下:

( 1 ) δ ( L ) = ∂ L ( y , y ^ ) ∂ z ( L ) = − ∑ i = 1 K y i ln ⁡ y ^ i ∂ z ( L ) = − ∑ i = 1 K y i ln ⁡ ( softmax ( a ( L ) ) ) i ∂ z ( L ) = − ∑ i = 1 K y i ln ⁡ ( softmax ( σ ( z ( L ) ) ) ) i ∂ z ( L ) (1) \begin{aligned} \delta^{(L)} &= \frac{\partial \mathcal{L}(\boldsymbol{y}, \hat{\boldsymbol{y}})}{\partial \boldsymbol{z}^{(L)}} \\ & = \frac{-\sum_{i=1}^{K} y_i \ln \hat{y}_i}{\partial \boldsymbol{z}^{(L)}} \\ & = \frac{-\sum_{i=1}^{K} y_i \ln \left( \text{softmax}(\boldsymbol{a}^{(L)}) \right)_i } {\partial \boldsymbol{z}^{(L)}} \\ & = \frac{-\sum_{i=1}^{K} y_i \ln \left( \text{softmax}( \sigma(\boldsymbol{z}^{(L)})) \right)_i } {\partial \boldsymbol{z}^{(L)}} \\ \end{aligned} (1)δ(L)=z(L)L(y,y^)=z(L)i=1Kyilny^i=z(L)i=1Kyiln(softmax(a(L)))i=z(L)i=1Kyiln(softmax(σ(z(L))))i


( 2 ) δ ( L ) = ∂ L ( y , y ^ ) ∂ z ( L ) = y ( ln ⁡ y ^ ) T ∂ z ( L ) = y ln ⁡ softmax ( a ( L ) ) T ∂ z ( L ) = y ln ⁡ softmax ( σ ( z ( L ) ) ) T ∂ z ( L ) (2) \begin{aligned} \delta^{(L)} &= \frac{\partial \mathcal{L}(\boldsymbol{y}, \hat{\boldsymbol{y}})}{\partial \boldsymbol{z}^{(L)}} \\ & = \frac{\boldsymbol{y} (\ln \hat{\boldsymbol{y}})^{T}}{\partial \boldsymbol{z}^{(L)}} \\ & = \frac{\boldsymbol{y} \ln \text{softmax}(\boldsymbol{a}^{(L)})^{T}}{\partial \boldsymbol{z}^{(L)}} \\ & = \frac{\boldsymbol{y} \ln \text{softmax}( \sigma(\boldsymbol{z}^{(L)}))^{T} } {\partial \boldsymbol{z}^{(L)}} \\ \end{aligned} (2)δ(L)=z(L)L(y,y^)=z(L)y(lny^)T=z(L)ylnsoftmax(a(L))T=z(L)ylnsoftmax(σ(z(L)))T


( 3 ) δ ( L ) = ∂ L ( y , y ^ ) ∂ z ( L ) = ∂ L ∂ z ( L ) = ∂ L ∂ y ^ ∂ y ^ ∂ a ( L ) ∂ a ( L ) ∂ z ( L ) (3) \begin{aligned} \delta^{(L)} &= \frac{\partial \mathcal{L}(\boldsymbol{y}, \hat{\boldsymbol{y}})}{\partial \boldsymbol{z}^{(L)}} \\ &= \frac{\partial \mathcal{L}}{\partial \boldsymbol{z}^{(L)}} \\ &=\frac{\partial \mathcal{L}}{\partial \boldsymbol{\hat{y}}} \frac{\partial\boldsymbol{\hat{y}} }{\partial \boldsymbol{a}^{(L)}} \frac{\partial \boldsymbol{a}^{(L)} }{\partial \boldsymbol{z}^{(L)}} \end{aligned} (3)δ(L)=z(L)L(y,y^)=z(L)L=y^La(L)y^z(L)a(L)

公式大 小分子分母
∂ L ∂ y ^ \frac{\partial \mathcal{L}}{\partial \boldsymbol{\hat{y}}} y^L1*K标量/矢量
∂ y ^ ∂ a ( L ) \frac{\partial\boldsymbol{\hat{y}} }{\partial \boldsymbol{a}^{(L)}} a(L)y^K*K矢量/矢量
∂ a ( L ) ∂ z ( L ) \frac{\partial \boldsymbol{a}^{(L)} }{\partial \boldsymbol{z}^{(L)}} z(L)a(L)K*K矢量/矢量

计算 ∂ L ∂ y ^ \frac{\partial \mathcal{L}}{\partial \boldsymbol{\hat{y}}} y^L

∂ L ∂ y ^ = ∂ y ln ⁡ y ^ T ∂ y ^ = [ y 1 1 y 1 ^ , y 2 1 y 2 ^ , ⋯   , y K 1 y ^ K ] \begin{aligned} \frac{\partial \mathcal{L}}{\partial \boldsymbol{\hat{y}}} &= \frac{\partial \boldsymbol{y}\ln \boldsymbol{\hat{y}}^{T}}{\partial \boldsymbol{\hat{y}}} \\ & = \left[ y_1 \frac{1}{\hat{y_1}}, y_2 \frac{1}{\hat{y_2}}, \cdots,y_K \frac{1}{\hat{y}_K} \right] \end{aligned} y^L=y^ylny^T=[y1y1^1,y2y2^1,,yKy^K1]

计算 ∂ y ^ ∂ a ( L ) \frac{\partial\boldsymbol{\hat{y}} }{\partial \boldsymbol{a}^{(L)}} a(L)y^

∂ y ^ ∂ a ( L ) = [ ∂ y ^ 1 ∂ a 1 ( L ) ∂ y ^ 1 ∂ a 2 ( L ) … ∂ y ^ 1 ∂ a K ( L ) ∂ y ^ 2 ∂ a 1 ( L ) ∂ y ^ 2 ∂ a 2 ( L ) … ∂ y ^ 2 ∂ a K ( L ) ⋮ ⋮ ⋱ ⋮ ∂ y ^ K ∂ a 1 ( L ) ∂ y ^ K ∂ a 2 ( L ) … ∂ y ^ K ∂ a K ( L ) ] \frac{\partial\boldsymbol{\hat{y}} }{\partial \boldsymbol{a}^{(L)}} = \left[ \begin{aligned} \frac{\partial \hat{y}_{1}}{\partial a_{1}^{(L)}} \frac{\partial \hat{y}_{1}}{\partial a_{2}^{(L)}} \ldots \frac{\partial \hat{y}_{1}}{\partial a_{K}^{(L)}} \\ \frac{\partial \hat{y}_{2}}{\partial a_{1}^{(L)}} \frac{\partial \hat{y}_{2}}{\partial a_{2}^{(L)}} \ldots \frac{\partial \hat{y}_{2}}{\partial a_{K}^{(L)}} \\ \vdots \quad \quad \vdots \quad \quad \ddots \quad \vdots \quad \\ \frac{\partial \hat{y}_{K}}{\partial a_{1}^{(L)}} \frac{\partial \hat{y}_{K}}{\partial a_{2}^{(L)}} \ldots \frac{\partial \hat{y}_{K}}{\partial a_{K}^{(L)}} \\ \end{aligned}\right] a(L)y^=a1(L)y^1a2(L)y^1aK(L)y^1a1(L)y^2a2(L)y^2aK(L)y^2a1(L)y^Ka2(L)y^KaK(L)y^K

i f j = i : if \quad j=i: ifj=i:
∂ y ^ j ∂ a i ( L ) = ∂ ( e a j ( L ) ∑ k e a k ( L ) ) ∂ a i ( L ) = ∂ ( e a j ( L ) ∑ k e a k ( L ) ) ∂ a j ( L ) = ( e a j ( L ) ) ′ ⋅ ∑ k e a k ( L ) − e a j ( L ) ⋅ e a j ( L ) ( ∑ k e a k ( L ) ) 2 = e a j ( L ) ∑ k e a k ( L ) − e a j ( L ) ∑ k e a k ( L ) ⋅ e a j ( L ) ∑ k e a k ( L ) = y ^ j ( 1 − y ^ j ) \begin{aligned} \frac{\partial \hat{y}_{j}}{\partial a_{i}^{(L)}} &=\frac{\partial \left(\frac{e^{a_{j}^{(L)}}}{\sum_{k} e^{a_{k}^{(L)}}}\right) }{\partial a_{i}^{(L)}} \\ &=\frac{\partial \left(\frac{e^{a_{j}^{(L)}}}{\sum_{k} e^{a_{k}^{(L)}}}\right) }{\partial a_{j}^{(L)}} \\ &=\frac{\left(e^{a_{j}^{(L)}}\right)^{\prime} \cdot \sum_{k} e^{a_{k}^{(L)}}-e^{a_{j}^{(L)}} \cdot e^{a_{j}^{(L)}}}{\left(\sum_{k} e^{a_{k}^{(L)}}\right)^{2}} \\ &=\frac{e^{a_{j}^{(L)}}}{\sum_{k} e^{a_{k}^{(L)}}}-\frac{e^{a_{j}^{(L)}}}{\sum_{k} e^{a_{k}^{(L)}}} \cdot \frac{e^{a_{j}^{(L)}}}{\sum_{k} e^{a_{k}^{(L)}}} =\hat{y}_{j}\left(1-\hat{y}_{j}\right) \end{aligned} ai(L)y^j=ai(L)(keak(L)eaj(L))=aj(L)(keak(L)eaj(L))=(keak(L))2(eaj(L))keak(L)eaj(L)eaj(L)=keak(L)eaj(L)keak(L)eaj(L)keak(L)eaj(L)=y^j(1y^j)

i f j ≠ i : if \quad j \neq i: ifj=i:
∂ y ^ j ∂ a i ( L ) = ∂ ( e a j ( L ) ∑ k e a k ( L ) ) ∂ a i ( L ) = 0 ⋅ ∑ k e a k ( L ) − e a i ( L ) ⋅ e a j ( L ) ( ∑ k e a k ( L ) ) 2 = − e a i ( L ) ∑ k e a k ( L ) ⋅ e a j ( L ) ∑ k e a k ( L ) = − y ^ i y ^ j \begin{aligned} \frac{\partial \hat{y}_{j}}{\partial a_{i}^{(L)}} &=\frac{\partial \left(\frac{e^{a_{j}^{(L)}}}{\sum_{k} e^{a_{k}^{(L)}}}\right) }{\partial a_{i}^{(L)}} \\ &=\frac{0 \cdot \sum_{k} e^{a_{k}^{(L)}}-e^{a_{i}^{(L)}} \cdot e^{a_{j}^{(L)}}} {\left(\sum_{k} e^{a_{k}^{(L)}}\right)^{2}} \\ &=-\frac{e^{a_{i}^{(L)}}}{\sum_{k} e^{a_{k}^{(L)}}} \cdot \frac{e^{a_{j}^{(L)}}}{\sum_{k} e^{a_{k}^{(L)}}} =-\hat{y}_{i} \hat{y}_{j} \end{aligned} ai(L)y^j=ai(L)(keak(L)eaj(L))=(keak(L))20keak(L)eai(L)eaj(L)=keak(L)eai(L)keak(L)eaj(L)=y^iy^j

计算 ∂ a ( L ) ∂ z ( L ) \frac{\partial \boldsymbol{a}^{(L)} }{\partial \boldsymbol{z}^{(L)}} z(L)a(L)

由于这里是属于激活函数的部分即
a ( L ) = σ ( z ( L ) ) \boldsymbol{a}^{(L)} = \sigma(\boldsymbol{z}^{(L)}) a(L)=σ(z(L))
并且激活函数是逐元素进行运算,则有
∂ a ( L ) ∂ z ( L ) = diag ( σ ′ ( z ( L ) ) ) \frac{\partial \boldsymbol{a}^{(L)} }{\partial \boldsymbol{z}^{(L)}} = \text{diag}(\sigma'(\boldsymbol{z}^{(L)})) z(L)a(L)=diag(σ(z(L)))

最终的 δ ( L ) \delta^{(L)} δ(L) 结果

则总的表达式如下
( 1 × M L ) = ( 1 × M L ) ( M L × M L ) ( M l × M L ) = 1 × M L δ ( L ) = ∂ L ∂ y ^ ∂ y ^ ∂ a ( L ) ∂ a ( L ) ∂ z ( L ) = [ y 1 1 y 1 ^ , y 2 1 y 2 ^ , ⋯   , y K 1 y ^ K ] [ ∂ y ^ 1 ∂ a 1 ( L ) ∂ y ^ 1 ∂ a 2 ( L ) … ∂ y ^ 1 ∂ a K ( L ) ∂ y ^ 2 ∂ a 1 ( L ) ∂ y ^ 2 ∂ a 2 ( L ) … ∂ y ^ 2 ∂ a K ( L ) ⋮ ⋮ ⋱ ⋮ ∂ y ^ K ∂ a 1 ( L ) ∂ y ^ K ∂ a 2 ( L ) … ∂ y ^ K ∂ a K ( L ) ] diag ( σ ′ ( z ( L ) ) ) \begin{aligned} (1\times M_{L}) &= (1\times M_{L})(M_{L} \times M_{L})(M_{l} \times M_{L}) = 1\times M_{L} \\ \delta^{(L)} &=\frac{\partial \mathcal{L}}{\partial \boldsymbol{\hat{y}}} \frac{\partial\boldsymbol{\hat{y}} }{\partial \boldsymbol{a}^{(L)}} \frac{\partial \boldsymbol{a}^{(L)} }{\partial \boldsymbol{z}^{(L)}}\\ &=\left[ y_1 \frac{1}{\hat{y_1}}, y_2 \frac{1}{\hat{y_2}}, \cdots,y_K \frac{1}{\hat{y}_K} \right] \left[ \begin{aligned} \frac{\partial \hat{y}_{1}}{\partial a_{1}^{(L)}} \frac{\partial \hat{y}_{1}}{\partial a_{2}^{(L)}} \ldots \frac{\partial \hat{y}_{1}}{\partial a_{K}^{(L)}} \\ \frac{\partial \hat{y}_{2}}{\partial a_{1}^{(L)}} \frac{\partial \hat{y}_{2}}{\partial a_{2}^{(L)}} \ldots \frac{\partial \hat{y}_{2}}{\partial a_{K}^{(L)}} \\ \vdots \quad \quad \vdots \quad \quad \ddots \quad \vdots \quad \\ \frac{\partial \hat{y}_{K}}{\partial a_{1}^{(L)}} \frac{\partial \hat{y}_{K}}{\partial a_{2}^{(L)}} \ldots \frac{\partial \hat{y}_{K}}{\partial a_{K}^{(L)}} \\ \end{aligned}\right] \text{diag}(\sigma'(\boldsymbol{z}^{(L)})) \end{aligned} (1×ML)δ(L)=(1×ML)(ML×ML)(Ml×ML)=1×ML=y^La(L)y^z(L)a(L)=[y1y1^1,y2y2^1,,yKy^K1]a1(L)y^1a2(L)y^1aK(L)y^1a1(L)y^2a2(L)y^2aK(L)y^2a1(L)y^Ka2(L)y^KaK(L)y^Kdiag(σ(z(L)))

最终对 ∂ L ( y , y ^ ) ∂ w i j ( l ) \frac{\partial \mathcal{L}(\boldsymbol{y}, \hat{\boldsymbol{y}})}{\partial w_{i j}^{(l)}} wij(l)L(y,y^) 的求导结果

因此,当我们将上述所有结果汇总即可得到最终对 ∂ L ( y , y ^ ) ∂ w i j ( l ) \frac{\partial \mathcal{L}(\boldsymbol{y}, \hat{\boldsymbol{y}})}{\partial w_{i j}^{(l)}} wij(l)L(y,y^) 的求导结果

( 1 × 1 ) = ( 1 × M l ) ( M l × 1 ) = ( 1 × 1 ) = 1 ∂ L ( y , y ^ ) ∂ w i j ( l ) = ∂ L ( y , y ^ ) ∂ z ( l ) ∂ z ( l ) ∂ w i j ( l ) = ∂ L ( y , y ^ ) ∂ z ( l ) ∂ z ( l ) ∂ w i j ( l ) = δ ( L ) ( ∏ i = L l + 1 W ( i ) . diag ( σ ′ ( z ( i − 1 ) ) ) ) [ 0 , … , a j ( l − 1 ) , … , 0 ] T \begin{aligned} (1\times 1) &= (1\times M_{l})(M_{l} \times 1) = (1\times 1) = 1 \\ \frac{\partial \mathcal{L}(\boldsymbol{y}, \hat{\boldsymbol{y}})}{\partial w_{i j}^{(l)}} &=\frac{\partial \mathcal{L}(\boldsymbol{y}, \hat{\boldsymbol{y}})}{\partial \mathbf{z}^{(l)}}\frac{\partial \boldsymbol{z}^{(l)}}{\partial w_{i j}^{(l)}} \\ &= \frac{\partial \mathcal{L}(\boldsymbol{y}, \hat{\boldsymbol{y}})}{\partial \mathbf{z}^{(l)}}\frac{\partial \boldsymbol{z}^{(l)}}{\partial w_{i j}^{(l)}} \\ &=\delta^{(L)} \left( \prod \limits_{i=L}^{l+1} \boldsymbol{W}^{(i)} . \text{diag}(\sigma'(\boldsymbol{z}^{(i-1)})) \right) \Big[0,\ldots,a_{j}^{(l-1)},\ldots,0 \Big]^{T} \end{aligned} (1×1)wij(l)L(y,y^)=(1×Ml)(Ml×1)=(1×1)=1=z(l)L(y,y^)wij(l)z(l)=z(l)L(y,y^)wij(l)z(l)=δ(L)(i=Ll+1W(i).diag(σ(z(i1))))[0,,aj(l1),,0]T

由丄式可知,在求导 ∂ L ( y , y ^ ) ∂ w i j ( l ) \frac{\partial \mathcal{L}(\boldsymbol{y}, \hat{\boldsymbol{y}})}{\partial w_{i j}^{(l)}} wij(l)L(y,y^) 时,距离最后一层越近( l l l 越接近 L L L ),则求导所需的计算量越少, 并且计算靠近输入层的参数需要靠近输出层的一些数据,这就是反向传播.

参考资料

  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

wowotou1998

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值