BP算法的原理解释和推导
已知的神经网络结构:
且已知的条件:
- a ( j ) = f ( z ( j ) ) \mathbf{a}^{\left( \mathbf{j} \right)}=\mathbf{f}\left( \mathbf{z}^{\left( \mathbf{j} \right)} \right) a(j)=f(z(j))
- z ( j ) = W ( j ) a ( j − 1 ) + b ( j ) ,而 θ ( j ) = { W ( j ) , b ( j ) } \mathbf{z}^{\left( \mathbf{j} \right)}=\mathbf{W}^{\left( \mathbf{j} \right)}\mathbf{a}^{\left( \mathbf{j}-1 \right)}+\mathbf{b}^{\left( \mathbf{j} \right)}\text{,而}\mathbf{\theta }^{\left( \mathbf{j} \right)}=\left\{ \mathbf{W}^{\left( \mathbf{j} \right)},\mathbf{b}^{\left( \mathbf{j} \right)} \right\} z(j)=W(j)a(j−1)+b(j),而θ(j)={W(j),b(j)}
对于上图,如果我们想得到 ∂ l ∂ θ ( j ) \frac{\partial \mathbf{l}}{\partial \mathbf{\theta }^{\left( \mathbf{j} \right)}} ∂θ(j)∂l,可以通过 z ( j ) \mathbf{z}^{\left( \mathbf{j} \right)} z(j)建立l和θ(j)之间的联系,即 ∂ l ∂ θ ( j ) = ∂ l ∂ z ( j ) ∗ ∂ z ( j ) ∂ θ ( j ) \frac{\partial \mathbf{l}}{\partial \mathbf{\theta }^{\left( \mathbf{j} \right)}}=\frac{\partial \mathbf{l}}{\partial \mathbf{z}^{\left( \mathbf{j} \right)}}*\frac{\partial \mathbf{z}^{\left( \mathbf{j} \right)}}{\partial \mathbf{\theta }^{\left( \mathbf{j} \right)}} ∂θ(j)∂l=∂z(j)∂l∗∂θ(j)∂z(j),而l和z(j)之间的联系则可以通过z(j+1)进行建立 ∂ l ∂ z ( j ) = ∂ l ∂ z ( j + 1 ) ∗ ∂ z ( j + 1 ) ∂ z ( j ) = ∂ l ∂ z ( j + 1 ) ∗ ∂ z ( j + 1 ) ∂ a ( j ) ∗ ∂ a ( j ) ∂ z ( j ) \frac{\partial \mathbf{l}}{\partial \mathbf{z}^{\left( \mathbf{j} \right)}}=\frac{\partial \mathbf{l}}{\partial \mathbf{z}^{\left( \mathbf{j}+1 \right)}}*\frac{\partial \mathbf{z}^{\left( \mathbf{j}+1 \right)}}{\partial \mathbf{z}^{\left( \mathbf{j} \right)}}=\frac{\partial \mathbf{l}}{\partial \mathbf{z}^{\left( \mathbf{j}+1 \right)}}*\frac{\partial \mathbf{z}^{\left( \mathbf{j}+1 \right)}}{\partial \mathbf{a}^{\left( \mathbf{j} \right)}}*\frac{\partial \mathbf{a}^{\left( \mathbf{j} \right)}}{\partial \mathbf{z}^{\left( \mathbf{j} \right)}} ∂z(j)∂l=∂z(j+1)∂l∗∂z(j)∂z(j+1)=∂z(j+1)∂l∗∂a(j)∂z(j+1)∗∂z(j)∂a(j),由此,我们得到 ∂ l ∂ θ ( j ) = ∂ l ∂ z ( j + 1 ) ∗ ∂ z ( j + 1 ) ∂ a ( j ) ∗ ∂ a ( j ) ∂ z ( j ) ∗ ∂ z ( j ) ∂ θ ( j ) \frac{\partial \mathbf{l}}{\partial \mathbf{\theta }^{\left( \mathbf{j} \right)}}=\frac{\partial \mathbf{l}}{\partial \mathbf{z}^{\left( \mathbf{j}+1 \right)}}*\frac{\partial \mathbf{z}^{\left( \mathbf{j}+1 \right)}}{\partial \mathbf{a}^{\left( \mathbf{j} \right)}}*\frac{\partial \mathbf{a}^{\left( \mathbf{j} \right)}}{\partial \mathbf{z}^{\left( \mathbf{j} \right)}}*\frac{\partial \mathbf{z}^{\left( \mathbf{j} \right)}}{\partial \mathbf{\theta }^{\left( \mathbf{j} \right)}} ∂θ(j)∂l=∂z(j+1)∂l∗∂a(j)∂z(j+1)∗∂z(j)∂a(j)∗∂θ(j)∂z(j)(链式求导法则),然后不断的迭代求导下去。
这里我们细心观察下式:
其中, ∂ z ( j + 1 ) ∂ a ( j ) = w ( j + 1 ) \frac{\partial \mathbf{z}^{\left( \mathbf{j}+1 \right)}}{\partial \mathbf{a}^{\left( \mathbf{j} \right)}}=\mathbf{w}^{\left( \mathbf{j}+1 \right)} ∂a(j)∂z(j+1)=w(j+1),而 ∂ a ( j ) ∂ z ( j ) = f ′ ( z ( j ) ) \frac{\partial \mathbf{a}^{\left( \mathbf{j} \right)}}{\partial \mathbf{z}^{\left( \mathbf{j} \right)}}=\mathbf{f}'\left( \mathbf{z}^{\left( \mathbf{j} \right)} \right) ∂z(j)∂a(j)=f′(z(j))。然后,我们将这两个式子代入上式,得到了一个新的式子:
而 ∂ l ∂ W ( j ) \frac{\partial \mathbf{l}}{\partial \mathbf{W}^{\left( \mathbf{j} \right)}} ∂W(j)∂l和 ∂ l ∂ b ( j ) \frac{\partial \mathbf{l}}{\partial \mathbf{b}^{\left( \mathbf{j} \right)}} ∂b(j)∂l是什么样子的呢?
此时,让我们来分析一个相对复杂一些的神经网络结构的BackPropagation过程:
且已知条件:
- l = l ( h ) \mathbf{l}=\mathbf{l}\left( \mathbf{h} \right) l=l(h)
- h = f ( w 1 , 1 ( 3 ) a 1 ( 2 ) + w 2 , 1 ( 3 ) a 2 ( 2 ) ) = f ( w 1 , 1 ( 3 ) f ( z 1 ( 2 ) ) + w 2 , 1 ( 3 ) f ( z 2 ( 2 ) ) ) = f ( w 1 , 1 ( 3 ) f ( w 1 , 1 ( 2 ) f ( z 1 ( 1 ) ) ) + w 2 , 1 ( 3 ) f ( w 2 , 1 ( 2 ) f ( z 1 ( 1 ) ) ) ) \mathbf{h}=\mathbf{f}\left( \mathbf{w}_{1,1}^{\left( 3 \right)}\mathbf{a}_{1}^{\left( 2 \right)}+\mathbf{w}_{2,1}^{\left( 3 \right)}\mathbf{a}_{2}^{\left( 2 \right)} \right) \\\,\, =\mathbf{f}\left( \mathbf{w}_{1,1}^{\left( 3 \right)}\mathbf{f}\left( \mathbf{z}_{1}^{\left( 2 \right)} \right) +\mathbf{w}_{2,1}^{\left( 3 \right)}\mathbf{f}\left( \mathbf{z}_{2}^{\left( 2 \right)} \right) \right) \\\,\, =\mathbf{f}\left( \mathbf{w}_{1,1}^{\left( 3 \right)}\mathbf{f}\left( \mathbf{w}_{1,1}^{\left( 2 \right)}\mathbf{f}\left( \mathbf{z}_{1}^{\left( 1 \right)} \right) \right) +\mathbf{w}_{2,1}^{\left( 3 \right)}\mathbf{f}\left( \mathbf{w}_{2,1}^{\left( 2 \right)}\mathbf{f}\left( \mathbf{z}_{1}^{\left( 1 \right)} \right) \right) \right) h=f(w1,1(3)a1(2)+w2,1(3)a2(2))=f(w1,1(3)f(z1(2))+w2,1(3)f(z2(2)))=f(w1,1(3)f(w1,1(2)f(z1(1)))+w2,1(3)f(w2,1(2)f(z1(1))))
此时,我们令 g 1 ( z 1 ( 1 ) ) = w 1 , 1 ( 3 ) f ( w 1 , 1 ( 2 ) f ( z 1 ( 1 ) ) ) \mathbf{g}_1\left( \mathbf{z}_{1}^{\left( 1 \right)} \right) =\mathbf{w}_{1,1}^{\left( 3 \right)}\mathbf{f}\left( \mathbf{w}_{1,1}^{\left( 2 \right)}\mathbf{f}\left( \mathbf{z}_{1}^{\left( 1 \right)} \right) \right) g1(z1(1))=w1,1(3)f(w1,1(2)f(z1(1)))和 g 2 ( z 1 ( 1 ) ) = w 2 , 1 ( 3 ) f ( w 2 , 1 ( 2 ) f ( z 1 ( 1 ) ) ) \mathbf{g}_2\left( \mathbf{z}_{1}^{\left( 1 \right)} \right) =\mathbf{w}_{2,1}^{\left( 3 \right)}\mathbf{f}\left( \mathbf{w}_{2,1}^{\left( 2 \right)}\mathbf{f}\left( \mathbf{z}_{1}^{\left( 1 \right)} \right) \right) g2(z1(1))=w2,1(3)f(w2,1(2)f(z1(1))),然后我们将上面h的表达式进行转换:
- h = f ( g 1 ( z 1 ( 1 ) ) + g 2 ( z 1 ( 1 ) ) ) \mathbf{h}=\mathbf{f}\left( \mathbf{g}_1\left( \mathbf{z}_{1}^{\left( 1 \right)} \right) +\mathbf{g}_2\left( \mathbf{z}_{1}^{\left( 1 \right)} \right) \right) h=f(g1(z1(1))+g2(z1(1)))
然后,我们求解 ∂ h ∂ z 1 ( 1 ) \frac{\partial \mathbf{h}}{\partial \mathbf{z}_{1}^{\left( 1 \right)}} ∂z1(1)∂h,来接着分析化简:
- ∂ h ∂ z 1 ( 1 ) = ∂ h ∂ g 1 ∗ ∂ g 1 ∂ z 1 ( 1 ) + ∂ h ∂ g 2 ∗ ∂ g 2 ∂ z 1 ( 1 ) = ∂ g 1 ∂ z 1 ( 2 ) w 1 , 1 ( 2 ) f ′ ( z 1 ( 1 ) ) + ∂ g 2 ∂ z 2 ( 2 ) w 2 , 1 ( 2 ) f ′ ( z 1 ( 1 ) ) = [ ∂ g 1 ∂ z 1 ( 2 ) w 1 , 1 ( 2 ) + ∂ g 2 ∂ z 2 ( 2 ) w 2 , 1 ( 2 ) ] f ′ ( z 1 ( 1 ) ) \frac{\partial \mathbf{h}}{\partial \mathbf{z}_{1}^{\left( 1 \right)}}=\frac{\partial \mathbf{h}}{\partial \mathbf{g}_1}*\frac{\partial \mathbf{g}_1}{\partial \mathbf{z}_{1}^{\left( 1 \right)}}+\frac{\partial \mathbf{h}}{\partial \mathbf{g}_2}*\frac{\partial \mathbf{g}_2}{\partial \mathbf{z}_{1}^{\left( 1 \right)}}\\\,\, =\frac{\partial \mathbf{g}_1}{\partial \mathbf{z}_{1}^{\left( 2 \right)}}\mathbf{w}_{1,1}^{\left( 2 \right)}\mathbf{f}'\left( \mathbf{z}_{1}^{\left( 1 \right)} \right) +\frac{\partial \mathbf{g}_2}{\partial \mathbf{z}_{2}^{\left( 2 \right)}}\mathbf{w}_{2,1}^{\left( 2 \right)}\mathbf{f}'\left( \mathbf{z}_{1}^{\left( 1 \right)} \right) \\\,\, =\left[ \frac{\partial \mathbf{g}_1}{\partial \mathbf{z}_{1}^{\left( 2 \right)}}\mathbf{w}_{1,1}^{\left( 2 \right)}+\frac{\partial \mathbf{g}_2}{\partial \mathbf{z}_{2}^{\left( 2 \right)}}\mathbf{w}_{2,1}^{\left( 2 \right)} \right] \mathbf{f}'\left( \mathbf{z}_{1}^{\left( 1 \right)} \right) ∂z1(1)∂h=∂g1∂h∗∂z1(1)∂g1+∂g2∂h∗∂z1(1)∂g2=∂z1(2)∂g1w1,1(2)f′(z1(1))+∂z2(2)∂g2w2,1(2)f′(z1(1))=[∂z1(2)∂g1w1,1(2)+∂z2(2)∂g2w2,1(2)]f′(z1(1))
- 进而得到迭代关系: δ 1 ( 1 ) = [ δ 1 ( 2 ) w 1 , 1 ( 2 ) + δ 2 ( 2 ) w 2 , 1 ( 2 ) ] f ′ ( z 1 ( 1 ) ) \mathbf{\delta }_{1}^{\left( 1 \right)}=\left[ \mathbf{\delta }_{1}^{\left( 2 \right)}\mathbf{w}_{1,1}^{\left( 2 \right)}+\mathbf{\delta }_{2}^{\left( 2 \right)}\mathbf{w}_{2,1}^{\left( 2 \right)} \right] \mathbf{f}'\left( \mathbf{z}_{1}^{\left( 1 \right)} \right) δ1(1)=[δ1(2)w1,1(2)+δ2(2)w2,1(2)]f′(z1(1))
最后我们便通过上式得到 ∂ h ∂ w 1 ( 1 ) \frac{\partial \mathbf{h}}{\partial \mathbf{w}_{1}^{\left( 1 \right)}} ∂w1(1)∂h和 ∂ h ∂ b 1 ( 1 ) \frac{\partial \mathbf{h}}{\partial \mathbf{b}_{1}^{\left( 1 \right)}} ∂b1(1)∂h,过程如下:
- ∂ h ∂ w 1 ( 1 ) = ∂ h ∂ z 1 ( 1 ) ∂ z 1 ( 1 ) ∂ w 1 ( 1 ) = δ 1 ( 1 ) a ( 0 ) = δ 1 ( 1 ) x 1 \frac{\partial \mathbf{h}}{\partial \mathbf{w}_{1}^{\left( 1 \right)}}=\frac{\partial \mathbf{h}}{\partial \mathbf{z}_{1}^{\left( 1 \right)}}\frac{\partial \mathbf{z}_{1}^{\left( 1 \right)}}{\partial \mathbf{w}_{1}^{\left( 1 \right)}}=\mathbf{\delta }_{1}^{\left( 1 \right)}\mathbf{a}^{\left( 0 \right)}=\mathbf{\delta }_{1}^{\left( 1 \right)}\mathbf{x}_1 ∂w1(1)∂h=∂z1(1)∂h∂w1(1)∂z1(1)=δ1(1)a(0)=δ1(1)x1
- ∂ h ∂ b 1 ( 1 ) = ∂ h ∂ z 1 ( 1 ) ∂ z 1 ( 1 ) ∂ b 1 ( 1 ) = δ 1 ( 1 ) \frac{\partial \mathbf{h}}{\partial \mathbf{b}_{1}^{\left( 1 \right)}}=\frac{\partial \mathbf{h}}{\partial \mathbf{z}_{1}^{\left( 1 \right)}}\frac{\partial \mathbf{z}_{1}^{\left( 1 \right)}}{\partial \mathbf{b}_{1}^{\left( 1 \right)}}=\mathbf{\delta }_{1}^{\left( 1 \right)} ∂b1(1)∂h=∂z1(1)∂h∂b1(1)∂z1(1)=δ1(1)
通过归纳 δ ( j ) \mathbf{\delta }^{\left( \mathbf{j} \right)} δ(j)和 δ ( j + 1 ) \mathbf{\delta }^{\left( \mathbf{j}+1 \right)} δ(j+1)之间的关系,我们得到了 一个特别重要也是最重要的BP公式 :
- δ ( j ) = f ′ ( z i ( j ) ) ∗ [ ∑ k = 1 N j + 1 w k , l ( j + 1 ) δ k ( j + 1 ) ] \mathbf{\delta }^{\left( \mathbf{j} \right)}=\mathbf{f}'\left( \mathbf{z}_{\mathbf{i}}^{\left( \mathbf{j} \right)} \right) *\left[ \sum_{\mathbf{k}=1}^{\mathbf{N}_{\mathbf{j}+1}}{\mathbf{w}_{\mathbf{k},\mathbf{l}}^{\left( \mathbf{j}+1 \right)}\mathbf{\delta }_{\mathbf{k}}^{\left( \mathbf{j}+1 \right)}} \right] δ(j)=f′(zi(j))∗[∑k=1Nj+1wk,l(j+1)δk(j+1)]
如图所示:
其中 w k , l ( j + 1 ) \mathbf{w}_{\mathbf{k},\mathbf{l}}^{\left( \mathbf{j}+1 \right)} wk,l(j+1)由记录值直接代入即可, δ k ( j + 1 ) \mathbf{\delta }_{\mathbf{k}}^{\left( \mathbf{j}+1 \right)} δk(j+1)是由 δ k ( j + 2 ) \mathbf{\delta }_{\mathbf{k}}^{\left( \mathbf{j}+2 \right)} δk(j+2)反向传播得到的,而 f ′ ( z i ( j ) ) \mathbf{f}'\left( \mathbf{z}_{\mathbf{i}}^{\left( \mathbf{j} \right)} \right) f′(zi(j))是由第j层的激活函数的导数公式代入 z i ( j ) \mathbf{z}_{\mathbf{i}}^{\left( \mathbf{j} \right)} zi(j)计算得到的,以下是常见的几种激活函数以及它们的导数公式:
但是我们问什么要使用BP算法呢?
解释:
- 因为如果没有BP算法,那么我们在计算某一个层的梯度的时候,就需要遍历在它所有的层进行梯度的链式计算,每一个位置的神经元的参数梯度计算都是如此,计算量爆炸!
- 但是,当我们拥有了BP算法,我们只要从后逐层计算每个位置神经元的参数梯度 δ ( j + 1 ) \mathbf{\delta }^{\left( \mathbf{j+1} \right)} δ(j+1)即可,然后并保存该层所计算出的参数梯度 δ ( j + 1 ) \mathbf{\delta }^{\left( \mathbf{j+1} \right)} δ(j+1),然后接着往前计算出前一层的 δ ( j ) \mathbf{\delta }^{\left( \mathbf{j} \right)} δ(j),依次迭代计算。
- BP算法的本质是动态规划,核心思想是“之前计算过的结果保存下来,下次计算接着拿出来用,并且发现它们之间的迭代关系,然后大大节省了计算开销。”