看了很多BP的推导,都不够简洁直观,这里总结一下。多层Perceptron就是全连接的网络,定义第l层的输入为 x ( l ) x^{(l)} x(l),那么全连接的线性输出 z ( l ) = W ( l ) x ( l ) + b ( l ) z^{(l)}=W^{(l)}x^{(l)}+b^{(l)} z(l)=W(l)x(l)+b(l)
上面的(l)都表示第l层,如果到了第l+1层,当然要过一个激活函数f,那么
z
(
l
+
1
)
=
W
(
l
+
1
)
x
(
l
+
1
)
+
b
(
l
+
1
)
z^{(l+1)}=W^{(l+1)}x^{(l+1)}+b^{(l+1)}
z(l+1)=W(l+1)x(l+1)+b(l+1)
如果把上面这个公式展开,考虑到
x
(
l
+
1
)
x^{(l+1)}
x(l+1)和
z
(
l
)
z^{(l)}
z(l)的关系,那就变成了
z
j
(
l
+
1
)
=
∑
i
W
j
i
(
l
+
1
)
f
(
z
i
(
l
)
)
+
b
j
(
l
+
1
)
z_j^{(l+1)}=\sum_iW_{ji}^{(l+1)}f(z_i^{(l)})+b_j^{(l+1)}
zj(l+1)=i∑Wji(l+1)f(zi(l))+bj(l+1)
假设第l+1层就输出了,那么整理代价函数就是
J
(
W
,
b
)
=
L
[
f
(
z
(
l
+
1
)
)
,
y
]
+
Ω
(
W
)
J(W,b)=L[f(z^{(l+1)}),y] + \Omega(W)
J(W,b)=L[f(z(l+1)),y]+Ω(W)
Back Propagation就是想求每一层每个节点(第i层)的两个偏导
∂
J
(
W
,
b
)
∂
W
(
i
)
\frac{\partial J(W,b)}{\partial W^{(i)}}
∂W(i)∂J(W,b)和
∂
J
(
W
,
b
)
∂
b
(
i
)
\frac{\partial J(W,b)}{\partial b^{(i)}}
∂b(i)∂J(W,b),好证明参数可以传递,不至于传着传着没货了,下面求导开始前,把刚才不严谨的公式画的更清楚一些,还是假设第l+1层后过了激活函数f就去算Loss了,同时定义第l层有
S
l
S_l
Sl个节点,第l+1层有
S
l
+
1
S_{l+1}
Sl+1个节点,对于第l层的第i个节点和loss
J
(
W
,
b
)
J(W,b)
J(W,b)可以得到如下图里的关系:
接下来开始求针对第l层参数的偏导,其中
z
i
(
l
+
1
)
z_i^{(l+1)}
zi(l+1)表示第l+1层的第i个节点的线性输出,
W
i
j
(
l
+
1
)
W_{ij}^{(l+1)}
Wij(l+1)表示第l+1层第i个节点从前一层第j个节点过sigmoid后乘以的参数:
∂
J
(
W
,
b
)
∂
W
i
j
(
l
+
1
)
=
∂
J
(
W
,
b
)
∂
z
i
(
l
+
1
)
∂
z
i
(
l
+
1
)
∂
W
i
j
(
l
+
1
)
=
∂
J
(
W
,
b
)
∂
z
i
(
l
+
1
)
x
i
(
l
+
1
)
\frac{\partial J(W,b)}{\partial W_{ij}^{(l+1)}}=\frac{\partial J(W,b)}{\partial z_i^{(l+1)}}\frac{\partial z_i^{(l+1)}}{\partial W_{ij}^{(l+1)}} \\ = \frac{\partial J(W,b)}{\partial z_i^{(l+1)}} x_i^{(l+1)}
∂Wij(l+1)∂J(W,b)=∂zi(l+1)∂J(W,b)∂Wij(l+1)∂zi(l+1)=∂zi(l+1)∂J(W,b)xi(l+1)
∂
J
(
W
,
b
)
∂
z
i
(
l
+
1
)
\frac{\partial J(W,b)}{\partial z_i^{(l+1)}}
∂zi(l+1)∂J(W,b)和J的具体表达式有关系,所以定义
δ
i
l
+
1
=
∂
J
(
W
,
b
)
∂
z
i
(
l
+
1
)
\delta_i^{l+1}=\frac{\partial J(W,b)}{\partial z_i^{(l+1)}}
δil+1=∂zi(l+1)∂J(W,b)
那么
∂
J
(
W
,
b
)
∂
W
i
j
(
l
+
1
)
=
∂
J
(
W
,
b
)
∂
z
i
(
l
+
1
)
∂
z
i
(
l
+
1
)
∂
W
i
j
(
l
+
1
)
=
δ
i
l
+
1
x
i
(
l
+
1
)
\frac{\partial J(W,b)}{\partial W_{ij}^{(l+1)}}=\frac{\partial J(W,b)}{\partial z_i^{(l+1)}}\frac{\partial z_i^{(l+1)}}{\partial W_{ij}^{(l+1)}} \\ = \delta_i^{l+1} x_i^{(l+1)}
∂Wij(l+1)∂J(W,b)=∂zi(l+1)∂J(W,b)∂Wij(l+1)∂zi(l+1)=δil+1xi(l+1)
∂
J
(
W
,
b
)
∂
b
i
(
l
+
1
)
=
∂
J
(
W
,
b
)
∂
z
i
(
l
+
1
)
∂
z
i
(
l
+
1
)
∂
b
(
l
+
1
)
=
δ
i
l
+
1
\frac{\partial J(W,b)}{\partial b_i^{(l+1)}}=\frac{\partial J(W,b)}{\partial z_i^{(l+1)}}\frac{\partial z_i^{(l+1)}}{\partial b^{(l+1)}} \\ = \delta_i^{l+1}
∂bi(l+1)∂J(W,b)=∂zi(l+1)∂J(W,b)∂b(l+1)∂zi(l+1)=δil+1
有了这俩,还想知道再向前一层是啥,比如针对W的:
∂
J
(
W
,
b
)
∂
W
i
j
(
l
)
=
∂
J
(
W
,
b
)
∂
z
i
(
l
)
∂
z
i
(
l
)
∂
W
i
j
(
l
)
=
δ
i
l
x
i
(
l
)
\frac{\partial J(W,b)}{\partial W_{ij}^{(l)}}=\frac{\partial J(W,b)}{\partial z_i^{(l)}}\frac{\partial z_i^{(l)}}{\partial W_{ij}^{(l)}} \\ = \delta_i^{l} x_i^{(l)}
∂Wij(l)∂J(W,b)=∂zi(l)∂J(W,b)∂Wij(l)∂zi(l)=δilxi(l)
剩下的焦点问题就是
δ
i
(
l
)
\delta_i^{(l)}
δi(l)怎么求了,刚才那个图就用上了,这东西可以看成损失函数在第l层第i个节点产生的残差量,刚才那个图一目了然:
δ
i
(
l
)
=
∂
J
(
W
,
b
)
∂
z
i
(
l
)
=
∑
j
=
1
S
l
+
1
[
∂
J
(
W
,
b
)
∂
z
j
(
l
+
1
)
∂
z
j
(
l
+
1
)
∂
z
i
(
l
)
]
=
∑
j
=
1
S
l
+
1
[
δ
j
(
l
+
1
)
W
j
i
(
l
+
1
)
f
′
(
z
i
(
l
)
)
]
\delta_i^{(l)}=\frac{\partial J(W,b)}{\partial z_i^{(l)}} \\ =\sum_{j=1}^{S_{l+1}}[\frac{\partial J(W,b)}{\partial z_j^{(l+1)}}\frac{\partial z_j^{(l+1)}}{\partial z_i^{(l)}}] \\ =\sum_{j=1}^{S_{l+1}}[\delta_j^{(l+1)}W_{ji}^{(l+1)}f'(z_i^{(l)})]
δi(l)=∂zi(l)∂J(W,b)=j=1∑Sl+1[∂zj(l+1)∂J(W,b)∂zi(l)∂zj(l+1)]=j=1∑Sl+1[δj(l+1)Wji(l+1)f′(zi(l))]
δ i ( l ) \delta_i^{(l)} δi(l)就是整个BP的精髓,说明了参数可以传递,其实就是递推表达式。后面补充几个很重要的问题:
-
可以看到 δ i ( l ) \delta_i^{(l)} δi(l)再展开一层会带来一堆连乘,导致误差膨胀或者消失,ResNet的改进思路就是如此,把上一层的残差短接到下一层,收获了很好的效果
-
激活函数可以选sigmoid或者relu,最后L可以选平方误差损失或者交叉熵。一般来说平方损失函数输出为连续,交叉熵损失函数更适合二分类问题。但是注意平方损失函数最后一层不能含有Sigmoid或者softmax激活函数!!!原因就是假设用sigmoid,下面表达式中含有sigmoid的导数,这个值非常小,导致学习非常慢
δ i l + 1 = ∂ J ( W , b ) ∂ z j ( l + 1 ) \delta_i^{l+1}=\frac{\partial J(W,b)}{\partial z_j^{(l+1)}} δil+1=∂zj(l+1)∂J(W,b)
写更加具体一些,如果二分类的话用MSE的话:
J ( W , b ) = ( σ ( z j ( l + 1 ) ) − y ) 2 2 ∂ J ( W , b ) ∂ z j ( l + 1 ) = ( σ ( z j ( l + 1 ) ) − y ) σ ( z j ( l + 1 ) ) ( 1 − σ ( z j ( l + 1 ) ) ) J(W,b)=\frac{(\sigma(z_j^{(l+1)})-y)^2}{2}\\ \frac{\partial J(W,b)}{\partial z_j^{(l+1)}}=(\sigma(z_j^{(l+1)})-y)\sigma(z_j^{(l+1)})(1-\sigma(z_j^{(l+1)})) J(W,b)=2(σ(zj(l+1))−y)2∂zj(l+1)∂J(W,b)=(σ(zj(l+1))−y)σ(zj(l+1))(1−σ(zj(l+1)))
但是如果是交叉熵的话,假设分为k类,k类中只有某一类为1,其他为0:
J ( W , b ) = − ∑ k = 1 m y k l n σ ( z j ( l + 1 ) ) ∂ J ( W , b ) ∂ z j ( l + 1 ) = y k ( σ ( z j ( l + 1 ) − 1 ) J(W,b)=-\sum_{k=1}^my_kln\sigma(z_j^{(l+1)})\\ \frac{\partial J(W,b)}{\partial z_j^{(l+1)}}=y_k(\sigma(z_j^{(l+1)}-1) J(W,b)=−k=1∑myklnσ(zj(l+1))∂zj(l+1)∂J(W,b)=yk(σ(zj(l+1)−1)
显然交叉熵下降会更快一些 -
最后写一下完整的J(W,b),用二元交叉熵损失函数来写吧:
J ( W , b ) = L [ f ( z ( l + 1 ) ) , y ] + Ω ( W ) = − 1 m ∑ i = 1 m [ y i l n o i + ( 1 − y i ) l n ( 1 − o i ) ) ] + λ 2 ∑ l = 1 N − 1 ∑ i = 1 S l ∑ j = 1 S l + 1 ( W i j ( l ) ) 2 J(W,b)=L[f(z^{(l+1)}),y] + \Omega(W) \\ = {{-\frac{1}{m}\sum_{i=1}^m[y_ilno_i+(1-y_i)ln(1-o_i))] }}+ \\ \frac{\lambda}{2}\sum_{l=1}^{N-1}\sum_{i=1}^{S_l}\sum_{j=1}^{S_{l+1}}(W_{ij}^{(l)})^2 J(W,b)=L[f(z(l+1)),y]+Ω(W)=−m1i=1∑m[yilnoi+(1−yi)ln(1−oi))]+2λl=1∑N−1i=1∑Slj=1∑Sl+1(Wij(l))2
英文博客可以参考: http://neuralnetworksanddeeplearning.com/chap2.html