深度学习基础之-6.1卷积神经网络

卷积的数学定义

(1) h ( x ) = ( f ∗ g ) ( x ) = ∫ − ∞ ∞ f ( t ) g ( x − t ) d t h(x)=(f*g)(x) = \int_{-\infty}^{\infty} f(t)g(x-t)dt \tag{1} h(x)=(fg)(x)=f(t)g(xt)dt(1)
卷积与傅里叶变换有着密切的关系。利用这点性质,即两函数的傅里叶变换的乘积等于它们卷积后的傅里叶变换,能使傅里叶分析中许多问题的处理得到简化。

离散定义

卷积神经网络中卷积的前向计算

(2) h ( x ) = ( f ∗ g ) ( x ) = ∑ t = − ∞ ∞ f ( t ) g ( x − t ) h(x) = (f*g)(x) = \sum^{\infty}_{t=-\infty} f(t)g(x-t) \tag{2} h(x)=(fg)(x)=t=f(t)g(xt)(2)

一维卷积实例

有两枚骰子 f , g f,g f,g,掷出后二者相加为4的概率如何计算?
第一种情况: f ( 1 ) g ( 3 ) , 3 + 1 = 4 f(1)g(3), 3+1=4 f(1)g(3),3+1=4
第二种情况: f ( 2 ) g ( 2 ) , 2 + 2 = 4 f(2)g(2), 2+2=4 f(2)g(2),2+2=4
第三种情况: f ( 3 ) g ( 1 ) , 1 + 3 = 4 f(3)g(1), 1+3=4 f(3)g(1),1+3=4
因此,两枚骰子点数加起来为4的概率为:

h ( 4 ) = f ( 1 ) g ( 3 ) + f ( 2 ) g ( 2 ) + f ( 3 ) g ( 1 ) h(4) = f(1)g(3)+f(2)g(2)+f(3)g(1) h(4)=f(1)g(3)+f(2)g(2)+f(3)g(1) = f ( 1 ) g ( 4 − 1 ) + f ( 2 ) g ( 4 − 2 ) + f ( 3 ) g ( 4 − 3 ) =f(1)g(4-1) + f(2)g(4-2) + f(3)g(4-3) =f(1)g(41)+f(2)g(42)+f(3)g(43)
符合卷积的定义,把它写成标准的形式就是公式2:

h ( 4 ) = ( f ∗ g ) ( 4 ) = ∑ t = 1 3 f ( t ) g ( 4 − t ) h(4)=(f*g)(4)=\sum _{t=1}^{3}f(t)g(4-t) h(4)=(fg)(4)=t=13f(t)g(4t)

单入单出的二维卷积

二维卷积一般用于图像处理上。在二维图片上做卷积,如果把图像Image简写为 I I I,把卷积核Kernal简写为 K K K,则目标图片的第 ( i , j ) (i,j) (i,j)个像素的卷积值为:
(3) h ( i , j ) = ( I ∗ K ) ( i , j ) = ∑ m ∑ n I ( m , n ) K ( i − m , j − n ) h(i,j) = (I*K)(i,j)=\sum_m \sum_n I(m,n)K(i-m,j-n) \tag{3} h(i,j)=(IK)(i,j)=mnI(m,n)K(im,jn)(3)
可以看出,这和一维情况下的公式2是一致的。从卷积的可交换性,我们可以把公式3等价地写作:

(4) h ( i , j ) = ( I ∗ K ) ( i , j ) = ∑ m ∑ n I ( i − m , j − n ) K ( m , n ) h(i,j) = (I*K)(i,j)=\sum_m \sum_n I(i-m,j-n)K(m,n) \tag{4} h(i,j)=(IK)(i,j)=mnI(im,jn)K(m,n)(4)
公式4的成立,是因为我们将Kernal进行了翻转。在神经网络中,一般会实现一个互相关函数(corresponding function),而卷积运算几乎一样,但不反转Kernal:
(5) h ( i , j ) = ( I ∗ K ) ( i , j ) = ∑ m ∑ n I ( i + m , j + n ) K ( m , n ) h(i,j) = (I*K)(i,j)=\sum_m \sum_n I(i+m,j+n)K(m,n) \tag{5} h(i,j)=(IK)(i,j)=mnI(i+m,j+n)K(m,n)(5)
在图像处理中,自相关函数和互相关函数定义如下:

  • 自相关:设原函数是f(t),则 h = f ( t ) ⋆ f ( − t ) h=f(t) \star f(-t) h=f(t)f(t),其中 ⋆ \star 表示卷积
  • 互相关:设两个函数分别是f(t)和g(t),则 h = f ( t ) ⋆ g ( − t ) h=f(t) \star g(-t) h=f(t)g(t)

互相关函数的运算,是两个序列滑动相乘,两个序列都不翻转。卷积运算也是滑动相乘,但是其中一个序列需要先翻转,再相乘。所以,从数学意义上说,机器学习实现的是互相关函数,而不是原始含义上的卷积。但我们为了简化,把公式5也称作为卷积。这就是卷积的来源。

结论:
  • 我们实现的卷积操作不是原始数学含义的卷积,而是工程上的卷积,可以简称为卷积
  • 在实现卷积操作时,并不会反转卷积核

单入多出的升维卷积

原始输入是一维的图片,但是我们可以用多个卷积核分别对其计算,从而得到多个特征输出
一张4x4的图片,用两个卷积核并行地处理,输出为2个2x2的图片。在训练过程中,这两个卷积核会完成不同的特征学习。

多入单出的降维卷积

一张图片,通常是彩色的,具有红绿蓝三个通道。我们可以有两个选择来处理:

  • 变成灰度的,每个像素只剩下一个值,就可以用二维卷积
  • 对于三个通道,每个通道都使用一个卷积核,分别处理红绿蓝三种颜色的信息

显然第2种方法可以从图中学习到更多的特征,于是出现了三维卷积,即一个卷积核内有三个子核,三个子核的尺寸是一样的,比如都是5x5,这样的话,这个卷积核就是一个3x5x5的立体核,所以称为三维卷积。

我们可以用一个卷积核来处理这张彩色图片,但是这个卷积核里必须有三个过滤

虽然输入图片是多个通道的,或者说是三维的,但是在相同数量的过滤器的计算后,相加在一起的结果是一个通道,即2维数据,所以称为降维。这当然简化了对多通道数据的计算难度,但同时也会损失多通道数据自带的颜色信息。

卷积神经网络中卷积的后向传播

卷积层的训练

同全连接层一样,卷积层的训练也需要从上一层回传的误差矩阵,然后计算:

  • 本层的权重矩阵的误差项
  • 本层的需要回传到下一层的误差矩阵

在下面的描述中,我们假设已经得到了从上一层回传的误差矩阵,并且已经经过了激活函数的反向传导。

计算反向传播的梯度矩阵

正向公式:

(0) Z = W ∗ A + b Z = W*A+b \tag{0} Z=WA+b(0)

其中,W是卷积核,*表示卷积(互相关)计算,A为当前层的输入项,b是偏移(未在图中画出),Z为当前层的输出项,但尚未经过激活函数处理。
分解到每一项就是下列公式:
(1) z 11 = w 11 ⋅ a 11 + w 12 ⋅ a 12 + w 21 ⋅ a 21 + w 22 ⋅ a 22 + b z_{11} = w_{11} \cdot a_{11} + w_{12} \cdot a_{12} + w_{21} \cdot a_{21} + w_{22} \cdot a_{22} + b \tag{1} z11=w11a11+w12a12+w21a21+w22a22+b(1) (2) z 12 = w 11 ⋅ a 12 + w 12 ⋅ a 13 + w 21 ⋅ a 22 + w 22 ⋅ a 23 + b z_{12} = w_{11} \cdot a_{12} + w_{12} \cdot a_{13} + w_{21} \cdot a_{22} + w_{22} \cdot a_{23} + b \tag{2} z12=w11a12+w12a13+w21a22+w22a23+b(2) (3) z 21 = w 11 ⋅ a 21 + w 12 ⋅ a 22 + w 21 ⋅ a 31 + w 22 ⋅ a 32 + b z_{21} = w_{11} \cdot a_{21} + w_{12} \cdot a_{22} + w_{21} \cdot a_{31} + w_{22} \cdot a_{32} + b \tag{3} z21=w11a21+w12a22+w21a31+w22a32+b(3) (4) z 22 = w 11 ⋅ a 22 + w 12 ⋅ a 23 + w 21 ⋅ a 32 + w 22 ⋅ a 33 + b z_{22} = w_{11} \cdot a_{22} + w_{12} \cdot a_{23} + w_{21} \cdot a_{32} + w_{22} \cdot a_{33} + b \tag{4} z22=w11a22+w12a23+w21a32+w22a33+b(4)

求损失函数J对a11的梯度:

(5) ∂ J ∂ a 11 = ∂ J ∂ z 11 ∂ z 11 ∂ a 11 = δ z 11 ⋅ w 11 \frac{\partial J}{\partial a_{11}}=\frac{\partial J}{\partial z_{11}} \frac{\partial z_{11}}{\partial a_{11}}=\delta_{z11}\cdot w_{11} \tag{5} a11J=z11Ja11z11=δz11w11(5)

上式中, δ z 11 \delta_{z11} δz11是从网络后端回传到本层的z11单元的梯度。

求J对a12的梯度时,先看正向公式,发现a12对z11和z12都有贡献,因此需要二者的偏导数相加:

(6) ∂ J ∂ a 12 = ∂ J ∂ z 11 ∂ z 11 ∂ a 12 + ∂ J ∂ z 12 ∂ z 12 ∂ a 12 = δ z 11 ⋅ w 12 + δ z 12 ⋅ w 11 \frac{\partial J}{\partial a_{12}}=\frac{\partial J}{\partial z_{11}} \frac{\partial z_{11}}{\partial a_{12}}+\frac{\partial J}{\partial z_{12}} \frac{\partial z_{12}}{\partial a_{12}}=\delta_{z11} \cdot w_{12}+\delta_{z12} \cdot w_{11} \tag{6} a12J=z11Ja12z11+z12Ja12z12=δz11w12+δz12w11(6)

最复杂的是求a22的梯度,因为从正向公式看,所有的输出都有a22的贡献,所以:

∂ J ∂ a 22 = ∂ J ∂ z 11 ∂ z 11 ∂ a 22 + ∂ J ∂ z 12 ∂ z 12 ∂ a 22 + ∂ J ∂ z 21 ∂ z 21 ∂ a 22 + ∂ J ∂ z 22 ∂ z 22 ∂ a 22 \frac{\partial J}{\partial a_{22}}=\frac{\partial J}{\partial z_{11}} \frac{\partial z_{11}}{\partial a_{22}}+\frac{\partial J}{\partial z_{12}} \frac{\partial z_{12}}{\partial a_{22}}+\frac{\partial J}{\partial z_{21}} \frac{\partial z_{21}}{\partial a_{22}}+\frac{\partial J}{\partial z_{22}} \frac{\partial z_{22}}{\partial a_{22}} a22J=z11Ja22z11+z12Ja22z12+z21Ja22z21+z22Ja22z22 (7) = δ z 11 ⋅ w 22 + δ z 12 ⋅ w 21 + δ z 21 ⋅ w 12 + δ z 22 ⋅ w 11 =\delta_{z11} \cdot w_{22} + \delta_{z12} \cdot w_{21} + \delta_{z21} \cdot w_{12} + \delta_{z22} \cdot w_{11} \tag{7} =δz11w22+δz12w21+δz21w12+δz22w11(7)

同理可得所有a的梯度。

观察公式7中的w的顺序,貌似是把原始的卷积核旋转了180度,再与传入误差项做卷积操作,即可得到所有元素的误差项。而公式5和公式6并不完备,是因为二者处于角落,这和卷积正向计算中的padding是相同的现象。因此,我们把传入的误差矩阵Delta-In做一个zero padding再乘以旋转180度的卷积核,就是要传出的误差矩阵Delta-Out
最后可以统一成为一个简洁的公式:

(8) δ o u t = δ i n ∗ W r o t 180 \delta_{out} = \delta_{in} * W^{rot180} \tag{8} δout=δinWrot180(8)

这个误差矩阵可以继续回传到下一层。

当Weights是3x3时, δ i n \delta_{in} δin需要padding=2,即加2圈0,才能和Weights卷积后,得到正确尺寸的 δ o u t \delta_{out} δout
当Weights是5x5时, δ i n \delta_{in} δin需要padding=4,即加4圈0,才能和Weights卷积后,得到正确尺寸的 δ o u t \delta_{out} δout
以此类推:当Weights是NxN时, δ i n \delta_{in} δin需要padding=N-1,即加N-1圈0
举例:

正向时stride=1: A ( 10 × 8 ) ∗ W ( 5 × 5 ) = Z ( 6 × 4 ) A^{(10 \times 8)}*W^{(5 \times 5)}=Z^{(6 \times 4)} A(10×8)W(5×5)=Z(6×4)

反向时, δ z ( 6 × 4 ) + 4 p a d d i n g = δ z ( 14 × 12 ) \delta_z^{(6 \times 4)} + 4 padding = \delta_z^{(14 \times 12)} δz(6×4)+4padding=δz(14×12)

然后: δ z ( 14 × 12 ) ∗ W r o t 180 ( 5 × 5 ) = δ a ( 10 × 8 ) \delta_z^{(14 \times 12)} * W^{rot180(5 \times 5)}= \delta_a^{(10 \times 8)} δz(14×12)Wrot180(5×5)=δa(10×8)

步长不为1时的梯度矩阵还原

我们先观察一下stride=1和2时,卷积结果的差异,二者的差别就是中间那个结果图的灰色部分。如果反向传播时,传入的误差矩阵是stride=2时的2x2的形状,那么我们只需要把它补上一个十字,变成3x3的误差矩阵,就可以用步长为1的算法了。

以此类推,如果步长为3时,需要补一个双线的十字。所以,当知道当前的卷积层步长为S(S>1)时:

得到从上层回传的误差矩阵形状,假设为 M × N M \times N M×N
初始化一个 ( M ⋅ S ) × ( N ⋅ S ) (M \cdot S) \times (N \cdot S) (MS)×(NS)的零矩阵
把传入的误差矩阵的第一行值放到零矩阵第0行的0,S,2S,3S…位置
然后把误差矩阵的第二行的值放到零矩阵第S行的0,S,2S,3S…位置

步长为2时,用实例表示就是这样:

[ δ 11 0 δ 12 0 δ 13   0 0 0 0 0   δ 21 0 δ 22 0 δ 23   ] \begin{bmatrix} \delta_{11} & 0 & \delta_{12} & 0 & \delta_{13}\ 0 & 0 & 0 & 0 & 0\ \delta_{21} & 0 & \delta_{22} & 0 & \delta_{23}\ \end{bmatrix} [δ110δ120δ13 00000 δ210δ220δ23 ]

步长为3时,用实例表示就是这样:

[ δ 11 0 0 δ 12 0 0 δ 13   0 0 0 0 0 0 0   0 0 0 0 0 0 0   δ 21 0 0 δ 22 0 0 δ 23   ] \begin{bmatrix} \delta_{11} & 0 & 0 & \delta_{12} & 0 & 0 & \delta_{13}\ 0 & 0 & 0 & 0 & 0 & 0 & 0\ 0 & 0 & 0 & 0 & 0 & 0 & 0\ \delta_{21} & 0 & 0 & \delta_{22} & 0 & 0 & \delta_{23}\ \end{bmatrix} [δ1100δ1200δ13 0000000 0000000 δ2100δ2200δ23 ]

有多个卷积核时的梯度计算

有多个卷积核也就意味着有多个输出通道。
正向公式:

z 111 = w 111 ⋅ a 11 + w 112 ⋅ a 12 + w 121 ⋅ a 21 + w 122 ⋅ a 22 z111 = w111 \cdot a11 + w112 \cdot a12 + w121 \cdot a21 + w122 \cdot a22 z111=w111a11+w112a12+w121a21+w122a22 z 112 = w 111 ⋅ a 12 + w 112 ⋅ a 13 + w 121 ⋅ a 22 + w 122 ⋅ a 23 z112 = w111 \cdot a12 + w112 \cdot a13 + w121 \cdot a22 + w122 \cdot a23 z112=w111a12+w112a13+w121a22+w122a23 z 121 = w 111 ⋅ a 21 + w 112 ⋅ a 22 + w 121 ⋅ a 31 + w 122 ⋅ a 32 z121 = w111 \cdot a21 + w112 \cdot a22 + w121 \cdot a31 + w122 \cdot a32 z121=w111a21+w112a22+w121a31+w122a32 z 122 = w 111 ⋅ a 22 + w 112 ⋅ a 23 + w 121 ⋅ a 32 + w 122 ⋅ a 33 z122 = w111 \cdot a22 + w112 \cdot a23 + w121 \cdot a32 + w122 \cdot a33 z122=w111a22+w112a23+w121a32+w122a33

z 211 = w 211 ⋅ a 11 + w 212 ⋅ a 12 + w 221 ⋅ a 21 + w 222 ⋅ a 22 z211 = w211 \cdot a11 + w212 \cdot a12 + w221 \cdot a21 + w222 \cdot a22 z211=w211a11+w212a12+w221a21+w222a22 z 212 = w 211 ⋅ a 12 + w 212 ⋅ a 13 + w 221 ⋅ a 22 + w 222 ⋅ a 23 z212 = w211 \cdot a12 + w212 \cdot a13 + w221 \cdot a22 + w222 \cdot a23 z212=w211a12+w212a13+w221a22+w222a23 z 221 = w 211 ⋅ a 21 + w 212 ⋅ a 22 + w 221 ⋅ a 31 + w 222 ⋅ a 32 z221 = w211 \cdot a21 + w212 \cdot a22 + w221 \cdot a31 + w222 \cdot a32 z221=w211a21+w212a22+w221a31+w222a32 z 222 = w 211 ⋅ a 22 + w 212 ⋅ a 23 + w 221 ⋅ a 32 + w 222 ⋅ a 33 z222 = w211 \cdot a22 + w212 \cdot a23 + w221 \cdot a32 + w222 \cdot a33 z222=w211a22+w212a23+w221a32+w222a33

求J对a22的梯度:

∂ J ∂ a 22 = ∂ J ∂ Z 1 ∂ Z 1 ∂ a 22 + ∂ J ∂ Z 2 ∂ Z 2 ∂ a 22 \frac{\partial J}{\partial a_{22}}=\frac{\partial J}{\partial Z_{1}} \frac{\partial Z_{1}}{\partial a_{22}}+\frac{\partial J}{\partial Z_{2}} \frac{\partial Z_{2}}{\partial a_{22}} a22J=Z1Ja22Z1+Z2Ja22Z2 = ∂ J ∂ z 111 ∂ z 111 ∂ a 22 + ∂ J ∂ z 112 ∂ z 112 ∂ a 22 + ∂ J ∂ z 121 ∂ z 121 ∂ a 22 + ∂ J ∂ z 122 ∂ z 122 ∂ a 22 =\frac{\partial J}{\partial z_{111}} \frac{\partial z_{111}}{\partial a_{22}}+\frac{\partial J}{\partial z_{112}} \frac{\partial z_{112}}{\partial a_{22}}+\frac{\partial J}{\partial z_{121}} \frac{\partial z_{121}}{\partial a_{22}}+\frac{\partial J}{\partial z_{122}} \frac{\partial z_{122}}{\partial a_{22}} =z111Ja22z111+z112Ja22z112+z121Ja22z121+z122Ja22z122 + ∂ J ∂ z 211 ∂ z 211 ∂ a 22 + ∂ J ∂ z 212 ∂ z 212 ∂ a 22 + ∂ J ∂ z 221 ∂ z 221 ∂ a 22 + ∂ J ∂ z 222 ∂ z 222 ∂ a 22 +\frac{\partial J}{\partial z_{211}} \frac{\partial z_{211}}{\partial a_{22}}+\frac{\partial J}{\partial z_{212}} \frac{\partial z_{212}}{\partial a_{22}}+\frac{\partial J}{\partial z_{221}} \frac{\partial z_{221}}{\partial a_{22}}+\frac{\partial J}{\partial z_{222}} \frac{\partial z_{222}}{\partial a_{22}} +z211Ja22z211+z212Ja22z212+z221Ja22z221+z222Ja22z222 = ( δ z 111 ⋅ w 122 + δ z 112 ⋅ w 121 + δ z 121 ⋅ w 112 + δ z 122 ⋅ w 111 ) =(\delta_{z111} \cdot w_{122} + \delta_{z112} \cdot w_{121} + \delta_{z121} \cdot w_{112} + \delta_{z122} \cdot w_{111}) =(δz111w122+δz112w121+δz121w112+δz122w111) + ( δ z 211 ⋅ w 222 + δ z 212 ⋅ w 221 + δ z 221 ⋅ w 212 + δ z 222 ⋅ w 211 ) +(\delta_{z211} \cdot w_{222} + \delta_{z212} \cdot w_{221} + \delta_{z221} \cdot w_{212} + \delta_{z222} \cdot w_{211}) +(δz211w222+δz212w221+δz221w212+δz222w211) = δ z 1 ∗ W 1 r o t 180 + δ z 2 ∗ W 2 r o t 180 =\delta_{z1} * W_1^{rot180} + \delta_{z2} * W_2^{rot180} =δz1W1rot180+δz2W2rot180

因此和公式8相似,先在 δ i n \delta_{in} δin外面加padding,然后和对应的旋转后的卷积核相乘,再把几个结果相加,就得到了需要前传的梯度矩阵:

(9) δ o u t = ∑ m δ i n m ∗ W m r o t 180 \delta_{out} = \sum_m \delta_{in_m} * W^{rot180}_m \tag{9} δout=mδinmWmrot180(9)

有多个输入时的梯度计算

当输入层是多个图层时,每个图层必须对应一个卷积核。
所以有前向公式:

z 11 = w 111 ⋅ a 111 + w 112 ⋅ a 112 + w 121 ⋅ a 121 + w 122 ⋅ a 122 z11 = w111 \cdot a111 + w112 \cdot a112 + w121 \cdot a121 + w122 \cdot a122 z11=w111a111+w112a112+w121a121+w122a122 (10) + w 211 ⋅ a 211 + w 212 ⋅ a 212 + w 221 ⋅ a 221 + w 222 ⋅ a 222 + w211 \cdot a211 + w212 \cdot a212 + w221 \cdot a221 + w222 \cdot a222 \tag{10} +w211a211+w212a212+w221a221+w222a222(10) z 12 = w 111 ⋅ a 112 + w 112 ⋅ a 113 + w 121 ⋅ a 122 + w 122 ⋅ a 123 z12 = w111 \cdot a112 + w112 \cdot a113 + w121 \cdot a122 + w122 \cdot a123 z12=w111a112+w112a113+w121a122+w122a123 (11) + w 211 ⋅ a 212 + w 212 ⋅ a 213 + w 221 ⋅ a 222 + w 222 ⋅ a 223 + w211 \cdot a212 + w212 \cdot a213 + w221 \cdot a222 + w222 \cdot a223 \tag{11} +w211a212+w212a213+w221a222+w222a223(11) z 21 = w 111 ⋅ a 121 + w 112 ⋅ a 122 + w 121 ⋅ a 131 + w 122 ⋅ a 132 z21 = w111 \cdot a121 + w112 \cdot a122 + w121 \cdot a131 + w122 \cdot a132 z21=w111a121+w112a122+w121a131+w122a132 (12) + w 211 ⋅ a 221 + w 212 ⋅ a 222 + w 221 ⋅ a 231 + w 222 ⋅ a 232 + w211 \cdot a221 + w212 \cdot a222 + w221 \cdot a231 + w222 \cdot a232 \tag{12} +w211a221+w212a222+w221a231+w222a232(12) z 22 = w 111 ⋅ a 122 + w 112 ⋅ a 123 + w 121 ⋅ a 132 + w 122 ⋅ a 133 z22 = w111 \cdot a122 + w112 \cdot a123 + w121 \cdot a132 + w122 \cdot a133 z22=w111a122+w112a123+w121a132+w122a133 (13) + w 211 ⋅ a 222 + w 212 ⋅ a 223 + w 221 ⋅ a 232 + w 222 ⋅ a 233 + w211 \cdot a222 + w212 \cdot a223 + w221 \cdot a232 + w222 \cdot a233 \tag{13} +w211a222+w212a223+w221a232+w222a233(13)

最复杂的情况,求J对a122的梯度:

∂ J ∂ a 111 = ∂ J ∂ z 11 ∂ z 11 ∂ a 122 + ∂ J ∂ z 12 ∂ z 12 ∂ a 122 + ∂ J ∂ z 21 ∂ z 21 ∂ a 122 + ∂ J ∂ z 22 ∂ z 22 ∂ a 122 \frac{\partial J}{\partial a111}=\frac{\partial J}{\partial z11}\frac{\partial z11}{\partial a122} + \frac{\partial J}{\partial z12}\frac{\partial z12}{\partial a122} + \frac{\partial J}{\partial z21}\frac{\partial z21}{\partial a122} + \frac{\partial J}{\partial z22}\frac{\partial z22}{\partial a122} a111J=z11Ja122z11+z12Ja122z12+z21Ja122z21+z22Ja122z22 = δ z 11 ⋅ w 122 + δ z 12 ⋅ w 121 + δ z 21 ⋅ w 112 + δ z 22 ⋅ w 111 =\delta_{z11} \cdot w122 + \delta_{z12} \cdot w121 + \delta_{z21} \cdot w112 + \delta_{z22} \cdot w111 =δz11w122+δz12w121+δz21w112+δz22w111

泛化以后得到:

(14) δ o u t 1 = δ i n ∗ W 1 r o t 180 \delta_{out1} = \delta_{in} * W_1^{rot180} \tag{14} δout1=δinW1rot180(14)

最复杂的情况,求J对a222的梯度:

∂ J ∂ a 211 = ∂ J ∂ z 11 ∂ z 11 ∂ a 222 + ∂ J ∂ z 12 ∂ z 12 ∂ a 222 + ∂ J ∂ z 21 ∂ z 21 ∂ a 222 + ∂ J ∂ z 22 ∂ z 22 ∂ a 222 \frac{\partial J}{\partial a211}=\frac{\partial J}{\partial z11}\frac{\partial z11}{\partial a222} + \frac{\partial J}{\partial z12}\frac{\partial z12}{\partial a222} + \frac{\partial J}{\partial z21}\frac{\partial z21}{\partial a222} + \frac{\partial J}{\partial z22}\frac{\partial z22}{\partial a222} a211J=z11Ja222z11+z12Ja222z12+z21Ja222z21+z22Ja222z22 = δ z 11 ⋅ w 222 + δ z 12 ⋅ w 221 + δ z 21 ⋅ w 212 + δ z 22 ⋅ w 211 =\delta_{z11} \cdot w222 + \delta_{z12} \cdot w221 + \delta_{z21} \cdot w212 + \delta_{z22} \cdot w211 =δz11w222+δz12w221+δz21w212+δz22w211

泛化以后得到:

(15) δ o u t 2 = δ i n ∗ W 2 r o t 180 \delta_{out2} = \delta_{in} * W_2^{rot180} \tag{15} δout2=δinW2rot180(15)

权重(卷积核)梯度计算

要求J对w11的梯度,从正向公式可以看到,w11对所有的z都有贡献,所以:

∂ J ∂ w 11 = ∂ J ∂ z 11 ∂ z 11 ∂ w 11 + ∂ J ∂ z 12 ∂ z 12 ∂ w 11 + ∂ J ∂ z 21 ∂ z 21 ∂ w 11 + ∂ J ∂ z 22 ∂ z 22 ∂ w 11 \frac{\partial J}{\partial w_{11}} = \frac{\partial J}{\partial z_{11}}\frac{\partial z_{11}}{\partial w_{11}} + \frac{\partial J}{\partial z_{12}}\frac{\partial z_{12}}{\partial w_{11}} + \frac{\partial J}{\partial z_{21}}\frac{\partial z_{21}}{\partial w_{11}} + \frac{\partial J}{\partial z_{22}}\frac{\partial z_{22}}{\partial w_{11}} w11J=z11Jw11z11+z12Jw11z12+z21Jw11z21+z22Jw11z22 (9) = δ z 11 ⋅ a 11 + δ z 12 ⋅ a 12 + δ z 21 ⋅ a 21 + δ z 22 ⋅ a 22 =\delta_{z11} \cdot a_{11} + \delta_{z12} \cdot a_{12} + \delta_{z21} \cdot a_{21} + \delta_{z22} \cdot a_{22} \tag{9} =δz11a11+δz12a12+δz21a21+δz22a22(9)

对W22也是一样的:

∂ J ∂ w 12 = ∂ J ∂ z 11 ∂ z 11 ∂ w 12 + ∂ J ∂ z 12 ∂ z 12 ∂ w 12 + ∂ J ∂ z 21 ∂ z 21 ∂ w 12 + ∂ J ∂ z 22 ∂ z 22 ∂ w 12 \frac{\partial J}{\partial w_{12}} = \frac{\partial J}{\partial z_{11}}\frac{\partial z_{11}}{\partial w_{12}} + \frac{\partial J}{\partial z_{12}}\frac{\partial z_{12}}{\partial w_{12}} + \frac{\partial J}{\partial z_{21}}\frac{\partial z_{21}}{\partial w_{12}} + \frac{\partial J}{\partial z_{22}}\frac{\partial z_{22}}{\partial w_{12}} w12J=z11Jw12z11+z12Jw12z12+z21Jw12z21+z22Jw12z22 (10) = δ z 11 ⋅ a 12 + δ z 12 ⋅ a 13 + δ z 21 ⋅ a 22 + δ z 22 ⋅ a 23 =\delta_{z11} \cdot a_{12} + \delta_{z12} \cdot a_{13} + \delta_{z21} \cdot a_{22} + \delta_{z22} \cdot a_{23} \tag{10} =δz11a12+δz12a13+δz21a22+δz22a23(10)
观察公式8和公式9,其实也是一个标准的卷积(互相关)操作过程。
总结成一个公式:

(11) δ w = A ∗ δ i n \delta_w = A * \delta_{in} \tag{11} δw=Aδin(11)

偏移的梯度计算

根据前向计算公式1,2,3,4,可以得到:

∂ J ∂ b = ∂ J ∂ z 11 ∂ z 11 ∂ b + ∂ J ∂ z 12 ∂ z 12 ∂ b + ∂ J ∂ z 21 ∂ z 21 ∂ b + ∂ J ∂ z 22 ∂ z 22 ∂ b \frac{\partial J}{\partial b} = \frac{\partial J}{\partial z_{11}}\frac{\partial z_{11}}{\partial b} + \frac{\partial J}{\partial z_{12}}\frac{\partial z_{12}}{\partial b} + \frac{\partial J}{\partial z_{21}}\frac{\partial z_{21}}{\partial b} + \frac{\partial J}{\partial z_{22}}\frac{\partial z_{22}}{\partial b} bJ=z11Jbz11+z12Jbz12+z21Jbz21+z22Jbz22 (12) = δ z 11 + δ z 12 + δ z 21 + δ z 22 =\delta_{z11} + \delta_{z12} + \delta_{z21} + \delta_{z22} \tag{12} =δz11+δz12+δz21+δz22(12)

所以:

(13) δ b = δ i n \delta_b = \delta_{in} \tag{13} δb=δin(13)

每个卷积核W可能会有多个filter,或者叫子核,但是一个卷积核只有一个偏移,无论有多少子核。

卷积神经网络中的池化

概念

池化 pooling,又称为下采样,downstream sampling or sub-sampling。

分为两种,一种是最大值池化 Max Pooling,一种是平均值池化 Mean/Average Pooling。

池化层的训练

我们假设2x2的图片中, [ [ 1 , 2 ] , [ 3 , 4 ] ] [[1,2],[3,4]] [[1,2],[3,4]]是上一层网络回传的残差,那么:

对于最大值池化,残差值会回传到当初最大值的位置上,而其它三个位置的残差都是0。
对于平均值池化,残差值会平均到原始的4个位置上。

最大池化(Max Pooling)

正向公式:

w = m a x ( a , b , c , d ) w = max(a,b,c,d) w=max(a,b,c,d)

反向公式(假设Input Layer中的最大值是b):

∂ w ∂ a = 0 {\partial w \over \partial a} = 0 aw=0 ∂ w ∂ b = 1 {\partial w \over \partial b} = 1 bw=1 ∂ w ∂ c = 0 {\partial w \over \partial c} = 0 cw=0 ∂ w ∂ d = 0 {\partial w \over \partial d} = 0 dw=0

因为a,c,d对w都没有贡献,所以偏导数自然为0,只有b有贡献,偏导数为1。

δ a = ∂ J ∂ a = ∂ J ∂ w ∂ w ∂ a = 0 \delta_a = {\partial J \over \partial a} = {\partial J \over \partial w} {\partial w \over \partial a} = 0 δa=aJ=wJaw=0

δ b = ∂ J ∂ b = ∂ J ∂ w ∂ w ∂ b = δ w ⋅ 1 = δ w \delta_b = {\partial J \over \partial b} = {\partial J \over \partial w} {\partial w \over \partial b} = \delta_w \cdot 1 = \delta_w δb=bJ=wJbw=δw1=δw

δ c = ∂ J ∂ c = ∂ J ∂ w ∂ w ∂ c = 0 \delta_c = {\partial J \over \partial c} = {\partial J \over \partial w} {\partial w \over \partial c} = 0 δc=cJ=wJcw=0

δ d = ∂ J ∂ d = ∂ J ∂ w ∂ w ∂ d = 0 \delta_d = {\partial J \over \partial d} = {\partial J \over \partial w} {\partial w \over \partial d} = 0 δd=dJ=wJdw=0

平均池化(Mean Pooling)

正向公式:

w = 1 4 ( a + b + c + d ) w = \frac{1}{4}(a+b+c+d) w=41(a+b+c+d)

反向公式(假设Layer-1中的最大值是b):

∂ w ∂ a = 1 4 {\partial w \over \partial a} = \frac{1}{4} aw=41 ∂ w ∂ b = 1 4 {\partial w \over \partial b} = \frac{1}{4} bw=41 ∂ w ∂ c = 1 4 {\partial w \over \partial c} = \frac{1}{4} cw=41 ∂ w ∂ d = 1 4 {\partial w \over \partial d} = \frac{1}{4} dw=41

因为a,c,d对w都没有贡献,所以偏导数自然为0,只有b有贡献,偏导数为1。

δ a = ∂ J ∂ a = ∂ J ∂ w ∂ w ∂ a = 1 4 δ w \delta_a = {\partial J \over \partial a} = {\partial J \over \partial w} {\partial w \over \partial a} = \frac{1}{4}\delta_w δa=aJ=wJaw=41δw

δ b = ∂ J ∂ b = ∂ J ∂ w ∂ w ∂ b = 1 4 δ w \delta_b = {\partial J \over \partial b} = {\partial J \over \partial w} {\partial w \over \partial b} = \frac{1}{4}\delta_w δb=bJ=wJbw=41δw

δ c = ∂ J ∂ c = ∂ J ∂ w ∂ w ∂ c = 1 4 δ w \delta_c = {\partial J \over \partial c} = {\partial J \over \partial w} {\partial w \over \partial c} = \frac{1}{4}\delta_w δc=cJ=wJcw=41δw

δ d = ∂ J ∂ d = ∂ J ∂ w ∂ w ∂ d = 1 4 δ w \delta_d = {\partial J \over \partial d} = {\partial J \over \partial w} {\partial w \over \partial d} = \frac{1}{4}\delta_w δd=dJ=wJdw=41δw

池化总结

无论是max pooling还是mean pooling,都没有要学习的参数,所以,在卷积网络的训练中,池化层需要做的只是把误差项向后传递,不需要计算任何梯度

https://github.com/microsoft/ai-edu/blob/master/B-教学案例与实践/B6-神经网络基本原理简明教程/17 第八步 - 卷积神经网络.md

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值