卷积的数学定义
(1)
h
(
x
)
=
(
f
∗
g
)
(
x
)
=
∫
−
∞
∞
f
(
t
)
g
(
x
−
t
)
d
t
h(x)=(f*g)(x) = \int_{-\infty}^{\infty} f(t)g(x-t)dt \tag{1}
h(x)=(f∗g)(x)=∫−∞∞f(t)g(x−t)dt(1)
卷积与傅里叶变换有着密切的关系。利用这点性质,即两函数的傅里叶变换的乘积等于它们卷积后的傅里叶变换,能使傅里叶分析中许多问题的处理得到简化。
离散定义
卷积神经网络中卷积的前向计算
(2) h ( x ) = ( f ∗ g ) ( x ) = ∑ t = − ∞ ∞ f ( t ) g ( x − t ) h(x) = (f*g)(x) = \sum^{\infty}_{t=-\infty} f(t)g(x-t) \tag{2} h(x)=(f∗g)(x)=t=−∞∑∞f(t)g(x−t)(2)
一维卷积实例
有两枚骰子
f
,
g
f,g
f,g,掷出后二者相加为4的概率如何计算?
第一种情况:
f
(
1
)
g
(
3
)
,
3
+
1
=
4
f(1)g(3), 3+1=4
f(1)g(3),3+1=4
第二种情况:
f
(
2
)
g
(
2
)
,
2
+
2
=
4
f(2)g(2), 2+2=4
f(2)g(2),2+2=4
第三种情况:
f
(
3
)
g
(
1
)
,
1
+
3
=
4
f(3)g(1), 1+3=4
f(3)g(1),1+3=4
因此,两枚骰子点数加起来为4的概率为:
h
(
4
)
=
f
(
1
)
g
(
3
)
+
f
(
2
)
g
(
2
)
+
f
(
3
)
g
(
1
)
h(4) = f(1)g(3)+f(2)g(2)+f(3)g(1)
h(4)=f(1)g(3)+f(2)g(2)+f(3)g(1)
=
f
(
1
)
g
(
4
−
1
)
+
f
(
2
)
g
(
4
−
2
)
+
f
(
3
)
g
(
4
−
3
)
=f(1)g(4-1) + f(2)g(4-2) + f(3)g(4-3)
=f(1)g(4−1)+f(2)g(4−2)+f(3)g(4−3)
符合卷积的定义,把它写成标准的形式就是公式2:
h ( 4 ) = ( f ∗ g ) ( 4 ) = ∑ t = 1 3 f ( t ) g ( 4 − t ) h(4)=(f*g)(4)=\sum _{t=1}^{3}f(t)g(4-t) h(4)=(f∗g)(4)=t=1∑3f(t)g(4−t)
单入单出的二维卷积
二维卷积一般用于图像处理上。在二维图片上做卷积,如果把图像Image简写为
I
I
I,把卷积核Kernal简写为
K
K
K,则目标图片的第
(
i
,
j
)
(i,j)
(i,j)个像素的卷积值为:
(3)
h
(
i
,
j
)
=
(
I
∗
K
)
(
i
,
j
)
=
∑
m
∑
n
I
(
m
,
n
)
K
(
i
−
m
,
j
−
n
)
h(i,j) = (I*K)(i,j)=\sum_m \sum_n I(m,n)K(i-m,j-n) \tag{3}
h(i,j)=(I∗K)(i,j)=m∑n∑I(m,n)K(i−m,j−n)(3)
可以看出,这和一维情况下的公式2是一致的。从卷积的可交换性,我们可以把公式3等价地写作:
(4)
h
(
i
,
j
)
=
(
I
∗
K
)
(
i
,
j
)
=
∑
m
∑
n
I
(
i
−
m
,
j
−
n
)
K
(
m
,
n
)
h(i,j) = (I*K)(i,j)=\sum_m \sum_n I(i-m,j-n)K(m,n) \tag{4}
h(i,j)=(I∗K)(i,j)=m∑n∑I(i−m,j−n)K(m,n)(4)
公式4的成立,是因为我们将Kernal进行了翻转。在神经网络中,一般会实现一个互相关函数(corresponding function),而卷积运算几乎一样,但不反转Kernal:
(5)
h
(
i
,
j
)
=
(
I
∗
K
)
(
i
,
j
)
=
∑
m
∑
n
I
(
i
+
m
,
j
+
n
)
K
(
m
,
n
)
h(i,j) = (I*K)(i,j)=\sum_m \sum_n I(i+m,j+n)K(m,n) \tag{5}
h(i,j)=(I∗K)(i,j)=m∑n∑I(i+m,j+n)K(m,n)(5)
在图像处理中,自相关函数和互相关函数定义如下:
- 自相关:设原函数是f(t),则 h = f ( t ) ⋆ f ( − t ) h=f(t) \star f(-t) h=f(t)⋆f(−t),其中 ⋆ \star ⋆表示卷积
- 互相关:设两个函数分别是f(t)和g(t),则 h = f ( t ) ⋆ g ( − t ) h=f(t) \star g(-t) h=f(t)⋆g(−t)
互相关函数的运算,是两个序列滑动相乘,两个序列都不翻转。卷积运算也是滑动相乘,但是其中一个序列需要先翻转,再相乘。所以,从数学意义上说,机器学习实现的是互相关函数,而不是原始含义上的卷积。但我们为了简化,把公式5也称作为卷积。这就是卷积的来源。
结论:
- 我们实现的卷积操作不是原始数学含义的卷积,而是工程上的卷积,可以简称为卷积
- 在实现卷积操作时,并不会反转卷积核
单入多出的升维卷积
原始输入是一维的图片,但是我们可以用多个卷积核分别对其计算,从而得到多个特征输出
一张4x4的图片,用两个卷积核并行地处理,输出为2个2x2的图片。在训练过程中,这两个卷积核会完成不同的特征学习。
多入单出的降维卷积
一张图片,通常是彩色的,具有红绿蓝三个通道。我们可以有两个选择来处理:
- 变成灰度的,每个像素只剩下一个值,就可以用二维卷积
- 对于三个通道,每个通道都使用一个卷积核,分别处理红绿蓝三种颜色的信息
显然第2种方法可以从图中学习到更多的特征,于是出现了三维卷积,即一个卷积核内有三个子核,三个子核的尺寸是一样的,比如都是5x5,这样的话,这个卷积核就是一个3x5x5的立体核,所以称为三维卷积。
我们可以用一个卷积核来处理这张彩色图片,但是这个卷积核里必须有三个过滤器
虽然输入图片是多个通道的,或者说是三维的,但是在相同数量的过滤器的计算后,相加在一起的结果是一个通道,即2维数据,所以称为降维。这当然简化了对多通道数据的计算难度,但同时也会损失多通道数据自带的颜色信息。
卷积神经网络中卷积的后向传播
卷积层的训练
同全连接层一样,卷积层的训练也需要从上一层回传的误差矩阵,然后计算:
- 本层的权重矩阵的误差项
- 本层的需要回传到下一层的误差矩阵
在下面的描述中,我们假设已经得到了从上一层回传的误差矩阵,并且已经经过了激活函数的反向传导。
计算反向传播的梯度矩阵
正向公式:
(0) Z = W ∗ A + b Z = W*A+b \tag{0} Z=W∗A+b(0)
其中,W是卷积核,*表示卷积(互相关)计算,A为当前层的输入项,b是偏移(未在图中画出),Z为当前层的输出项,但尚未经过激活函数处理。
分解到每一项就是下列公式:
(1)
z
11
=
w
11
⋅
a
11
+
w
12
⋅
a
12
+
w
21
⋅
a
21
+
w
22
⋅
a
22
+
b
z_{11} = w_{11} \cdot a_{11} + w_{12} \cdot a_{12} + w_{21} \cdot a_{21} + w_{22} \cdot a_{22} + b \tag{1}
z11=w11⋅a11+w12⋅a12+w21⋅a21+w22⋅a22+b(1)
(2)
z
12
=
w
11
⋅
a
12
+
w
12
⋅
a
13
+
w
21
⋅
a
22
+
w
22
⋅
a
23
+
b
z_{12} = w_{11} \cdot a_{12} + w_{12} \cdot a_{13} + w_{21} \cdot a_{22} + w_{22} \cdot a_{23} + b \tag{2}
z12=w11⋅a12+w12⋅a13+w21⋅a22+w22⋅a23+b(2)
(3)
z
21
=
w
11
⋅
a
21
+
w
12
⋅
a
22
+
w
21
⋅
a
31
+
w
22
⋅
a
32
+
b
z_{21} = w_{11} \cdot a_{21} + w_{12} \cdot a_{22} + w_{21} \cdot a_{31} + w_{22} \cdot a_{32} + b \tag{3}
z21=w11⋅a21+w12⋅a22+w21⋅a31+w22⋅a32+b(3)
(4)
z
22
=
w
11
⋅
a
22
+
w
12
⋅
a
23
+
w
21
⋅
a
32
+
w
22
⋅
a
33
+
b
z_{22} = w_{11} \cdot a_{22} + w_{12} \cdot a_{23} + w_{21} \cdot a_{32} + w_{22} \cdot a_{33} + b \tag{4}
z22=w11⋅a22+w12⋅a23+w21⋅a32+w22⋅a33+b(4)
求损失函数J对a11的梯度:
(5) ∂ J ∂ a 11 = ∂ J ∂ z 11 ∂ z 11 ∂ a 11 = δ z 11 ⋅ w 11 \frac{\partial J}{\partial a_{11}}=\frac{\partial J}{\partial z_{11}} \frac{\partial z_{11}}{\partial a_{11}}=\delta_{z11}\cdot w_{11} \tag{5} ∂a11∂J=∂z11∂J∂a11∂z11=δz11⋅w11(5)
上式中, δ z 11 \delta_{z11} δz11是从网络后端回传到本层的z11单元的梯度。
求J对a12的梯度时,先看正向公式,发现a12对z11和z12都有贡献,因此需要二者的偏导数相加:
(6) ∂ J ∂ a 12 = ∂ J ∂ z 11 ∂ z 11 ∂ a 12 + ∂ J ∂ z 12 ∂ z 12 ∂ a 12 = δ z 11 ⋅ w 12 + δ z 12 ⋅ w 11 \frac{\partial J}{\partial a_{12}}=\frac{\partial J}{\partial z_{11}} \frac{\partial z_{11}}{\partial a_{12}}+\frac{\partial J}{\partial z_{12}} \frac{\partial z_{12}}{\partial a_{12}}=\delta_{z11} \cdot w_{12}+\delta_{z12} \cdot w_{11} \tag{6} ∂a12∂J=∂z11∂J∂a12∂z11+∂z12∂J∂a12∂z12=δz11⋅w12+δz12⋅w11(6)
最复杂的是求a22的梯度,因为从正向公式看,所有的输出都有a22的贡献,所以:
∂ J ∂ a 22 = ∂ J ∂ z 11 ∂ z 11 ∂ a 22 + ∂ J ∂ z 12 ∂ z 12 ∂ a 22 + ∂ J ∂ z 21 ∂ z 21 ∂ a 22 + ∂ J ∂ z 22 ∂ z 22 ∂ a 22 \frac{\partial J}{\partial a_{22}}=\frac{\partial J}{\partial z_{11}} \frac{\partial z_{11}}{\partial a_{22}}+\frac{\partial J}{\partial z_{12}} \frac{\partial z_{12}}{\partial a_{22}}+\frac{\partial J}{\partial z_{21}} \frac{\partial z_{21}}{\partial a_{22}}+\frac{\partial J}{\partial z_{22}} \frac{\partial z_{22}}{\partial a_{22}} ∂a22∂J=∂z11∂J∂a22∂z11+∂z12∂J∂a22∂z12+∂z21∂J∂a22∂z21+∂z22∂J∂a22∂z22 (7) = δ z 11 ⋅ w 22 + δ z 12 ⋅ w 21 + δ z 21 ⋅ w 12 + δ z 22 ⋅ w 11 =\delta_{z11} \cdot w_{22} + \delta_{z12} \cdot w_{21} + \delta_{z21} \cdot w_{12} + \delta_{z22} \cdot w_{11} \tag{7} =δz11⋅w22+δz12⋅w21+δz21⋅w12+δz22⋅w11(7)
同理可得所有a的梯度。
观察公式7中的w的顺序,貌似是把原始的卷积核旋转了180度,再与传入误差项做卷积操作,即可得到所有元素的误差项。而公式5和公式6并不完备,是因为二者处于角落,这和卷积正向计算中的padding是相同的现象。因此,我们把传入的误差矩阵Delta-In做一个zero padding,再乘以旋转180度的卷积核,就是要传出的误差矩阵Delta-Out
最后可以统一成为一个简洁的公式:
(8) δ o u t = δ i n ∗ W r o t 180 \delta_{out} = \delta_{in} * W^{rot180} \tag{8} δout=δin∗Wrot180(8)
这个误差矩阵可以继续回传到下一层。
当Weights是3x3时,
δ
i
n
\delta_{in}
δin需要padding=2,即加2圈0,才能和Weights卷积后,得到正确尺寸的
δ
o
u
t
\delta_{out}
δout
当Weights是5x5时,
δ
i
n
\delta_{in}
δin需要padding=4,即加4圈0,才能和Weights卷积后,得到正确尺寸的
δ
o
u
t
\delta_{out}
δout
以此类推:当Weights是NxN时,
δ
i
n
\delta_{in}
δin需要padding=N-1,即加N-1圈0
举例:
正向时stride=1: A ( 10 × 8 ) ∗ W ( 5 × 5 ) = Z ( 6 × 4 ) A^{(10 \times 8)}*W^{(5 \times 5)}=Z^{(6 \times 4)} A(10×8)∗W(5×5)=Z(6×4)
反向时, δ z ( 6 × 4 ) + 4 p a d d i n g = δ z ( 14 × 12 ) \delta_z^{(6 \times 4)} + 4 padding = \delta_z^{(14 \times 12)} δz(6×4)+4padding=δz(14×12)
然后: δ z ( 14 × 12 ) ∗ W r o t 180 ( 5 × 5 ) = δ a ( 10 × 8 ) \delta_z^{(14 \times 12)} * W^{rot180(5 \times 5)}= \delta_a^{(10 \times 8)} δz(14×12)∗Wrot180(5×5)=δa(10×8)
步长不为1时的梯度矩阵还原
我们先观察一下stride=1和2时,卷积结果的差异,二者的差别就是中间那个结果图的灰色部分。如果反向传播时,传入的误差矩阵是stride=2时的2x2的形状,那么我们只需要把它补上一个十字,变成3x3的误差矩阵,就可以用步长为1的算法了。
以此类推,如果步长为3时,需要补一个双线的十字。所以,当知道当前的卷积层步长为S(S>1)时:
得到从上层回传的误差矩阵形状,假设为
M
×
N
M \times N
M×N
初始化一个
(
M
⋅
S
)
×
(
N
⋅
S
)
(M \cdot S) \times (N \cdot S)
(M⋅S)×(N⋅S)的零矩阵
把传入的误差矩阵的第一行值放到零矩阵第0行的0,S,2S,3S…位置
然后把误差矩阵的第二行的值放到零矩阵第S行的0,S,2S,3S…位置
…
步长为2时,用实例表示就是这样:
[ δ 11 0 δ 12 0 δ 13 0 0 0 0 0 δ 21 0 δ 22 0 δ 23 ] \begin{bmatrix} \delta_{11} & 0 & \delta_{12} & 0 & \delta_{13}\ 0 & 0 & 0 & 0 & 0\ \delta_{21} & 0 & \delta_{22} & 0 & \delta_{23}\ \end{bmatrix} [δ110δ120δ13 00000 δ210δ220δ23 ]
步长为3时,用实例表示就是这样:
[ δ 11 0 0 δ 12 0 0 δ 13 0 0 0 0 0 0 0 0 0 0 0 0 0 0 δ 21 0 0 δ 22 0 0 δ 23 ] \begin{bmatrix} \delta_{11} & 0 & 0 & \delta_{12} & 0 & 0 & \delta_{13}\ 0 & 0 & 0 & 0 & 0 & 0 & 0\ 0 & 0 & 0 & 0 & 0 & 0 & 0\ \delta_{21} & 0 & 0 & \delta_{22} & 0 & 0 & \delta_{23}\ \end{bmatrix} [δ1100δ1200δ13 0000000 0000000 δ2100δ2200δ23 ]
有多个卷积核时的梯度计算
有多个卷积核也就意味着有多个输出通道。
正向公式:
z 111 = w 111 ⋅ a 11 + w 112 ⋅ a 12 + w 121 ⋅ a 21 + w 122 ⋅ a 22 z111 = w111 \cdot a11 + w112 \cdot a12 + w121 \cdot a21 + w122 \cdot a22 z111=w111⋅a11+w112⋅a12+w121⋅a21+w122⋅a22 z 112 = w 111 ⋅ a 12 + w 112 ⋅ a 13 + w 121 ⋅ a 22 + w 122 ⋅ a 23 z112 = w111 \cdot a12 + w112 \cdot a13 + w121 \cdot a22 + w122 \cdot a23 z112=w111⋅a12+w112⋅a13+w121⋅a22+w122⋅a23 z 121 = w 111 ⋅ a 21 + w 112 ⋅ a 22 + w 121 ⋅ a 31 + w 122 ⋅ a 32 z121 = w111 \cdot a21 + w112 \cdot a22 + w121 \cdot a31 + w122 \cdot a32 z121=w111⋅a21+w112⋅a22+w121⋅a31+w122⋅a32 z 122 = w 111 ⋅ a 22 + w 112 ⋅ a 23 + w 121 ⋅ a 32 + w 122 ⋅ a 33 z122 = w111 \cdot a22 + w112 \cdot a23 + w121 \cdot a32 + w122 \cdot a33 z122=w111⋅a22+w112⋅a23+w121⋅a32+w122⋅a33
z 211 = w 211 ⋅ a 11 + w 212 ⋅ a 12 + w 221 ⋅ a 21 + w 222 ⋅ a 22 z211 = w211 \cdot a11 + w212 \cdot a12 + w221 \cdot a21 + w222 \cdot a22 z211=w211⋅a11+w212⋅a12+w221⋅a21+w222⋅a22 z 212 = w 211 ⋅ a 12 + w 212 ⋅ a 13 + w 221 ⋅ a 22 + w 222 ⋅ a 23 z212 = w211 \cdot a12 + w212 \cdot a13 + w221 \cdot a22 + w222 \cdot a23 z212=w211⋅a12+w212⋅a13+w221⋅a22+w222⋅a23 z 221 = w 211 ⋅ a 21 + w 212 ⋅ a 22 + w 221 ⋅ a 31 + w 222 ⋅ a 32 z221 = w211 \cdot a21 + w212 \cdot a22 + w221 \cdot a31 + w222 \cdot a32 z221=w211⋅a21+w212⋅a22+w221⋅a31+w222⋅a32 z 222 = w 211 ⋅ a 22 + w 212 ⋅ a 23 + w 221 ⋅ a 32 + w 222 ⋅ a 33 z222 = w211 \cdot a22 + w212 \cdot a23 + w221 \cdot a32 + w222 \cdot a33 z222=w211⋅a22+w212⋅a23+w221⋅a32+w222⋅a33
求J对a22的梯度:
∂ J ∂ a 22 = ∂ J ∂ Z 1 ∂ Z 1 ∂ a 22 + ∂ J ∂ Z 2 ∂ Z 2 ∂ a 22 \frac{\partial J}{\partial a_{22}}=\frac{\partial J}{\partial Z_{1}} \frac{\partial Z_{1}}{\partial a_{22}}+\frac{\partial J}{\partial Z_{2}} \frac{\partial Z_{2}}{\partial a_{22}} ∂a22∂J=∂Z1∂J∂a22∂Z1+∂Z2∂J∂a22∂Z2 = ∂ J ∂ z 111 ∂ z 111 ∂ a 22 + ∂ J ∂ z 112 ∂ z 112 ∂ a 22 + ∂ J ∂ z 121 ∂ z 121 ∂ a 22 + ∂ J ∂ z 122 ∂ z 122 ∂ a 22 =\frac{\partial J}{\partial z_{111}} \frac{\partial z_{111}}{\partial a_{22}}+\frac{\partial J}{\partial z_{112}} \frac{\partial z_{112}}{\partial a_{22}}+\frac{\partial J}{\partial z_{121}} \frac{\partial z_{121}}{\partial a_{22}}+\frac{\partial J}{\partial z_{122}} \frac{\partial z_{122}}{\partial a_{22}} =∂z111∂J∂a22∂z111+∂z112∂J∂a22∂z112+∂z121∂J∂a22∂z121+∂z122∂J∂a22∂z122 + ∂ J ∂ z 211 ∂ z 211 ∂ a 22 + ∂ J ∂ z 212 ∂ z 212 ∂ a 22 + ∂ J ∂ z 221 ∂ z 221 ∂ a 22 + ∂ J ∂ z 222 ∂ z 222 ∂ a 22 +\frac{\partial J}{\partial z_{211}} \frac{\partial z_{211}}{\partial a_{22}}+\frac{\partial J}{\partial z_{212}} \frac{\partial z_{212}}{\partial a_{22}}+\frac{\partial J}{\partial z_{221}} \frac{\partial z_{221}}{\partial a_{22}}+\frac{\partial J}{\partial z_{222}} \frac{\partial z_{222}}{\partial a_{22}} +∂z211∂J∂a22∂z211+∂z212∂J∂a22∂z212+∂z221∂J∂a22∂z221+∂z222∂J∂a22∂z222 = ( δ z 111 ⋅ w 122 + δ z 112 ⋅ w 121 + δ z 121 ⋅ w 112 + δ z 122 ⋅ w 111 ) =(\delta_{z111} \cdot w_{122} + \delta_{z112} \cdot w_{121} + \delta_{z121} \cdot w_{112} + \delta_{z122} \cdot w_{111}) =(δz111⋅w122+δz112⋅w121+δz121⋅w112+δz122⋅w111) + ( δ z 211 ⋅ w 222 + δ z 212 ⋅ w 221 + δ z 221 ⋅ w 212 + δ z 222 ⋅ w 211 ) +(\delta_{z211} \cdot w_{222} + \delta_{z212} \cdot w_{221} + \delta_{z221} \cdot w_{212} + \delta_{z222} \cdot w_{211}) +(δz211⋅w222+δz212⋅w221+δz221⋅w212+δz222⋅w211) = δ z 1 ∗ W 1 r o t 180 + δ z 2 ∗ W 2 r o t 180 =\delta_{z1} * W_1^{rot180} + \delta_{z2} * W_2^{rot180} =δz1∗W1rot180+δz2∗W2rot180
因此和公式8相似,先在 δ i n \delta_{in} δin外面加padding,然后和对应的旋转后的卷积核相乘,再把几个结果相加,就得到了需要前传的梯度矩阵:
(9) δ o u t = ∑ m δ i n m ∗ W m r o t 180 \delta_{out} = \sum_m \delta_{in_m} * W^{rot180}_m \tag{9} δout=m∑δinm∗Wmrot180(9)
有多个输入时的梯度计算
当输入层是多个图层时,每个图层必须对应一个卷积核。
所以有前向公式:
z 11 = w 111 ⋅ a 111 + w 112 ⋅ a 112 + w 121 ⋅ a 121 + w 122 ⋅ a 122 z11 = w111 \cdot a111 + w112 \cdot a112 + w121 \cdot a121 + w122 \cdot a122 z11=w111⋅a111+w112⋅a112+w121⋅a121+w122⋅a122 (10) + w 211 ⋅ a 211 + w 212 ⋅ a 212 + w 221 ⋅ a 221 + w 222 ⋅ a 222 + w211 \cdot a211 + w212 \cdot a212 + w221 \cdot a221 + w222 \cdot a222 \tag{10} +w211⋅a211+w212⋅a212+w221⋅a221+w222⋅a222(10) z 12 = w 111 ⋅ a 112 + w 112 ⋅ a 113 + w 121 ⋅ a 122 + w 122 ⋅ a 123 z12 = w111 \cdot a112 + w112 \cdot a113 + w121 \cdot a122 + w122 \cdot a123 z12=w111⋅a112+w112⋅a113+w121⋅a122+w122⋅a123 (11) + w 211 ⋅ a 212 + w 212 ⋅ a 213 + w 221 ⋅ a 222 + w 222 ⋅ a 223 + w211 \cdot a212 + w212 \cdot a213 + w221 \cdot a222 + w222 \cdot a223 \tag{11} +w211⋅a212+w212⋅a213+w221⋅a222+w222⋅a223(11) z 21 = w 111 ⋅ a 121 + w 112 ⋅ a 122 + w 121 ⋅ a 131 + w 122 ⋅ a 132 z21 = w111 \cdot a121 + w112 \cdot a122 + w121 \cdot a131 + w122 \cdot a132 z21=w111⋅a121+w112⋅a122+w121⋅a131+w122⋅a132 (12) + w 211 ⋅ a 221 + w 212 ⋅ a 222 + w 221 ⋅ a 231 + w 222 ⋅ a 232 + w211 \cdot a221 + w212 \cdot a222 + w221 \cdot a231 + w222 \cdot a232 \tag{12} +w211⋅a221+w212⋅a222+w221⋅a231+w222⋅a232(12) z 22 = w 111 ⋅ a 122 + w 112 ⋅ a 123 + w 121 ⋅ a 132 + w 122 ⋅ a 133 z22 = w111 \cdot a122 + w112 \cdot a123 + w121 \cdot a132 + w122 \cdot a133 z22=w111⋅a122+w112⋅a123+w121⋅a132+w122⋅a133 (13) + w 211 ⋅ a 222 + w 212 ⋅ a 223 + w 221 ⋅ a 232 + w 222 ⋅ a 233 + w211 \cdot a222 + w212 \cdot a223 + w221 \cdot a232 + w222 \cdot a233 \tag{13} +w211⋅a222+w212⋅a223+w221⋅a232+w222⋅a233(13)
最复杂的情况,求J对a122的梯度:
∂ J ∂ a 111 = ∂ J ∂ z 11 ∂ z 11 ∂ a 122 + ∂ J ∂ z 12 ∂ z 12 ∂ a 122 + ∂ J ∂ z 21 ∂ z 21 ∂ a 122 + ∂ J ∂ z 22 ∂ z 22 ∂ a 122 \frac{\partial J}{\partial a111}=\frac{\partial J}{\partial z11}\frac{\partial z11}{\partial a122} + \frac{\partial J}{\partial z12}\frac{\partial z12}{\partial a122} + \frac{\partial J}{\partial z21}\frac{\partial z21}{\partial a122} + \frac{\partial J}{\partial z22}\frac{\partial z22}{\partial a122} ∂a111∂J=∂z11∂J∂a122∂z11+∂z12∂J∂a122∂z12+∂z21∂J∂a122∂z21+∂z22∂J∂a122∂z22 = δ z 11 ⋅ w 122 + δ z 12 ⋅ w 121 + δ z 21 ⋅ w 112 + δ z 22 ⋅ w 111 =\delta_{z11} \cdot w122 + \delta_{z12} \cdot w121 + \delta_{z21} \cdot w112 + \delta_{z22} \cdot w111 =δz11⋅w122+δz12⋅w121+δz21⋅w112+δz22⋅w111
泛化以后得到:
(14) δ o u t 1 = δ i n ∗ W 1 r o t 180 \delta_{out1} = \delta_{in} * W_1^{rot180} \tag{14} δout1=δin∗W1rot180(14)
最复杂的情况,求J对a222的梯度:
∂ J ∂ a 211 = ∂ J ∂ z 11 ∂ z 11 ∂ a 222 + ∂ J ∂ z 12 ∂ z 12 ∂ a 222 + ∂ J ∂ z 21 ∂ z 21 ∂ a 222 + ∂ J ∂ z 22 ∂ z 22 ∂ a 222 \frac{\partial J}{\partial a211}=\frac{\partial J}{\partial z11}\frac{\partial z11}{\partial a222} + \frac{\partial J}{\partial z12}\frac{\partial z12}{\partial a222} + \frac{\partial J}{\partial z21}\frac{\partial z21}{\partial a222} + \frac{\partial J}{\partial z22}\frac{\partial z22}{\partial a222} ∂a211∂J=∂z11∂J∂a222∂z11+∂z12∂J∂a222∂z12+∂z21∂J∂a222∂z21+∂z22∂J∂a222∂z22 = δ z 11 ⋅ w 222 + δ z 12 ⋅ w 221 + δ z 21 ⋅ w 212 + δ z 22 ⋅ w 211 =\delta_{z11} \cdot w222 + \delta_{z12} \cdot w221 + \delta_{z21} \cdot w212 + \delta_{z22} \cdot w211 =δz11⋅w222+δz12⋅w221+δz21⋅w212+δz22⋅w211
泛化以后得到:
(15) δ o u t 2 = δ i n ∗ W 2 r o t 180 \delta_{out2} = \delta_{in} * W_2^{rot180} \tag{15} δout2=δin∗W2rot180(15)
权重(卷积核)梯度计算
要求J对w11的梯度,从正向公式可以看到,w11对所有的z都有贡献,所以:
∂ J ∂ w 11 = ∂ J ∂ z 11 ∂ z 11 ∂ w 11 + ∂ J ∂ z 12 ∂ z 12 ∂ w 11 + ∂ J ∂ z 21 ∂ z 21 ∂ w 11 + ∂ J ∂ z 22 ∂ z 22 ∂ w 11 \frac{\partial J}{\partial w_{11}} = \frac{\partial J}{\partial z_{11}}\frac{\partial z_{11}}{\partial w_{11}} + \frac{\partial J}{\partial z_{12}}\frac{\partial z_{12}}{\partial w_{11}} + \frac{\partial J}{\partial z_{21}}\frac{\partial z_{21}}{\partial w_{11}} + \frac{\partial J}{\partial z_{22}}\frac{\partial z_{22}}{\partial w_{11}} ∂w11∂J=∂z11∂J∂w11∂z11+∂z12∂J∂w11∂z12+∂z21∂J∂w11∂z21+∂z22∂J∂w11∂z22 (9) = δ z 11 ⋅ a 11 + δ z 12 ⋅ a 12 + δ z 21 ⋅ a 21 + δ z 22 ⋅ a 22 =\delta_{z11} \cdot a_{11} + \delta_{z12} \cdot a_{12} + \delta_{z21} \cdot a_{21} + \delta_{z22} \cdot a_{22} \tag{9} =δz11⋅a11+δz12⋅a12+δz21⋅a21+δz22⋅a22(9)
对W22也是一样的:
∂
J
∂
w
12
=
∂
J
∂
z
11
∂
z
11
∂
w
12
+
∂
J
∂
z
12
∂
z
12
∂
w
12
+
∂
J
∂
z
21
∂
z
21
∂
w
12
+
∂
J
∂
z
22
∂
z
22
∂
w
12
\frac{\partial J}{\partial w_{12}} = \frac{\partial J}{\partial z_{11}}\frac{\partial z_{11}}{\partial w_{12}} + \frac{\partial J}{\partial z_{12}}\frac{\partial z_{12}}{\partial w_{12}} + \frac{\partial J}{\partial z_{21}}\frac{\partial z_{21}}{\partial w_{12}} + \frac{\partial J}{\partial z_{22}}\frac{\partial z_{22}}{\partial w_{12}}
∂w12∂J=∂z11∂J∂w12∂z11+∂z12∂J∂w12∂z12+∂z21∂J∂w12∂z21+∂z22∂J∂w12∂z22
(10)
=
δ
z
11
⋅
a
12
+
δ
z
12
⋅
a
13
+
δ
z
21
⋅
a
22
+
δ
z
22
⋅
a
23
=\delta_{z11} \cdot a_{12} + \delta_{z12} \cdot a_{13} + \delta_{z21} \cdot a_{22} + \delta_{z22} \cdot a_{23} \tag{10}
=δz11⋅a12+δz12⋅a13+δz21⋅a22+δz22⋅a23(10)
观察公式8和公式9,其实也是一个标准的卷积(互相关)操作过程。
总结成一个公式:
(11) δ w = A ∗ δ i n \delta_w = A * \delta_{in} \tag{11} δw=A∗δin(11)
偏移的梯度计算
根据前向计算公式1,2,3,4,可以得到:
∂ J ∂ b = ∂ J ∂ z 11 ∂ z 11 ∂ b + ∂ J ∂ z 12 ∂ z 12 ∂ b + ∂ J ∂ z 21 ∂ z 21 ∂ b + ∂ J ∂ z 22 ∂ z 22 ∂ b \frac{\partial J}{\partial b} = \frac{\partial J}{\partial z_{11}}\frac{\partial z_{11}}{\partial b} + \frac{\partial J}{\partial z_{12}}\frac{\partial z_{12}}{\partial b} + \frac{\partial J}{\partial z_{21}}\frac{\partial z_{21}}{\partial b} + \frac{\partial J}{\partial z_{22}}\frac{\partial z_{22}}{\partial b} ∂b∂J=∂z11∂J∂b∂z11+∂z12∂J∂b∂z12+∂z21∂J∂b∂z21+∂z22∂J∂b∂z22 (12) = δ z 11 + δ z 12 + δ z 21 + δ z 22 =\delta_{z11} + \delta_{z12} + \delta_{z21} + \delta_{z22} \tag{12} =δz11+δz12+δz21+δz22(12)
所以:
(13) δ b = δ i n \delta_b = \delta_{in} \tag{13} δb=δin(13)
每个卷积核W可能会有多个filter,或者叫子核,但是一个卷积核只有一个偏移,无论有多少子核。
卷积神经网络中的池化
概念
池化 pooling,又称为下采样,downstream sampling or sub-sampling。
分为两种,一种是最大值池化 Max Pooling,一种是平均值池化 Mean/Average Pooling。
池化层的训练
我们假设2x2的图片中, [ [ 1 , 2 ] , [ 3 , 4 ] ] [[1,2],[3,4]] [[1,2],[3,4]]是上一层网络回传的残差,那么:
对于最大值池化,残差值会回传到当初最大值的位置上,而其它三个位置的残差都是0。
对于平均值池化,残差值会平均到原始的4个位置上。
最大池化(Max Pooling)
正向公式:
w = m a x ( a , b , c , d ) w = max(a,b,c,d) w=max(a,b,c,d)
反向公式(假设Input Layer中的最大值是b):
∂ w ∂ a = 0 {\partial w \over \partial a} = 0 ∂a∂w=0 ∂ w ∂ b = 1 {\partial w \over \partial b} = 1 ∂b∂w=1 ∂ w ∂ c = 0 {\partial w \over \partial c} = 0 ∂c∂w=0 ∂ w ∂ d = 0 {\partial w \over \partial d} = 0 ∂d∂w=0
因为a,c,d对w都没有贡献,所以偏导数自然为0,只有b有贡献,偏导数为1。
δ a = ∂ J ∂ a = ∂ J ∂ w ∂ w ∂ a = 0 \delta_a = {\partial J \over \partial a} = {\partial J \over \partial w} {\partial w \over \partial a} = 0 δa=∂a∂J=∂w∂J∂a∂w=0
δ b = ∂ J ∂ b = ∂ J ∂ w ∂ w ∂ b = δ w ⋅ 1 = δ w \delta_b = {\partial J \over \partial b} = {\partial J \over \partial w} {\partial w \over \partial b} = \delta_w \cdot 1 = \delta_w δb=∂b∂J=∂w∂J∂b∂w=δw⋅1=δw
δ c = ∂ J ∂ c = ∂ J ∂ w ∂ w ∂ c = 0 \delta_c = {\partial J \over \partial c} = {\partial J \over \partial w} {\partial w \over \partial c} = 0 δc=∂c∂J=∂w∂J∂c∂w=0
δ d = ∂ J ∂ d = ∂ J ∂ w ∂ w ∂ d = 0 \delta_d = {\partial J \over \partial d} = {\partial J \over \partial w} {\partial w \over \partial d} = 0 δd=∂d∂J=∂w∂J∂d∂w=0
平均池化(Mean Pooling)
正向公式:
w = 1 4 ( a + b + c + d ) w = \frac{1}{4}(a+b+c+d) w=41(a+b+c+d)
反向公式(假设Layer-1中的最大值是b):
∂ w ∂ a = 1 4 {\partial w \over \partial a} = \frac{1}{4} ∂a∂w=41 ∂ w ∂ b = 1 4 {\partial w \over \partial b} = \frac{1}{4} ∂b∂w=41 ∂ w ∂ c = 1 4 {\partial w \over \partial c} = \frac{1}{4} ∂c∂w=41 ∂ w ∂ d = 1 4 {\partial w \over \partial d} = \frac{1}{4} ∂d∂w=41
因为a,c,d对w都没有贡献,所以偏导数自然为0,只有b有贡献,偏导数为1。
δ a = ∂ J ∂ a = ∂ J ∂ w ∂ w ∂ a = 1 4 δ w \delta_a = {\partial J \over \partial a} = {\partial J \over \partial w} {\partial w \over \partial a} = \frac{1}{4}\delta_w δa=∂a∂J=∂w∂J∂a∂w=41δw
δ b = ∂ J ∂ b = ∂ J ∂ w ∂ w ∂ b = 1 4 δ w \delta_b = {\partial J \over \partial b} = {\partial J \over \partial w} {\partial w \over \partial b} = \frac{1}{4}\delta_w δb=∂b∂J=∂w∂J∂b∂w=41δw
δ c = ∂ J ∂ c = ∂ J ∂ w ∂ w ∂ c = 1 4 δ w \delta_c = {\partial J \over \partial c} = {\partial J \over \partial w} {\partial w \over \partial c} = \frac{1}{4}\delta_w δc=∂c∂J=∂w∂J∂c∂w=41δw
δ d = ∂ J ∂ d = ∂ J ∂ w ∂ w ∂ d = 1 4 δ w \delta_d = {\partial J \over \partial d} = {\partial J \over \partial w} {\partial w \over \partial d} = \frac{1}{4}\delta_w δd=∂d∂J=∂w∂J∂d∂w=41δw
池化总结
无论是max pooling还是mean pooling,都没有要学习的参数,所以,在卷积网络的训练中,池化层需要做的只是把误差项向后传递,不需要计算任何梯度。
https://github.com/microsoft/ai-edu/blob/master/B-教学案例与实践/B6-神经网络基本原理简明教程/17 第八步 - 卷积神经网络.md