VGG 16 公式推导
VGG-16共有13层卷积层,5层池化层和3层全连接层,对前两层全连接网络采用dropout和L2正则化防止过拟合,采用批量梯度下降+Momentum以交叉熵为目标损失进行训练优化。
-
n l n^l nl—第 l l l层网络节点(卷积核)数目;
-
k p , q l k_{p,q}^l kp,ql—第 l l l层 p p p通道与第 l − 1 l-1 l−1层 q q q通道对应卷积核;
-
b p l b_p^l bpl—第 l l l层 p p p节点(通道)的偏置;
-
W l W^l Wl—第 l l l层全连接网络的权重;
-
z l z^l zl—第 l l l层未经过激活函数的前向输入;
-
a l a^l al—第 l l l层经过激活函数后的前向输出;
前向传递
第
l
l
l层卷积操作公式:
z
p
l
(
i
,
j
)
=
∑
q
=
1
n
l
−
1
∑
u
=
−
1
1
∑
v
=
−
1
1
a
q
l
−
1
(
i
−
u
,
j
−
v
)
k
p
,
q
l
(
u
,
v
)
+
b
p
l
a
p
l
(
i
,
j
)
=
R
e
L
U
(
z
p
l
(
i
,
j
)
)
z_{p}^{l}(i,j)=\sum\limits_{q=1}^{{{n}^{l-1}}}{\sum\limits_{u=-1}^{1}{\sum\limits_{v=-1}^{1}{a_{q}^{l-1}(i-u,j-v)k_{p,q}^{l}(u,v)}}}+b_{p}^{l} \\ a_{p}^{l}(i,j)=ReLU\left( z_{p}^{l}(i,j) \right)
zpl(i,j)=q=1∑nl−1u=−1∑1v=−1∑1aql−1(i−u,j−v)kp,ql(u,v)+bplapl(i,j)=ReLU(zpl(i,j))
第
l
l
l层最大池化公式:
z
p
l
(
i
,
j
)
=
max
(
a
p
l
−
1
(
2
i
−
u
,
2
j
−
v
)
)
u
,
v
∈
{
0
,
1
}
z_{p}^{l}(i,j)=\max \left( a_{p}^{l-1}(2i-u,2j-v) \right)u,v\in \left\{ 0,1 \right\}
zpl(i,j)=max(apl−1(2i−u,2j−v))u,v∈{0,1}
经过前18层的卷积核池化操作后可获得
7
×
7
×
512
7×7×512
7×7×512大小的特征图,需要将其转化为一个25,088维的向量以便作为全连接层的输入,该过程输出为
a
18
a^{18}
a18:
a
18
=
F
(
{
z
p
18
}
p
=
1
,
2
,
⋯
,
512
)
a^{18}=F \left(\left\{z_p^{18}\right\}_{p=1,2,⋯,512}\right)
a18=F({zp18}p=1,2,⋯,512)
全连接网络的前两层采用dropout,设为
d
d
d,第
l
l
l层节点的连通可用
r
l
r^l
rl来表示,其服从伯努利分布:
r
l
∼
B
e
r
n
o
u
l
l
i
(
d
)
{{r}^{l}}\sim Bernoulli(d)
rl∼Bernoulli(d)
前向传播为:
a
~
l
=
r
l
⊙
a
l
z
l
+
1
=
W
l
+
1
a
~
l
+
b
l
+
1
a
l
+
1
=
R
e
L
U
(
z
l
+
1
)
{{{\tilde{a}}}^{l}}={{r}^{l}}\odot {{a}^{l}} \\ {{z}^{l+1}}={{W}^{l+1}}{{{\tilde{a}}}^{l}}+{{b}^{l+1}} \\ {{a}^{l+1}}=ReLU({{z}^{l+1}})
a~l=rl⊙alzl+1=Wl+1a~l+bl+1al+1=ReLU(zl+1)
其中,⨀为Hadmard积,即矩阵对应元素相乘。
输出层的激活函数为softmax:
a
i
L
=
s
o
f
t
m
a
x
(
z
i
L
)
=
e
z
i
L
∑
k
=
1
n
L
e
z
k
L
a_{i}^{L}=softmax(z_{i}^{L})=\frac{{{e}^{z_{i}^{L}}}}{\sum\limits_{k=1}^{{{n}^{L}}}{{{e}^{z_{k}^{L}}}}}
aiL=softmax(ziL)=k=1∑nLezkLeziL
采用交叉熵损失作为损失函数:
L
=
−
∑
i
=
1
n
L
y
i
log
a
i
L
L=-\sum\limits_{i=1}^{{{n}^{L}}}{{{y}_{i}}\log a_{i}^{L}}
L=−i=1∑nLyilogaiL
反向传播
引入中间变量 δ l \delta^l δl,为第 l l l层的误差,表示损失函数对第l层前向输入 z l z^l zl 的梯度,即为 ∂ L ∂ z l \frac{\partial{L}}{\partial{z^l}} ∂zl∂L
Softmax函数偏导数计算公式为:
当
i
=
j
i=j
i=j时,
∂
∂
z
j
(
e
z
j
∑
k
=
1
n
e
z
k
)
=
e
z
j
∑
k
=
1
n
e
z
k
−
(
e
z
j
)
2
(
∑
k
=
1
n
e
z
k
)
2
=
a
j
(
1
−
a
j
)
\frac{\partial }{\partial {{z}_{j}}}\left( \frac{{{e}^{{{z}_{j}}}}}{\sum\nolimits_{k=1}^{n}{{{e}^{{{z}_{k}}}}}} \right)=\frac{{{e}^{{{z}_{j}}}}\sum\nolimits_{k=1}^{n}{{{e}^{{{z}_{k}}}}}-{{\left( {{e}^{{{z}_{j}}}} \right)}^{2}}}{{{\left( \sum\nolimits_{k=1}^{n}{{{e}^{{{z}_{k}}}}} \right)}^{2}}} ={{a}_{j}}\left( 1-{{a}_{j}} \right)
∂zj∂(∑k=1nezkezj)=(∑k=1nezk)2ezj∑k=1nezk−(ezj)2=aj(1−aj)
当
i
≠
j
i\ne j
i=j时,
∂
∂
z
j
(
e
z
i
∑
k
=
1
n
e
z
k
)
=
−
e
z
i
e
z
j
(
∑
k
=
1
n
e
z
k
)
2
=
−
a
i
a
j
\frac{\partial }{\partial {{z}_{j}}}\left( \frac{{{e}^{{{z}_{i}}}}}{\sum\nolimits_{k=1}^{n}{{{e}^{{{z}_{k}}}}}} \right)=\frac{-{{e}^{{{z}_{i}}}}{{e}^{{{z}_{j}}}}}{{{\left( \sum\nolimits_{k=1}^{n}{{{e}^{{{z}_{k}}}}} \right)}^{2}}}=-{{a}_{i}}{{a}_{j}}
∂zj∂(∑k=1nezkezi)=(∑k=1nezk)2−eziezj=−aiaj
输出层第
j
j
j个节点的误差为:
δ
j
L
=
∂
L
∂
z
j
L
=
∑
i
=
1
n
L
∂
L
∂
a
i
L
∂
a
i
L
∂
z
j
L
=
∂
L
∂
a
j
L
∂
a
j
L
∂
z
j
L
+
∑
i
≠
j
∂
L
∂
a
i
L
∂
a
i
L
∂
z
j
L
=
−
y
j
a
j
L
a
j
L
(
1
−
a
j
L
)
+
∑
i
≠
j
−
y
i
a
i
L
(
−
a
i
L
a
j
L
)
=
−
y
j
(
1
−
a
j
L
)
+
a
j
L
∑
i
≠
j
y
i
=
a
j
L
−
y
j
\begin{aligned} \delta _{j}^{L}&=\frac{\partial L}{\partial z_{j}^{L}} \\ &=\sum\limits_{i=1}^{{{n}^{L}}}{\frac{\partial L}{\partial a_{i}^{L}}\frac{\partial a_{i}^{L}}{\partial z_{j}^{L}}} \\ & =\frac{\partial L}{\partial a_{j}^{L}}\frac{\partial a_{j}^{L}}{\partial z_{j}^{L}}+\sum\limits_{i\ne j}{\frac{\partial L}{\partial a_{i}^{L}}\frac{\partial a_{i}^{L}}{\partial z_{j}^{L}}} \\ & =-\frac{{{y}_{j}}}{a_{j}^{L}}a_{j}^{L}(1-a_{j}^{L})+\sum\limits_{i\ne j}{-\frac{{{y}_{i}}}{a_{i}^{L}}(-a_{i}^{L}a_{j}^{L})} \\ & =-{{y}_{j}}(1-a_{j}^{L})+a_{j}^{L}\sum\limits_{i\ne j}{{{y}_{i}}} \\ & =a_{j}^{L}-{{y}_{j}} \end{aligned}
δjL=∂zjL∂L=i=1∑nL∂aiL∂L∂zjL∂aiL=∂ajL∂L∂zjL∂ajL+i=j∑∂aiL∂L∂zjL∂aiL=−ajLyjajL(1−ajL)+i=j∑−aiLyi(−aiLajL)=−yj(1−ajL)+ajLi=j∑yi=ajL−yj
输出层的反向传播误差为:
δ
L
=
a
L
−
y
{{\delta }^{L}}={{a}^{L}}-y
δL=aL−y
第
l
l
l个隐藏层第
j
j
j个节点的反向传播误差为:
δ
j
l
=
∂
L
∂
z
j
l
=
∑
i
=
1
n
l
+
1
∂
L
∂
z
i
l
+
1
∂
z
i
l
+
1
∂
a
~
j
l
∂
a
~
j
l
∂
a
j
l
∂
a
j
l
∂
z
j
l
=
∑
i
=
1
n
l
+
1
δ
i
l
+
1
W
i
,
j
l
+
1
r
j
l
R
e
L
U
(
z
j
l
)
′
=
(
W
:
,
j
l
+
1
)
T
δ
l
+
1
r
j
l
R
e
L
U
(
z
j
l
)
′
\begin{aligned} \delta _{j}^{l}&=\frac{\partial L}{\partial z_{j}^{l}}\\ &=\sum\limits_{i=1}^{{{n}^{l+1}}}{\frac{\partial L}{\partial z_{i}^{l+1}}\frac{\partial z_{i}^{l+1}}{\partial \tilde{a}_{j}^{l}}\frac{\partial \tilde{a}_{j}^{l}}{\partial a_{j}^{l}}\frac{\partial a_{j}^{l}}{\partial z_{j}^{l}}}\\ &=\sum_{i=1}^{n^{l+1}}{\delta^{l+1}_iW^{l+1}_{i,j}r^l_jReLU(z^l_j)'}\\ &={{\left( W_{:,j}^{l+1} \right)}^{T}}{{\delta }^{l+1}}r_{j}^{l}ReLU(z_{j}^{l}{)}' \end{aligned}
δjl=∂zjl∂L=i=1∑nl+1∂zil+1∂L∂a~jl∂zil+1∂ajl∂a~jl∂zjl∂ajl=i=1∑nl+1δil+1Wi,jl+1rjlReLU(zjl)′=(W:,jl+1)Tδl+1rjlReLU(zjl)′
因此第
l
l
l全连接层的反向传播误差为:
δ
l
=
(
W
l
+
1
)
T
δ
l
+
1
⊙
r
l
⊙
R
e
L
U
(
z
l
)
′
{{\delta }^{l}}={{\left( {{W}^{l+1}} \right)}^{T}}{{\delta }^{l+1}}\odot {{r}^{l}}\odot ReLU({{z}^{l}}{)}'
δl=(Wl+1)Tδl+1⊙rl⊙ReLU(zl)′
由全连接层到池化层的反向传播误差为:
δ
18
=
F
−
1
(
(
W
19
)
T
δ
19
)
{{\delta }^{18}}={{F}^{-1}}\left( {{({{W}^{19}})}^{T}}{{\delta }^{19}} \right)
δ18=F−1((W19)Tδ19)
其中
δ
18
\delta^{18}
δ18为
7
×
7
×
512
7×7×512
7×7×512的张量。
由第
l
+
1
l+1
l+1池化层的
δ
l
+
1
\delta^{l+1}
δl+1推导第
l
l
l卷积层的反向传播误差时,对于最大池化,我们需要利用上采样将
δ
l
+
1
\delta^{l+1}
δl+1中每个通道中的元素放到之前做前向传播时最大值的位置处,其他元素为0:
δ
p
l
=
∂
L
∂
z
p
l
=
∂
L
∂
a
p
l
∂
a
p
l
∂
z
p
l
=
u
p
s
a
m
p
l
e
(
δ
p
l
+
1
)
⊙
R
e
L
U
(
z
p
l
)
′
\begin{aligned} \delta _{p}^{l}&=\frac{\partial L}{\partial z_{p}^{l}} \\ & =\frac{\partial L}{\partial a_{p}^{l}}\frac{\partial a_{p}^{l}}{\partial z_{p}^{l}} \\ & =upsample(\delta _{p}^{l+1})\odot ReLU(z_{p}^{l}{)}' \end{aligned}
δpl=∂zpl∂L=∂apl∂L∂zpl∂apl=upsample(δpl+1)⊙ReLU(zpl)′
因此,由池化层误差计算卷积层反向传播误差公式为:
δ
l
=
u
p
s
a
m
p
l
e
(
δ
l
+1
)
⊙
R
e
L
U
(
z
l
)
′
{{\delta }^{l}}=upsample({{\delta }^{l\text{+1}}})\odot ReLU({{z}^{l}}{)}'
δl=upsample(δl+1)⊙ReLU(zl)′
由第
l
+
1
l+1
l+1卷积层的
δ
l
+
1
\delta^{l+1}
δl+1推导第
l
l
l卷积层的反向传播误差:
z
p
l
+
1
=
∑
q
=
1
n
l
a
q
l
∗
k
p
,
q
l
+
1
+
b
p
l
+
1
z_{p}^{l+1}=\sum\limits_{q=1}^{{{n}^{l}}}{a_{q}^{l}*k_{p,q}^{l+1}}+b_{p}^{l+1}
zpl+1=q=1∑nlaql∗kp,ql+1+bpl+1
δ q l = ∂ L ∂ z q l = ∑ p = 1 n l + 1 ∂ L ∂ z p l + 1 ∂ z p l + 1 ∂ a q l ∂ a q l ∂ z q l \delta _{q}^{l}=\frac{\partial L}{\partial z_{q}^{l}}=\sum\limits_{p=1}^{{{n}^{l+1}}}{\frac{\partial L}{\partial z_{p}^{l+1}}\frac{\partial z_{p}^{l+1}}{\partial a_{q}^{l}}\frac{\partial a_{q}^{l}}{\partial z_{q}^{l}}} δql=∂zql∂L=p=1∑nl+1∂zpl+1∂L∂aql∂zpl+1∂zql∂aql
∂ L ∂ z p l + 1 ∂ z p l + 1 ∂ a q l = δ p l + 1 ∗ r o t 180 ( k p , q l + 1 ) \frac{\partial L}{\partial z_{p}^{l+1}}\frac{\partial z_{p}^{l+1}}{\partial a_{q}^{l}}=\delta _{p}^{l+1}*rot180(k_{p,q}^{l+1}) ∂zpl+1∂L∂aql∂zpl+1=δpl+1∗rot180(kp,ql+1)
因此,
δ
q
l
=
∂
L
∂
z
q
l
=
[
∑
p
=
1
n
l
+
1
δ
p
l
+
1
∗
r
o
t
180
(
k
p
,
q
l
+
1
)
]
⊙
R
e
L
U
(
z
q
l
)
′
\delta _{q}^{l}=\frac{\partial L}{\partial z_{q}^{l}}=\left[ \sum\limits_{p=1}^{{{n}^{l+1}}}{\delta _{p}^{l+1}*rot180(k_{p,q}^{l+1})} \right]\odot ReLU(z_{q}^{l}{)}'
δql=∂zql∂L=⎣⎡p=1∑nl+1δpl+1∗rot180(kp,ql+1)⎦⎤⊙ReLU(zql)′
当第
l
l
l层为池化层时,
R
e
L
U
(
z
q
l
)
′
=
1
ReLU(z_{q}^{l}{)}'=1
ReLU(zql)′=1。
梯度计算
已知全连接层
l
l
l的反向传播误差
δ
l
\delta^l
δl,计算梯度
∂
L
∂
W
l
\frac{\partial L}{\partial W^l}
∂Wl∂L和
∂
L
∂
b
l
\frac{\partial L}{\partial b^l}
∂bl∂L:
∂
L
∂
W
l
=
∂
L
∂
z
l
∂
z
l
∂
W
l
=
δ
l
(
a
l
−
1
)
T
\frac{\partial L}{\partial {{W}^{l}}}=\frac{\partial L}{\partial {{z}^{l}}}\frac{\partial {{z}^{l}}}{\partial {{W}^{l}}}={{\delta }^{l}}{{\left( {{a}^{l-1}} \right)}^{T}}
∂Wl∂L=∂zl∂L∂Wl∂zl=δl(al−1)T
∂ L ∂ b l = ∂ L ∂ z l ∂ z l ∂ b l = δ l \frac{\partial L}{\partial {{b}^{l}}}=\frac{\partial L}{\partial {{z}^{l}}}\frac{\partial {{z}^{l}}}{\partial {{b}^{l}}}={{\delta }^{l}} ∂bl∂L=∂zl∂L∂bl∂zl=δl
当获得平均梯度信息后添加L2正则项,设系数为
γ
\gamma
γ:
∂
L
∂
W
l
=
∂
L
∂
W
l
+
γ
(
r
l
(
r
l
−
1
)
T
)
W
l
\frac{\partial L}{\partial {{W}^{l}}}=\frac{\partial L}{\partial {{W}^{l}}}+\gamma \left( {{r}^{l}}{{\left( {{r}^{l-1}} \right)}^{T}} \right){{W}^{l}}
∂Wl∂L=∂Wl∂L+γ(rl(rl−1)T)Wl
已知卷积层的反向传播误差
δ
l
\delta^l
δl,计算梯度
∂
L
∂
k
p
,
q
l
\frac{\partial L}{\partial k^l_{p,q}}
∂kp,ql∂L和
∂
L
∂
b
p
l
\frac{\partial L}{\partial b^l_p}
∂bpl∂L:
∂
L
∂
k
p
,
q
l
=
∂
L
∂
z
p
l
∂
z
p
l
∂
k
p
,
q
l
=
δ
p
l
∗
a
q
l
−
1
\frac{\partial L}{\partial k_{p,q}^{l}}=\frac{\partial L}{\partial z_{p}^{l}}\frac{\partial z_{p}^{l}}{\partial k_{p,q}^{l}}=\delta _{p}^{l}*a_{q}^{l-1}
∂kp,ql∂L=∂zpl∂L∂kp,ql∂zpl=δpl∗aql−1
∂ L ∂ b p l = ∂ L ∂ z p l ∂ z p l ∂ b p l = ∑ i = 1 n l ∑ j = 1 n l δ p l ( i , j ) \frac{\partial L}{\partial b_{p}^{l}}=\frac{\partial L}{\partial z_{p}^{l}}\frac{\partial z_{p}^{l}}{\partial b_{p}^{l}}=\sum\limits_{i=1}^{{{n}^{l}}}{\sum\limits_{j=1}^{{{n}^{l}}}{\delta _{p}^{l}(i,j)}} ∂bpl∂L=∂zpl∂L∂bpl∂zpl=i=1∑nlj=1∑nlδpl(i,j)
参数更新
当获得一个批次的平均梯度信息后利用批量梯度下降+Momentum方法进行参数更新:
v
d
k
=
β
v
d
k
+
(
1
−
β
)
∂
L
∂
k
l
v
d
b
=
β
v
d
b
+
(
1
−
β
)
∂
L
∂
b
l
v
d
W
=
β
v
d
W
+
(
1
−
β
)
∂
L
∂
W
l
\begin{aligned} {{v}_{dk}}&=\beta {{v}_{dk}}+(1-\beta )\frac{\partial L}{\partial {{k}^{l}}} \\ {{v}_{db}}&=\beta {{v}_{db}}+(1-\beta )\frac{\partial L}{\partial {{b}^{l}}} \\ {{v}_{dW}}&=\beta {{v}_{dW}}+(1-\beta )\frac{\partial L}{\partial {{W}^{l}}} \end{aligned}
vdkvdbvdW=βvdk+(1−β)∂kl∂L=βvdb+(1−β)∂bl∂L=βvdW+(1−β)∂Wl∂L
参数更新:
k
l
=
k
l
−
α
v
d
k
l
b
l
=
b
l
−
α
v
d
b
l
W
l
=
W
l
−
α
v
d
W
l
\begin{aligned} {{k}^{l}}&={{k}^{l}}-\alpha {{v}_{d{{k}^{l}}}} \\ {{b}^{l}}&={{b}^{l}}-\alpha {{v}_{d{{b}^{l}}}} \\ {{W}^{l}}&={{W}^{l}}-\alpha {{v}_{d{{W}^{l}}}} \end{aligned}
klblWl=kl−αvdkl=bl−αvdbl=Wl−αvdWl
[1]. Very Deep Convolutional Networks for Large-Scale Image Recognition
[2]. Dropout: A Simple Way to Prevent Neural Networks from Overfitting
[3]. 卷积神经网络(CNN)反向传播算法