VGG 16前向传播与反向传播公式推导

VGG 16 公式推导

VGG-16共有13层卷积层,5层池化层和3层全连接层,对前两层全连接网络采用dropout和L2正则化防止过拟合,采用批量梯度下降+Momentum以交叉熵为目标损失进行训练优化。
在这里插入图片描述

  • n l n^l nl—第 l l l层网络节点(卷积核)数目;

  • k p , q l k_{p,q}^l kp,ql—第 l l l p p p通道与第 l − 1 l-1 l1 q q q通道对应卷积核;

  • b p l b_p^l bpl—第 l l l p p p节点(通道)的偏置;

  • W l W^l Wl—第 l l l层全连接网络的权重;

  • z l z^l zl—第 l l l层未经过激活函数的前向输入;

  • a l a^l al—第 l l l层经过激活函数后的前向输出;

前向传递

l l l层卷积操作公式:
z p l ( i , j ) = ∑ q = 1 n l − 1 ∑ u = − 1 1 ∑ v = − 1 1 a q l − 1 ( i − u , j − v ) k p , q l ( u , v ) + b p l a p l ( i , j ) = R e L U ( z p l ( i , j ) ) z_{p}^{l}(i,j)=\sum\limits_{q=1}^{{{n}^{l-1}}}{\sum\limits_{u=-1}^{1}{\sum\limits_{v=-1}^{1}{a_{q}^{l-1}(i-u,j-v)k_{p,q}^{l}(u,v)}}}+b_{p}^{l} \\ a_{p}^{l}(i,j)=ReLU\left( z_{p}^{l}(i,j) \right) zpl(i,j)=q=1nl1u=11v=11aql1(iu,jv)kp,ql(u,v)+bplapl(i,j)=ReLU(zpl(i,j))
l l l层最大池化公式:
z p l ( i , j ) = max ⁡ ( a p l − 1 ( 2 i − u , 2 j − v ) ) u , v ∈ { 0 , 1 } z_{p}^{l}(i,j)=\max \left( a_{p}^{l-1}(2i-u,2j-v) \right)u,v\in \left\{ 0,1 \right\} zpl(i,j)=max(apl1(2iu,2jv))u,v{0,1}
经过前18层的卷积核池化操作后可获得 7 × 7 × 512 7×7×512 7×7×512大小的特征图,需要将其转化为一个25,088维的向量以便作为全连接层的输入,该过程输出为 a 18 a^{18} a18:
a 18 = F ( { z p 18 } p = 1 , 2 , ⋯ , 512 ) a^{18}=F \left(\left\{z_p^{18}\right\}_{p=1,2,⋯,512}\right) a18=F({zp18}p=1,2,,512)
全连接网络的前两层采用dropout,设为 d d d,第 l l l层节点的连通可用 r l r^l rl来表示,其服从伯努利分布:
r l ∼ B e r n o u l l i ( d ) {{r}^{l}}\sim Bernoulli(d) rlBernoulli(d)
前向传播为:
a ~ l = r l ⊙ a l z l + 1 = W l + 1 a ~ l + b l + 1 a l + 1 = R e L U ( z l + 1 ) {{{\tilde{a}}}^{l}}={{r}^{l}}\odot {{a}^{l}} \\ {{z}^{l+1}}={{W}^{l+1}}{{{\tilde{a}}}^{l}}+{{b}^{l+1}} \\ {{a}^{l+1}}=ReLU({{z}^{l+1}}) a~l=rlalzl+1=Wl+1a~l+bl+1al+1=ReLU(zl+1)
其中,⨀为Hadmard积,即矩阵对应元素相乘。

输出层的激活函数为softmax:
a i L = s o f t m a x ( z i L ) = e z i L ∑ k = 1 n L e z k L a_{i}^{L}=softmax(z_{i}^{L})=\frac{{{e}^{z_{i}^{L}}}}{\sum\limits_{k=1}^{{{n}^{L}}}{{{e}^{z_{k}^{L}}}}} aiL=softmax(ziL)=k=1nLezkLeziL
采用交叉熵损失作为损失函数:
L = − ∑ i = 1 n L y i log ⁡ a i L L=-\sum\limits_{i=1}^{{{n}^{L}}}{{{y}_{i}}\log a_{i}^{L}} L=i=1nLyilogaiL

反向传播

引入中间变量 δ l \delta^l δl,为第 l l l层的误差,表示损失函数对第l层前向输入 z l z^l zl 的梯度,即为 ∂ L ∂ z l \frac{\partial{L}}{\partial{z^l}} zlL

Softmax函数偏导数计算公式为:

i = j i=j i=j时,
∂ ∂ z j ( e z j ∑ k = 1 n e z k ) = e z j ∑ k = 1 n e z k − ( e z j ) 2 ( ∑ k = 1 n e z k ) 2 = a j ( 1 − a j ) \frac{\partial }{\partial {{z}_{j}}}\left( \frac{{{e}^{{{z}_{j}}}}}{\sum\nolimits_{k=1}^{n}{{{e}^{{{z}_{k}}}}}} \right)=\frac{{{e}^{{{z}_{j}}}}\sum\nolimits_{k=1}^{n}{{{e}^{{{z}_{k}}}}}-{{\left( {{e}^{{{z}_{j}}}} \right)}^{2}}}{{{\left( \sum\nolimits_{k=1}^{n}{{{e}^{{{z}_{k}}}}} \right)}^{2}}} ={{a}_{j}}\left( 1-{{a}_{j}} \right) zj(k=1nezkezj)=(k=1nezk)2ezjk=1nezk(ezj)2=aj(1aj)
i ≠ j i\ne j i=j时,
∂ ∂ z j ( e z i ∑ k = 1 n e z k ) = − e z i e z j ( ∑ k = 1 n e z k ) 2 = − a i a j \frac{\partial }{\partial {{z}_{j}}}\left( \frac{{{e}^{{{z}_{i}}}}}{\sum\nolimits_{k=1}^{n}{{{e}^{{{z}_{k}}}}}} \right)=\frac{-{{e}^{{{z}_{i}}}}{{e}^{{{z}_{j}}}}}{{{\left( \sum\nolimits_{k=1}^{n}{{{e}^{{{z}_{k}}}}} \right)}^{2}}}=-{{a}_{i}}{{a}_{j}} zj(k=1nezkezi)=(k=1nezk)2eziezj=aiaj
输出层第 j j j个节点的误差为:
δ j L = ∂ L ∂ z j L = ∑ i = 1 n L ∂ L ∂ a i L ∂ a i L ∂ z j L = ∂ L ∂ a j L ∂ a j L ∂ z j L + ∑ i ≠ j ∂ L ∂ a i L ∂ a i L ∂ z j L = − y j a j L a j L ( 1 − a j L ) + ∑ i ≠ j − y i a i L ( − a i L a j L ) = − y j ( 1 − a j L ) + a j L ∑ i ≠ j y i = a j L − y j \begin{aligned} \delta _{j}^{L}&=\frac{\partial L}{\partial z_{j}^{L}} \\ &=\sum\limits_{i=1}^{{{n}^{L}}}{\frac{\partial L}{\partial a_{i}^{L}}\frac{\partial a_{i}^{L}}{\partial z_{j}^{L}}} \\ & =\frac{\partial L}{\partial a_{j}^{L}}\frac{\partial a_{j}^{L}}{\partial z_{j}^{L}}+\sum\limits_{i\ne j}{\frac{\partial L}{\partial a_{i}^{L}}\frac{\partial a_{i}^{L}}{\partial z_{j}^{L}}} \\ & =-\frac{{{y}_{j}}}{a_{j}^{L}}a_{j}^{L}(1-a_{j}^{L})+\sum\limits_{i\ne j}{-\frac{{{y}_{i}}}{a_{i}^{L}}(-a_{i}^{L}a_{j}^{L})} \\ & =-{{y}_{j}}(1-a_{j}^{L})+a_{j}^{L}\sum\limits_{i\ne j}{{{y}_{i}}} \\ & =a_{j}^{L}-{{y}_{j}} \end{aligned} δjL=zjLL=i=1nLaiLLzjLaiL=ajLLzjLajL+i=jaiLLzjLaiL=ajLyjajL(1ajL)+i=jaiLyi(aiLajL)=yj(1ajL)+ajLi=jyi=ajLyj
输出层的反向传播误差为:
δ L = a L − y {{\delta }^{L}}={{a}^{L}}-y δL=aLy
l l l个隐藏层第 j j j个节点的反向传播误差为:
δ j l = ∂ L ∂ z j l = ∑ i = 1 n l + 1 ∂ L ∂ z i l + 1 ∂ z i l + 1 ∂ a ~ j l ∂ a ~ j l ∂ a j l ∂ a j l ∂ z j l = ∑ i = 1 n l + 1 δ i l + 1 W i , j l + 1 r j l R e L U ( z j l ) ′ = ( W : , j l + 1 ) T δ l + 1 r j l R e L U ( z j l ) ′ \begin{aligned} \delta _{j}^{l}&=\frac{\partial L}{\partial z_{j}^{l}}\\ &=\sum\limits_{i=1}^{{{n}^{l+1}}}{\frac{\partial L}{\partial z_{i}^{l+1}}\frac{\partial z_{i}^{l+1}}{\partial \tilde{a}_{j}^{l}}\frac{\partial \tilde{a}_{j}^{l}}{\partial a_{j}^{l}}\frac{\partial a_{j}^{l}}{\partial z_{j}^{l}}}\\ &=\sum_{i=1}^{n^{l+1}}{\delta^{l+1}_iW^{l+1}_{i,j}r^l_jReLU(z^l_j)'}\\ &={{\left( W_{:,j}^{l+1} \right)}^{T}}{{\delta }^{l+1}}r_{j}^{l}ReLU(z_{j}^{l}{)}' \end{aligned} δjl=zjlL=i=1nl+1zil+1La~jlzil+1ajla~jlzjlajl=i=1nl+1δil+1Wi,jl+1rjlReLU(zjl)=(W:,jl+1)Tδl+1rjlReLU(zjl)
因此第 l l l全连接层的反向传播误差为:
δ l = ( W l + 1 ) T δ l + 1 ⊙ r l ⊙ R e L U ( z l ) ′ {{\delta }^{l}}={{\left( {{W}^{l+1}} \right)}^{T}}{{\delta }^{l+1}}\odot {{r}^{l}}\odot ReLU({{z}^{l}}{)}' δl=(Wl+1)Tδl+1rlReLU(zl)
由全连接层到池化层的反向传播误差为:
δ 18 = F − 1 ( ( W 19 ) T δ 19 ) {{\delta }^{18}}={{F}^{-1}}\left( {{({{W}^{19}})}^{T}}{{\delta }^{19}} \right) δ18=F1((W19)Tδ19)
其中 δ 18 \delta^{18} δ18 7 × 7 × 512 7×7×512 7×7×512的张量。

由第 l + 1 l+1 l+1池化层的 δ l + 1 \delta^{l+1} δl+1推导第 l l l卷积层的反向传播误差时,对于最大池化,我们需要利用上采样将 δ l + 1 \delta^{l+1} δl+1中每个通道中的元素放到之前做前向传播时最大值的位置处,其他元素为0:
δ p l = ∂ L ∂ z p l = ∂ L ∂ a p l ∂ a p l ∂ z p l = u p s a m p l e ( δ p l + 1 ) ⊙ R e L U ( z p l ) ′ \begin{aligned} \delta _{p}^{l}&=\frac{\partial L}{\partial z_{p}^{l}} \\ & =\frac{\partial L}{\partial a_{p}^{l}}\frac{\partial a_{p}^{l}}{\partial z_{p}^{l}} \\ & =upsample(\delta _{p}^{l+1})\odot ReLU(z_{p}^{l}{)}' \end{aligned} δpl=zplL=aplLzplapl=upsample(δpl+1)ReLU(zpl)
因此,由池化层误差计算卷积层反向传播误差公式为:
δ l = u p s a m p l e ( δ l +1 ) ⊙ R e L U ( z l ) ′ {{\delta }^{l}}=upsample({{\delta }^{l\text{+1}}})\odot ReLU({{z}^{l}}{)}' δl=upsample(δl+1)ReLU(zl)
由第 l + 1 l+1 l+1卷积层的 δ l + 1 \delta^{l+1} δl+1推导第 l l l卷积层的反向传播误差:
z p l + 1 = ∑ q = 1 n l a q l ∗ k p , q l + 1 + b p l + 1 z_{p}^{l+1}=\sum\limits_{q=1}^{{{n}^{l}}}{a_{q}^{l}*k_{p,q}^{l+1}}+b_{p}^{l+1} zpl+1=q=1nlaqlkp,ql+1+bpl+1

δ q l = ∂ L ∂ z q l = ∑ p = 1 n l + 1 ∂ L ∂ z p l + 1 ∂ z p l + 1 ∂ a q l ∂ a q l ∂ z q l \delta _{q}^{l}=\frac{\partial L}{\partial z_{q}^{l}}=\sum\limits_{p=1}^{{{n}^{l+1}}}{\frac{\partial L}{\partial z_{p}^{l+1}}\frac{\partial z_{p}^{l+1}}{\partial a_{q}^{l}}\frac{\partial a_{q}^{l}}{\partial z_{q}^{l}}} δql=zqlL=p=1nl+1zpl+1Laqlzpl+1zqlaql

∂ L ∂ z p l + 1 ∂ z p l + 1 ∂ a q l = δ p l + 1 ∗ r o t 180 ( k p , q l + 1 ) \frac{\partial L}{\partial z_{p}^{l+1}}\frac{\partial z_{p}^{l+1}}{\partial a_{q}^{l}}=\delta _{p}^{l+1}*rot180(k_{p,q}^{l+1}) zpl+1Laqlzpl+1=δpl+1rot180(kp,ql+1)

因此,
δ q l = ∂ L ∂ z q l = [ ∑ p = 1 n l + 1 δ p l + 1 ∗ r o t 180 ( k p , q l + 1 ) ] ⊙ R e L U ( z q l ) ′ \delta _{q}^{l}=\frac{\partial L}{\partial z_{q}^{l}}=\left[ \sum\limits_{p=1}^{{{n}^{l+1}}}{\delta _{p}^{l+1}*rot180(k_{p,q}^{l+1})} \right]\odot ReLU(z_{q}^{l}{)}' δql=zqlL=p=1nl+1δpl+1rot180(kp,ql+1)ReLU(zql)
当第 l l l层为池化层时, R e L U ( z q l ) ′ = 1 ReLU(z_{q}^{l}{)}'=1 ReLU(zql)=1

梯度计算

已知全连接层 l l l的反向传播误差 δ l \delta^l δl,计算梯度 ∂ L ∂ W l \frac{\partial L}{\partial W^l} WlL ∂ L ∂ b l \frac{\partial L}{\partial b^l} blL
∂ L ∂ W l = ∂ L ∂ z l ∂ z l ∂ W l = δ l ( a l − 1 ) T \frac{\partial L}{\partial {{W}^{l}}}=\frac{\partial L}{\partial {{z}^{l}}}\frac{\partial {{z}^{l}}}{\partial {{W}^{l}}}={{\delta }^{l}}{{\left( {{a}^{l-1}} \right)}^{T}} WlL=zlLWlzl=δl(al1)T

∂ L ∂ b l = ∂ L ∂ z l ∂ z l ∂ b l = δ l \frac{\partial L}{\partial {{b}^{l}}}=\frac{\partial L}{\partial {{z}^{l}}}\frac{\partial {{z}^{l}}}{\partial {{b}^{l}}}={{\delta }^{l}} blL=zlLblzl=δl

当获得平均梯度信息后添加L2正则项,设系数为 γ \gamma γ
∂ L ∂ W l = ∂ L ∂ W l + γ ( r l ( r l − 1 ) T ) W l \frac{\partial L}{\partial {{W}^{l}}}=\frac{\partial L}{\partial {{W}^{l}}}+\gamma \left( {{r}^{l}}{{\left( {{r}^{l-1}} \right)}^{T}} \right){{W}^{l}} WlL=WlL+γ(rl(rl1)T)Wl
已知卷积层的反向传播误差 δ l \delta^l δl,计算梯度 ∂ L ∂ k p , q l \frac{\partial L}{\partial k^l_{p,q}} kp,qlL ∂ L ∂ b p l \frac{\partial L}{\partial b^l_p} bplL
∂ L ∂ k p , q l = ∂ L ∂ z p l ∂ z p l ∂ k p , q l = δ p l ∗ a q l − 1 \frac{\partial L}{\partial k_{p,q}^{l}}=\frac{\partial L}{\partial z_{p}^{l}}\frac{\partial z_{p}^{l}}{\partial k_{p,q}^{l}}=\delta _{p}^{l}*a_{q}^{l-1} kp,qlL=zplLkp,qlzpl=δplaql1

∂ L ∂ b p l = ∂ L ∂ z p l ∂ z p l ∂ b p l = ∑ i = 1 n l ∑ j = 1 n l δ p l ( i , j ) \frac{\partial L}{\partial b_{p}^{l}}=\frac{\partial L}{\partial z_{p}^{l}}\frac{\partial z_{p}^{l}}{\partial b_{p}^{l}}=\sum\limits_{i=1}^{{{n}^{l}}}{\sum\limits_{j=1}^{{{n}^{l}}}{\delta _{p}^{l}(i,j)}} bplL=zplLbplzpl=i=1nlj=1nlδpl(i,j)

参数更新

当获得一个批次的平均梯度信息后利用批量梯度下降+Momentum方法进行参数更新:
v d k = β v d k + ( 1 − β ) ∂ L ∂ k l v d b = β v d b + ( 1 − β ) ∂ L ∂ b l v d W = β v d W + ( 1 − β ) ∂ L ∂ W l \begin{aligned} {{v}_{dk}}&=\beta {{v}_{dk}}+(1-\beta )\frac{\partial L}{\partial {{k}^{l}}} \\ {{v}_{db}}&=\beta {{v}_{db}}+(1-\beta )\frac{\partial L}{\partial {{b}^{l}}} \\ {{v}_{dW}}&=\beta {{v}_{dW}}+(1-\beta )\frac{\partial L}{\partial {{W}^{l}}} \end{aligned} vdkvdbvdW=βvdk+(1β)klL=βvdb+(1β)blL=βvdW+(1β)WlL
参数更新:
k l = k l − α v d k l b l = b l − α v d b l W l = W l − α v d W l \begin{aligned} {{k}^{l}}&={{k}^{l}}-\alpha {{v}_{d{{k}^{l}}}} \\ {{b}^{l}}&={{b}^{l}}-\alpha {{v}_{d{{b}^{l}}}} \\ {{W}^{l}}&={{W}^{l}}-\alpha {{v}_{d{{W}^{l}}}} \end{aligned} klblWl=klαvdkl=blαvdbl=WlαvdWl

[1]. Very Deep Convolutional Networks for Large-Scale Image Recognition

[2]. Dropout: A Simple Way to Prevent Neural Networks from Overfitting

[3]. 卷积神经网络(CNN)反向传播算法

  • 4
    点赞
  • 17
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值