卷积神经网络基础2

一、误差得计算

以三层的神经网络为例,如下图:
在这里插入图片描述

其中,第一层为输入层,包含 x 1 x _{1} x1 x 2 x _{2} x2两个节点;中间层为隐藏层,包含 σ 1 \sigma_{1} σ1 σ 2 \sigma_{2} σ2 σ 3 \sigma_{3} σ3三个节点;最后一层为输出层,包含 y 1 y_{1} y1 y 2 y_{2} y2两个节点。另外, ω ( 11 ) ( 1 ) \omega _{(11)}^{(1)} ω(11)(1)代表权重,其上标 ( 1 ) {(1)} (1)中,1代表所在的层数; 其下标 ( 11 ) {(11)} (11)中,第一个1代表上一层的第1个节点,第二个1代表本层的第1个节点;最后, b b b代表偏置。

具体而言,输出层 y 1 y_{1} y1 y 2 y_{2} y2的计算如下:

y 1 = ω ( 11 ) ( 2 ) ⋅ σ 1 ( x 1 ⋅ ω ( 11 ) ( 1 ) + x 2 ⋅ ω ( 21 ) ( 1 ) + b 1 ( 1 ) ) + ω ( 21 ) ( 2 ) ⋅ σ 2 ( x 1 ⋅ ω ( 12 ) ( 1 ) + x 2 ⋅ ω ( 22 ) ( 1 ) + b 2 ( 1 ) ) + ω ( 31 ) ( 2 ) ⋅ σ 3 ( x 1 ⋅ ω ( 13 ) ( 1 ) + x 2 ⋅ ω ( 23 ) ( 1 ) + b 3 ( 1 ) ) + b 1 ( 2 ) \begin{align*} y_1=\omega_{(11)}^{(2)}\cdot \sigma_{1}(x_1\cdot\omega_{(11)}^{(1)}+x_2\cdot\omega_{(21)}^{(1)}+b_1^{(1)}) \\+\omega_{(21)}^{(2)}\cdot \sigma_{2}(x_1\cdot\omega_{(12)}^{(1)}+x_2\cdot\omega_{(22)}^{(1)}+b_2^{(1)}) \\+\omega_{(31)}^{(2)}\cdot \sigma_{3}(x_1\cdot\omega_{(13)}^{(1)}+x_2\cdot\omega_{(23)}^{(1)}+b_3^{(1)}) \\+b_1^{(2)}\tag{1} \end{align*} y1=ω(11)(2)σ1(x1ω(11)(1)+x2ω(21)(1)+b1(1))+ω(21)(2)σ2(x1ω(12)(1)+x2ω(22)(1)+b2(1))+ω(31)(2)σ3(x1ω(13)(1)+x2ω(23)(1)+b3(1))+b1(2)(1)
y 2 = ω ( 12 ) ( 2 ) ⋅ σ 1 ( x 1 ⋅ ω ( 11 ) ( 1 ) + x 2 ⋅ ω ( 21 ) ( 1 ) + b 1 ( 1 ) ) + ω ( 22 ) ( 2 ) ⋅ σ 2 ( x 1 ⋅ ω ( 12 ) ( 1 ) + x 2 ⋅ ω ( 22 ) ( 1 ) + b 2 ( 1 ) ) + ω ( 32 ) ( 2 ) ⋅ σ 3 ( x 1 ⋅ ω ( 13 ) ( 1 ) + x 2 ⋅ ω ( 23 ) ( 1 ) + b 3 ( 1 ) ) + b 2 ( 2 ) \begin{align*} y_2=\omega_{(12)}^{(2)}\cdot \sigma_{1}(x_1\cdot\omega_{(11)}^{(1)}+x_2\cdot\omega_{(21)}^{(1)}+b_1^{(1)}) \\+\omega_{(22)}^{(2)}\cdot \sigma_{2}(x_1\cdot\omega_{(12)}^{(1)}+x_2\cdot\omega_{(22)}^{(1)}+b_2^{(1)}) \\+\omega_{(32)}^{(2)}\cdot \sigma_{3}(x_1\cdot\omega_{(13)}^{(1)}+x_2\cdot\omega_{(23)}^{(1)}+b_3^{(1)}) \\+b_2^{(2)}\tag{2} \end{align*} y2=ω(12)(2)σ1(x1ω(11)(1)+x2ω(21)(1)+b1(1))+ω(22)(2)σ2(x1ω(12)(1)+x2ω(22)(1)+b2(1))+ω(32)(2)σ3(x1ω(13)(1)+x2ω(23)(1)+b3(1))+b2(2)(2)

y 1 y_{1} y1 y 2 y_{2} y2得到之后,通过 S o f t m a x Softmax Softmax函数输出最后的 O 1 O_{1} O1 O 2 O_{2} O2(经过 S o f t m a x Softmax Softmax处理后所有输出节点概率和为1),具体计算如下:

O 1 = e y 1 e y 1 + e y 2 \begin{align*}O_1=\frac{e^{y_1}}{e^{y_1}+e^{y_2}} \tag{3} \end{align*} O1=ey1+ey2ey1(3)

O 2 = e y 2 e y 1 + e y 2 \begin{align*}O_2=\frac{e^{y_2}}{e^{y_1}+e^{y_2}} \tag{4} \end{align*} O2=ey1+ey2ey2(4)

接下来进行误差的计算,即计算交叉熵损失(Cross entropy loss),有如下两种计算方式:

  1. 针对多分类问题( S o f t m a x Softmax Softmax输出,所有输出概率和为1)

H = − ∑ i O i ∗ l o g ( O i ) \begin{align*} H=- {\textstyle \sum_{i}} O_i^\ast log(O_i) \tag{5} \end{align*} H=iOilog(Oi)(5)

  1. 针对二分类问题( S i g m o i d Sigmoid Sigmoid输出,每个输出节点之间互不相干)

H = − 1 N ∑ i = 1 N [ O i ∗ l o g ( O i ) + ( 1 − O i ∗ ) l o g ( 1 − O i ) ] \begin{align*} H=-\frac{1}{N} {\textstyle \sum_{i=1}^{N}} [O_i^\ast log(O_i)+(1-O_i^\ast)log(1-O_i)] \tag{6} \end{align*} H=N1i=1N[Oilog(Oi)+(1Oi)log(1Oi)](6)

其中, O i ∗ O_i^\ast Oi为真实标签值, O i O_i Oi为预测值,默认 l o g log log e e e为底,即使用 l n ln ln

因为示例为二分类问题,故其误差的计算公式如下:

L o s s = − ( O 1 ∗ l o g ( O 1 ) + O 2 ∗ l o g ( O 2 ) ) \begin{align*} Loss=-(O_1^\ast log(O_1)+O_2^\ast log(O_2)) \tag{7} \end{align*} Loss=(O1log(O1)+O2log(O2))(7)

二、误差的反向传播

以对 ω ( 11 ) ( 2 ) \omega _{(11)}^{(2)} ω(11)(2)的更新为例,即要求 L o s s Loss Loss对其的梯度(偏导数),为方便计算,首先将 ( 1 ) (1) (1)式简化为 ( 8 ) (8) (8)式,如下:

y 1 = ω ( 11 ) ( 2 ) ⋅ a 1 + ω ( 21 ) ( 2 ) ⋅ a 2 + ω ( 31 ) ( 2 ) ⋅ a 3 + b 1 ( 2 ) \begin{align*} y_1=\omega_{(11)}^{(2)}\cdot a_1+\omega_{(21)}^{(2)}\cdot a_2+\omega_{(31)}^{(2)}\cdot a_3 + b_1^{(2)}\tag{8} \end{align*} y1=ω(11)(2)a1+ω(21)(2)a2+ω(31)(2)a3+b1(2)(8)
具体梯度(偏导数)的计算如下:

∂ L o s s ∂ ω 11 ( 2 ) = ∂ L o s s ∂ y 1 ⋅ ∂ y 1 ∂ ω 11 ( 2 ) = ( ∂ L o s s ∂ O 1 ⋅ ∂ O 1 ∂ y 1 + ∂ L o s s ∂ O 2 ⋅ ∂ O 2 ∂ y 1 ) ⋅ ∂ y 1 ∂ ω 11 ( 2 ) = [ ( − O 1 ∗ ⋅ 1 O 1 ) O 1 ( 1 − O 1 ) + ( − O 2 ∗ ⋅ 1 O 2 ) O 1 ( O 1 − 1 ) ] ⋅ a 1 = [ − O 1 ∗ ⋅ 1 O 1 ⋅ O 1 ⋅ O 2 − O 2 ∗ ⋅ 1 O 2 ⋅ ( − O 1 ⋅ O 2 ) ] ⋅ a 1 = ( O 2 ∗ ⋅ O 1 − O 1 ∗ ⋅ O 2 ) ⋅ a 1 \begin{align*} &\quad \frac{\partial Loss}{\partial \omega_{11}^{(2)}} = \frac{\partial Loss}{\partial y_1} \cdot \frac{\partial y_1}{\partial \omega_{11}^{(2)}} \\&\quad =(\frac{\partial Loss}{\partial O_1} \cdot \frac{\partial O_1}{\partial y_1} +\frac{\partial Loss}{\partial O_2} \cdot \frac{\partial O_2}{\partial y_1}) \cdot \frac{\partial y_1}{\partial \omega_{11}^{(2)}} \\&\quad =[(-O_1^*\cdot \frac{1}{O_1})O_1(1-O_1)+(-O_2^*\cdot \frac{1}{O_2})O_1(O_1-1)]\cdot a_1 \\&\quad =[-O_1^*\cdot \frac{1}{O_1}\cdot O_1\cdot O_2-O_2^*\cdot \frac{1}{O_2}\cdot (-O_1\cdot O_2)]\cdot a_1 \\&\quad =(O_2^*\cdot O_1-O_1^*\cdot O_2)\cdot a_1 \tag{9} \end{align*} ω11(2)Loss=y1Lossω11(2)y1=(O1Lossy1O1+O2Lossy1O2)ω11(2)y1=[(O1O11)O1(1O1)+(O2O21)O1(O11)]a1=[O1O11O1O2O2O21(O1O2)]a1=(O2O1O1O2)a1(9)

三、权重的更新

得到梯度以后,便可以进行权重的更新,具体更新如下:

ω 11 ( 2 ) ( n e w ) = ω 11 ( 2 ) ( o l d ) − l e a r n i n g r a t e ⋅ g r a d i e n t \begin{align*} \omega_{11}^{(2)}(new)=\omega_{11}^{(2)}(old)-learning_{rate}\cdot gradient \tag{10} \end{align*} ω11(2)(new)=ω11(2)(old)learningrategradient(10)

其中, ω 11 ( 2 ) ( n e w ) \omega_{11}^{(2)}(new) ω11(2)(new)为新的权重值; ω 11 ( 2 ) ( o l d ) \omega_{11}^{(2)}(old) ω11(2)(old)为旧的权重值; l e a r n i n g r a t e learning_{rate} learningrate为设置的学习率; g r a d i e n t gradient gradient为梯度,即 g r a d i e n t = ∂ L o s s ∂ ω 11 ( 2 ) gradient= \frac{\partial Loss}{\partial \omega_{11}^{(2)}} gradient=ω11(2)Loss

在实际应用中往往不可能一次性将所有数据载入内存(算力也不够),所以只能分批次(batch)训练,分批次训练与整体训练区别如下:
在这里插入图片描述

为了使网络更快的收敛(加速分批次样本的求解),接下来引入常见的几种优化器(optimizer):

  1. SGD 优化器(Stochastic Gradient Descent,随机梯度下降)
    ω t + 1 = ω t − α ⋅ g ( ω t ) \begin{align*} \omega_{t+1}=\omega_{t}-\alpha \cdot g(\omega _t) \tag{11} \end{align*} ωt+1=ωtαg(ωt)(11)
    其中, α \alpha α为学习率, g ( ω t ) g(\omega_t) g(ωt) t t t时刻对参数 ω t \omega_t ωt的损失梯度。该算法的缺点是:容易受样本噪声影响;可能陷入局部最优解。

  2. SGD+Momentum(带动量的随机梯度下降)
    ν t = η ⋅ ν t − 1 + α ⋅ g ( ω t ) ω t + 1 = ω t − ν t \begin{align*}&\quad \nu _t=\eta \cdot \nu _{t-1}+\alpha \cdot g(\omega _t)\tag{12} \\ &\quad \omega _{t+1}=\omega _t-\nu _t\tag{13} \end{align*} νt=ηνt1+αg(ωt)ωt+1=ωtνt(12)(13)
    其中, α \alpha α为学习率, g ( ω t ) g(\omega_t) g(ωt) t t t时刻对参数 ω t \omega_t ωt的损失梯度, η ( 0.9 ) \eta(0.9) η(0.9)为动量系数。

  3. Adagrad 优化器(自适应学习率)
    s t = s t − 1 + g ( ω t ) ⋅ g ( ω t ) ω t + 1 = ω t − α s t + ε ⋅ g ( ω t ) \begin{align*} &\quad s_t=s_{t-1}+g(\omega _t)\cdot g(\omega _t)\tag{14} \\ &\quad \omega _{t+1}=\omega _t-\frac{\alpha }{\sqrt{s_t+\varepsilon } } \cdot g(\omega _t)\tag{15} \end{align*} st=st1+g(ωt)g(ωt)ωt+1=ωtst+ε αg(ωt)(14)(15)
    其中, α \alpha α为学习率, g ( ω t ) g(\omega_t) g(ωt) t t t时刻对参数 ω t \omega_t ωt的损失梯度, ε ( 1 0 − 7 ) \varepsilon(10^{-7}) ε(107)为防止分母为零的参数。这里需要注意,学习率下降的太快,可能导致还没有收敛就停止训练。

  4. RMSProp 优化器(自适应学习率)
    s t = η ⋅ s t − 1 + ( 1 − η ) ⋅ g ( ω t ) ⋅ g ( ω t ) ω t + 1 = ω t − α s t + ε ⋅ g ( ω t ) \begin{align*} &\quad s_t=\eta \cdot s_{t-1}+ (1-\eta )\cdot g(\omega _t)\cdot g(\omega _t)\tag{16} \\ &\quad \omega _{t+1}=\omega _t-\frac{\alpha }{\sqrt{s_t+\varepsilon } } \cdot g(\omega _t)\tag{17} \end{align*} st=ηst1+(1η)g(ωt)g(ωt)ωt+1=ωtst+ε αg(ωt)(16)(17)
    其中, α \alpha α为学习率, g ( ω t ) g(\omega_t) g(ωt) t t t时刻对参数 ω t \omega_t ωt的损失梯度, η ( 0.9 ) \eta(0.9) η(0.9)为动量系数, ε ( 1 0 − 7 ) \varepsilon(10^{-7}) ε(107)为防止分母为零的参数。

  5. Adam 优化器(自适应学习率)
    m t = β 1 ⋅ m t − 1 + ( 1 − β 1 ) ⋅ g ( ω t ) ν t = β 2 ⋅ ν t − 1 + ( 1 − β 2 ) ⋅ g ( ω t ) ⋅ g ( ω t ) m t ^ = m t 1 − β 1 t ν t ^ = ν t 1 − β 2 t ω t + 1 = ω t − α ν t ^ + ε ⋅ m t ^ \begin{align*} &\quad m_t = \beta_1 \cdot m_{t-1} + (1-\beta_1) \cdot g(\omega_t)\tag{18} \\&\quad \nu_t = \beta_2 \cdot \nu_{t-1} + (1-\beta_2) \cdot g(\omega_t) \cdot g(\omega_t)\tag{19} \\&\quad \hat{m_t}=\frac{m_t}{1-\beta_1^t} \tag{20} \\&\quad \hat{\nu_t}=\frac{\nu_t}{1-\beta_2^t} \tag{21} \\&\quad \omega _{t+1}=\omega _t-\frac{\alpha }{\sqrt{\hat{\nu_t}+\varepsilon } } \cdot \hat{m_t} \tag{22} \end{align*} mt=β1mt1+(1β1)g(ωt)νt=β2νt1+(1β2)g(ωt)g(ωt)mt^=1β1tmtνt^=1β2tνtωt+1=ωtνt^+ε αmt^(18)(19)(20)(21)(22)
    其中, α \alpha α为学习率, g ( ω t ) g(\omega_t) g(ωt) t t t时刻对参数 ω t \omega_t ωt的损失梯度, β 1 ( 0.9 ) \beta_1(0.9) β1(0.9) β 2 ( 0.999 ) \beta_2(0.999) β2(0.999)为控制衰减速度的参数, ε ( 1 0 − 7 ) \varepsilon(10^{-7}) ε(107)为防止分母为零的参数。

  • 17
    点赞
  • 12
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值