优化梯度下降算法

本文概述了深度学习中关键的优化问题,如归一化输入以防止梯度消失或爆炸,合理的权重初始化策略,以及常用的优化算法(如小批量梯度、指数加权平均、动量、RMSprop和Adam),还包括学习率衰减和避免局部最优解的方法。
摘要由CSDN通过智能技术生成

Optimization problem

speed up the training of your neural network

Normalizing inputs

  1. subtract mean

μ = 1 m ∑ i = 1 m x ( i ) x : = x − μ \mu =\frac{1}{m}\sum _{i=1}^{m}x^{(i)}\\ x:=x-\mu μ=m1i=1mx(i)x:=xμ

  1. normalize variance

σ 2 = 1 m ∑ i = 1 m ( x ( i ) ) 2 x / = σ \sigma ^2=\frac{1}{m}\sum_{i=1}^m(x^{(i)})^2\\ x/=\sigma σ2=m1i=1m(x(i))2x/=σ

vanishing/exploding gradients

y = w [ l ] w [ l − 1 ] . . . w [ 2 ] w [ 1 ] x w [ l ] > I → ( w [ l ] ) L → ∞ w [ l ] < I → ( w [ l ] ) L → 0 y=w^{[l]}w^{[l-1]}...w^{[2]}w^{[1]}x\\ w^{[l]}>I\rightarrow (w^{[l]})^L\rightarrow\infty \\w^{[l]}<I\rightarrow (w^{[l]})^L\rightarrow0 y=w[l]w[l1]...w[2]w[1]xw[l]>I(w[l])Lw[l]<I(w[l])L0

weight initialize

v a r ( w ) = 1 n ( l − 1 ) w [ l ] = n p . r a n d o m . r a n d n ( s h a p e ) ∗ n p . s q r t ( 1 n ( l − 1 ) ) var(w)=\frac{1}{n^{(l-1)}}\\ w^{[l]}=np.random.randn(shape)*np.sqrt(\frac{1}{n^{(l-1)}}) var(w)=n(l1)1w[l]=np.random.randn(shape)np.sqrt(n(l1)1)

gradient check

Numerical approximation

f ( θ ) = θ 3 f ′ ( θ ) = f ( θ + ε ) − f ( θ − ε ) 2 ε f(\theta)=\theta^3\\ f'(\theta)=\frac{f(\theta+\varepsilon)-f(\theta-\varepsilon)}{2\varepsilon} f(θ)=θ3f(θ)=2εf(θ+ε)f(θε)

grad check

d θ a p p r o x [ i ] = J ( θ 1 , . . . θ i + ε . . . ) − J ( θ 1 , . . . θ i − ε . . . ) 2 ε = d θ [ i ] c h e c k : ∥ d θ a p p r o x − d θ ∥ 2 ∥ d θ a p p r o x ∥ 2 + ∥ d θ ∥ 2 < 1 0 − 7 d\theta_{approx}[i]=\frac{J(\theta_1,...\theta_i+\varepsilon...)-J(\theta_1,...\theta_i-\varepsilon...)}{2\varepsilon}=d\theta[i]\\ check:\frac{\Vert d\theta_{approx}-d\theta\Vert_2}{\Vert d\theta_{approx}\Vert_2+\Vert d\theta\Vert_2}<10^{-7} dθapprox[i]=2εJ(θ1,...θi+ε...)J(θ1,...θiε...)=dθ[i]check:dθapprox2+dθ2dθapproxdθ2<107

Optimize algorithm

mini-bach gradient

[ x ( 1 ) . . . x ( m ) ] → [ x { 1 } . . . x { m / u } ] ( a n      e p o c h : F o r w a r d      p r o p      o n      x { t } : z [ l ] = w [ l ] X { t } + b [ l ] A [ l ] = g [ l ] ( z [ l ] ) J { t } = 1 1000 ∑ i = 1 l L ( y ^ ( i ) , y ( i ) ) + λ 2 ∗ s i z e ∑ l ∥ w [ l ] ∥ F 2 B a c k w a r d      p r o p [x^{(1)}...x^{(m)}]\rightarrow [x^{\{1\}}...x^{\{m/u\}}]\\ (an\;\;epoch:Forward\;\;prop\;\;on\;\;x^{\{t\}}:\\ z^{[l]}=w^{[l]}X^{\{t\}}+b^{[l]}\\ A^{[l]}=g^{[l]}(z^{[l]})\\ J^{\{t\}}=\frac{1}{1000}\sum_{i=1}^l\mathcal{L}(\hat y^{(i)},y^{(i)})+\frac{\lambda}{2*size}\sum_l\Vert w^{[l]}\Vert_F^2\\ Backward\;\;prop [x(1)...x(m)][x{1}...x{m/u}](anepoch:Forwardproponx{t}:z[l]=w[l]X{t}+b[l]A[l]=g[l](z[l])J{t}=10001i=1lL(y^(i),y(i))+2sizeλlw[l]F2Backwardprop

mini-batch size

size = m -> Batch gradient descent <- small train set (<2000)

size = 1 -> stochastic gradient descent

typical mini-batch size (62,128,256…)

exponential weighted averages

$$
v_\theta = 0\
\theta_t\rightarrow v_\theta:=\beta v_{\theta-1}+(1-\beta)\theta_\theta\

$$

Bias correction

1 1 − β → v t 1 − β t \frac{1}{1-\beta}\rightarrow\frac{v_t}{1-\beta^t} 1β11βtvt

Momentum

V d w = β V d w + ( 1 − β ) d w V d b = β V d b + ( 1 − β ) d b w : = w − α V d w V_{dw}=\beta V_{dw}+(1-\beta)dw\\ V_{db}=\beta V_{db}+(1-\beta)db\\ w:=w-\alpha V_{dw} Vdw=βVdw+(1β)dwVdb=βVdb+(1β)dbw:=wαVdw

RMSprop

S d w = β 2 S d w + ( 1 − β 2 ) d w 2 S d b = β 2 S d b + ( 1 − β 2 ) d b 2 w : = w − α d w S d w + ε S_{dw}=\beta_2 S_{dw}+(1-\beta_2)dw^2\\ S_{db}=\beta_2 S_{db}+(1-\beta_2)db^2\\ w:=w-\alpha \frac{dw}{\sqrt S_{dw}+\varepsilon}\\ Sdw=β2Sdw+(1β2)dw2Sdb=β2Sdb+(1β2)db2w:=wαS dw+εdw

Adam algorithm

V d w = 0 , S d w = 0 V d w = β 1 V d w + ( 1 − β 1 ) d w V d b = β 1 V d b + ( 1 − β 1 ) d b S d w = β 2 S d w + ( 1 − β 2 ) d w 2 S d b = β 2 S d b + ( 1 − β 2 ) d b 2 V d w c o r r e c t = v d w 1 − β 1 t S d w c o r r e c t = s d w 1 − β 2 t W : = W − α V d w c o r r e c t S d w c o r r e c t + ε β 1 : 0.9 , β 2 : 0.999 V_{dw}=0,S_{dw}=0\\ V_{dw}=\beta_1 V_{dw}+(1-\beta_1)dw\\V_{db}=\beta_1 V_{db}+(1-\beta_1)db\\ S_{dw}=\beta_2 S_{dw}+(1-\beta_2)dw^2\\S_{db}=\beta_2 S_{db}+(1-\beta_2)db^2\\ V_{dw}^{correct}=\frac{v_{dw}}{1-\beta_1^t}\\S_{dw}^{correct}=\frac{s_{dw}}{1-\beta_2^t}\\ W:=W-\alpha \frac{V_{dw}^{correct}}{\sqrt{S_{dw}^{correct}}+\varepsilon}\\ \beta_1:0.9,\beta_2:0.999 Vdw=0,Sdw=0Vdw=β1Vdw+(1β1)dwVdb=β1Vdb+(1β1)dbSdw=β2Sdw+(1β2)dw2Sdb=β2Sdb+(1β2)db2Vdwcorrect=1β1tvdwSdwcorrect=1β2tsdwW:=WαSdwcorrect +εVdwcorrectβ1:0.9,β2:0.999

Learning rate decay

α = 1 1 + d e c a y R a t e ∗ e p o c h N u m b e r α 0 α = k e p o c h N u m α 0 \alpha=\frac{1}{1+decayRate*epochNumber}\alpha_0\\ \alpha=\frac{k}{\sqrt{epochNum}}\alpha_0 α=1+decayRateepochNumber1α0α=epochNum kα0

Local optima

在这里插入图片描述
在这里插入图片描述

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值