文章目录
Optimization problem
speed up the training of your neural network
Normalizing inputs
- subtract mean
μ = 1 m ∑ i = 1 m x ( i ) x : = x − μ \mu =\frac{1}{m}\sum _{i=1}^{m}x^{(i)}\\ x:=x-\mu μ=m1i=1∑mx(i)x:=x−μ
- normalize variance
σ 2 = 1 m ∑ i = 1 m ( x ( i ) ) 2 x / = σ \sigma ^2=\frac{1}{m}\sum_{i=1}^m(x^{(i)})^2\\ x/=\sigma σ2=m1i=1∑m(x(i))2x/=σ
vanishing/exploding gradients
y = w [ l ] w [ l − 1 ] . . . w [ 2 ] w [ 1 ] x w [ l ] > I → ( w [ l ] ) L → ∞ w [ l ] < I → ( w [ l ] ) L → 0 y=w^{[l]}w^{[l-1]}...w^{[2]}w^{[1]}x\\ w^{[l]}>I\rightarrow (w^{[l]})^L\rightarrow\infty \\w^{[l]}<I\rightarrow (w^{[l]})^L\rightarrow0 y=w[l]w[l−1]...w[2]w[1]xw[l]>I→(w[l])L→∞w[l]<I→(w[l])L→0
weight initialize
v a r ( w ) = 1 n ( l − 1 ) w [ l ] = n p . r a n d o m . r a n d n ( s h a p e ) ∗ n p . s q r t ( 1 n ( l − 1 ) ) var(w)=\frac{1}{n^{(l-1)}}\\ w^{[l]}=np.random.randn(shape)*np.sqrt(\frac{1}{n^{(l-1)}}) var(w)=n(l−1)1w[l]=np.random.randn(shape)∗np.sqrt(n(l−1)1)
gradient check
Numerical approximation
f ( θ ) = θ 3 f ′ ( θ ) = f ( θ + ε ) − f ( θ − ε ) 2 ε f(\theta)=\theta^3\\ f'(\theta)=\frac{f(\theta+\varepsilon)-f(\theta-\varepsilon)}{2\varepsilon} f(θ)=θ3f′(θ)=2εf(θ+ε)−f(θ−ε)
grad check
d θ a p p r o x [ i ] = J ( θ 1 , . . . θ i + ε . . . ) − J ( θ 1 , . . . θ i − ε . . . ) 2 ε = d θ [ i ] c h e c k : ∥ d θ a p p r o x − d θ ∥ 2 ∥ d θ a p p r o x ∥ 2 + ∥ d θ ∥ 2 < 1 0 − 7 d\theta_{approx}[i]=\frac{J(\theta_1,...\theta_i+\varepsilon...)-J(\theta_1,...\theta_i-\varepsilon...)}{2\varepsilon}=d\theta[i]\\ check:\frac{\Vert d\theta_{approx}-d\theta\Vert_2}{\Vert d\theta_{approx}\Vert_2+\Vert d\theta\Vert_2}<10^{-7} dθapprox[i]=2εJ(θ1,...θi+ε...)−J(θ1,...θi−ε...)=dθ[i]check:∥dθapprox∥2+∥dθ∥2∥dθapprox−dθ∥2<10−7
Optimize algorithm
mini-bach gradient
[ x ( 1 ) . . . x ( m ) ] → [ x { 1 } . . . x { m / u } ] ( a n e p o c h : F o r w a r d p r o p o n x { t } : z [ l ] = w [ l ] X { t } + b [ l ] A [ l ] = g [ l ] ( z [ l ] ) J { t } = 1 1000 ∑ i = 1 l L ( y ^ ( i ) , y ( i ) ) + λ 2 ∗ s i z e ∑ l ∥ w [ l ] ∥ F 2 B a c k w a r d p r o p [x^{(1)}...x^{(m)}]\rightarrow [x^{\{1\}}...x^{\{m/u\}}]\\ (an\;\;epoch:Forward\;\;prop\;\;on\;\;x^{\{t\}}:\\ z^{[l]}=w^{[l]}X^{\{t\}}+b^{[l]}\\ A^{[l]}=g^{[l]}(z^{[l]})\\ J^{\{t\}}=\frac{1}{1000}\sum_{i=1}^l\mathcal{L}(\hat y^{(i)},y^{(i)})+\frac{\lambda}{2*size}\sum_l\Vert w^{[l]}\Vert_F^2\\ Backward\;\;prop [x(1)...x(m)]→[x{1}...x{m/u}](anepoch:Forwardproponx{t}:z[l]=w[l]X{t}+b[l]A[l]=g[l](z[l])J{t}=10001i=1∑lL(y^(i),y(i))+2∗sizeλl∑∥w[l]∥F2Backwardprop
mini-batch size
size = m -> Batch gradient descent <- small train set (<2000)
size = 1 -> stochastic gradient descent
typical mini-batch size (62,128,256…)
exponential weighted averages
$$
v_\theta = 0\
\theta_t\rightarrow v_\theta:=\beta v_{\theta-1}+(1-\beta)\theta_\theta\
$$
Bias correction
1 1 − β → v t 1 − β t \frac{1}{1-\beta}\rightarrow\frac{v_t}{1-\beta^t} 1−β1→1−βtvt
Momentum
V d w = β V d w + ( 1 − β ) d w V d b = β V d b + ( 1 − β ) d b w : = w − α V d w V_{dw}=\beta V_{dw}+(1-\beta)dw\\ V_{db}=\beta V_{db}+(1-\beta)db\\ w:=w-\alpha V_{dw} Vdw=βVdw+(1−β)dwVdb=βVdb+(1−β)dbw:=w−αVdw
RMSprop
S d w = β 2 S d w + ( 1 − β 2 ) d w 2 S d b = β 2 S d b + ( 1 − β 2 ) d b 2 w : = w − α d w S d w + ε S_{dw}=\beta_2 S_{dw}+(1-\beta_2)dw^2\\ S_{db}=\beta_2 S_{db}+(1-\beta_2)db^2\\ w:=w-\alpha \frac{dw}{\sqrt S_{dw}+\varepsilon}\\ Sdw=β2Sdw+(1−β2)dw2Sdb=β2Sdb+(1−β2)db2w:=w−αSdw+εdw
Adam algorithm
V d w = 0 , S d w = 0 V d w = β 1 V d w + ( 1 − β 1 ) d w V d b = β 1 V d b + ( 1 − β 1 ) d b S d w = β 2 S d w + ( 1 − β 2 ) d w 2 S d b = β 2 S d b + ( 1 − β 2 ) d b 2 V d w c o r r e c t = v d w 1 − β 1 t S d w c o r r e c t = s d w 1 − β 2 t W : = W − α V d w c o r r e c t S d w c o r r e c t + ε β 1 : 0.9 , β 2 : 0.999 V_{dw}=0,S_{dw}=0\\ V_{dw}=\beta_1 V_{dw}+(1-\beta_1)dw\\V_{db}=\beta_1 V_{db}+(1-\beta_1)db\\ S_{dw}=\beta_2 S_{dw}+(1-\beta_2)dw^2\\S_{db}=\beta_2 S_{db}+(1-\beta_2)db^2\\ V_{dw}^{correct}=\frac{v_{dw}}{1-\beta_1^t}\\S_{dw}^{correct}=\frac{s_{dw}}{1-\beta_2^t}\\ W:=W-\alpha \frac{V_{dw}^{correct}}{\sqrt{S_{dw}^{correct}}+\varepsilon}\\ \beta_1:0.9,\beta_2:0.999 Vdw=0,Sdw=0Vdw=β1Vdw+(1−β1)dwVdb=β1Vdb+(1−β1)dbSdw=β2Sdw+(1−β2)dw2Sdb=β2Sdb+(1−β2)db2Vdwcorrect=1−β1tvdwSdwcorrect=1−β2tsdwW:=W−αSdwcorrect+εVdwcorrectβ1:0.9,β2:0.999
Learning rate decay
α = 1 1 + d e c a y R a t e ∗ e p o c h N u m b e r α 0 α = k e p o c h N u m α 0 \alpha=\frac{1}{1+decayRate*epochNumber}\alpha_0\\ \alpha=\frac{k}{\sqrt{epochNum}}\alpha_0 α=1+decayRate∗epochNumber1α0α=epochNumkα0
Local optima