- Batch vs. mini-batch gradientt descent
(1)可以分成5000个子集。
for t = 1, …., 5000
Forward prop on x{t} x { t } .
Z[1]=w[1]x{t}+b[t] Z [ 1 ] = w [ 1 ] x { t } + b [ t ]
A[1]=g[1](Z[1]) A [ 1 ] = g [ 1 ] ( Z [ 1 ] )
…..
A[l]=g[l](z[l]) A [ l ] = g [ l ] ( z [ l ] )
(2) Compute cost :
J=11000∑li=1L(y^(i),y(i))+λ2⋅1000∑l∥w[l]∥2F J = 1 1000 ∑ i = 1 l L ( y ^ ( i ) , y ( i ) ) + λ 2 ⋅ 1000 ∑ l ‖ w [ l ] ‖ F 2
(3) Backprop to compute gradients J{t}(x{t},y{t}) J { t } ( x { t } , y { t } )
w[t]=w[l]−αdw[l] w [ t ] = w [ l ] − α d w [ l ]
b[l]=b[l]−αdb[l] b [ l ] = b [ l ] − α d b [ l ] - Choosing mini-batch size
(1)if mini-batch size = m: the size of training set——->Batch gradient descent.(如果训练集数据大将会运行很长时间)
(2) if mini-batch size = 1: Stochastic gradient descent—->Every example is its own mini-batch.(噪声大,而且最后总是在最小值附近摆动)
(3)Choose In-between(minibatch size not too big/small) - Some guidelines about choosing your mini-batch size:
(1)If small training(m <= 2000 ) set: Use batch gradient descent.
(2)Typical mini-batch size:64, 128, 256, 512(据说 2n 2 n 代码运行得快)
(3)Make sure some mini-batch fit in CPU/GPU memory. x[t],y[t] x [ t ] , y [ t ] - Exponentially weighted moving averages(指数加权滑动平均)
Vt=βVt−1+(1−β)θt V t = β V t − 1 + ( 1 − β ) θ t
β=0.9≈11−βdays′tempetaure β = 0.9 ≈ 1 1 − β d a y s ′ t e m p e t a u r e
β=0.98≈11−β=50days′temperature β = 0.98 ≈ 1 1 − β = 50 d a y s ′ t e m p e r a t u r e - Bias correction(偏差修正) in exponentially weighted average.
Vt=βVt−1+(1−β)θt V t = β V t − 1 + ( 1 − β ) θ t
Vt1−βt V t 1 − β t - Gradient descent with momentum(动量梯度下降法):
(1) Compute dw,db d w , d b on current mini-batch.
Vdw=βVdw+(1−β)dw V d w = β V d w + ( 1 − β ) d w
Vθ=βVθ+(1−β)θt V θ = β V θ + ( 1 − β ) θ t
Vdb=βVdb+(1−β)db V d b = β V d b + ( 1 − β ) d b
(2) Update w,b w , b :
w=w−αVdw w = w − α V d w
b=b−αVdb b = b − α V d b
使得梯度下降法在垂直方向的震荡幅度变小,水平方向的移动更快速,以更快速度进行梯度下降。
(3)Implementation details:
Vdw=0,Vdb=0 V d w = 0 , V d b = 0
On iteration t:
Compute dW,db d W , d b on the current mini-batch.
vdW=βvdW+(1−β)dW v d W = β v d W + ( 1 − β ) d W
vdb=βvdb+(1−β)db v d b = β v d b + ( 1 − β ) d b
W=W−αvdW,b=b−αvdb W = W − α v d W , b = b − α v d b
Hyperparameters: α α , β β , β=0.9 β = 0.9
(4) RMSprop(root mean square prop)(均方根传递)
On iteration t:
Compute dw,db d w , d b on current mini-batch
SdW=βSdW+(1−β)(dW)2 S d W = β S d W + ( 1 − β ) ( d W ) 2 : (dW)2,element−wise ( d W ) 2 , e l e m e n t − w i s e
Sdb=βSdb+(1−β)(db)2 S d b = β S d b + ( 1 − β ) ( d b ) 2
Update:
W=W−αdWSdW√+ε W = W − α d W S d W + ε
b=b−αdbSdb√+ε b = b − α d b S d b + ε
(5) Adam optimization algorithms
VdW=0,SdW=0,Vdb=0,Sdb=0 V d W = 0 , S d W = 0 , V d b = 0 , S d b = 0
On iteration t:
Compute dW,db d W , d b using current mini-batch.(mini-batch gradient)
“momentum”:
VdW=β1VdW+(1−β1)dW,Vdb=β1Vdb+(1−β1)db V d W = β 1 V d W + ( 1 − β 1 ) d W , V d b = β 1 V d b + ( 1 − β 1 ) d b
“RMSprop”:
Sdw=β2Sdw+(1−β2)(dW)2,Sdb=β2Sdb+(1−β2)db S d w = β 2 S d w + ( 1 − β 2 ) ( d W ) 2 , S d b = β 2 S d b + ( 1 − β 2 ) d b
Bias corrected:
vcorrecteddw=VdW1−βt1, v d w c o r r e c t e d = V d W 1 − β 1 t ,
Vcorrecteddb=Vdb1−βt1 V d b c o r r e c t e d = V d b 1 − β 1 t
ScorrecteddW=Sdw1−βt2, S d W c o r r e c t e d = S d w 1 − β 2 t ,
Scorrecteddb=Sdb1−βt2 S d b c o r r e c t e d = S d b 1 − β 2 t
W=W−αVcorrecteddwScorrecteddw+√+ε W = W − α V d w c o r r e c t e d S d w c o r r e c t e d + + ε
b=b−αVcorrecteddbScorrecteddb√+ b = b − α V d b c o r r e c t e d S d b c o r r e c t e d +
Hyperparameters choice:
α:needstobetune α : n e e d s t o b e t u n e
β1:0.9(dW) β 1 : 0.9 ( d W )
β2:0.999(dW2)element−wise β 2 : 0.999 ( d W 2 ) e l e m e n t − w i s e
ε:10−8 ε : 10 − 8 - Learning rate decay
epoch: 迭代次数
α=11+decay_rate⋅epoch_numbers⋅α0 α = 1 1 + d e c a y _ r a t e ⋅ e p o c h _ n u m b e r s ⋅ α 0
8.Local optima in neural network
Optimization algorithms -----week2
最新推荐文章于 2022-02-17 20:09:31 发布