文章目录
前言
笔记,自用
1. Local Minima and Saddle Point
判断是局部最小值还是鞍点
2. Batch and Momentum
2.1 Why use batch
2.1.1 Small batch vs Large batch
(1) Sometimes large batch size does not require longer time to compute gradient. Because of Parallel processing of GPUs.(Unless batch size is too large)
(2) Sometimes smaller batch requires longer time for one epoch(longer time for seeing all data once).As shown below.
(3) Sometimes the opposite:
ONE OF THE RESONS:
Full batch 的loss 走到一个local minima或saddle point就停下来了,没办法再更新参数。Small batch的batch都不同,第一个走不动了下一个batch也许就能接着走。
2.1.2 Sometimes small batch is better on testing data
ONE OF THE RESONS:
Sharp Minima很可能会困住large batch,但不会困住small batch(有人信有人不信)。
2.1.3 A short summary
2.2 What is momentum
2.3 Short summary
3. Learning rate
3.1 lr cannot be one-size-fits-all
某个方向上gradient很小,lr调大;gradient很小,很陡峭,lr调小。
(1) Root Mean Square
(2) RMSProp
(3) Adam: RMSProp + Momentum
3.2 Learning Rate Scheduling
(1) Learning Rate Decay
(2) Warm up
4. Possible impact of Loss
A rough understanding of softmax:
Making y(have any value) between 0 and 1, for label y can be 0 or 1.
Why Cross-entromy?
5. Batch Normalization
5.1 Why batch normalization
One of feature normalization:
在z或a处做normaliaztion并无太大差别。使用sigmiod时建议在z。As below:
当x经过feature normalization后,z1z2z3与后续z1等a1等都相关联,z1改变后面的参数都会变化,所以要同时考虑后面所有的参数。又由于data量大,不能考虑完整的,所以考虑一个batch的normalization。
5.2 Testing problem
5.3 Other normalization
A short summary
Batch normalization change the landscape of error surface.