1.Reasons for failure
gradient is close to zero,卡在了critical point(驻点)
判断critical point是不是saddle point,求出H
Sometimes ,sometimes
Saddle point
is the eigen value of H,
is an eigen vector of H
如果卡在了Saddle point,找出为负的时候,所对应的
,只要顺着
的方向去更新参数,就可以找到更低的Loss。(不是最优解决方法,后续更新)
2.Optimization with Batch
1 epoch = see all the batches once --> Shuffle(随机排序) after each batch
Small Batch v.s Large Batch
Larger batch size does not require longer time to compute gradient(因为并行计算)
Smaller batch requires longer time for one epoch
但在测试数据时发现Optimization Fails
原因是 Smaller batch 弹性更大
summary![](https://img-blog.csdnimg.cn/20210712164456424.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L2JhaWJhaWRvdWRvdQ==,size_16,color_FFFFFF,t_70)
3.Learning Rate cannot be one-size-fits-all
Different parameters needs different learning rate
是 Parameter Dependent
如何计算
呢
……
又想要learning rate能根据时间自动调整,所以引出下面的方法
RMSProp
The recent gradient has larger influence and the past gradients have less influence.
Learning Rate Scheduling
1. Learning Rate Decay
2.Warm Up
4. Summary of Optimization
一般的Gradient Descent
经过多方面的改进