cs231n-notes-Lecture-7：各种优化方法介绍与比较

最新推荐文章于 2024-03-20 22:14:26 发布

Ravi-Jay

最新推荐文章于 2024-03-20 22:14:26 发布

阅读量394

点赞数

分类专栏： Machine Learning Deep Learning 文章标签： Optimization SGD Momentum AdaGrad Adam Regularization

本文链接：https://blog.csdn.net/ravi_jay/article/details/82758078

版权

Machine Learning 同时被 2 个专栏收录

7 篇文章 0 订阅

订阅专栏

Deep Learning

5 篇文章 0 订阅

订阅专栏

Lecture-7 Training Neural Networks

Optimization

SGD

Cons
1. Very slow progress along shallow dimension, jitter along steep direction.
  2. local minima or saddle point. Saddle points are much more common in high dimension.
2. Gradients come from minibatches, so they can be noisy!

SGD + Momentum

在这里插入图片描述

Nesterov Momentum

在这里插入图片描述

AdaGrad

在这里插入图片描述

step size becomes smaller and smaller because grad_squared is always increasing.
the gradient becomes smaller in the waggling dimension
not common: Slow, get stuck easily

RMSProp

在这里插入图片描述

decay_rate: commonly 0.9 or 0.99
Slove the problem that Adagrad always slow down the movement in all dimensions : only work in some dimesions whose gradients are large.

Adam

在这里插入图片描述

Sort of like RMSProp with momentum
Bias correction is used to avoid it moves large distance at the very first step.

Learning rate decay

common in SGD but not in Adam.
draw the loss curve and think if it’s needed.

Second-order Optimization

Quasi-Newton methods (BGFS most popular):

instead of inverting the Hessian (O(n^3)), approximate
inverse Hessian with rank 1 updates over time (O(n^2) each).

L-BFGS (Limited memory BFGS):

Does not form/store the full inverse Hessian.

Usually works very well in full batch, deterministic mode i.e. if you have a single, deterministic f(x) then L-BFGS will probably work very nicely
Does not transfer very well to mini-batch setting. Give bad results. Adapting L-BFGS to large-scale, stochastic setting is an active area of research.

Model Ensembles

Train multiple independent models
At test time average their results

Enjoy 2% extra performance

Tips and Tricks

Instead of training independent models, use multiple snapshots of a single model during training!

Loshchilov and Hutter, “SGDR: Stochastic gradient descent with restarts”, arXiv 2016 Huang et al, “Snapshot ensembles: train 1, get M for free”, ICLR 2017

Improve single-model performance

Regularization

Add term to loss
Dropout (two explanations)
- Forces the network to have a redundant representation; Prevents co-adaptation of features
- Dropout is training a large ensemble of models (that share parameters).
Data augmentation
- Horizontal Flips
- Random crops and scales
- Color Jitter
- translation
- rotation
- stretching
- shearing
- lens distortions
DropConnect