Lecture-7 Training Neural Networks
Optimization
SGD
- Cons
- Very slow progress along shallow dimension, jitter along steep direction.
2. local minima or saddle point. Saddle points are much more common in high dimension.
- Gradients come from minibatches, so they can be noisy!
- Very slow progress along shallow dimension, jitter along steep direction.
SGD + Momentum
Nesterov Momentum
AdaGrad
- step size becomes smaller and smaller because grad_squared is always increasing.
- the gradient becomes smaller in the waggling dimension
- not common: Slow, get stuck easily
RMSProp
- decay_rate: commonly 0.9 or 0.99
- Slove the problem that Adagrad always slow down the movement in all dimensions : only work in some dimesions whose gradients are large.
Adam
- Sort of like RMSProp with momentum
- Bias correction is used to avoid it moves large distance at the very first step.
Learning rate decay
- common in SGD but not in Adam.
- draw the loss curve and think if it’s needed.
Second-order Optimization
- Quasi-Newton methods (BGFS most popular):
- instead of inverting the Hessian (O(n^3)), approximate
- inverse Hessian with rank 1 updates over time (O(n^2) each).
- L-BFGS (Limited memory BFGS):
Does not form/store the full inverse Hessian.
- Usually works very well in full batch, deterministic mode i.e. if you have a single, deterministic f(x) then L-BFGS will probably work very nicely
- Does not transfer very well to mini-batch setting. Give bad results. Adapting L-BFGS to large-scale, stochastic setting is an active area of research.
Model Ensembles
- Train multiple independent models
- At test time average their results
Enjoy 2% extra performance
Tips and Tricks
- Instead of training independent models, use multiple snapshots of a single model during training!
Loshchilov and Hutter, “SGDR: Stochastic gradient descent with restarts”, arXiv 2016 Huang et al, “Snapshot ensembles: train 1, get M for free”, ICLR 2017
Improve single-model performance
Regularization
-
Add term to loss
-
Dropout (two explanations)
- Forces the network to have a redundant representation; Prevents co-adaptation of features
- Dropout is training a large ensemble of models (that share parameters).
-
Data augmentation
- Horizontal Flips
- Random crops and scales
- Color Jitter
- translation
- rotation
- stretching
- shearing
- lens distortions
-
DropConnect
Wan et al, “Regularization of Neural Networks using DropConnect”, ICML 2013
- Fractional Max Pooling
Graham, “Fractional Max Pooling”, arXiv 2014
- Stochastic Depth
Huang et al, “Deep Networks with Stochastic Depth”, ECCV 2016