cs231n-notes-Lecture-7:各种优化方法介绍与比较

Lecture-7 Training Neural Networks

Optimization

SGD
  • Cons
    1. Very slow progress along shallow dimension, jitter along steep direction.
      在这里插入图片描述 2. local minima or saddle point. Saddle points are much more common in high dimension.
      在这里插入图片描述
    2. Gradients come from minibatches, so they can be noisy!
SGD + Momentum

在这里插入图片描述

Nesterov Momentum

在这里插入图片描述

AdaGrad

在这里插入图片描述

  • step size becomes smaller and smaller because grad_squared is always increasing.
  • the gradient becomes smaller in the waggling dimension
  • not common: Slow, get stuck easily
RMSProp

在这里插入图片描述

  • decay_rate: commonly 0.9 or 0.99
  • Slove the problem that Adagrad always slow down the movement in all dimensions : only work in some dimesions whose gradients are large.
Adam

在这里插入图片描述

  • Sort of like RMSProp with momentum
  • Bias correction is used to avoid it moves large distance at the very first step.

Learning rate decay

  • common in SGD but not in Adam.
  • draw the loss curve and think if it’s needed.

Second-order Optimization

  • Quasi-Newton methods (BGFS most popular):
  • instead of inverting the Hessian (O(n^3)), approximate
  • inverse Hessian with rank 1 updates over time (O(n^2) each).
  • L-BFGS (Limited memory BFGS):

Does not form/store the full inverse Hessian.

  • Usually works very well in full batch, deterministic mode i.e. if you have a single, deterministic f(x) then L-BFGS will probably work very nicely
  • Does not transfer very well to mini-batch setting. Give bad results. Adapting L-BFGS to large-scale, stochastic setting is an active area of research.

Model Ensembles

  1. Train multiple independent models
  2. At test time average their results

Enjoy 2% extra performance

Tips and Tricks
  • Instead of training independent models, use multiple snapshots of a single model during training!

Loshchilov and Hutter, “SGDR: Stochastic gradient descent with restarts”, arXiv 2016 Huang et al, “Snapshot ensembles: train 1, get M for free”, ICLR 2017

Improve single-model performance

Regularization
  • Add term to loss
    在这里插入图片描述

  • Dropout (two explanations)

    • Forces the network to have a redundant representation; Prevents co-adaptation of features
    • Dropout is training a large ensemble of models (that share parameters).
  • Data augmentation

    • Horizontal Flips
    • Random crops and scales
    • Color Jitter
    • translation
    • rotation
    • stretching
    • shearing
    • lens distortions
  • DropConnect

Wan et al, “Regularization of Neural Networks using DropConnect”, ICML 2013

  • Fractional Max Pooling

Graham, “Fractional Max Pooling”, arXiv 2014

  • Stochastic Depth

Huang et al, “Deep Networks with Stochastic Depth”, ECCV 2016

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值