CS231n
Lecture 7: Training Neural Networks, Part 2
Optimization
SGD
w -= lr * grad
- Loss function has high condition number: ratio of largest to smallest singular value of the Hessian matrix is large ⇒ ⇒ Very slow progress along shallow dimension, jitter along steep direction
- 易陷入局部最优或鞍点出不来
- 对minibatch的噪声敏感
SGD + Momentum
v = rho * v + grad
x -= lr * v
- Build up “velocity” as a running mean of gradients
- Rho gives “friction”; typically rho=0.9 or 0.99
能够跳出鞍点和局部最优点,平滑mini-batch的噪声
Nesterov Momentum
AdaGrad
grad /= (norm(grad) + epsilon)
x -= lr * grad
Added element-wise scaling of the gradient based on the historical sum of squares in each dimension
Q: What happens with AdaGrad?
A: grad normalization
Q2: What happens to the step size over long time?
A: keeps the same
RMSProp
Adam
Q: What happens at first timestep?
A: 起初
m1=m2=0
m
1
=
m
2
=
0
,于是dx=lr * (1- beta1) * dx/((1 - beta2) * norm(grad) + epsilon)
退化为AdaGrad
Q: Which one of these learning rates is best to use?
A: weight decay. 一般只对SGD+Momentum使用而不对Adam使用
beta1 = 0.9, beta2 = 0.999
, and learning_rate = 1e-3 or 5e-4
is a great starting point for many models!
In practice, Adam is a good default choice in most cases
Model Ensembles
- Train multiple independent models
- At test time average their results
⇒ ⇒ +2%
Instead of training independent models, use multiple snapshots of a single model during training!
Polyak averaging: Instead of using actual parameter vector, keep a moving average of the parameter vector and use that at test time
这些方法好像很少见到有人用过
improve single-model performance: Regularization, Dropout
Regularization
- Data Augmentation: Horizontal Flips, Random crops and scales, Color Jitter, …
- DropConnect
- Fractional Max Pooling
- Stochastic Depth
Dropout
- Forces the network to have a redundant representation;
- Prevents co-adaptation of features
- training a large ensemble of models (that share parameters)
At test time, multiply by dropout probability
Transfer Learning
it’s the norm, not an exception