Lecture 7: Training Neural Networks, Part 2

最新推荐文章于 2021-08-24 11:36:49 发布

qq_36356761

最新推荐文章于 2021-08-24 11:36:49 发布

阅读量182

点赞数

分类专栏： CS231n

本文链接：https://blog.csdn.net/qq_36356761/article/details/80074868

版权

CS231n 专栏收录该内容

14 篇文章 1 订阅

订阅专栏

CS231n

Lecture 7: Training Neural Networks, Part 2

Optimization

SGD

w -= lr * grad

Loss function has high condition number: ratio of largest to smallest singular value of the Hessian matrix is large $\Rightarrow$ Very slow progress along shallow dimension, jitter along steep direction
易陷入局部最优或鞍点出不来
对minibatch的噪声敏感

SGD + Momentum

v = rho * v + grad
x -= lr * v

Build up “velocity” as a running mean of gradients
Rho gives “friction”; typically rho=0.9 or 0.99
能够跳出鞍点和局部最优点，平滑mini-batch的噪声

Nesterov Momentum

AdaGrad

grad /= (norm(grad) + epsilon)
x -= lr * grad

Added element-wise scaling of the gradient based on the historical sum of squares in each dimension
Q: What happens with AdaGrad?
A: grad normalization
Q2: What happens to the step size over long time?
A: keeps the same

RMSProp

Adam

Q: What happens at first timestep?
A: 起初 $m_1=m_2=0$ ，于是dx=lr * (1- beta1) * dx/((1 - beta2) * norm(grad) + epsilon)退化为AdaGrad
Q: Which one of these learning rates is best to use?
A: weight decay. 一般只对SGD+Momentum使用而不对Adam使用

beta1 = 0.9, beta2 = 0.999, and learning_rate = 1e-3 or 5e-4 is a great starting point for many models!
In practice, Adam is a good default choice in most cases

Model Ensembles

Train multiple independent models
At test time average their results
$\Rightarrow$ +2%

Instead of training independent models, use multiple snapshots of a single model during training!
Polyak averaging: Instead of using actual parameter vector, keep a moving average of the parameter vector and use that at test time
这些方法好像很少见到有人用过
improve single-model performance: Regularization, Dropout

Regularization

Data Augmentation: Horizontal Flips, Random crops and scales, Color Jitter, …
DropConnect
Fractional Max Pooling
Stochastic Depth

Dropout

Forces the network to have a redundant representation;
Prevents co-adaptation of features
training a large ensemble of models (that share parameters)

At test time, multiply by dropout probability

Transfer Learning

it’s the norm, not an exception

qq_36356761

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Lecture 7: Training Neural Networks, Part 2

CS231nLecture 7: Training Neural Networks, Part 2OptimizationSGDw -= lr * gradLoss function has high condition number: ratio of largest to smallest singular value of the Hessian matrix is l...
复制链接

扫一扫