Lecture 6: Training Neural Networks, Part I

最新推荐文章于 2023-01-17 16:59:59 发布

qq_36356761

最新推荐文章于 2023-01-17 16:59:59 发布

阅读量247

点赞数

分类专栏： CS231n

本文链接：https://blog.csdn.net/qq_36356761/article/details/80074372

版权

CS231n 专栏收录该内容

14 篇文章 1 订阅

订阅专栏

CS231n

Lecture 6: Training Neural Networks, Part I

Review

回顾之前的内容，我们学习了神经网络的反向传播训练方法和CNN的结构，于是对于CNN我们可以用反向传播方法进行训练，具体方式是
1. 采样mini-batch
2. 前向传播获得loss
3. 根据loss进行反向传播梯度
4. 根据梯度更新参数

Training Neural Networks

Activation Functions

sigmoid, leaky ReLU, tanh, maxout, ReLU, ELU, …
传统神经网络中用的是sigmoid，但是这有很大问题
1. 神经元一饱和梯度就消失，tanh同理
2. 输出不是以0为中心的
3. exp()计算代价较大
ReLU则有很大优势
1. 不饱和
2. 计算代价很低
3. 收敛快
4. 从神经生物学来看更合理
白璧微瑕：不是以0为中心的， $x<0$ 时不激活
其余还有leaky ReLU等，因为较少应用所以略过

Data Preprocessing

中心化、正规化： $y = \frac{x - \mu}{\sigma}$

Weight Initialization

Gauss initialization不行
Xavier initialization: $w \sim N(\mu,\sigma^2), \mu = 0^\mathrm{fan_{in} \times fan_{out}}, \sigma = \sqrt{\mathrm{fan_{in}}}$ ，在ReLU时不好用
He et.al.: $w \sim N(\mu,\sigma^2), \mu = 0^\mathrm{fan_{in} \times fan_{out}}, \sigma = \sqrt{\mathrm{\frac{fan_{in}}{2}}}$

Batch Normalization

直接用Gauss initialization

x^= x - E ( x ) V a r ( x )

$\hat{x}=\frac{x-\mathrm{E}(x)}{\mathrm{Var}(x)}$
一般放在CONV/FC层和ReLU层之间，即[CONV+BN+ReLU+pool]或[FC+BN+ReLU+pool]
Problem: do we necessarily want a unit gaussian input to a tanh layer?
A:

N(0,1) N ( 0 , 1 ) $N(0,1)$ 的大部分能量集中在

[−3,3] [ − 3 , 3 ] $[-3,3]$ 之间，而

tanh(3)=0.995,tanh(2)=0.964 tanh ⁡ ( 3 ) = 0.995 , tanh ⁡ ( 2 ) = 0.964 $\tanh(3) = 0.995,\tanh(2) = 0.964$ ，梯度已经开始消失，所以还是将

σ σ $\sigma$ 进一步缩小比较好
实际使用时迭代更新

x^= x - μ σ 2 + ϵ - - - - - \sqrt y = γ x^+ β

$\hat{x} = \frac{x-\mu}{\sqrt{\sigma^2 + \epsilon}}\\ y = \gamma\hat{x} + \beta$
好处

Improves gradient flow through the network
Allows higher learning rates
Reduces the strong dependenceon initialization
Acts as a form of regularization in a funny way, and slightly reduces the need for dropout, maybe

Babysitting the Learning Process

Preprocess the data
Choose the architecture
Double check that the loss is reasonable
Try training…Make sure that you can overfit very small portion of the training data; Start with small regularization and ind learning rate that makes the loss go down;

Hyperparameter Optimization

coarse $\to$ fine
If the cost is ever > 3 * original cost, break out early
it’s best to optimize in log space
Q: But this best cross-validation result is worrying. Why?
A:
big gap between training accuracy and testing accuracy $\Rightarrow$ overfitting

qq_36356761

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Lecture 6: Training Neural Networks, Part I

CS231nLecture 6: Training Neural Networks, Part IReview回顾之前的内容，我们学习了神经网络的反向传播训练方法和CNN的结构，于是对于CNN我们可以用反向传播方法进行训练，具体方式是 1. 采样mini-batch 2. 前向传播获得loss 3. 根据loss进行反向传播梯度 4. 根据梯...
复制链接

扫一扫

专栏目录