deeplearning course-02-01 深度学习实践基础 Practical aspects of deep learning

最新推荐文章于 2022-03-12 10:06:12 发布

simonNada

最新推荐文章于 2022-03-12 10:06:12 发布

阅读量405

点赞数

分类专栏：深度学习学习笔记

本文链接：https://blog.csdn.net/monkey3233/article/details/79568198

版权

深度学习同时被 2 个专栏收录

7 篇文章 0 订阅

订阅专栏

学习笔记

3 篇文章 0 订阅

订阅专栏

deeplearning course-02-01 深度学习实践基础 Practical aspects of deep learning

@(学习笔记)

deeplearning course-02-01 深度学习实践基础 Practical aspects of deep learning

一. 初始化机器学习 Setting up your Machine learning application

1. 训练集，开发集，测试集 Train/Dev/Test sets

需要确认，训练和测试数据集同分布
一般把数据集分为train dev(develop) test，训练集开发集测试集
比例上来讲

数据集情况	train	dev	test
小数据集	60%	20%	20%
大数据集	98%	1%	1%
大数据集	99.5%	0.25%	0.25%

2. 偏差和方差 Bias and Variance

偏差和方差：
- 高偏差，high bias ：欠拟合
- 高方差 high variance：过拟合
判断标准：
- 对比基准为相对人类识别率而言
- Train set error 大， Dev set error 大，这是高偏差 high bias情况
- Train set error 小， Dev set error 大，这是高方差 high variance 情况，过拟合
- Train set error 大， Dev set error 大很多，这是高偏差 high bias情况，也是高方差情况

3. 机器学习的基本步骤 Basic recipe for machine learning

首先开始训练，目标是将train error降低到接近人类错误率
如果出现high bias的情况，说明欠拟合，可以尝试更大的网络，更长的训练时间
再training set上表现较好后，用dev set进行测试
dev set上测试效果差，说明有high variance出现，可以选择，更多的数据训练，正则化，合适的网络等方式
多次迭代，最后得到低偏差，低方差的网络模型

二. 正则化 Regularizing your neural network

1. 正则化 Regularization

正则化的目的是，减弱过拟合，一般情况是网络大而数据不足；
这里写图片描述

2. 为什么正则化能减小过拟合 Why regularization reduces overfitting？

正则化：
- 减弱了神经网络中某些神经元的作用，相当于形成了一个更小型的神经网络，从而减少了过拟合；
- 正则化后，W都比较小，Z也就较小，网络会趋近于线性网络，从而减弱过拟合

3. 丢失正则化 Dropout Regularization

随机丢弃部分神经元
注意，丢弃神经元会造成期望下降，所以要除以keep_prob

d3 = np.random.rand(a3.shape[0], a3.shape[1]) < keep_prob
a3 = np.multiply(a3, d3)
a3 /= keep_prob

4. 理解丢失层 Understanding dropout

丢失层会让输入的某几个特征随机失效，所以神经元无法依靠特定的输入特征，而必须依赖于几乎所有特征，这样会导致权重平均化到每一个输入上，这样就压缩了这些权重的平方范数，类似与L2正则化，就完成了正则化；
对于复杂层，可以提高drop 率，简单层则降低；
dropout层会导致下降函数图错误，ng的方式是先没有dropout 层训练，保证损失函数下降后再开启dropout层

5. 其他的正则化方式 Other regularization methods

data argumentation 扩增数据集，图像左右翻转，旋转，字符图集扭曲，
Early stopping，早终止法：

三. 优化设置 Setting up your optimization problem

1. 归一化 Normalizing inputs

这里写图片描述
- 减去平均值，再除以方差
- 归一化可以加快每次迭代时，梯度下降的步长，因为在未正则化时，在扁平的维度上下降很慢，可能要多次才能迭代到最优，而归一化后，可以以较大步长向最优值前进；

2. 梯度爆炸和梯度下降 Vanishing/Exploding gradients

在深度神经网络的情况下，较深的W会因为前面多层累计乘法造成梯度爆炸或者消失现象： $W^{[L]}= W^{[L-1]}*W^{[L-2]}...W^{[1]}$

3. 为深度网络初始化权值Weight Initialization for Deep Networks

一般的初始化参数：

W [l] = n p . r a n d o m . r a n d (s h a p e) * n p . s q r t (2 n [ l - 1 ])

$W^{[l]} = np.random.rand(shape) * np.sqrt(\frac{2}{n^{[l-1]}})$
tanh的初始化，Xavier initialization：

W [l] = n p . r a n d o m . r a n d (s h a p e) * n p . s q r t (1 n [ l - 1 ])

$W^{[l]} = np.random.rand(shape) * np.sqrt(\frac{1}{n^{[l-1]}})$
Yoshua Bengio发明：

W [l] = n p . r a n d o m . r a n d (s h a p e) * n p . s q r t (2 n [ l - 1 ] + n [ l ])

$W^{[l]} = np.random.rand(shape) * np.sqrt(\frac{2}{n^{[l-1]}+n^{[l]}})$

4. 梯度的数值近似检验 Numerical approximation of gradients

g (θ) = f ( θ + ϵ ) - f ( θ - ϵ ) 2 ϵ

$g(\theta) = \frac{f(\theta + \epsilon)-f(\theta - \epsilon)}{2\epsilon}$

5. 梯度检查 Gradient checking

for each i:

d θ a p p r o i x [i] = J ( θ 1 , θ 2 , . . . θ i + ϵ , . . . ) - J ( θ 1 , θ 2 , . . . θ i - ϵ , . . . ) 2 ϵ

$d\theta_{approix}[i] = \frac{J(\theta_1,\theta_2,...\theta_i+\epsilon,...)- J(\theta_1,\theta_2,...\theta_i-\epsilon,...)}{2\epsilon}$
check ：

d i s t a n c e = | | d θ a p p r o i x - d θ | | 2 | | d θ a p p r o i x | | 2 + | | d θ | | 2

$distance = \frac{||d\theta_{approix}-d\theta||_2}{||d\theta_{approix}||_2 + ||d\theta||_2}$

ϵ = 10 - 7

$\epsilon = 10^{-7}$
if

d i s t a n c e < 10 - 7 - - - - - > g r e a t

$distance < 10^{-7} -----> great$

d i s t a n c e < 10 - 5 - - - - - > c h e c k!

$distance < 10^{-5} -----> check!$

d i s t a n c e < 10 - 3 - - - - - > c h e c k!!!

$distance < 10^{-3} -----> check!!!$

6. 梯度检查注意事项 Gradient checking implementation notes

训练时不要用梯度检查
$d\theta$ 中的 $db$ 差距很大，而 $dw$ 差距很小时，可能是 $db$ 计算有问题。
记得计算时加上正则项
梯度检查不能和dropout 层同事工作
随机初始化后，也许在运行几次训练迭代后，再进行梯度检查效果会较好

作业总结

参数的初始化
- The weights $W[l]$ should be initialized randomly to break symmetry.
- It is however okay to initialize the biases $b[l]$ to zeros. Symmetry is still broken so long as $W[l]$ is initialized randoml
- Initializing weights to very large random values does not work well.
- Hopefully intializing with small random values does better. The important question is: how small should be these random values be? Lets find out in the next part!
- Different initializations lead to different results
- Random initialization is used to break symmetry and make sure different hidden units can learn different things
- Don’t intialize to values that are too large
- He initialization works well for networks with ReLU activations.
正则化的作用：
- Regularization will help you reduce overfitting.
- Regularization will drive your weights to lower values.
- L2 regularization and Dropout are two very effective regularization techniques.

simonNada

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
deeplearning course-02-01 深度学习实践基础 Practical aspects of deep learning

deeplearning course-02-01 深度学习实践基础 Practical aspects of deep learning@(学习笔记)deeplearning course-02-01 深度学习实践基础 Practical aspects of deep learning一. 初始化机器学习 Setting up your Machine learning appl...
复制链接

扫一扫