Deep Learning Specialization 2: Hyperparameter tuning, Regularization and Optimization - Note 1

最新推荐文章于 2021-01-12 14:39:57 发布

du00

最新推荐文章于 2021-01-12 14:39:57 发布

阅读量174

点赞数

分类专栏：笔记

本文链接：https://blog.csdn.net/duh2so4/article/details/94319844

版权

笔记专栏收录该内容

16 篇文章 0 订阅

订阅专栏

Deep Learning Specialization 2: Hyperparameter tuning, Regularization and Optimization - Note 1

笔记主要是将课程视频和作业的内容分类简单整合，绝非创新。

1. Setting up your Machine Learning Application

Train/Dev/Test Sets: 数据集切割，同分布切割，train用于训练，dev用于调整超参，test用于验证。
概念 贝叶斯误差：数据集标注的准确率
High Bias（欠拟合）
- 更多的隐藏层
- 更多的隐藏单元
High Variance（过拟合）
- 更多的数据
- 正则化

2. Regularizing your nerual network

2.1 L2-Regularization

一般使用L2正则，抑制方差过大（Weight Decay）即特征值出现极大，也就是极端依赖某些特征的情况。
$\text{cost} + \frac{\lambda}{2m} \sum_{l=1}^L ||W^{[l]}||_2$
代码其实很简单

L2_regularization_cost = (np.sum(np.square(W1)) + ... + np.sum(np.square(W3))) * lambd / (2 * m)

2.2 Dropout Regularization

在输入、输出层以外，随机丢弃该层隐藏结点，以达到避免过分依赖单个非常有用的特征的目的。效果仍然是会使得特征会倾向于是一些比较小的数（Weight Decay）。

没有公式

代码实现注意套路，最后一步的缩放操作可以让期望回归至未使用dropout水平，消除对预测阶段的影响。

# 前向
# Step 1: 创建dropout mask
D1 = np.random.rand(A1.shape[0], A1.shape[1])     
D1 = D1 < keep_prob
# Step 2: shut down some neurons of A1
A1 = A1 * D1
# Step 3: scale the value of neurons that haven't been shut down
A1 = A1 / keep_prob
# 后向 
dA1 = dA1 / keep_prob

注意（摘抄）：

Dropout is a regularization technique.
You only use dropout during training. Don’t use dropout (randomly eliminate nodes) during test time.
Apply dropout both during forward and backward propagation.
During training time, divide each dropout layer by keep_prob to keep the same expected value for the activations. For example, if keep_prob is 0.5, then we will on average shut down half the nodes, so the output will be scaled by 0.5 since only the remaining half are contributing to the solution. Dividing by 0.5 is equivalent to multiplying by 2. Hence, the output now has the same expected value. You can check that this works even when keep_prob is other values than 0.5.

2.3 其它

Data Augmentation: 增加更多的经过变换的数据，图像可以左右颠倒这样，一般的怕是不行了。

Early Stopping: 根据train/dev set的loss来决定什么时候停止，dev set不下降就可以停了。老师不太喜欢这种方式。

model	train accuracy	test accuracy
3-layer NN without regularization	95%	91.5%
3-layer NN with L2-regularization	94%	93%
3-layer NN with dropout	93%	95%

3. Setting up your optimization problem

3.1 Normalizing Input

$\frac{x - \mu} {\sigma}$
Gradient Descent在不同方向上的试探的步长是一样的，当不同维度特征尺度差异比较大时会降低学习的速度。

3.2 Vanishing / Exploding gradients

无法从根本解决，但谨慎选择初始化权重的方法可以降低风险

3.3 Weight Initialization for Deep Networks

Zero initialization: fails to “break symmetry”

$W^{[l]}$ should be initialized randomly to break symmetry
It is OK to initalize the bias $b^{[l]}$ to zeros

Random initialization: break symmetry

Initializing weights to very large random values does not work well.
Small random values does better

He/Xavier Initialization: for ReLU
$Var(w_i) = \frac{2}{n}$

$W^{[l]} = \mathcal{N}(0, 1) * \sqrt {\frac{2}{n^{[l-1]}}}$

Model	Train accuracy	Problem/Comment
3-layer NN with zeros initialization	50%	fails to break symmetry
3-layer NN with large random initialization	83%	too large weights
3-layer NN with He initialization	99%	recommended method

3.4 Gradient Check

基本原理：
$\frac{f(\theta+\epsilon) - f(\theta - \epsilon)}{2\theta} \approx g(\theta)$
文法：将所有的 $W^{[i]}$ , $b^{[i]}$ 平铺在一起形成一个 $J(\theta) = J(\theta_1, \theta_2, ...)$
$d\theta_\text{approx}[i] = \frac{J(\theta_1, \theta_2, .., \theta_i + \epsilon, ...) - J(\theta_1, \theta_2, .., \theta_i - \epsilon, ...)}{2\epsilon} \\ \approx d\theta[i] = \frac{\partial J} {\partial \theta_i} \\ \text{Check: } \frac{||d\theta_\text{approx} - d\theta|| _2}{||d\theta_\text{approx}||_2 + ||d\theta||_2} < 10^{-7}$