Deep Learning Specialization 2: Hyperparameter tuning, Regularization and Optimization - Note 1
笔记主要是将课程视频和作业的内容分类简单整合,绝非创新。
1. Setting up your Machine Learning Application
- Train/Dev/Test Sets: 数据集切割,同分布切割,train用于训练,dev用于调整超参,test用于验证。
- 概念 贝叶斯误差:数据集标注的准确率
- High Bias(欠拟合)
- 更多的隐藏层
- 更多的隐藏单元
- High Variance(过拟合)
- 更多的数据
- 正则化
2. Regularizing your nerual network
2.1 L2-Regularization
一般使用L2正则,抑制方差过大(Weight Decay)即 特征值出现极大,也就是极端依赖某些特征的情况。
cost
+
λ
2
m
∑
l
=
1
L
∣
∣
W
[
l
]
∣
∣
2
\text{cost} + \frac{\lambda}{2m} \sum_{l=1}^L ||W^{[l]}||_2
cost+2mλl=1∑L∣∣W[l]∣∣2
代码其实很简单
L2_regularization_cost = (np.sum(np.square(W1)) + ... + np.sum(np.square(W3))) * lambd / (2 * m)
2.2 Dropout Regularization
在输入、输出层以外,随机丢弃该层隐藏结点,以达到避免过分依赖单个非常有用的特征的目的。效果仍然是会使得特征会倾向于是一些比较小的数(Weight Decay)。
没有公式
代码实现注意套路,最后一步的缩放操作可以让期望回归至未使用dropout水平,消除对预测阶段的影响。
# 前向
# Step 1: 创建dropout mask
D1 = np.random.rand(A1.shape[0], A1.shape[1])
D1 = D1 < keep_prob
# Step 2: shut down some neurons of A1
A1 = A1 * D1
# Step 3: scale the value of neurons that haven't been shut down
A1 = A1 / keep_prob
# 后向
dA1 = dA1 / keep_prob
注意(摘抄):
- Dropout is a regularization technique.
- You only use dropout during training. Don’t use dropout (randomly eliminate nodes) during test time.
- Apply dropout both during forward and backward propagation.
- During training time, divide each dropout layer by keep_prob to keep the same expected value for the activations. For example, if keep_prob is 0.5, then we will on average shut down half the nodes, so the output will be scaled by 0.5 since only the remaining half are contributing to the solution. Dividing by 0.5 is equivalent to multiplying by 2. Hence, the output now has the same expected value. You can check that this works even when keep_prob is other values than 0.5.
2.3 其它
Data Augmentation: 增加更多的经过变换的数据,图像可以左右颠倒这样,一般的怕是不行了。
Early Stopping: 根据train/dev set的loss来决定什么时候停止,dev set不下降就可以停了。老师不太喜欢这种方式。
model | train accuracy | test accuracy |
---|---|---|
3-layer NN without regularization | 95% | 91.5% |
3-layer NN with L2-regularization | 94% | 93% |
3-layer NN with dropout | 93% | 95% |
3. Setting up your optimization problem
3.1 Normalizing Input
x
−
μ
σ
\frac{x - \mu} {\sigma}
σx−μ
Gradient Descent在不同方向上的试探的步长是一样的,当不同维度特征尺度差异比较大时会降低学习的速度。
3.2 Vanishing / Exploding gradients
无法从根本解决,但谨慎选择初始化权重的方法可以降低风险
3.3 Weight Initialization for Deep Networks
Zero initialization: fails to “break symmetry”
- W [ l ] W^{[l]} W[l] should be initialized randomly to break symmetry
- It is OK to initalize the bias b [ l ] b^{[l]} b[l] to zeros
Random initialization: break symmetry
- Initializing weights to very large random values does not work well.
- Small random values does better
He/Xavier Initialization: for ReLU
V
a
r
(
w
i
)
=
2
n
Var(w_i) = \frac{2}{n}
Var(wi)=n2
W [ l ] = N ( 0 , 1 ) ∗ 2 n [ l − 1 ] W^{[l]} = \mathcal{N}(0, 1) * \sqrt {\frac{2}{n^{[l-1]}}} W[l]=N(0,1)∗n[l−1]2
Model | Train accuracy | Problem/Comment |
---|---|---|
3-layer NN with zeros initialization | 50% | fails to break symmetry |
3-layer NN with large random initialization | 83% | too large weights |
3-layer NN with He initialization | 99% | recommended method |
3.4 Gradient Check
基本原理:
f
(
θ
+
ϵ
)
−
f
(
θ
−
ϵ
)
2
θ
≈
g
(
θ
)
\frac{f(\theta+\epsilon) - f(\theta - \epsilon)}{2\theta} \approx g(\theta)
2θf(θ+ϵ)−f(θ−ϵ)≈g(θ)
文法:将所有的
W
[
i
]
W^{[i]}
W[i],
b
[
i
]
b^{[i]}
b[i]平铺在一起形成一个
J
(
θ
)
=
J
(
θ
1
,
θ
2
,
.
.
.
)
J(\theta) = J(\theta_1, \theta_2, ...)
J(θ)=J(θ1,θ2,...)
d
θ
approx
[
i
]
=
J
(
θ
1
,
θ
2
,
.
.
,
θ
i
+
ϵ
,
.
.
.
)
−
J
(
θ
1
,
θ
2
,
.
.
,
θ
i
−
ϵ
,
.
.
.
)
2
ϵ
≈
d
θ
[
i
]
=
∂
J
∂
θ
i
Check:
∣
∣
d
θ
approx
−
d
θ
∣
∣
2
∣
∣
d
θ
approx
∣
∣
2
+
∣
∣
d
θ
∣
∣
2
<
1
0
−
7
d\theta_\text{approx}[i] = \frac{J(\theta_1, \theta_2, .., \theta_i + \epsilon, ...) - J(\theta_1, \theta_2, .., \theta_i - \epsilon, ...)}{2\epsilon} \\ \approx d\theta[i] = \frac{\partial J} {\partial \theta_i} \\ \text{Check: } \frac{||d\theta_\text{approx} - d\theta|| _2}{||d\theta_\text{approx}||_2 + ||d\theta||_2} < 10^{-7}
dθapprox[i]=2ϵJ(θ1,θ2,..,θi+ϵ,...)−J(θ1,θ2,..,θi−ϵ,...)≈dθ[i]=∂θi∂JCheck: ∣∣dθapprox∣∣2+∣∣dθ∣∣2∣∣dθapprox−dθ∣∣2<10−7
还是挺直观的。