Deep Learning Specialization 2: Hyperparameter tuning, Regularization and Optimization - Note 1

Deep Learning Specialization 2: Hyperparameter tuning, Regularization and Optimization - Note 1

笔记主要是将课程视频和作业的内容分类简单整合,绝非创新。

1. Setting up your Machine Learning Application

  1. Train/Dev/Test Sets: 数据集切割,同分布切割,train用于训练,dev用于调整超参,test用于验证。
  2. 概念 贝叶斯误差:数据集标注的准确率
  3. High Bias(欠拟合)
    • 更多的隐藏层
    • 更多的隐藏单元
  4. High Variance(过拟合)
    • 更多的数据
    • 正则化

2. Regularizing your nerual network

2.1 L2-Regularization

一般使用L2正则,抑制方差过大(Weight Decay)即 特征值出现极大,也就是极端依赖某些特征的情况。
cost + λ 2 m ∑ l = 1 L ∣ ∣ W [ l ] ∣ ∣ 2 \text{cost} + \frac{\lambda}{2m} \sum_{l=1}^L ||W^{[l]}||_2 cost+2mλl=1LW[l]2
代码其实很简单

L2_regularization_cost = (np.sum(np.square(W1)) + ... + np.sum(np.square(W3))) * lambd / (2 * m)

2.2 Dropout Regularization

在输入、输出层以外,随机丢弃该层隐藏结点,以达到避免过分依赖单个非常有用的特征的目的。效果仍然是会使得特征会倾向于是一些比较小的数(Weight Decay)。

没有公式

代码实现注意套路,最后一步的缩放操作可以让期望回归至未使用dropout水平,消除对预测阶段的影响。

# 前向
# Step 1: 创建dropout mask
D1 = np.random.rand(A1.shape[0], A1.shape[1])     
D1 = D1 < keep_prob
# Step 2: shut down some neurons of A1
A1 = A1 * D1
# Step 3: scale the value of neurons that haven't been shut down
A1 = A1 / keep_prob
# 后向 
dA1 = dA1 / keep_prob

注意(摘抄):

  1. Dropout is a regularization technique.
  2. You only use dropout during training. Don’t use dropout (randomly eliminate nodes) during test time.
  3. Apply dropout both during forward and backward propagation.
  4. During training time, divide each dropout layer by keep_prob to keep the same expected value for the activations. For example, if keep_prob is 0.5, then we will on average shut down half the nodes, so the output will be scaled by 0.5 since only the remaining half are contributing to the solution. Dividing by 0.5 is equivalent to multiplying by 2. Hence, the output now has the same expected value. You can check that this works even when keep_prob is other values than 0.5.

2.3 其它

Data Augmentation: 增加更多的经过变换的数据,图像可以左右颠倒这样,一般的怕是不行了。

Early Stopping: 根据train/dev set的loss来决定什么时候停止,dev set不下降就可以停了。老师不太喜欢这种方式。

modeltrain accuracytest accuracy
3-layer NN without regularization95%91.5%
3-layer NN with L2-regularization94%93%
3-layer NN with dropout93%95%

3. Setting up your optimization problem

3.1 Normalizing Input

x − μ σ \frac{x - \mu} {\sigma} σxμ
Gradient Descent在不同方向上的试探的步长是一样的,当不同维度特征尺度差异比较大时会降低学习的速度。

3.2 Vanishing / Exploding gradients

无法从根本解决,但谨慎选择初始化权重的方法可以降低风险

3.3 Weight Initialization for Deep Networks

Zero initialization: fails to “break symmetry”

  • W [ l ] W^{[l]} W[l] should be initialized randomly to break symmetry
  • It is OK to initalize the bias b [ l ] b^{[l]} b[l] to zeros

Random initialization: break symmetry

  • Initializing weights to very large random values does not work well.
  • Small random values does better

He/Xavier Initialization: for ReLU
V a r ( w i ) = 2 n Var(w_i) = \frac{2}{n} Var(wi)=n2

W [ l ] = N ( 0 , 1 ) ∗ 2 n [ l − 1 ] W^{[l]} = \mathcal{N}(0, 1) * \sqrt {\frac{2}{n^{[l-1]}}} W[l]=N(0,1)n[l1]2

ModelTrain accuracyProblem/Comment
3-layer NN with zeros initialization50%fails to break symmetry
3-layer NN with large random initialization83%too large weights
3-layer NN with He initialization99%recommended method

3.4 Gradient Check

基本原理:
f ( θ + ϵ ) − f ( θ − ϵ ) 2 θ ≈ g ( θ ) \frac{f(\theta+\epsilon) - f(\theta - \epsilon)}{2\theta} \approx g(\theta) 2θf(θ+ϵ)f(θϵ)g(θ)
文法:将所有的 W [ i ] W^{[i]} W[i], b [ i ] b^{[i]} b[i]平铺在一起形成一个 J ( θ ) = J ( θ 1 , θ 2 , . . . ) J(\theta) = J(\theta_1, \theta_2, ...) J(θ)=J(θ1,θ2,...)
d θ approx [ i ] = J ( θ 1 , θ 2 , . . , θ i + ϵ , . . . ) − J ( θ 1 , θ 2 , . . , θ i − ϵ , . . . ) 2 ϵ ≈ d θ [ i ] = ∂ J ∂ θ i Check:  ∣ ∣ d θ approx − d θ ∣ ∣ 2 ∣ ∣ d θ approx ∣ ∣ 2 + ∣ ∣ d θ ∣ ∣ 2 &lt; 1 0 − 7 d\theta_\text{approx}[i] = \frac{J(\theta_1, \theta_2, .., \theta_i + \epsilon, ...) - J(\theta_1, \theta_2, .., \theta_i - \epsilon, ...)}{2\epsilon} \\ \approx d\theta[i] = \frac{\partial J} {\partial \theta_i} \\ \text{Check: } \frac{||d\theta_\text{approx} - d\theta|| _2}{||d\theta_\text{approx}||_2 + ||d\theta||_2} &lt; 10^{-7} dθapprox[i]=2ϵJ(θ1,θ2,..,θi+ϵ,...)J(θ1,θ2,..,θiϵ,...)dθ[i]=θiJCheck: dθapprox2+dθ2dθapproxdθ2<107

还是挺直观的。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值