吴恩达.深度学习系列-C2神经网络进阶-W1神经网络实践

11 篇文章 0 订阅
9 篇文章 0 订阅

学习目标

  • 回顾不同的初始化方式会产生不同的结果。Recall that different types of initializations
    lead to different results
  • 认识在复合神经网络中初始化的重要性。Recognize the importance of initialization in
    complex neural networks.
  • 认识Train/Dev/Test数据集之间的差异性。Recognize the difference between
    train/dev/test sets
  • 诊断你的模型中的偏差和方差问题。Diagnose the bias and variance issues in your model
  • 学习何时与如何使用正则化方法,比如Dropout或L2正则化。Learn when and how to use
    regularization methods such as dropout or L2 regularization.
  • 掌握深度学习中的实践问题,例如梯度消失梯度爆炸,学会如何使用梯度检验来核实是否正确的执行了反向传播来避免这些问题。Understand experimental issues in deep learning such as Vanishing or Exploding gradients and learn how to deal with them 。Use gradient checking to verify the correctness of your backpropagation implementation

1.设置你的机器学习程序

1.1.训练集/开发(验证)集/测试集

实践中我们会碰到很多超参数:网络的层数、每一层神经元个数、学习率、激活函数、正则化方式等等。对于这些设定目前还没有统一的方法能够直接计算得出。需要不断“提出想法->编写代码->实践检验->提出想法….”这样循环迭代来检验想法、参数设定等的正确性。如下图:
这里写图片描述
我们将完整的Data数据集分成三份。
Training Sets:训练集,用于算法学习使用。
Cross Validation(Development Sets):交叉验证集,也可以简称为开发集,本课程中使用开发集这个简称。用于验证超参数及不同设置下算法性能的对比。
Test Sets:测试集,用于检查算法最终的性能(比如:泛化能力)。这样能在评估算法性能时不引入偏差。
三个数据集的一种可能分法是6:2:2。如果你的数据集达到100万甚至更多,那么你的开发集(交叉验证集)与测试集可能只要1万的样本就足矣评估分类器的性能。
注意点:确保开发集/测试集中的数据分布相同。
测试集设立的目的是给你一个无偏估计,来评价你最终选取的网络的性能。如果你不需要无偏估计,那么没有测试集也没有问题。
无偏估计是用样本统计量来估计总体参数时的一种无偏推断。估计量的数学期望等于被估计参数的真实值,则称此此估计量为被估计参数的无偏估计,即具有无偏性,是一种用于评价估计量优良性的准则。无偏估计的意义是:在多次重复下,它们的平均数接近所估计的参数真值。无偏估计常被应用于测验分数统计中。

1.2.偏差-方差困境

Bias/Variance
“偏差-方差困境”(也称为偏差-方差权衡)。high bias高偏差,经常意味着欠拟合。high variance高方差,经常意味着过拟合。

low bias&
high variance
high bias&
low variance
high bias &
high variance
low bias &
low variance
Train set error1%15%15%0.5%
Dev set error11%16%30%1%
特征过拟合欠拟合过拟合+欠拟合完美

注意第三中情况,可能同时具备高偏差与高方差问题。
以上假设的前提是贝叶斯误差接近0%,即人眼误差(也称理想误差)为0%。用人工方式能够完全正确区分正负样本。
使用开发集(验证集)的正确率来检测算法是否存在高偏差/高方差问题。

1.3.机器学习的基本准则

Basic Recipe for Machine Learning
这里写图片描述
深度学习(相较过去的机器学习)因为能建立更大的网络(避免高方差),并却能够获得海量的大数据(避免高偏差)再加上正确的正则化技术,通常能够解决偏差-方差问题。

2.Regularizing your neural network

2.1.范数的介绍

范数公式详细说明用途
L0 ||x||0=#(i|xi0) | | x | | 0 = # ( i | x i ≠ 0 )
x向量各个非零元素的个数
如果选用L0范数作为正则化,那么w的大部分元素都会被优化为0,即w将是稀疏编码的的。通过最小化L0范数,可以去寻找最少最优的稀疏特征值。NP难,基本不用
L1 ||x||1=i|xi| | | x | | 1 = ∑ i | x i |
为x向量各个元素绝对值之和。
Lasso regularization,是L0的近似凸优化。
l1距离,又称为曼哈顿距离,L1正则化相当于参数w到原点(0,0)的曼哈顿距离
对L1优化的解是一个稀疏解,因此L1范数也被叫做稀疏规则算子。通过L1可以实现特征的稀疏,去掉一些没有信息的特征,例如在对用户的电影爱好做分类的时候,用户有100个特征,可能只有十几个特征是对分类有用的,大部分特征如身高体重等可能都是无用的,利用L1范数就可以过滤掉。
由于L1范数并没有平滑的函数表示,起初L1最优化问题解决起来非常困难,但随着计算机技术的到来,利用很多凸优化算法使得L1最优化成为可能。
机器学习常用
L2 ||x||2=ix2i | | x | | 2 = ∑ i x i 2
为x向量各个元素平方和的1/2次方。
又叫“岭回归”(Ridge Regression)、“权值衰减”(weight decay)。L2范数又称Euclidean范数(衡量距离是欧氏距离)或者Frobenius范数。L2范数越小,可以使得w的每个元素都很小,接近于0,但与L1范数不同的是他不会让它等于0而是接近于0.深度学习常用
L∞ ||x||2=max(|xi|) | | x | | 2 = m a x ( | x i | )
主要被用来度量向量元素的最大值,是x向量各个元素绝对值最大那个元素的绝对值.
切比雪夫距离(Chebyshev distance), 其度量的距离,就像国际象棋中国王从一个点到另一个点所要走的步数。应用领域,不知

2.2.正则化

L2 Regularization

J(w,b)=1mi=1mL(y^(i),y(i))+λ2m||w||22 逻 辑 回 归 正 则 化 : J ( w , b ) = 1 m ∑ i = 1 m L ( y ^ ( i ) , y ( i ) ) + λ 2 m | | w | | 2 2

神经网络的正则化项 λ2mll=1||w[l]||2=λ2mLl=1n[l1]i=1n[l]j=1(w[l]ij)2 λ 2 m ∑ l = 1 l | | w [ l ] | | 2 = λ 2 m ∑ l = 1 L ∑ i = 1 n [ l − 1 ] ∑ j = 1 n [ l ] ( w i j [ l ] ) 2
||w||22=nxj=1w2j=wTwwϵRnx,bϵR | | w | | 2 2 = ∑ j = 1 n x w j 2 = w T ⋅ w ; w ϵ R n x , b ϵ R
λ λ :正则化参数,λ越大,整体w越趋向于0。是超参数。使用交叉验证集,通过尝试一系列的值,找出最好的那个。
2m:只是个人工设定的常数值,用于约去求导后的2m数值。对Min过程的结果来说添加 12m 1 2 m 后,没有影响。
||w||22 | | w | | 2 2 :参数矩阵w的L2范数的平方,称为矩阵的Frobenius范数。L2 regularization.也称为
为什么不对b进行正则化?
答:通常w是一个非常高维的参数矢量,尤其在发生高方差的情况下,对b进行正则化不会产生什么影响。【我的解释:b是一个常数项,不会产生高方差的结果,高方差是x的高次方特征对应的参数值过大所产生的,常数不是它的产生原因,所以无需对其进行控制】
注意:lambda是python中的保留字
L1 Regularization
λm||w||21=λmnxi=1|w| λ m | | w | | 1 2 = λ m ∑ i = 1 n x | w |
使用L1正则化后,w将变得稀疏(sparse),有些人认为能够压缩模型。吴恩达认为L1缺点:在压缩模型的功能上效果不大。另一些观点认为损失函数的解将变得不稳定。
L2 Regularization Backpropagation
dW[l]=()+λmW[l] d W [ l ] = ( 反 向 传 播 原 偏 导 式 ) + λ m W [ l ]
增加正则化惩罚项后的偏导为: W[l]:=W[l]αdW[l] W [ l ] := W [ l ] − α d W [ l ]

2.3.为什么正则化能减少过拟合?

这里写图片描述
high variance:神经网络层数太多,神经元的个数也太多。
我们加入L2正则化后,假设我们将 λ λ 设定的十分大,那么网络通过学习后,W权重矩阵大部分值将趋向于0。

无L2正则化L2正则化后
这里写图片描述这里写图片描述

L2正则化后,相当于将许多神经元的参数十分接近0,等同于取消了这些神经元的影响。减少神经元,让整个网络变成一个更小的网络,整个网络的状态从high variance向high bias转换。如果我们控制的好λ ,那么可以将整个网络调试在just right的状态下。
另一种关于L2正则化能防止过拟合的解释:
这里写图片描述
以tanh,sigmoid为例,当加入L2正则化惩罚后,w的值趋向于0,z=w*x+b,a=tanh(z)。那么z值也将接近0,而激活函数在0附近接近于线性函数。我们知道线性函数无论在网络中叠加多少次,输出的结果仍是线性的(高偏差状态)。让网络总体输出趋向线性化输出,不去过度拟合那些十分偏离分界线的数据,从而实现了防止过拟合。

2.4.Dropout Regularization

随机丢弃正则化,或叫Dropout正则化。
这里写图片描述
Dropout每次随机让一些神经元失活。如上右图,就像在使用一个更小的神经网络,这样就具备了正则化的效果。
Inverted dropout(反向随机失活技术)举例:
network with layer l=3;keep_prob=0.8(保留的节点概率为0.8,即丢弃率为0.2)
d3=np.random.rand(a3.shap[0],a3.shap[1]) < keep_prob #随机选取20%的节点为0,要丢弃
A3=np.multiply(A3,d3) #为0的节点,a3=0,即失活
A3 = A3/keep_prob #a3的20%被丢弃,但将a3除以keep_prob后,可以让a3的期望(输出均值)维持不变。
Z4=W4*A3+b4 #因为a3=a3/keep_prob,z4的期望不会被改变。
预测的时候:
keep_prob设为1,即没有随机丢弃。

2.5.Understanding Dropout

Intuition:Can’t rely on any one feature,so have to spread out weights.
这里写图片描述
比如:有多个上一层神经元,都连接到同一个下一层神经元,因为上一层神经元会随机失活,那么下一层神经元就不能依赖某一个特定的上一层神经元,而要把每一个上一层神经元的权重都变小,并采用他们共同的输出结果。dropout,有利于压缩上一层神经元权重的平方和。这一点与L2正则化类似。当然dropout不同于L2 regularizaiton,但可以实现类似的效果。
这里写图片描述
dropout的一个特点是,我们可以对不同层设定不同的keep_prob,如上图的网络,我们对容易过拟合的层设定keep_prob=0.5,而对不容易过拟合的层设定keep_prob=1.

理论上,也可以对输入层x进行dropout,但实践中一般不这么操作,如果有也是设定一个比较高的值比如keep_prob=0.9这样接近1的数字。
对不同层设定不同的keep_prob的坏处是,增加了超参数,我们需要更多的通过交叉验证来获得更有效的一组keep_prob。
所以经常是只对容易出现过拟合的层设定keep_prob,或者整个网络不同层的keep_prob保持一致。
comput vision领域因为输入的样本的特征空间非常大,n>m(样本数量),容易出现过拟合,所以非常多的应用dropout技术。

2.6.Other regularization methods

Data augmentation
- 数据集扩增,可以作为一种类似的正则化的效果,因为可以减少过拟合
- 水平翻转
- 随机扭曲
Early stopping
- 提前终止法
- 因为优化cost函数的过程没有完全实现,所以这个方法考虑的问题比较复杂。现在基本已经被L2 regularization取代。
- 优点:只要运行少量的优化过程,通常测试小,中,大三种w,能大致找到合适的w。而不用像L2正则化那样要探索非常多的 λ λ
这里写图片描述

3.Setting up your optimization problem

3.1.Normalizing inputs

对输入数据进行归一化,能加快训练速度。
σ2=1mmi=1(x(i)μ)2 σ 2 = 1 m ∑ i = 1 m ( x ( i ) − μ ) 2
这里写图片描述
上图左:未normalization的数据分布
上图中: x1,x2μ x 1 , x 2 都 减 去 本 特 征 的 均 值 μ , 后 的 分 布
上图右: x=xμσ2 x = x − μ σ 2 的分布
这里写图片描述
上图左:是未归一化的参数空间
上图右:是归一化后的参数空间
对输入数据所有特种纸都归一化后,让不同的特征对应的参数范围趋于相同,这样在学习过程中会减少学习曲线的振动,更快的到达最低值。所以能加快学习速度。
注意:开发集、训练集、测试集应该采用相同归一化方法。

3.2.Vanishing/Exploding gradients

简化问题,假设一个多层网络,每一层只有一个神经元,参数只有w没有b。如下图:
这里写图片描述
a1=g(a0*w1),a2=g(a1*w2),a3=g(a2*w3),a4=g(a3*w4),a5=g(a4,w5), y^=g(a5w6) y ^ = g ( a 5 ∗ w 6 )
L(yy^)w1=La5a5a4a4a3a3a2a2a1a1w1 ∂ L ( y − y ^ ) ∂ w 1 = ∂ L ∂ a 5 ⋅ ∂ a 5 ∂ a 4 ⋅ ∂ a 4 ∂ a 3 ⋅ ∂ a 3 ∂ a 2 ⋅ ∂ a 2 ∂ a 1 ⋅ ∂ a 1 ∂ w 1
在激活函数是sigmoid,tanh的情况下,a=g(z)的导数通常小于1,那么随着层数的增加,其导数将越来越小。如果我们选择线性激活,a1=a0*w1……,那么∂w1=w6*w5*w4*w3*w3*w2*a0,如果w的初始化值大于1,那么将出现∂w1被指数级放大,反之亦被指数级缩小,即梯度爆炸或消失的情况。这就是梯度消失与梯度爆炸的原因。
这里写图片描述
上图展示了一个4层网络不同层梯度下降的速度,离输出层越近,下降越快,随着离输出层越来越远,梯度下降的速度越来越小。
总结:从深层网络角度来讲,不同的层学习的速度差异很大,表现为网络中靠近输出的层学习的情况很好,靠近输入的层学习的很慢,有时甚至训练了很久,前几层的权值和刚开始随机初始化的值差不多。因此,梯度消失、爆炸,其根本原因在于反向传播训练法则,属于先天不足,另外多说一句,Hinton提出capsule的原因就是为了彻底抛弃反向传播,如果真能大范围普及,那真是一个革命。

3.3.深度网络的权重初始化

Weight Initialization for Deep Networks
权重初始化的思想是,(因为输入层的数据已经通过normalization inputs 被初始化在类似的分布上)初始化权重在一个以0为均值,方差为1的分布上,从而让输出的z不会过大(大大超过1)也不会太小(过于接近0)并具备相同的分布。这不能完全解决梯度爆炸与消失的问题,但能部分解决一些。
这里写图片描述
对单个神经元, z=w1x1+w2x2++wnxn z = w 1 x 1 + w 2 x 2 + ⋯ + w n x n 我们会希望随着n不断变大, wi w i 越来越小,从而保证输出的z处于一个合适范围。

不同初始化方法函数
tanh的初始化 np.random.randn(shape)np.sqrt(1n[l1]) n p . r a n d o m . r a n d n ( s h a p e ) ∗ n p . s q r t ( 1 n [ l − 1 ] )
relu的初始化 np.random.randn(shape)np.sqrt(2n[l1]) n p . r a n d o m . r a n d n ( s h a p e ) ∗ n p . s q r t ( 2 n [ l − 1 ] )
其他出事方法 2n[l1]+n[l] 2 n [ l − 1 ] + n [ l ]

3.4.Numerical approximation of gradients

梯度的数值近似,用于梯度检验
这里写图片描述
导数的公式:

f(θ)=limϵ0f(θ+ϵ)f(θϵ)2ϵ f ′ ( θ ) = lim ϵ → 0 f ( θ + ϵ ) − f ( θ − ϵ ) 2 ϵ

在求梯度的近似值时,可以假设 f(θ)=θ3,θ,ϵ10.01approx=3.00013f(θ)=3θ2,3 f ( θ ) = θ 3 , 将 θ , ϵ 分 别 设 为 1 , 0.01 代 入 上 式 , 可 以 得 到 a p p r o x = 3.0001 ≈ 3 ( f ′ ( θ ) = 3 θ 2 , 3 ) 近似值与正确的导数计算值非常接近。
这里求梯度近似值用的是双侧法 f(θ+ϵ)f(θϵ)2ϵ f ( θ + ϵ ) − f ( θ − ϵ ) 2 ϵ 可以通过计算证明,双侧法与导数函数的误差在 O(ϵ2) O ( ϵ 2 ) 这个级别上,如果用单侧法 f(θ+ϵ)ϵ f ( θ + ϵ ) ϵ 可以通过计算证明单侧法与导数函数的误差在 O(ϵ) O ( ϵ ) 这个级别上。对于我们设定ϵ=0.01甚至更小的时候,双侧法的误差明显小非常多。
所以计算梯度近似值中,我们采用双侧法。

3.5.Gradient checking

将每一层的参数 W[1],b[1],...,W[L],b[L] W [ 1 ] , b [ 1 ] , . . . , W [ L ] , b [ L ] 按顺序拼接在一个大的向量θ中。
损失函数 J(W[1],b[1],...,W[L],b[L])=J(θ1,θ2,....θendi) J ( W [ 1 ] , b [ 1 ] , . . . , W [ L ] , b [ L ] ) = J ( θ 1 , θ 2 , . . . . θ e n d i )
dW[1],db[1],...,dW[L],db[L] d W [ 1 ] , d b [ 1 ] , . . . , d W [ L ] , d b [ L ] 按顺序拼接在一个大的向量dθ中。dθ.shape=θ.shape.
这里写图片描述
检查 dθapprox d θ a p p r o x 与dθ可以用欧氏距离来计算。

ϵ=error=||dθapproxdθ||2||dθapprox||2+||dθ||2 ϵ = e r r o r = | | d θ a p p r o x − d θ | | 2 | | d θ a p p r o x | | 2 + | | d θ | | 2

dθapprox d θ a p p r o x 与dθ的值比较大,那么它们的差值也会较大,为了去除它们的量纲,使用 ||dθapprox||2+||dθ||2 | | d θ a p p r o x | | 2 + | | d θ | | 2 作为分母。
通常认为 ϵ=107 ϵ = 10 − 7 这个级别的误差证明梯度检验的结果是正确。
执行检查的流程:
1)参数初始化所有的 W[l] W [ l ] b[l]=0 b [ l ] = 0
2) 将所有的参数首位拼接并reshape到一个大的向量θ中。
3)for each i:
dθapprox=J(θ1,..,θi+ϵ,...)J(θ1,..,θiϵ,...)2ϵ d θ a p p r o x = J ( θ 1 , . . , θ i + ϵ , . . . ) − J ( θ 1 , . . , θ i − ϵ , . . . ) 2 ϵ

注:按向量的顺序θ从头开始,每个θ做一次这个操作。计算的结果放在与θ同shape的大向量 dθapprox d θ a p p r o x 中。 dθapprox d θ a p p r o x 里的每一个值记录着对应参数变化后的近似偏导数。
4)正常求反向传播,返回的 dW[1],db[1],...,dW[L],db[L] d W [ 1 ] , d b [ 1 ] , . . . , d W [ L ] , d b [ L ] ,也按顺序拼接在一个大向量dθ中。
5)按上面介绍的方法对比 dθapprox d θ a p p r o x 与dθ的误差。

3.6.Gradient Checking Implementation Notes

  • Don’t use Gradient Checking in training-only to debug.梯度检查后验证整个梯度计算是正确的,关闭梯度检查再开始训练。
  • 如果梯度检查出误差很大,可以通过对比 dθapprox[i] d θ a p p r o x [ i ] 与dθ[i]的差别,找出从那一层开始出现较大误差,那么请仔细检查那一层的梯度计算代码。梯度检查可以帮助你从哪里开始检查错误。
  • 不要忘了正则化项。如果你有使用正则化,梯度检查时应该把正则化计算也包含进去。
  • 梯度检查不能与dropout一起使用.keep_prob设置为1再开始梯度检查
  • Run at random initialization;perhaps again after some training.

4.Practice Questions

With the inverted dropout technique, at test time:
You do not apply dropout (do not randomly eliminate units) and do not keep the 1/keep_prob factor in the calculations used in training
测试期间不应用dropout(不随机消减神经元),并且在训练期间不让keep_prob=1。

5.参数初始化(编程题1-Initialization Parameters)

5.1.Zero initialization

这里写图片描述
将所有层的W,b都初始化为0,执行效果很糟。Cost根本没有下降,而整个算法并没有比随机猜测会更好。通常情况下,初始化所有的weights为0,会导致网络无法“打破对称性”(break symmetry) 。这意味着每一层的神经元都是在学习相同的东西,你就像在训练一个每一层的 n[l]=1 n [ l ] = 1 的网络(每一层都只有一个神经元)。那么整个网络的效能不会比一个类似逻辑回归这样的线性分类器更有用。

What you should remember:
- The weights W[l] W [ l ] should be initialized randomly to break symmetry.
- It is however okay to initialize the biases b[l] b [ l ] to zeros. Symmetry is still broken so long as W[l] W [ l ] is initialized randomly.

5.2.Random initialization

这里写图片描述
Observations:
- The cost starts very high. This is because with large random-valued weights, the last activation (sigmoid) outputs results that are very close to 0 or 1 for some examples, and when it gets that example wrong it incurs a very high loss for that example. Indeed, when log(a[3])=log(0) log ⁡ ( a [ 3 ] ) = log ⁡ ( 0 ) , the loss goes to infinity.
- Poor initialization can lead to vanishing/exploding gradients, which also slows down the optimization algorithm.
- If you train this network longer you will see better results, but initializing with overly large random numbers slows down the optimization.


In summary:
- Initializing weights to very large random values does not work well.
- Hopefully intializing with small random values does better. The important question is: how small should be these random values be? Lets find out in the next part!

5.3. He initialization

# GRADED FUNCTION: initialize_parameters_he

def initialize_parameters_he(layers_dims):
    """
    Arguments:
    layer_dims -- python array (list) containing the size of each layer.

    Returns:
    parameters -- python dictionary containing your parameters "W1", "b1", ..., "WL", "bL":
                    W1 -- weight matrix of shape (layers_dims[1], layers_dims[0])
                    b1 -- bias vector of shape (layers_dims[1], 1)
                    ...
                    WL -- weight matrix of shape (layers_dims[L], layers_dims[L-1])
                    bL -- bias vector of shape (layers_dims[L], 1)
    """

    np.random.seed(3)
    parameters = {}
    L = len(layers_dims) - 1 # integer representing the number of layers

    for l in range(1, L + 1):
        ### START CODE HERE ### (≈ 2 lines of code)
        parameters['W' + str(l)] = np.random.randn(layers_dims[l],layers_dims[l-1])*np.sqrt(2/layers_dims[l-1])
        parameters['b' + str(l)] = np.zeros((layers_dims[l],1))*10
        ### END CODE HERE ###

    return parameters

这里写图片描述
Finally, try “He Initialization”; this is named for the first author of He et al., 2015. (If you have heard of “Xavier initialization”, this is similar except Xavier initialization uses a scaling factor for the weights W[l] W [ l ] of sqrt(1./layers_dims[l-1]) where He initialization would use sqrt(2./layers_dims[l-1]).)

Exercise: Implement the following function to initialize your parameters with He initialization.

Hint: This function is similar to the previous initialize_parameters_random(...). The only difference is that instead of multiplying np.random.randn(..,..) by 10, you will multiply it by 2dimension of the previous layer 2 dimension of the previous layer , which is what He initialization recommends for layers with a ReLU activation.

Observations:
- The model with He initialization separates the blue and the red dots very well in a small number of iterations.

总结

You have seen three different types of initializations. For the same number of iterations and same hyperparameters the comparison is:

**Model** **Train accuracy** **Problem/Comment**
3-layer NN with zeros initialization 50% fails to break symmetry
3-layer NN with large random initialization 83% too large weights
3-layer NN with He initialization 99% recommended method


What you should remember from this notebook:
- Different initializations lead to different results
- Random initialization is used to break symmetry and make sure different hidden units can learn different things
- Don’t intialize to values that are too large
- He initialization works well for networks with ReLU activations.

6.正则化-(编程题2-Regularization)

6.1.Non-regularized

这里写图片描述
The train accuracy is 94.8% while the test accuracy is 91.5%. This is the baseline model (you will observe the impact of regularization on this model). Run the following code to plot the decision boundary of your model.
这里写图片描述
The non-regularized model is obviously overfitting the training set. It is fitting the noisy points! Lets now look at two techniques to reduce overfitting.

6.2.L2 Regularization

Jregularized=1mi=1m(y(i)log(a[L](i))+(1y(i))log(1a[L](i)))cross-entropy cost+1mλ2lkjW[l]2k,jL2 regularization cost J r e g u l a r i z e d = − 1 m ∑ i = 1 m ( y ( i ) log ⁡ ( a [ L ] ( i ) ) + ( 1 − y ( i ) ) log ⁡ ( 1 − a [ L ] ( i ) ) ) ⏟ cross-entropy cost + 1 m λ 2 ∑ l ∑ k ∑ j W k , j [ l ] 2 ⏟ L2 regularization cost

Exercise: Implement compute_cost_with_regularization() which computes the cost given by formula (2). To calculate kjW[l]2k,j ∑ k ∑ j W k , j [ l ] 2 , use :

np.sum(np.square(Wl))

Note that you have to do this for W[1] W [ 1 ] , W[2] W [ 2 ] and W[3] W [ 3 ] , then sum the three terms and multiply by 1mλ2 1 m λ 2 .

### START CODE HERE ### (approx. 1 line)
    L2_regularization_cost = (np.sum(np.square(W1))+np.sum(np.square(W2))+np.sum(np.square(W3)))*lambd/(2*m)
    ### END CODER HERE ###

Exercise: Implement the changes needed in backward propagation to take into account regularization. The changes only concern dW1, dW2 and dW3. For each, you have to add the regularization term’s gradient ( ddW(12λmW2)=λmW d d W ( 1 2 λ m W 2 ) = λ m W ).

这里写图片描述
On the train set:
Accuracy: 0.938388625592
On the test set:
Accuracy: 0.93

这里写图片描述

Observations:
- The value of λ λ is a hyperparameter that you can tune using a dev set.
- L2 regularization makes your decision boundary smoother. If λ λ is too large, it is also possible to “oversmooth”, resulting in a model with high bias.

What is L2-regularization actually doing?:

L2-regularization relies on the assumption that a model with small weights is simpler than a model with large weights. Thus, by penalizing the square values of the weights in the cost function you drive all the weights to smaller values. It becomes too costly for the cost to have large weights! This leads to a smoother model in which the output changes more slowly as the input changes.


What you should remember – the implications of L2-regularization on:
- The cost computation:
- A regularization term is added to the cost
- The backpropagation function:
- There are extra terms in the gradients with respect to weight matrices
- Weights end up smaller (“weight decay”):
- Weights are pushed to smaller values.

6.3.Dropout

Dropout is a widely used regularization technique that is specific to deep learning. It randomly shuts down some neurons in each iteration.
这里写图片描述
Forward propagation with dropout

 ### START CODE HERE ### (approx. 4 lines)         # Steps 1-4 below correspond to the Steps 1-4 described above. 
  # Step 1: initialize matrix D1 = np.random.rand(..., ...)
    D1 = np.random.rand(A1.shape[0],A1.shape[1])  
  # Step 2: convert entries of D1 to 0 or 1 (using keep_prob as the threshold)                                      
    D1 = D1<keep_prob 
  # Step 3: shut down some neurons of A1                                       
    A1 = A1*D1         
  # Step 4: scale the value of neurons that haven't been shut down                               
    A1 = A1/keep_prob                                        
 ### END CODE HERE ###

Backward propagation with dropout

### START CODE HERE ### (≈ 2 lines of code)
# Step 1: Apply mask D2 to shut down the same neurons as during the forward propagation
    dA2 = dA2*D2   
# Step 2: Scale the value of neurons that haven't been shut down           
    dA2 = dA2/keep_prob            
### END CODE HERE ###

这里写图片描述

On the train set:
Accuracy: 0.928909952607
On the test set:
Accuracy: 0.95

这里写图片描述

Note:
- A common mistake when using dropout is to use it both in training and testing. You should use dropout (randomly eliminate nodes) only in training.
- Deep learning frameworks like tensorflow, PaddlePaddle, keras or caffe come with a dropout layer implementation. Don’t stress - you will soon learn some of these frameworks.


What you should remember about dropout:
- Dropout is a regularization technique.
- You only use dropout during training. Don’t use dropout (randomly eliminate nodes) during test time.
- Apply dropout both during forward and backward propagation.
- During training time, divide each dropout layer by keep_prob to keep the same expected value for the activations. For example, if keep_prob is 0.5, then we will on average shut down half the nodes, so the output will be scaled by 0.5 since only the remaining half are contributing to the solution. Dividing by 0.5 is equivalent to multiplying by 2. Hence, the output now has the same expected value. You can check that this works even when keep_prob is other values than 0.5.

6.4.总结

Here are the results of our three models:

**model** **train accuracy** **test accuracy**
3-layer NN without regularization 95% 91.5%
3-layer NN with L2-regularization 94% 93%
3-layer NN with dropout 93% 95%


What we want you to remember from this notebook:
- Regularization will help you reduce overfitting.
- Regularization will drive your weights to lower values.
- L2 regularization and Dropout are two very effective regularization techniques.

Gradient Checking(编程题3-梯度检查)

**
**

Backpropagation computes the gradients Jθ ∂ J ∂ θ , where θ θ denotes the parameters of the model. J J is computed using forward propagation and your loss function.

Because forward propagation is relatively easy to implement, you’re confident you got that right, and so you’re almost 100% sure that you’re computing the cost J correctly. Thus, you can use your code for computing J J to verify the code for computing Jθ.

Let’s look back at the definition of a derivative (or gradient):

Jθ=limε0J(θ+ε)J(θε)2ε(1) (1) ∂ J ∂ θ = lim ε → 0 J ( θ + ε ) − J ( θ − ε ) 2 ε

If you’re not familiar with the “ limε0 lim ε → 0 ” notation, it’s just a way of saying “when ε ε is really really small.”

We know the following:

  • Jθ ∂ J ∂ θ is what you want to make sure you’re computing correctly.
  • You can compute J(θ+ε) J ( θ + ε ) and J(θε) J ( θ − ε ) (in the case that θ θ is a real number), since you’re confident your implementation for J J is correct.

Lets use equation (1) and a small value for ε to convince your CEO that your code for computing Jθ ∂ J ∂ θ is correct!
Exercise: Implement gradient_check_n().

Instructions: Here is pseudo-code that will help you implement the gradient check.

For each i in num_parameters:
- To compute J_plus[i]:
1. Set θ+ θ + to np.copy(parameters_values)
2. Set θ+i θ i + to θ+i+ε θ i + + ε
3. Calculate J+i J i + using to forward_propagation_n(x, y, vector_to_dictionary( θ+ θ + )).
- To compute J_minus[i]: do the same thing with θ θ −
- Compute gradapprox[i]=J+iJi2ε g r a d a p p r o x [ i ] = J i + − J i − 2 ε

Thus, you get a vector gradapprox, where gradapprox[i] is an approximation of the gradient with respect to parameter_values[i]. You can now compare this gradapprox vector to the gradients vector from backpropagation. Just like for the 1D case (Steps 1’, 2’, 3’), compute:

difference=gradgradapprox2grad2+gradapprox2(3) (3) d i f f e r e n c e = ‖ g r a d − g r a d a p p r o x ‖ 2 ‖ g r a d ‖ 2 + ‖ g r a d a p p r o x ‖ 2

def gradient_check_n(parameters, gradients, X, Y, epsilon = 1e-7):
    """
    Checks if backward_propagation_n computes correctly the gradient of the cost output by forward_propagation_n

    Arguments:
    parameters -- python dictionary containing your parameters "W1", "b1", "W2", "b2", "W3", "b3":
    grad -- output of backward_propagation_n, contains gradients of the cost with respect to the parameters. 
    x -- input datapoint, of shape (input size, 1)
    y -- true "label"
    epsilon -- tiny shift to the input to compute approximated gradient with formula(1)

    Returns:
    difference -- difference (2) between the approximated gradient and the backward propagation gradient
    """

    # Set-up variables
    parameters_values, _ = dictionary_to_vector(parameters)
    grad = gradients_to_vector(gradients)
    num_parameters = parameters_values.shape[0]
    J_plus = np.zeros((num_parameters, 1))
    J_minus = np.zeros((num_parameters, 1))
    gradapprox = np.zeros((num_parameters, 1))

    # Compute gradapprox
    for i in range(num_parameters):

        # Compute J_plus[i]. Inputs: "parameters_values, epsilon". Output = "J_plus[i]".
        # "_" is used because the function you have to outputs two parameters but we only care about the first one
        ### START CODE HERE ### (approx. 3 lines)
        thetaplus = np.copy(parameters_values)                                      # Step 1
        thetaplus[i][0] = thetaplus[i][0] + epsilon                                # Step 2
        J_plus[i], _ = forward_propagation_n(X, Y, vector_to_dictionary(thetaplus))                                   # Step 3
        ### END CODE HERE ###

        # Compute J_minus[i]. Inputs: "parameters_values, epsilon". Output = "J_minus[i]".
        ### START CODE HERE ### (approx. 3 lines)
        thetaminus = np.copy(parameters_values)                                      # Step 1
        thetaminus[i][0] = thetaminus[i][0] - epsilon                               # Step 2        
        J_minus[i], _ = forward_propagation_n(X, Y, vector_to_dictionary(thetaminus))                                  # Step 3
        ### END CODE HERE ###

        # Compute gradapprox[i]
        ### START CODE HERE ### (approx. 1 line)
        gradapprox[i] = (J_plus[i] - J_minus[i])/(2*epsilon)
        ### END CODE HERE ###

    # Compare gradapprox to backward propagation gradients by computing difference.
    ### START CODE HERE ### (approx. 1 line)
    numerator = np.linalg.norm(grad -  gradapprox)                                          # Step 1'
    denominator = np.linalg.norm(grad) + np.linalg.norm(gradapprox)                    # Step 2'
    difference = numerator/ denominator                                        # Step 3'
    ### END CODE HERE ###

    if difference > 2e-7:
        print ("\033[93m" + "There is a mistake in the backward propagation! difference = " + str(difference) + "\033[0m")
    else:
        print ("\033[92m" + "Your backward propagation works perfectly fine! difference = " + str(difference) + "\033[0m")

    return difference


What you should remember from this notebook:
- Gradient checking verifies closeness between the gradients from backpropagation and the numerical approximation of the gradient (computed using forward propagation).
- Gradient checking is slow, so we don’t run it in every iteration of training. You would usually run it only to make sure your code is correct, then turn it off and use backprop for the actual learning process.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值