  • 回顾不同的初始化方式会产生不同的结果。Recall that different types of initializations
    lead to different results
  • 认识在复合神经网络中初始化的重要性。Recognize the importance of initialization in
    complex neural networks.
  • 认识Train/Dev/Test数据集之间的差异性。Recognize the difference between
    train/dev/test sets
  • 诊断你的模型中的偏差和方差问题。Diagnose the bias and variance issues in your model
  • 学习何时与如何使用正则化方法,比如Dropout或L2正则化。Learn when and how to use
    regularization methods such as dropout or L2 regularization.
  • 掌握深度学习中的实践问题,例如梯度消失梯度爆炸,学会如何使用梯度检验来核实是否正确的执行了反向传播来避免这些问题。Understand experimental issues in deep learning such as Vanishing or Exploding gradients and learn how to deal with them 。Use gradient checking to verify the correctness of your backpropagation implementation



Training Sets:训练集,用于算法学习使用。
Cross Validation(Development Sets):交叉验证集,也可以简称为开发集,本课程中使用开发集这个简称。用于验证超参数及不同设置下算法性能的对比。
Test Sets:测试集,用于检查算法最终的性能(比如:泛化能力)。这样能在评估算法性能时不引入偏差。


“偏差-方差困境”(也称为偏差-方差权衡)。high bias高偏差,经常意味着欠拟合。high variance高方差,经常意味着过拟合。

low bias&
high variance
high bias&
low variance
high bias &
high variance
low bias &
low variance
Train set error1%15%15%0.5%
Dev set error11%16%30%1%



Basic Recipe for Machine Learning

2.Regularizing your neural network


L0 ||x||0=#(i|xi0) | | x | | 0 = # ( i | x i ≠ 0 )
L1 ||x||1=i|xi| | | x | | 1 = ∑ i | x i |
Lasso regularization,是L0的近似凸优化。
L2 ||x||2=ix2i | | x | | 2 = ∑ i x i 2
又叫“岭回归”(Ridge Regression)、“权值衰减”(weight decay)。L2范数又称Euclidean范数(衡量距离是欧氏距离)或者Frobenius范数。L2范数越小,可以使得w的每个元素都很小,接近于0,但与L1范数不同的是他不会让它等于0而是接近于0.深度学习常用
L∞ ||x||2=max(|xi|) | | x | | 2 = m a x ( | x i | )
切比雪夫距离(Chebyshev distance), 其度量的距离,就像国际象棋中国王从一个点到另一个点所要走的步数。应用领域,不知


L2 Regularization

J(w,b)=1mi=1mL(y^(i),y(i))+λ2m||w||22 逻 辑 回 归 正 则 化 : J ( w , b ) = 1 m ∑ i = 1 m L ( y ^ ( i ) , y ( i ) ) + λ 2 m | | w | | 2 2

神经网络的正则化项 λ2mll=1||w[l]||2=λ2mLl=1n[l1]i=1n[l]j=1(w[l]ij)2 λ 2 m ∑ l = 1 l | | w [ l ] | | 2 = λ 2 m ∑ l = 1 L ∑ i = 1 n [ l − 1 ] ∑ j = 1 n [ l ] ( w i j [ l ] ) 2
||w||22=nxj=1w2j=wTwwϵRnx,bϵR | | w | | 2 2 = ∑ j = 1 n x w j 2 = w T ⋅ w ; w ϵ R n x , b ϵ R
λ λ :正则化参数,λ越大,整体w越趋向于0。是超参数。使用交叉验证集,通过尝试一系列的值,找出最好的那个。
2m:只是个人工设定的常数值,用于约去求导后的2m数值。对Min过程的结果来说添加 12m 1 2 m 后,没有影响。
||w||22 | | w | | 2 2 :参数矩阵w的L2范数的平方,称为矩阵的Frobenius范数。L2 regularization.也称为
L1 Regularization
λm||w||21=λmnxi=1|w| λ m | | w | | 1 2 = λ m ∑ i = 1 n x | w |
L2 Regularization Backpropagation
dW[l]=()+λmW[l] d W [ l ] = ( 反 向 传 播 原 偏 导 式 ) + λ m W [ l ]
增加正则化惩罚项后的偏导为: W[l]:=W[l]αdW[l] W [ l ] := W [ l ] − α d W [ l ]


high variance:神经网络层数太多,神经元的个数也太多。
我们加入L2正则化后,假设我们将 λ λ 设定的十分大,那么网络通过学习后,W权重矩阵大部分值将趋向于0。


L2正则化后,相当于将许多神经元的参数十分接近0,等同于取消了这些神经元的影响。减少神经元,让整个网络变成一个更小的网络,整个网络的状态从high variance向high bias转换。如果我们控制的好λ ,那么可以将整个网络调试在just right的状态下。

2.4.Dropout Regularization

Inverted dropout(反向随机失活技术)举例:
network with layer l=3;keep_prob=0.8(保留的节点概率为0.8,即丢弃率为0.2)
d3=np.random.rand(a3.shap[0],a3.shap[1]) < keep_prob #随机选取20%的节点为0,要丢弃
A3=np.multiply(A3,d3) #为0的节点,a3=0,即失活
A3 = A3/keep_prob #a3的20%被丢弃,但将a3除以keep_prob后,可以让a3的期望(输出均值)维持不变。
Z4=W4*A3+b4 #因为a3=a3/keep_prob,z4的期望不会被改变。

2.5.Understanding Dropout

Intuition:Can’t rely on any one feature,so have to spread out weights.
比如:有多个上一层神经元,都连接到同一个下一层神经元,因为上一层神经元会随机失活,那么下一层神经元就不能依赖某一个特定的上一层神经元,而要把每一个上一层神经元的权重都变小,并采用他们共同的输出结果。dropout,有利于压缩上一层神经元权重的平方和。这一点与L2正则化类似。当然dropout不同于L2 regularizaiton,但可以实现类似的效果。

comput vision领域因为输入的样本的特征空间非常大,n>m(样本数量),容易出现过拟合,所以非常多的应用dropout技术。

2.6.Other regularization methods

Data augmentation
- 数据集扩增,可以作为一种类似的正则化的效果,因为可以减少过拟合
- 水平翻转
- 随机扭曲
Early stopping
- 提前终止法
- 因为优化cost函数的过程没有完全实现,所以这个方法考虑的问题比较复杂。现在基本已经被L2 regularization取代。
- 优点:只要运行少量的优化过程,通常测试小,中,大三种w,能大致找到合适的w。而不用像L2正则化那样要探索非常多的 λ λ

3.Setting up your optimization problem

3.1.Normalizing inputs

σ2=1mmi=1(x(i)μ)2 σ 2 = 1 m ∑ i = 1 m ( x ( i ) − μ ) 2
上图中: x1,x2μ x 1 , x 2 都 减 去 本 特 征 的 均 值 μ , 后 的 分 布
上图右: x=xμσ2 x = x − μ σ 2 的分布

3.2.Vanishing/Exploding gradients

a1=g(a0*w1),a2=g(a1*w2),a3=g(a2*w3),a4=g(a3*w4),a5=g(a4,w5), y^=g(a5w6) y ^ = g ( a 5 ∗ w 6 )
L(yy^)w1=La5a5a4a4a3a3a2a2a1a1w1 ∂ L ( y − y ^ ) ∂ w 1 = ∂ L ∂ a 5 ⋅ ∂ a 5 ∂ a 4 ⋅ ∂ a 4 ∂ a 3 ⋅ ∂ a 3 ∂ a 2 ⋅ ∂ a 2 ∂ a 1 ⋅ ∂ a 1 ∂ w 1


Weight Initialization for Deep Networks
权重初始化的思想是,(因为输入层的数据已经通过normalization inputs 被初始化在类似的分布上)初始化权重在一个以0为均值,方差为1的分布上,从而让输出的z不会过大(大大超过1)也不会太小(过于接近0)并具备相同的分布。这不能完全解决梯度爆炸与消失的问题,但能部分解决一些。
对单个神经元, z=w1x1+w2x2++wnxn z = w 1 x 1 + w 2 x 2 + ⋯ + w n x n 我们会希望随着n不断变大, wi w i 越来越小,从而保证输出的z处于一个合适范围。

tanh的初始化 np.random.randn(shape)np.sqrt(1n[l1]) n p . r a n d o m . r a n d n ( s h a p e ) ∗ n p . s q r t ( 1 n [ l − 1 ] )
relu的初始化 np.random.randn(shape)np.sqrt(2n[l1]) n p . r a n d o m . r a n d n ( s h a p e ) ∗ n p . s q r t ( 2 n [ l − 1 ] )
其他出事方法 2n[l1]+n[l] 2 n [ l − 1 ] + n [ l ]

3.4.Numerical approximation of gradients


f(θ)=limϵ0f(θ+ϵ)f(θϵ)2ϵ f ′ ( θ ) = lim ϵ → 0 f ( θ + ϵ ) − f ( θ − ϵ ) 2 ϵ

在求梯度的近似值时,可以假设 f(θ)=θ3,θ,ϵ10.01approx=3.00013f(θ)=3θ2,3 f ( θ ) = θ 3 , 将 θ , ϵ 分 别 设 为 1 , 0.01 代 入 上 式 , 可 以 得 到 a p p r o x = 3.0001 ≈ 3 ( f ′ ( θ ) = 3 θ 2 , 3 ) 近似值与正确的导数计算值非常接近。
这里求梯度近似值用的是双侧法 f(θ+ϵ)f(θϵ)2ϵ f ( θ + ϵ ) − f ( θ − ϵ ) 2 ϵ 可以通过计算证明,双侧法与导数函数的误差在 O(ϵ2) O ( ϵ 2 ) 这个级别上,如果用单侧法 f(θ+ϵ)ϵ f ( θ + ϵ ) ϵ 可以通过计算证明单侧法与导数函数的误差在 O(ϵ) O ( ϵ ) 这个级别上。对于我们设定ϵ=0.01甚至更小的时候,双侧法的误差明显小非常多。

3.5.Gradient checking

将每一层的参数 W[1],b[1],...,W[L],b[L] W [ 1 ] , b [ 1 ] , . . . , W [ L ] , b [ L ] 按顺序拼接在一个大的向量θ中。
损失函数 J(W[1],b[1],...,W[L],b[L])=J(θ1,θ2,....θendi) J ( W [ 1 ] , b [ 1 ] , . . . , W [ L ] , b [ L ] ) = J ( θ 1 , θ 2 , . . . . θ e n d i )
dW[1],db[1],...,dW[L],db[L] d W [ 1 ] , d b [ 1 ] , . . . , d W [ L ] , d b [ L ] 按顺序拼接在一个大的向量dθ中。dθ.shape=θ.shape.
检查 dθapprox d θ a p p r o x 与dθ可以用欧氏距离来计算。

ϵ=error=||dθapproxdθ||2||dθapprox||2+||dθ||2 ϵ = e r r o r = | | d θ a p p r o x − d θ | | 2 | | d θ a p p r o x | | 2 + | | d θ | | 2

dθapprox d θ a p p r o x 与dθ的值比较大,那么它们的差值也会较大,为了去除它们的量纲,使用 ||dθapprox||2+||dθ||2 | | d θ a p p r o x | | 2 + | | d θ | | 2 作为分母。
通常认为 ϵ=107 ϵ = 10 − 7 这个级别的误差证明梯度检验的结果是正确。
1)参数初始化所有的 W[l] W [ l ] b[l]=0 b [ l ] = 0
2) 将所有的参数首位拼接并reshape到一个大的向量θ中。
3)for each i:
dθapprox=J(θ1,..,θi+ϵ,...)J(θ1,..,θiϵ,...)2ϵ d θ a p p r o x = J ( θ 1 , . . , θ i + ϵ , . . . ) − J ( θ 1 , . . , θ i − ϵ , . . . ) 2 ϵ

注:按向量的顺序θ从头开始,每个θ做一次这个操作。计算的结果放在与θ同shape的大向量 dθapprox d θ a p p r o x 中。 dθapprox d θ a p p r o x 里的每一个值记录着对应参数变化后的近似偏导数。
4)正常求反向传播,返回的 dW[1],db[1],...,dW[L],db[L] d W [ 1 ] , d b [ 1 ] , . . . , d W [ L ] , d b [ L ] ,也按顺序拼接在一个大向量dθ中。
5)按上面介绍的方法对比 dθapprox d θ a p p r o x 与dθ的误差。

3.6.Gradient Checking Implementation Notes

  • Don’t use Gradient Checking in training-only to debug.梯度检查后验证整个梯度计算是正确的,关闭梯度检查再开始训练。
  • 如果梯度检查出误差很大,可以通过对比 dθapprox[i] d θ a p p r o x [ i ] 与dθ[i]的差别,找出从那一层开始出现较大误差,那么请仔细检查那一层的梯度计算代码。梯度检查可以帮助你从哪里开始检查错误。
  • 不要忘了正则化项。如果你有使用正则化,梯度检查时应该把正则化计算也包含进去。
  • 梯度检查不能与dropout一起使用.keep_prob设置为1再开始梯度检查
  • Run at random initialization;perhaps again after some training.

4.Practice Questions

With the inverted dropout technique, at test time:
You do not apply dropout (do not randomly eliminate units) and do not keep the 1/keep_prob factor in the calculations used in training

5.参数初始化(编程题1-Initialization Parameters)

5.1.Zero initialization

将所有层的W,b都初始化为0,执行效果很糟。Cost根本没有下降,而整个算法并没有比随机猜测会更好。通常情况下,初始化所有的weights为0,会导致网络无法“打破对称性”(break symmetry) 。这意味着每一层的神经元都是在学习相同的东西,你就像在训练一个每一层的 n[l]=1 n [ l ] = 1 的网络(每一层都只有一个神经元)。那么整个网络的效能不会比一个类似逻辑回归这样的线性分类器更有用。

What you should remember:
- The weights W[l] W [ l ] should be initialized randomly to break symmetry.
- It is however okay to initialize the biases b[l] b [ l ] to zeros. Symmetry is still broken so long as W[l] W [ l ] is initialized randomly.

5.2.Random initialization

- The cost starts very high. This is because with large random-valued weights, the last activation (sigmoid) outputs results that are very close to 0 or 1 for some examples, and when it gets that example wrong it incurs a very high loss for that example. Indeed, when log(a[3])=log(0) log ⁡ ( a [ 3 ] ) = log ⁡ ( 0 ) , the loss goes to infinity.
- Poor initialization can lead to vanishing/exploding gradients, which also slows down the optimization algorithm.
- If you train this network longer you will see better results, but initializing with overly large random numbers slows down the optimization.

In summary:
- Initializing weights to very large random values does not work well.
- Hopefully intializing with small random values does better. The important question is: how small should be these random values be? Lets find out in the next part!

5.3. He initialization

# GRADED FUNCTION: initialize_parameters_he

def initialize_parameters_he(layers_dims):
    layer_dims -- python array (list) containing the size of each layer.

    parameters -- python dictionary containing your parameters "W1", "b1", ..., "WL", "bL":
                    W1 -- weight matrix of shape (layers_dims[1], layers_dims[0])
                    b1 -- bias vector of shape (layers_dims[1], 1)
                    WL -- weight matrix of shape (layers_dims[L], layers_dims[L-1])
                    bL -- bias vector of shape (layers_dims[L], 1)

    parameters = {}
    L = len(layers_dims) - 1 # integer representing the number of layers

    for l in range(1, L + 1):
        ### START CODE HERE ### (≈ 2 lines of code)
        parameters['W' + str(l)] = np.random.randn(layers_dims[l],layers_dims[l-1])*np.sqrt(2/layers_dims[l-1])
        parameters['b' + str(l)] = np.zeros((layers_dims[l],1))*10
        ### END CODE HERE ###

    return parameters

Finally, try “He Initialization”; this is named for the first author of He et al., 2015. (If you have heard of “Xavier initialization”, this is similar except Xavier initialization uses a scaling factor for the weights W[l] W [ l ] of sqrt(1./layers_dims[l-1]) where He initialization would use sqrt(2./layers_dims[l-1]).)

Exercise: Implement the following function to initialize your parameters with He initialization.

Hint: This function is similar to the previous initialize_parameters_random(...). The only difference is that instead of multiplying np.random.randn(..,..) by 10, you will multiply it by 2dimension of the previous layer 2 dimension of the previous layer , which is what He initialization recommends for layers with a ReLU activation.

- The model with He initialization separates the blue and the red dots very well in a small number of iterations.


You have seen three different types of initializations. For the same number of iterations and same hyperparameters the comparison is:

**Model** **Train accuracy** **Problem/Comment**
3-layer NN with zeros initialization 50% fails to break symmetry
3-layer NN with large random initialization 83% too large weights
3-layer NN with He initialization 99% recommended method

What you should remember from this notebook:
- Different initializations lead to different results
- Random initialization is used to break symmetry and make sure different hidden units can learn different things
- Don’t intialize to values that are too large
- He initialization works well for networks with ReLU activations.



The train accuracy is 94.8% while the test accuracy is 91.5%. This is the baseline model (you will observe the impact of regularization on this model). Run the following code to plot the decision boundary of your model.
The non-regularized model is obviously overfitting the training set. It is fitting the noisy points! Lets now look at two techniques to reduce overfitting.

6.2.L2 Regularization

Jregularized=1mi=1m(y(i)log(a[L](i))+(1y(i))log(1a[L](i)))cross-entropy cost+1mλ2lkjW[l]2k,jL2 regularization cost J r e g u l a r i z e d = − 1 m ∑ i = 1 m ( y ( i ) log ⁡ ( a [ L ] ( i ) ) + ( 1 − y ( i ) ) log ⁡ ( 1 − a [ L ] ( i ) ) ) ⏟ cross-entropy cost + 1 m λ 2 ∑ l ∑ k ∑ j W k , j [ l ] 2 ⏟ L2 regularization cost

Exercise: Implement compute_cost_with_regularization() which computes the cost given by formula (2). To calculate kjW[l]2k,j ∑ k ∑ j W k , j [ l ] 2 , use :


Note that you have to do this for W[1] W [ 1 ] , W[2] W [ 2 ] and W[3] W [ 3 ] , then sum the three terms and multiply by 1mλ2 1 m λ 2 .

### START CODE HERE ### (approx. 1 line)
    L2_regularization_cost = (np.sum(np.square(W1))+np.sum(np.square(W2))+np.sum(np.square(W3)))*lambd/(2*m)
    ### END CODER HERE ###

Exercise: Implement the changes needed in backward propagation to take into account regularization. The changes only concern dW1, dW2 and dW3. For each, you have to add the regularization term’s gradient ( ddW(12λmW2)=λmW d d W ( 1 2 λ m W 2 ) = λ m W ).

On the train set:
Accuracy: 0.938388625592
On the test set:
Accuracy: 0.93


- The value of λ λ is a hyperparameter that you can tune using a dev set.
- L2 regularization makes your decision boundary smoother. If λ λ is too large, it is also possible to “oversmooth”, resulting in a model with high bias.

What is L2-regularization actually doing?:

L2-regularization relies on the assumption that a model with small weights is simpler than a model with large weights. Thus, by penalizing the square values of the weights in the cost function you drive all the weights to smaller values. It becomes too costly for the cost to have large weights! This leads to a smoother model in which the output changes more slowly as the input changes.

What you should remember – the implications of L2-regularization on:
- The cost computation:
- A regularization term is added to the cost
- The backpropagation function:
- There are extra terms in the gradients with respect to weight matrices
- Weights end up smaller (“weight decay”):
- Weights are pushed to smaller values.


Dropout is a widely used regularization technique that is specific to deep learning. It randomly shuts down some neurons in each iteration.
Forward propagation with dropout

 ### START CODE HERE ### (approx. 4 lines)         # Steps 1-4 below correspond to the Steps 1-4 described above. 
  # Step 1: initialize matrix D1 = np.random.rand(..., ...)
    D1 = np.random.rand(A1.shape[0],A1.shape[1])  
  # Step 2: convert entries of D1 to 0 or 1 (using keep_prob as the threshold)                                      
    D1 = D1<keep_prob 
  # Step 3: shut down some neurons of A1                                       
    A1 = A1*D1         
  # Step 4: scale the value of neurons that haven't been shut down                               
    A1 = A1/keep_prob                                        
 ### END CODE HERE ###

Backward propagation with dropout

### START CODE HERE ### (≈ 2 lines of code)
# Step 1: Apply mask D2 to shut down the same neurons as during the forward propagation
    dA2 = dA2*D2   
# Step 2: Scale the value of neurons that haven't been shut down           
    dA2 = dA2/keep_prob            


On the train set:
Accuracy: 0.928909952607
On the test set:
Accuracy: 0.95


- A common mistake when using dropout is to use it both in training and testing. You should use dropout (randomly eliminate nodes) only in training.
- Deep learning frameworks like tensorflow, PaddlePaddle, keras or caffe come with a dropout layer implementation. Don’t stress - you will soon learn some of these frameworks.

What you should remember about dropout:
- Dropout is a regularization technique.
- You only use dropout during training. Don’t use dropout (randomly eliminate nodes) during test time.
- Apply dropout both during forward and backward propagation.
- During training time, divide each dropout layer by keep_prob to keep the same expected value for the activations. For example, if keep_prob is 0.5, then we will on average shut down half the nodes, so the output will be scaled by 0.5 since only the remaining half are contributing to the solution. Dividing by 0.5 is equivalent to multiplying by 2. Hence, the output now has the same expected value. You can check that this works even when keep_prob is other values than 0.5.


Here are the results of our three models:

**model** **train accuracy** **test accuracy**
3-layer NN without regularization 95% 91.5%
3-layer NN with L2-regularization 94% 93%
3-layer NN with dropout 93% 95%

What we want you to remember from this notebook:
- Regularization will help you reduce overfitting.
- Regularization will drive your weights to lower values.
- L2 regularization and Dropout are two very effective regularization techniques.

Gradient Checking(编程题3-梯度检查)


Backpropagation computes the gradients Jθ ∂ J ∂ θ , where θ θ denotes the parameters of the model. J J is computed using forward propagation and your loss function.

Because forward propagation is relatively easy to implement, you’re confident you got that right, and so you’re almost 100% sure that you’re computing the cost J correctly. Thus, you can use your code for computing J J to verify the code for computing Jθ.

Let’s look back at the definition of a derivative (or gradient):

Jθ=limε0J(θ+ε)J(θε)2ε(1) (1) ∂ J ∂ θ = lim ε → 0 J ( θ + ε ) − J ( θ − ε ) 2 ε

If you’re not familiar with the “ limε0 lim ε → 0 ” notation, it’s just a way of saying “when ε ε is really really small.”

We know the following:

  • Jθ ∂ J ∂ θ is what you want to make sure you’re computing correctly.
  • You can compute J(θ+ε) J ( θ + ε ) and J(θε) J ( θ − ε ) (in the case that θ θ is a real number), since you’re confident your implementation for J J is correct.

Lets use equation (1) and a small value for ε to convince your CEO that your code for computing Jθ ∂ J ∂ θ is correct!
Exercise: Implement gradient_check_n().

Instructions: Here is pseudo-code that will help you implement the gradient check.

For each i in num_parameters:
- To compute J_plus[i]:
1. Set θ+ θ + to np.copy(parameters_values)
2. Set θ+i θ i + to θ+i+ε θ i + + ε
3. Calculate J+i J i + using to forward_propagation_n(x, y, vector_to_dictionary( θ+ θ + )).
- To compute J_minus[i]: do the same thing with θ θ −
- Compute gradapprox[i]=J+iJi2ε g r a d a p p r o x [ i ] = J i + − J i − 2 ε

Thus, you get a vector gradapprox, where gradapprox[i] is an approximation of the gradient with respect to parameter_values[i]. You can now compare this gradapprox vector to the gradients vector from backpropagation. Just like for the 1D case (Steps 1’, 2’, 3’), compute:

difference=gradgradapprox2grad2+gradapprox2(3) (3) d i f f e r e n c e = ‖ g r a d − g r a d a p p r o x ‖ 2 ‖ g r a d ‖ 2 + ‖ g r a d a p p r o x ‖ 2

def gradient_check_n(parameters, gradients, X, Y, epsilon = 1e-7):
    Checks if backward_propagation_n computes correctly the gradient of the cost output by forward_propagation_n

    parameters -- python dictionary containing your parameters "W1", "b1", "W2", "b2", "W3", "b3":
    grad -- output of backward_propagation_n, contains gradients of the cost with respect to the parameters. 
    x -- input datapoint, of shape (input size, 1)
    y -- true "label"
    epsilon -- tiny shift to the input to compute approximated gradient with formula(1)

    difference -- difference (2) between the approximated gradient and the backward propagation gradient

    # Set-up variables
    parameters_values, _ = dictionary_to_vector(parameters)
    grad = gradients_to_vector(gradients)
    num_parameters = parameters_values.shape[0]
    J_plus = np.zeros((num_parameters, 1))
    J_minus = np.zeros((num_parameters, 1))
    gradapprox = np.zeros((num_parameters, 1))

    # Compute gradapprox
    for i in range(num_parameters):

        # Compute J_plus[i]. Inputs: "parameters_values, epsilon". Output = "J_plus[i]".
        # "_" is used because the function you have to outputs two parameters but we only care about the first one
        ### START CODE HERE ### (approx. 3 lines)
        thetaplus = np.copy(parameters_values)                                      # Step 1
        thetaplus[i][0] = thetaplus[i][0] + epsilon                                # Step 2
        J_plus[i], _ = forward_propagation_n(X, Y, vector_to_dictionary(thetaplus))                                   # Step 3
        ### END CODE HERE ###

        # Compute J_minus[i]. Inputs: "parameters_values, epsilon". Output = "J_minus[i]".
        ### START CODE HERE ### (approx. 3 lines)
        thetaminus = np.copy(parameters_values)                                      # Step 1
        thetaminus[i][0] = thetaminus[i][0] - epsilon                               # Step 2        
        J_minus[i], _ = forward_propagation_n(X, Y, vector_to_dictionary(thetaminus))                                  # Step 3
        ### END CODE HERE ###

        # Compute gradapprox[i]
        ### START CODE HERE ### (approx. 1 line)
        gradapprox[i] = (J_plus[i] - J_minus[i])/(2*epsilon)
        ### END CODE HERE ###

    # Compare gradapprox to backward propagation gradients by computing difference.
    ### START CODE HERE ### (approx. 1 line)
    numerator = np.linalg.norm(grad -  gradapprox)                                          # Step 1'
    denominator = np.linalg.norm(grad) + np.linalg.norm(gradapprox)                    # Step 2'
    difference = numerator/ denominator                                        # Step 3'
    ### END CODE HERE ###

    if difference > 2e-7:
        print ("\033[93m" + "There is a mistake in the backward propagation! difference = " + str(difference) + "\033[0m")
        print ("\033[92m" + "Your backward propagation works perfectly fine! difference = " + str(difference) + "\033[0m")

    return difference

What you should remember from this notebook:
- Gradient checking verifies closeness between the gradients from backpropagation and the numerical approximation of the gradient (computed using forward propagation).
- Gradient checking is slow, so we don’t run it in every iteration of training. You would usually run it only to make sure your code is correct, then turn it off and use backprop for the actual learning process.

