【深度学习】吴恩达深度学习-Course2改善深层神经网络：超参数调试、正则化以及优化-第一周深度学习的实用层面编程（中）——正则化

本文链接：https://blog.csdn.net/passer__jw767/article/details/122674944

本文详细介绍了深度学习模型中正则化的重要性和应用，包括L2正则化和Dropout技术。通过在训练过程中引入正则化项和随机失活，可以有效地防止模型过度拟合训练数据，提高测试集上的预测性能。在实践中，L2正则化通过调整损失函数，使权重矩阵的值变小，而Dropout则在每次迭代中随机关闭部分神经元，以增加模型的泛化能力。实验结果显示，采用正则化和Dropout的模型在测试集上的准确率显著提升，表明这两种技术在防止过拟合方面非常有效。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Remind：如果复制某一部分代码发现无法运行，可能是我粘贴了错误的代码，你可以看看六的总代码，总代码保证一定是正确的！😃

视频链接：[【中英字幕】吴恩达深度学习课程第二课 — 改善深层神经网络：超参数调试、正则化以及优化](https://www.bilibili.com/video/BV1V441127zE?p=14) 参考链接：

资源下载链接（来自参考链接2）：

本文所用资料
data.mat下载后名称为9.mat，需要手动更名为data.mat

〇、作业目标

深度学习模型具有很大的灵活性和容量，如果训练数据集不够大，过度拟合可能会成为一个严重的问题。当然，它在训练集上表现不错，但学习网络并没有推广到它从未见过的新例子。
在这次作业中你将学习到：

在你的深度学习模型中使用正则化

一、需要导入的包

testCases由此文章中第一大点的第3小点处下载。
其他安装可参照上一篇文章的第一大点（使用Anaconda进行安装）

# import packages
import numpy as np
import matplotlib.pyplot as plt
from reg_utils import sigmoid, relu, plot_decision_boundary, initialize_parameters, load_2D_dataset, predict_dec
from reg_utils import compute_cost, predict, forward_propagation, backward_propagation, update_parameters
import sklearn
import sklearn.datasets
import scipy.io
from testCases import *

plt.rcParams['figure.figsize'] = (7.0, 4.0) # set default size of plots
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'gray'

跑一下不报错就行，可能会有一些警告，但是没有影响。

二、问题描述（来自参考链接2）

假设你现在是一个AI专家，你需要设计一个模型，可以用于推荐在足球场中守门员将球发至哪个位置可以让本队的球员抢到球的可能性更大。说白了，实际上就是一个二分类，一半是己方抢到球，一半就是对方抢到球，我们来看一下这个图：
在这里插入图片描述

三、数据集

读取一下数据集（在前边给出的代码基础上加上如下两行）：

train_X, train_Y, test_X, test_Y = reg_utils.load_2D_dataset(is_plot=True)
plt.show()

结果如下：
在这里插入图片描述
每一个点代表球落下的可能的位置，蓝色代表己方的球员会抢到球，红色代表对手的球员会抢到球。
我们的目标：根据模型来画出一条线，找到适合我方球员能抢到球的位置。
数据集分析：这个数据集有点嘈杂，但它看起来像是一条将左上半部分（蓝色）和右下半部分（红色）分开的对角线。
首先我们将尝试未正则化的模型，然后你将学习如何正则化它并决定你会选择哪个模型来解决法国足球合作问题。

四、学习过程

1、Non-regularized model（非正则化模型）

你将使用以下神经网络（下面已经为您实现）。该模型可用于：

正则化模式中：设定lambd输入一个非零值。这里我们使用“lambd”来替代“lambda”因为在Python中“lambda”是一个保存关键字
在dropout（随机失活）模式中：设定一个小于1的值keep_prob

你将首先尝试一个没有任何正则化的模型，然后你需要完善：

L2正则化：函数“compute_cost_with_regularization()”和“backward_propagation_with_regularization()”
Dropout：函数“forward_propagation_with_dropout()”和“backward_propagation_with_dropout()”

在每一部分，你将使用给定的输入来跑模型以保证它会调用你所写的函数。请观察下面你熟悉的模型代码。

def model(X, Y, learning_rate = 0.3, num_iterations = 30000, print_cost = True, lambd = 0, keep_prob = 1):
    """
    Implements a three-layer neural network: LINEAR->RELU->LINEAR->RELU->LINEAR->SIGMOID.
    
    Arguments:
    X -- input data, of shape (input size, number of examples)
    Y -- true "label" vector (1 for blue dot / 0 for red dot), of shape (output size, number of examples)
    learning_rate -- learning rate of the optimization
    num_iterations -- number of iterations of the optimization loop
    print_cost -- If True, print the cost every 10000 iterations
    lambd -- regularization hyperparameter, scalar
    keep_prob - probability of keeping a neuron active during drop-out, scalar.
    
    Returns:
    parameters -- parameters learned by the model. They can then be used to predict.
    """
        
    grads = {}
    costs = []                            # to keep track of the cost
    m = X.shape[1]                        # number of examples
    layers_dims = [X.shape[0], 20, 3, 1]
    
    # Initialize parameters dictionary.
    parameters = initialize_parameters(layers_dims)

    # Loop (gradient descent)

    for i in range(0, num_iterations):

        # Forward propagation: LINEAR -> RELU -> LINEAR -> RELU -> LINEAR -> SIGMOID.
        if keep_prob == 1:
            a3, cache = forward_propagation(X, parameters)
        elif keep_prob < 1:
            a3, cache = forward_propagation_with_dropout(X, parameters, keep_prob)
        
        # Cost function
        if lambd == 0:
            cost = compute_cost(a3, Y)
        else:
            cost = compute_cost_with_regularization(a3, Y, parameters, lambd)
            
        # Backward propagation.
        assert(lambd == 0 or keep_prob == 1)    # it is possible to use both L2 regularization and dropout, 
                                            # but this assignment will only explore one at a time
        if lambd == 0 and keep_prob == 1:
            grads = backward_propagation(X, Y, cache)
        elif lambd != 0:
            grads = backward_propagation_with_regularization(X, Y, cache, lambd)
        elif keep_prob < 1:
            grads = backward_propagation_with_dropout(X, Y, cache, keep_prob)
        
        # Update parameters.
        parameters = update_parameters(parameters, grads, learning_rate)
        
        # Print the loss every 10000 iterations
        if print_cost and i % 10000 == 0:
            print("Cost after iteration {}: {}".format(i, cost))
        if print_cost and i % 1000 == 0:
            costs.append(cost)
    
    # plot the cost
    plt.plot(costs)
    plt.ylabel('cost')
    plt.xlabel('iterations (x1,000)')
    plt.title("Learning rate =" + str(learning_rate))
    plt.show()
    
    return parameters

查看损失函数变化

使用下边的代码查看损失函数（在运行下面的代码时，请将还没写的函数注释掉）

parameters = model(train_X, train_Y)
print("On the training set:")
predictions_train = predict(train_X, train_Y, parameters)
print("On the test set:")
predictions_test = predict(test_X, test_Y, parameters)

结果如下：

Cost after iteration 0: 0.6557412523481002
Cost after iteration 10000: 0.16329987525724213
Cost after iteration 20000: 0.13851642423257488

（注：如果你plt.show()出来的图是带有数据集点的话，请在一开始加载数据集那段代码的括号内将is_plot参数置为false）
在这里插入图片描述

On the training set:
Accuracy: 0.9478672985781991
On the test set:
Accuracy: 0.915

查看划分的边界

plt.title("Model without regularization")
axes = plt.gca()
axes.set_xlim([-0.75, 0.40])
axes.set_ylim([-0.75, 0.65])
plot_decision_boundary(lambda x: predict_dec(parameters, x.T), train_X, train_Y)

结果如下：
在这里插入图片描述
非正则化显然过度拟合了训练集。它过度拟合了在蓝色区域里的红点和红色区域里的蓝点。现在我们来看两种减少过度拟合的技术。

2、L2正则化

避免过度拟合的标准方式是使用L2正则化。其涵盖适当地修改损失函数，从在这里插入图片描述
改变成了：

让我们修改损失函数并观察结果。

练习：完成函数compute_cost_with_regularization()、backward_propagation_with_regularization()

练习1：完成函数compute_cost_with_regularization()，使用上边改变后的公式来计算损失。
提示：当你计算在这里插入图片描述时，可以使用np.sum(np.square(Wl))。注意，你必须对W^ [1]，W^ [2]和W^ [3]这样做，然后将这三项相加并乘以(1/m)*(λ/2)。

def compute_cost_with_regularization(A3, Y, parameters, lambd):
    """
    Implement the cost function with L2 regularization. See formula (2) above.
    
    Arguments:
    A3 -- post-activation, output of forward propagation, of shape (output size, number of examples)
    Y -- "true" labels vector, of shape (output size, number of examples)
    parameters -- python dictionary containing parameters of the model
    
    Returns:
    cost - value of the regularized loss function (formula (2))
    """

写完后应如下：

def compute_cost_with_regularization(A3, Y, parameters, lambd):
    """
    Implement the cost function with L2 regularization. See formula (2) above.

    Arguments:
    A3 -- post-activation, output of forward propagation, of shape (output size, number of examples)
    Y -- "true" labels vector, of shape (output size, number of examples)
    parameters -- python dictionary containing parameters of the model

    Returns:
    cost - value of the regularized loss function (formula (2))
    """
    m = Y.shape[1]
    W1 = parameters["W1"]
    W2 = parameters["W2"]
    W3 = parameters["W3"]

    cross_entropy_cost = compute_cost(A3, Y)

    L2_regularization_cost = lambd * (np.sum(np.square(W1)) + np.sum(np.square(W2)) + np.sum(np.square(W3))) / (2 * m)

    cost = cross_entropy_cost + L2_regularization_cost

    return cost

因为你改变了损失函数，所以同时你也要改变反向传播函数。所有的梯度都必须根据这个新成本来计算。

练习2：改变反向传播，在其中考虑正则化。这些更改只涉及dW1、dW2、dW3。对于它们每一项你需要加上正则化梯度在这里插入图片描述

def backward_propagation_with_regularization(X, Y, cache, lambd):
    """
    Implements the backward propagation of our baseline model to which we added an L2 regularization.
    
    Arguments:
    X -- input dataset, of shape (input size, number of examples)
    Y -- "true" labels vector, of shape (output size, number of examples)
    cache -- cache output from forward_propagation()
    lambd -- regularization hyperparameter, scalar
    
    Returns:
    gradients -- A dictionary with the gradients with respect to each parameter, activation and pre-activation variables
    """

答案及解释如下：

def backward_propagation_with_regularization(X, Y, cache, lambd):
    """
    Implements the backward propagation of our baseline model to which we added an L2 regularization.

    Arguments:
    X -- input dataset, of shape (input size, number of examples)
    Y -- "true" labels vector, of shape (output size, number of examples)
    cache -- cache output from forward_propagation()
    lambd -- regularization hyperparameter, scalar

    Returns:
    gradients -- A dictionary with the gradients with respect to each parameter, activation and pre-activation variables
    """
    m = X.shape[1]
    (Z1, A1, W1, b1, Z2, A2, W2, b2, Z3, A3, W3, b3) = cache

    # 反向传播，如果对这一部分没有印象了可以再去看看course1的week3的课/笔记
    # 这里前向传播最后一步使用sigmoid函数。在此处，我们便是取sigmoid的导数
    # 用链式求导法则计算导数
    # 用吴恩达课上的表示法便如下： "dA3"代表dL(A3,Y)/dA3 ， “dZ3”代表dL(A,Y)/dZ3
    # "dA3" = dL(A3,Y)/dA3 = -(A3/Y) + (1-Y)/(1-A3)
    # "dZ3" = dL(A3,Y)/dA3 * dA3/dZ3 = dL(A3,Y)/dA3 * A3(1-A3) = A3 - Y
    dZ3 = A3 - Y
    # 正向传播时，Z3 = np.dot(W3, A2) + b3
    # "dW3" = dL(A3,Y)/dW3 = "dZ3" * dZ3/dW3 = "dZ3" * A2
    # 给我们的提示里说到（提示是已经为我们计算好了导数），每一项dW需要加上正则化梯度，所以我们就将提示中给出的lambda * W / m加上即可

    dW3 = (lambd * W3) / m + 1. / m * np.dot(dZ3, A2.T)
    db3 = 1. / m * np.sum(dZ3, axis=1, keepdims=True)

    # 以下同理
    # 根据链式求导法则，"dA2" = "dZ3" * dZ3/dA2
    # 在正向传播时，有：Z3 = np.dot(W3, A2) + b3
    # 由正向传播的公式，我们就可以知道dZ3/dA2 = W3
    # 从而，“dA2” = “dZ3" * W3
    dA2 = np.dot(W3.T, dZ3)
    # 在正向传播时，有A2 = relu(Z2)
    # "dZ2" = "dA2" * dA2/dZ2 = "dA2" * np,int64(g(z) > 0); 括号内是一个逻辑表达式，解释如下：
    # relu函数我们都知道是g(z) = max(0,z); 而relu函数的导数情况如下：if z<0,g'(z)=0; elif z>0,g'(z)=1; elif z=0,g'(z)=undefined(或者你可以自己定义为0/1)
    # 那么我们可以知道，relu函数的导数可以通过一个逻辑表达式“g(z) > 0”来进行判断;若g(z)>0，则结果为1。又因为g(z)=max(0,z)，我们可以知道z一定是>0的，所以g'(z)=1。
    # 通过上面一句话的推导，我们可以使用"g(z) > 0"来进行relu函数的求导
    # np.int64(a): 将a的值转换成整数
    dZ2 = np.multiply(dA2, np.int64(A2 > 0))
    dW2 = (lambd * W2) / m + 1. / m * np.dot(dZ2, A1.T)
    db2 = 1. / m * np.sum(dZ2, axis=1, keepdims=True)

    dA1 = np.dot(W2.T, dZ2)
    dZ1 = np.multiply(dA1, np.int64(A1 > 0))
    dW1 = (lambd * W1) / m + 1. / m * np.dot(dZ1, X.T)
    db1 = 1. / m * np.sum(dZ1, axis=1, keepdims=True)

    gradients = {"dZ3": dZ3, "dW3": dW3, "db3": db3, "dA2": dA2,
                 "dZ2": dZ2, "dW2": dW2, "db2": db2, "dA1": dA1,
                 "dZ1": dZ1, "dW1": dW1, "db1": db1}

    return gradients

个人认为，L2正则化的反向传播过程算是完成L2正则化过程中难的一部分。在反向传播的过程中，比较重要的是在计算dW时候，我们仍然需要加上正则化梯度λ*W/m，同时，Relu函数导数的计算也是很巧妙的点，如果无法理解也没关系，只要记住就好了。

查看损失函数变化

您可以复制粘贴用以下代码查看您损失函数的变化情况

parameters = model(train_X, train_Y, lambd=0.7)
print("On the train set:")
predictions_train = predict(train_X, train_Y, parameters)
print("On the test set:")
predictions_test = predict(test_X, test_Y, parameters)

结果如下：
在这里插入图片描述

Cost after iteration 0: 0.6455597362665152
Cost after iteration 10000: 0.22243930284965185
Cost after iteration 20000: 0.217197697773333
On the train set:
Accuracy: 0.9383886255924171
On the test set:
Accuracy: 0.93

恭喜！test集的准确性增长到了93%！
然而我们并没有很好地拟合训练集，我们接下来看一下划分出来的边界是如何的

查看划分的边界

您可以复制粘贴以下代码查看划分的边界

plt.title("Model with L2-regularization")
axes = plt.gca()
axes.set_xlim([-0.75,0.40])
axes.set_ylim([-0.75,0.65])
plot_decision_boundary(lambda x: predict_dec(parameters, x.T), train_X, train_Y)

结果如下：
在这里插入图片描述
下面对使用L2正则化的结果做一个总结

观察：

λ的值是一个超参数，可以使用dev集进行调整
L2正则化让您的决策边界变得更加光滑，如果λ太大，同样会导致“过于光滑”，导致模型出现较大的偏差（高偏差的表现就是未能够很好地拟合数据）

L2正则化实际完成了什么工作？
L2正则化依赖于“模型有较小的权重比模型有较大权重简单”这个假设。因此，通过调整成本函数中权重的平方值，可以使所有权重的值变小。当成本函数有较大的权重时，会导致成本更高！这将使得随着输入的变化，输出的变化更为缓慢，从而导致更平滑的模型。

你需要记住的与L2正则化有关的内容：

成本函数的计算：在成本函数中加入一个正则化项
反向传播函数：在计算梯度时有关于权重矩阵的额外项
权重最终变小（权重衰减）：权重会变成更小的值

3、Dropout（随机失活）

最后，dropout对深度学习来说很特殊，它被广泛运用于正则化技术。在每一次的迭代中，这种技术会随机使几个神经单元失活（不起作用）。通过一下两个图片来明白其中的意思（参考链接2的作者提供了原图下载链接）：
在这里插入图片描述
图1：第二层启用随机节点删除
在每一次迭代中，删除一层的每个神经元的概率为 $1 - keep\_prob$ ，我们在这里保持概率为 $keep\_prob$ （这里为50％）。丢弃的节点都不参与迭代时的前向和后向传播。

图2：在第一层和第三层启用随机删除结点
$1^{st}$ 平均40%节点被删除， $3^{rd}$ 平均删除了20%的节点。
当随机删除一些结点时，实际上是修改了你的模型。随机失活的idea是在每一次迭代中，你训练一个使用神经元子集的不同模型。伴随着dropout，你的神经元会对另一个特定神经元激活变得不那么敏感，因为其他神经元可能会随时被关闭。

练习：完成forward_propagation_with_dropout()和backward_propagation_with_dropout()

练习1： 完成带有随机失活的前向传播函数，你将使用3层神经网络，并在第一、二层加入随机失活。我们将不会在输入层或输出层使用随机失活。
说明：你想要使得在第一、第二层的一些神经元失活。为了实现这一步，你需要跟着以下四个步骤来做：

在课程中，我们创造有着和 $a^{[1]}$ 相同维度的变量 $d^{[1]}$ ，并使用np.random.randn()来初始化获得0~1之间的随机数字。在这里，你将使用一个向量化的方法来创建有着与 $A^{[1]}$ 相同维度的随机矩阵 $D^{[1]}=[ d^{[1] (1)} d^{[1] (2)} ... d^{[1] (m)} ]$ 。
通过适当设置阈值，将 $D^{[1]}$ 的每一条目时以概率 $1 - keep\_prob$ 设置为0，以概率 $keep\_prob$ 设置为1。（即低于 $keep\_prob$ 的值我们就将其设置为0，如果高于 $keep\_prob$ 我们就将其设置为1）
提示：要将矩阵X所有条目设置为0（如果条目小于0.5）或1（如果条目大于0.5），你可以执行以下操作：X = (X < 0.5)。注意，0和1等同于False和True
将 $A^{[1]}$ 设置为 $A^{[1]} * D^{[1]}$ 。你可以把 $D^{[1]}$ 看作一个掩码，这样当它与另一个矩阵相乘时，关闭的那些节点（值为0）就会不参与计算，因为0乘以任何值都为0。
使用 $A^{[1]}$ 除以 $keep\_prob$ 。这样我们通过缩放就在计算成本时拥有相同的期望值，这叫做反向dropout。

完成以下函数：
我们指定随机种子np.random.seed(1)

def forward_propagation_with_dropout(X, parameters, keep_prob=0.5):
    """
    Implements the forward propagation: LINEAR -> RELU + DROPOUT -> LINEAR -> RELU + DROPOUT -> LINEAR -> SIGMOID.
    
    Arguments:
    X -- input dataset, of shape (2, number of examples)
    parameters -- python dictionary containing your parameters "W1", "b1", "W2", "b2", "W3", "b3":
                    W1 -- weight matrix of shape (20, 2)
                    b1 -- bias vector of shape (20, 1)
                    W2 -- weight matrix of shape (3, 20)
                    b2 -- bias vector of shape (3, 1)
                    W3 -- weight matrix of shape (1, 3)
                    b3 -- bias vector of shape (1, 1)
    keep_prob - probability of keeping a neuron active during drop-out, scalar
    
    Returns:
    A3 -- last activation value, output of the forward propagation, of shape (1,1)
    cache -- tuple, information stored for computing the backward propagation
    """

完成结果如下：

def forward_propagation_with_dropout(X, parameters, keep_prob=0.5):
    """
    Implements the forward propagation: LINEAR -> RELU + DROPOUT -> LINEAR -> RELU + DROPOUT -> LINEAR -> SIGMOID.

    Arguments:
    X -- input dataset, of shape (2, number of examples)
    parameters -- python dictionary containing your parameters "W1", "b1", "W2", "b2", "W3", "b3":
                    W1 -- weight matrix of shape (20, 2)
                    b1 -- bias vector of shape (20, 1)
                    W2 -- weight matrix of shape (3, 20)
                    b2 -- bias vector of shape (3, 1)
                    W3 -- weight matrix of shape (1, 3)
                    b3 -- bias vector of shape (1, 1)
    keep_prob - probability of keeping a neuron active during drop-out, scalar

    Returns:
    A3 -- last activation value, output of the forward propagation, of shape (1,1)
    cache -- tuple, information stored for computing the backward propagation
    """
    np.random.seed(1)

    W1 = parameters["W1"]
    b1 = parameters["b1"]
    W2 = parameters["W2"]
    b2 = parameters["b2"]
    W3 = parameters["W3"]
    b3 = parameters["b3"]

    Z1 = np.dot(W1, X) + b1
    A1 = relu(Z1)
    # Step1：创建有着与A^[1]相同维度的随机矩阵D^[1]
    D1 = np.random.rand(A1.shape[0], A1.shape[1])
    # Step2：将D^[1]的每一条目时以概率1−keep_prob设置为0，以概率keep_prob设置为1
    D1 = D1 < keep_prob
    # Step3：将A^[1]设置为A^[1] * D^[1]
    A1 = A1 * D1
    # Step4：使用A^[1]除以keep_prob
    A1 /= keep_prob

    # 第二层的写法和第一层是一样的
    Z2 = np.dot(W2, A1) + b2
    A2 = relu(Z2)
    D2 = np.random.rand(A2.shape[0], A2.shape[1])
    D2 = D2 < keep_prob
    A2 = A2 * D2
    A2 /= keep_prob

    # 第三层不用采取keep_prob的方法，所以就不用这么写了
    Z3 = np.dot(W3, A2) + b3
    A3 = sigmoid(Z3)

    cache = (Z1, D1, A1, W1, b1, Z2, D2, A2, W2, b2, Z3, A3, W3, b3)

    return A3, cache

练习2： 完成dropout的反向传播函数。在这之前，你需要训练三层神经网络。在第一、第二层隐藏层加上随机失活。将 $D^{[1]}$ 和 $D^{[2]}$ 存储于cache中。
说明： dropout的反向传播实际上非常简单，你只需要做下面这两步：

你在前向传播中通过A1*D1的方式关闭了一些神经元，在反向传播中，你需要通过D1 * dA1关闭同样的神经元
在前向传播中，你需要使用A1除以 $keep\_prob$ 。因此在反向传播中，你需要再次使用dA1除以 $keep\_prob$ ，然后它的导数dA1也会按照相同的比例进行缩放( $keep\_prob$ )

完成以下函数：

def backward_propagation_with_dropout(X, Y, cache, keep_prob):
    """
    Implements the backward propagation of our baseline model to which we added dropout.
    
    Arguments:
    X -- input dataset, of shape (2, number of examples)
    Y -- "true" labels vector, of shape (output size, number of examples)
    cache -- cache output from forward_propagation_with_dropout()
    keep_prob - probability of keeping a neuron active during drop-out, scalar
    
    Returns:
    gradients -- A dictionary with the gradients with respect to each parameter, activation and pre-activation variables
    """

完成结果如下：

def backward_propagation_with_dropout(X, Y, cache, keep_prob):
    """
    Implements the backward propagation of our baseline model to which we added dropout.

    Arguments:
    X -- input dataset, of shape (2, number of examples)
    Y -- "true" labels vector, of shape (output size, number of examples)
    cache -- cache output from forward_propagation_with_dropout()
    keep_prob - probability of keeping a neuron active during drop-out, scalar

    Returns:
    gradients -- A dictionary with the gradients with respect to each parameter, activation and pre-activation variables
    """
    m = X.shape[1]
    (Z1, D1, A1, W1, b1, Z2, D2, A2, W2, b2, Z3, A3, W3, b3) = cache

    dZ3 = A3 - Y
    dW3 = 1. / m * np.dot(dZ3, A2.T)
    db3 = 1. / m * np.sum(dZ3, axis=1, keepdims=True)

    dA2 = np.dot(W3.T, dZ3)
    # 在反向传播中关闭前向传播中关闭的神经元
    dA2 = dA2 * D2
    # 等比例缩放
    dA2 /= keep_prob
    dZ2 = np.multiply(dA2, np.int64(A2 > 0))
    dW2 = 1. / m * np.dot(dZ2, A1.T)
    db2 = 1. / m * np.sum(dZ2, axis=1, keepdims=True)

    # 与第二层同理
    dA1 = np.dot(W2.T, dZ2)
    dA1 = dA1 * D1
    dA1 /= keep_prob
    dZ1 = np.multiply(dA1, np.int64(A1 > 0))
    dW1 = 1. / m * np.dot(dZ1, X.T)
    db1 = 1. / m * np.sum(dZ1, axis=1, keepdims=True)

    gradients = {"dZ3": dZ3, "dW3": dW3, "db3": db3, "dA2": dA2,
                 "dZ2": dZ2, "dW2": dW2, "db2": db2, "dA1": dA1,
                 "dZ1": dZ1, "dW1": dW1, "db1": db1}
    
    return gradients

查看损失函数变化

让我们以 $keep\_prob=0.86$ 来跑一下模型。这意味着在每一次迭代的过程中关闭1、2层神经元结点的概率为24%。在model()函数中，分别调用forward_propagation_with_dropout(...)和backward_propagation_with_dropout(...)

parameters = model(train_X, train_Y, keep_prob=0.86, learning_rate=0.3)

print("On the train set:")
predictions_train = predict(train_X, train_Y, parameters)
print("On the test set:")
predictions_test = predict(test_X, test_Y, parameters)

结果如下：

Cost after iteration 0: 0.6543912405149825
H:\DeepLearning_wed\course2\week1\reg_utils.py:121: RuntimeWarning: divide by zero encountered in log
  logprobs = np.multiply(-np.log(a3),Y) + np.multiply(-np.log(1 - a3), 1 - Y)
H:\DeepLearning_wed\course2\week1\reg_utils.py:121: RuntimeWarning: invalid value encountered in multiply
  logprobs = np.multiply(-np.log(a3),Y) + np.multiply(-np.log(1 - a3), 1 - Y)
Cost after iteration 10000: 0.0610169865749056
Cost after iteration 20000: 0.060582435798513114
On the train set:
Accuracy: 0.9289099526066351
On the test set:
Accuracy: 0.95

在这里插入图片描述
Dropout的表现更好！在测试集上的准确度再次上升（达到了95%）。你的模型不再过度拟合训练集并在测试集上有了更好地表现。法国足球队将感激你！

查看划分的边界

您可以复制以下代码查看划分的边界：

plt.title("Model with dropout")
axes = plt.gca()
axes.set_xlim([-0.75, 0.40])
axes.set_ylim([-0.75, 0.65])
plot_decision_boundary(lambda x: predict_dec(parameters, x.T), train_X, train_Y)

在这里插入图片描述

笔记：

使用dropout的一个常见错误是在训练集和测试集中都是用dropout。你应该只在训练过程中使用dropout。
深度学习框架如 tensorflow, PaddlePaddle, keras或caffe都有dropout的实现。如果你不太明白也不要有太大压力，很快我们就将学习这些框架。

关于dropout我们需要记住的：

Dropout是一项正则化技术
你只需要在训练中使用dropout，而不要在测试中使用dropout
在前向传播和反向传播中都要使用dropout
在训练过程中，在每一个使用dropout的层中除以 $keep\_prob$ 以为激活保持相同的期望值。例如，如果 $keep\_prob$ 为0.5，那么我们将要关闭一半的结点，因此，输出将被缩放0.5，因为只有剩下的一半对结果有帮助。除以0.5相当于乘以2。因此，输出现在具有相同的预期值。即使keep_prob不是0.5，您也可以检查它是否有效。

五、总结

这里有三个模型的比较：
在这里插入图片描述
注意，正规化会影响训练集的性能！这是因为它限制了网络过度适应训练集的能力。但是，由于它最终提供了更好的测试精度，它有助于完善您的神经网络。
祝贺你完成这项任务！：-）
我们从本篇文章中需要记住的：

正则化将帮助减少过度拟合
正则化将你的权重降到一个较低的值
L2正则化和Dropout是两个非常有影响力的正则化技术

六、源代码

import numpy as np
import matplotlib.pyplot as plt
from course2.week1.reg_utils import sigmoid, relu, plot_decision_boundary, initialize_parameters, load_2D_dataset, predict_dec
from course2.week1.reg_utils import compute_cost, predict, forward_propagation, backward_propagation, update_parameters
import sklearn
import sklearn.datasets
import scipy.io
from course2.week1.testCases import *

# plt.rcParams['figure.figsize'] = (7.0, 4.0) # set default size of plots
# plt.rcParams['image.interpolation'] = 'nearest'
# plt.rcParams['image.cmap'] = 'gray'

train_X, train_Y, test_X, test_Y = load_2D_dataset(is_plot=False)
# plt.show()


def model(X, Y, learning_rate=0.3, num_iterations=30000, print_cost=True, lambd=0, keep_prob=1):
    """
    Implements a three-layer neural network: LINEAR->RELU->LINEAR->RELU->LINEAR->SIGMOID.

    Arguments:
    X -- input data, of shape (input size, number of examples)
    Y -- true "label" vector (1 for blue dot / 0 for red dot), of shape (output size, number of examples)
    learning_rate -- learning rate of the optimization
    num_iterations -- number of iterations of the optimization loop
    print_cost -- If True, print the cost every 10000 iterations
    lambd -- regularization hyperparameter, scalar
    keep_prob - probability of keeping a neuron active during drop-out, scalar.

    Returns:
    parameters -- parameters learned by the model. They can then be used to predict.
    """

    grads = {}
    costs = []  # to keep track of the cost
    m = X.shape[1]  # number of examples
    layers_dims = [X.shape[0], 20, 3, 1]

    # Initialize parameters dictionary.
    parameters = initialize_parameters(layers_dims)

    # Loop (gradient descent)

    for i in range(0, num_iterations):

        # Forward propagation: LINEAR -> RELU -> LINEAR -> RELU -> LINEAR -> SIGMOID.
        if keep_prob == 1:
            a3, cache = forward_propagation(X, parameters)
        elif keep_prob < 1:
            a3, cache = forward_propagation_with_dropout(X, parameters, keep_prob)

        # Cost function
        if lambd == 0:
            cost = compute_cost(a3, Y)
        else:
            cost = compute_cost_with_regularization(a3, Y, parameters, lambd)

        # Backward propagation.
        assert (lambd == 0 or keep_prob == 1)  # it is possible to use both L2 regularization and dropout,
        # but this assignment will only explore one at a time
        if lambd == 0 and keep_prob == 1:
            grads = backward_propagation(X, Y, cache)
        elif lambd != 0:
            grads = backward_propagation_with_regularization(X, Y, cache, lambd)
        elif keep_prob < 1:
            grads = backward_propagation_with_dropout(X, Y, cache, keep_prob)

        # Update parameters.
        parameters = update_parameters(parameters, grads, learning_rate)

        # Print the loss every 10000 iterations
        if print_cost and i % 10000 == 0:
            print("Cost after iteration {}: {}".format(i, cost))
        if print_cost and i % 1000 == 0:
            costs.append(cost)

    # plot the cost
    plt.plot(costs)
    plt.ylabel('cost')
    plt.xlabel('iterations (x1,000)')
    plt.title("Learning rate =" + str(learning_rate))
    plt.show()

    return parameters


def compute_cost_with_regularization(A3, Y, parameters, lambd):
    """
    Implement the cost function with L2 regularization. See formula (2) above.

    Arguments:
    A3 -- post-activation, output of forward propagation, of shape (output size, number of examples)
    Y -- "true" labels vector, of shape (output size, number of examples)
    parameters -- python dictionary containing parameters of the model

    Returns:
    cost - value of the regularized loss function (formula (2))
    """
    m = Y.shape[1]
    W1 = parameters["W1"]
    W2 = parameters["W2"]
    W3 = parameters["W3"]

    cross_entropy_cost = compute_cost(A3, Y)

    L2_regularization_cost = lambd * (np.sum(np.square(W1)) + np.sum(np.square(W2)) + np.sum(np.square(W3))) / (2 * m)

    cost = cross_entropy_cost + L2_regularization_cost

    return cost


def backward_propagation_with_regularization(X, Y, cache, lambd):
    """
    Implements the backward propagation of our baseline model to which we added an L2 regularization.

    Arguments:
    X -- input dataset, of shape (input size, number of examples)
    Y -- "true" labels vector, of shape (output size, number of examples)
    cache -- cache output from forward_propagation()
    lambd -- regularization hyperparameter, scalar

    Returns:
    gradients -- A dictionary with the gradients with respect to each parameter, activation and pre-activation variables
    """
    m = X.shape[1]
    (Z1, A1, W1, b1, Z2, A2, W2, b2, Z3, A3, W3, b3) = cache

    # 反向传播，如果对这一部分没有印象了可以再去看看course1的week3的课/笔记
    # 这里前向传播最后一步使用sigmoid函数。在此处，我们便是取sigmoid的导数
    # 用链式求导法则计算导数
    # 用吴恩达课上的表示法便如下： "dA3"代表dL(A3,Y)/dA3 ， “dZ3”代表dL(A,Y)/dZ3
    # "dA3" = dL(A3,Y)/dA3 = -(A3/Y) + (1-Y)/(1-A3)
    # "dZ3" = dL(A3,Y)/dA3 * dA3/dZ3 = dL(A3,Y)/dA3 * A3(1-A3) = A3 - Y
    dZ3 = A3 - Y
    # 正向传播时，Z3 = np.dot(W3, A2) + b3
    # "dW3" = dL(A3,Y)/dW3 = "dZ3" * dZ3/dW3 = "dZ3" * A2
    # 给我们的提示里说到（提示是已经为我们计算好了导数），每一项dW需要加上正则化梯度，所以我们就将提示中给出的lambda * W / m加上即可

    dW3 = (lambd * W3) / m + 1. / m * np.dot(dZ3, A2.T)
    db3 = 1. / m * np.sum(dZ3, axis=1, keepdims=True)

    # 以下同理
    # 根据链式求导法则，"dA2" = "dZ3" * dZ3/dA2
    # 在正向传播时，有：Z3 = np.dot(W3, A2) + b3
    # 由正向传播的公式，我们就可以知道dZ3/dA2 = W3
    # 从而，“dA2” = “dZ3" * W3
    dA2 = np.dot(W3.T, dZ3)
    # 在正向传播时，有A2 = relu(Z2)
    # "dZ2" = "dA2" * dA2/dZ2 = "dA2" * np,int64(g(z) > 0); 括号内是一个逻辑表达式，解释如下：
    # relu函数我们都知道是g(z) = max(0,z); 而relu函数的导数情况如下：if z<0,g'(z)=0; elif z>0,g'(z)=1; elif z=0,g'(z)=undefined(或者你可以自己定义为0/1)
    # 那么我们可以知道，relu函数的导数可以通过一个逻辑表达式“g(z) > 0”来进行判断;若g(z)>0，则结果为1。又因为g(z)=max(0,z)，我们可以知道z一定是>0的，所以g'(z)=1。
    # 通过上面一句话的推导，我们可以使用"g(z) > 0"来进行relu函数的求导
    # np.int64(a): 将a的值转换成整数
    dZ2 = np.multiply(dA2, np.int64(A2 > 0))
    dW2 = (lambd * W2) / m + 1. / m * np.dot(dZ2, A1.T)
    db2 = 1. / m * np.sum(dZ2, axis=1, keepdims=True)

    dA1 = np.dot(W2.T, dZ2)
    dZ1 = np.multiply(dA1, np.int64(A1 > 0))
    dW1 = (lambd * W1) / m + 1. / m * np.dot(dZ1, X.T)
    db1 = 1. / m * np.sum(dZ1, axis=1, keepdims=True)

    gradients = {"dZ3": dZ3, "dW3": dW3, "db3": db3, "dA2": dA2,
                 "dZ2": dZ2, "dW2": dW2, "db2": db2, "dA1": dA1,
                 "dZ1": dZ1, "dW1": dW1, "db1": db1}

    return gradients


def forward_propagation_with_dropout(X, parameters, keep_prob=0.5):
    """
    Implements the forward propagation: LINEAR -> RELU + DROPOUT -> LINEAR -> RELU + DROPOUT -> LINEAR -> SIGMOID.

    Arguments:
    X -- input dataset, of shape (2, number of examples)
    parameters -- python dictionary containing your parameters "W1", "b1", "W2", "b2", "W3", "b3":
                    W1 -- weight matrix of shape (20, 2)
                    b1 -- bias vector of shape (20, 1)
                    W2 -- weight matrix of shape (3, 20)
                    b2 -- bias vector of shape (3, 1)
                    W3 -- weight matrix of shape (1, 3)
                    b3 -- bias vector of shape (1, 1)
    keep_prob - probability of keeping a neuron active during drop-out, scalar

    Returns:
    A3 -- last activation value, output of the forward propagation, of shape (1,1)
    cache -- tuple, information stored for computing the backward propagation
    """
    np.random.seed(1)

    W1 = parameters["W1"]
    b1 = parameters["b1"]
    W2 = parameters["W2"]
    b2 = parameters["b2"]
    W3 = parameters["W3"]
    b3 = parameters["b3"]

    Z1 = np.dot(W1, X) + b1
    A1 = relu(Z1)
    # Step1：创建有着与A^[1]相同维度的随机矩阵D^[1]
    D1 = np.random.rand(A1.shape[0], A1.shape[1])
    # Step2：将D^[1]的每一条目时以概率1−keep_prob设置为0，以概率keep_prob设置为1
    D1 = D1 < keep_prob
    # Step3：将A^[1]设置为A^[1] * D^[1]
    A1 = A1 * D1
    # Step4：使用A^[1]除以keep_prob
    A1 /= keep_prob

    # 第二层的写法和第一层是一样的
    Z2 = np.dot(W2, A1) + b2
    A2 = relu(Z2)
    D2 = np.random.rand(A2.shape[0], A2.shape[1])
    D2 = D2 < keep_prob
    A2 = A2 * D2
    A2 /= keep_prob

    # 第三层不用采取keep_prob的方法，所以就不用这么写了
    Z3 = np.dot(W3, A2) + b3
    A3 = sigmoid(Z3)

    cache = (Z1, D1, A1, W1, b1, Z2, D2, A2, W2, b2, Z3, A3, W3, b3)

    return A3, cache

def backward_propagation_with_dropout(X, Y, cache, keep_prob):
    """
    Implements the backward propagation of our baseline model to which we added dropout.

    Arguments:
    X -- input dataset, of shape (2, number of examples)
    Y -- "true" labels vector, of shape (output size, number of examples)
    cache -- cache output from forward_propagation_with_dropout()
    keep_prob - probability of keeping a neuron active during drop-out, scalar

    Returns:
    gradients -- A dictionary with the gradients with respect to each parameter, activation and pre-activation variables
    """
    m = X.shape[1]
    (Z1, D1, A1, W1, b1, Z2, D2, A2, W2, b2, Z3, A3, W3, b3) = cache

    dZ3 = A3 - Y
    dW3 = 1. / m * np.dot(dZ3, A2.T)
    db3 = 1. / m * np.sum(dZ3, axis=1, keepdims=True)

    dA2 = np.dot(W3.T, dZ3)
    # 在反向传播中关闭前向传播中关闭的神经元
    dA2 = dA2 * D2
    # 等比例缩放
    dA2 /= keep_prob
    dZ2 = np.multiply(dA2, np.int64(A2 > 0))
    dW2 = 1. / m * np.dot(dZ2, A1.T)
    db2 = 1. / m * np.sum(dZ2, axis=1, keepdims=True)

    # 与第二层同理
    dA1 = np.dot(W2.T, dZ2)
    dA1 = dA1 * D1
    dA1 /= keep_prob
    dZ1 = np.multiply(dA1, np.int64(A1 > 0))
    dW1 = 1. / m * np.dot(dZ1, X.T)
    db1 = 1. / m * np.sum(dZ1, axis=1, keepdims=True)

    gradients = {"dZ3": dZ3, "dW3": dW3, "db3": db3, "dA2": dA2,
                 "dZ2": dZ2, "dW2": dW2, "db2": db2, "dA1": dA1,
                 "dZ1": dZ1, "dW1": dW1, "db1": db1}

    return gradients


# # Model without regularization 训练及预测
# parameters = model(train_X, train_Y)
# print("On the training set:")
# predictions_train = predict(train_X, train_Y, parameters)
# print("On the test set:")
# predictions_test = predict(test_X, test_Y, parameters)
# # Model without regularization 绘图
# plt.title("Model without regularization")
# axes = plt.gca()
# axes.set_xlim([-0.75, 0.40])
# axes.set_ylim([-0.75, 0.65])
# plot_decision_boundary(lambda x: predict_dec(parameters, x.T), train_X, train_Y)


# # Model with L2-regularization训练及预测
# parameters = model(train_X, train_Y, lambd=0.7)
# print("On the train set:")
# predictions_train = predict(train_X, train_Y, parameters)
# print("On the test set:")
# predictions_test = predict(test_X, test_Y, parameters)
# # Model with L2-regularization绘图
# plt.title("Model with L2-regularization")
# axes = plt.gca()
# axes.set_xlim([-0.75,0.40])
# axes.set_ylim([-0.75,0.65])
# plot_decision_boundary(lambda x: predict_dec(parameters, x.T), train_X, train_Y)

# Model with dropout训练及预测
parameters = model(train_X, train_Y, keep_prob=0.86, learning_rate=0.3)
print("On the train set:")
predictions_train = predict(train_X, train_Y, parameters)
print("On the test set:")
predictions_test = predict(test_X, test_Y, parameters)
# Model with dropout绘图
plt.title("Model with dropout")
axes = plt.gca()
axes.set_xlim([-0.75, 0.40])
axes.set_ylim([-0.75, 0.65])
plot_decision_boundary(lambda x: predict_dec(parameters, x.T), train_X, train_Y)