吴恩达《神经网络和深度学习》第四周编程作业—构建深度神经网络

※※※※※上一篇:【用一层隐藏层的神经网络分类二维数据】※※※※※下一篇:【深度神经网络应用–Cat or Not】※※※※※


  在上一篇教程中我们已经训练了一个两层的神经网络(只有一个隐藏层)。这篇文章,我们将学会构建一个任意层数的深度神经网络,并实现构建深度神经网络所需的所有函数!

学完本篇文章将掌握的技能:
   ∙ \bullet 使用ReLU等非线性单位来改善模型
   ∙ \bullet 建立更深的神经网络(具有1个以上的隐藏层)
   ∙ \bullet 实现一个易于使用的神经网络类

  本文所使用的资料:【点击下载】,提取码:hwwc。请在开始之前下载好所需资料,然后将文件解压到你的代码文件同一级目录下,请确保你的代码那里有dnn_utils.pytestCases.pylr_utils.py 文件。

【符号说明】

   ∙ \bullet 上标 [ l ] \left [ l \right ] [l] 表示与 l t h l^{th} lth 层相关的数量。
    - 示例: a [ L ] a^{\left [ L \right ]} a[L] L t h L^{th} Lth 层的激活。 W [ L ] W^{\left [ L \right ]} W[L] b [ L ] b^{\left [ L \right ]} b[L] L t h L^{th} Lth 层参数。
   ∙ \bullet 上标 ( i ) \left ( i \right ) (i) 表示与 i t h i^{th} ith 示例相关的数量。
    - 示例: x ( i ) x^{\left ( i \right )} x(i) i t h i^{th} ith 的训练数据。
   ∙ \bullet 下标 i i i 表示 i t h i^{th} ith 的向量。
    - 示例: a i [ l ] a_{i}^{\left [ l \right ]} ai[l] 表示 l t h l^{th} lth 层激活的 i t h i^{th} ith 输入。

1 安装包

  在开始之前我们需要准备一些软件包:

import numpy as np
import h5py
import matplotlib.pyplot as plt
import testCases                                                        # 参见资料包
from dnn_utils import sigmoid, sigmoid_backward, relu, relu_backward    # 参见资料包
import lr_utils                                                         # 参见资料包

  为了和我的数据匹配,你需要指定随机种子。

np.random.seed(1)

2 构建深度神经网络的框架

  为了构建深度神经网络,我们需要实现几个“辅助函数”。这些辅助函数将在下一篇文章【深度神经网络应用–图像分类】中使用,用来构建一个两层神经网络和一个L层的神经网络。

  构建深度神经网络的流程如下所示:

   ∙ \bullet 初始化两层的神经网络和 L L L 层的神经网络的参数。

   ∙ \bullet 实现正向传播模块(在下图中以紫色显示)。
    - 完成模型正向传播步骤的LINEAR部分( Z [ l ] Z^{\left [ l \right ]} Z[l])。
    - 提供使用的ACTIVATION函数(relu / Sigmoid)。
    - 将前两个步骤合并为新的[LINEAR-> ACTIVATION]前向函数。
    - 堆叠[LINEAR-> RELU]正向函数L-1次(第1到L-1层),并在末尾添加[LINEAR-> SIGMOID](最后的层)。这合成了一个新的L_model_forward函数。

   ∙ \bullet 计算损失。

   ∙ \bullet 实现反向传播模块(在下图中以红色表示)。
    - 完成模型反向传播步骤的LINEAR部分。
    - 提供的ACTIVATE函数的梯度(relu_backward / sigmoid_backward)。
    - 将前两个步骤组合成新的[LINEAR-> ACTIVATION]反向函数。
    - 将[LINEAR-> RELU]向后堆叠L-1次,并在新的L_model_backward函数中后向添加[LINEAR-> SIGMOID]

   ∙ \bullet 最后更新参数。

在这里插入图片描述

【注意】:对于每个正向函数,都有一个对应的反向函数。这也是为什么在正向传播模块的每一步都将一些值存储在缓存中的原因。在反向传播模块中,将使用缓存的值来计算梯度。

3 初始化

  首先编写两个辅助函数用来初始化模型的参数。第一个函数将用于初始化两层模型的参数。第二个将把初始化过程推广到 L L L 层模型上。

3.1 两层神经网络参数的初始化

【说明】

   ∙ \bullet 模型的结构为:LINEAR -> RELU -> LINEAR -> SIGMOID
   ∙ \bullet 随机初始化权重矩阵。确保准确的维度,使用 np.random.randn(shape)* 0.01
   ∙ \bullet 将偏差初始化为0。使用 np.zeros(shape)

【代码】

# GRADED FUNCTION: initialize_parameters

def initialize_parameters(n_x, n_h, n_y):
    """
    Argument:
    n_x -- size of the input layer
    n_h -- size of the hidden layer
    n_y -- size of the output layer

    Returns:
    parameters -- python dictionary containing your parameters:
                    W1 -- weight matrix of shape (n_h, n_x)
                    b1 -- bias vector of shape (n_h, 1)
                    W2 -- weight matrix of shape (n_y, n_h)
                    b2 -- bias vector of shape (n_y, 1)
    """

    np.random.seed(1)

    W1 = np.random.randn(n_h, n_x) * 0.01
    b1 = np.zeros((n_h, 1))
    W2 = np.random.randn(n_y, n_h) * 0.01
    b2 = np.zeros((n_y, 1))

    assert (W1.shape == (n_h, n_x))
    assert (b1.shape == (n_h, 1))
    assert (W2.shape == (n_y, n_h))
    assert (b2.shape == (n_y, 1))

    parameters = {"W1": W1,
                  "b1": b1,
                  "W2": W2,
                  "b2": b2}

    return parameters

  初始化完成我们来测试一下:

【测试】

print("==============测试initialize_parameters==============")
parameters = initialize_parameters(3, 2, 1)
print("W1 = " + str(parameters["W1"]))
print("b1 = " + str(parameters["b1"]))
print("W2 = " + str(parameters["W2"]))
print("b2 = " + str(parameters["b2"]))

【结果】

==============测试initialize_parameters==============
W1 = [[ 0.01624345 -0.00611756 -0.00528172]
 [-0.01072969  0.00865408 -0.02301539]]
b1 = [[0.]
 [0.]]
W2 = [[ 0.01744812 -0.00761207]]
b2 = [[0.]]

3.2 L层神经网络参数的初始化

  更深的L层神经网络的初始化更加复杂,因为存在更多的权重矩阵和偏差向量。完成 initialize_parameters_deep后,应确保各层之间的维度匹配。回想一下, n [ l ] n^{\left [ l \right ]} n[l] l l l 层中的神经元数量。 因此,如果我们输入的 X X X 的大小为 ( 12288 , 209 ) (12288, 209) (12288,209)(以 m = 2009 m = 2009 m=2009 为例),则:

在这里插入图片描述

  当我们在python中计算 ( W X + b ) (WX+b) (WX+b) 时,使用广播,比如:

在这里插入图片描述
则:

在这里插入图片描述

【说明】

   ∙ \bullet 模型的结构为 [LINEAR -> RELU] (L-1) -> LINEAR -> SIGMOID。也就是说,前 L − 1 L-1 L1 层使用ReLU作为激活函数,最后一层采用sigmoid激活函数输出。
   ∙ \bullet 随机初始化权重矩阵。使用np.random.rand(shape)* 0.01
   ∙ \bullet 零初始化偏差。使用np.zeros(shape)
   ∙ \bullet 我们将在layer_dims变量中存储 n [ l ] n^{\left [ l \right ]} n[l],即不同层中的神经元数。例如,上篇文章中“二维数据分类模型”的layer_dims为[2,4,1]:即一个样本数据包含2个特征,一个隐藏层包含4个隐藏单元,一个输出层包含1个输出单元。因此,W1的维度为(4,2),b1的维度为(4,1),W2的维度为(1,4),而b2的维度为(1,1)。现在把它应用到 L L L 层。

【代码】

# GRADED FUNCTION: initialize_parameters_deep

def initialize_parameters_deep(layer_dims):
    """
    Arguments:
    layer_dims -- python array (list) containing the dimensions of each layer in our network

    Returns:
    parameters -- python dictionary containing your parameters "W1", "b1", ..., "WL", "bL":
                    Wl -- weight matrix of shape (layer_dims[l], layer_dims[l-1])
                    bl -- bias vector of shape (layer_dims[l], 1)
    """

    np.random.seed(3)
    parameters = {}
    L = len(layer_dims)  # number of layers in the network

    for l in range(1, L):
        parameters['W' + str(l)] = np.random.randn(layer_dims[l], layer_dims[l - 1]) * 0.01
        parameters['b' + str(l)] = np.zeros((layer_dims[l], 1))
        
        assert (parameters['W' + str(l)].shape == (layer_dims[l], layer_dims[l - 1]))
        assert (parameters['b' + str(l)].shape == (layer_dims[l], 1))

    return parameters

  测试一下:

【测试】

# 测试initialize_parameters_deep
print("==============测试initialize_parameters_deep==============")
layers_dims = [5, 4, 3]
parameters = initialize_parameters_deep(layers_dims)
print("W1 = " + str(parameters["W1"]))
print("b1 = " + str(parameters["b1"]))
print("W2 = " + str(parameters["W2"]))
print("b2 = " + str(parameters["b2"]))

【结果】

==============测试initialize_parameters_deep==============
W1 = [[ 0.01788628  0.0043651   0.00096497 -0.01863493 -0.00277388]
 [-0.00354759 -0.00082741 -0.00627001 -0.00043818 -0.00477218]
 [-0.01313865  0.00884622  0.00881318  0.01709573  0.00050034]
 [-0.00404677 -0.0054536  -0.01546477  0.00982367 -0.01101068]]
b1 = [[0.]
 [0.]
 [0.]
 [0.]]
W2 = [[-0.01185047 -0.0020565   0.01486148  0.00236716]
 [-0.01023785 -0.00712993  0.00625245 -0.00160513]
 [-0.00768836 -0.00230031  0.00745056  0.01976111]]
b2 = [[0.]
 [0.]
 [0.]]

  我们分别构建了两层和多层神经网络的初始化参数的函数,现在我们开始构建正向传播函数。

4 前向传播模块

  首先实现一些基本函数,用于稍后的模型实现。按以下顺序完成三个函数:

   ∙ \bullet LINEAR
   ∙ \bullet LINEAR -> ACTIVATION,其中激活函数采用ReLUSigmoid
   ∙ \bullet [LINEAR -> RELU] (L-1) -> LINEAR -> SIGMOID(整个模型) 。

4.1 线性前向

  线性前向模块(在所有数据中均进行向量化)的计算按照以下公式: Z [ l ] = W [ l ] A [ l − 1 ] + b [ l ] Z^{\left [ l \right ]}=W^{\left [ l \right ]}A^{\left [ l-1 \right ]}+b^{\left [ l \right ]} Z[l]=W[l]A[l1]+b[l]其中 A [ 0 ] = X A^{\left [ 0 \right ]} = X A[0]=X

  前向传播中,线性部分计算如下:

【代码】

# GRADED FUNCTION: linear_forward

def linear_forward(A, W, b):
    """
    Implement the linear part of a layer's forward propagation.

    Arguments:
    A -- activations from previous layer (or input data): (size of previous layer, number of examples)
    W -- weights matrix: numpy array of shape (size of current layer, size of previous layer)
    b -- bias vector, numpy array of shape (size of the current layer, 1)

    Returns:
    Z -- the input of the activation function, also called pre-activation parameter
    cache -- a python dictionary containing "A", "W" and "b" ; stored for computing the backward pass efficiently
    """

    Z = np.dot(W, A) + b

    assert (Z.shape == (W.shape[0], A.shape[1]))
    cache = (A, W, b)

    return Z, cache

  测试一下线性部分:

【测试】

# 测试linear_forward
print("==============测试linear_forward==============")
A, W, b = testCases.linear_forward_test_case()
Z, linear_cache = linear_forward(A, W, b)
print("A = " + str(A))
print("W = " + str(W))
print("b = " + str(b))
print("Z = " + str(Z))

【结果】

==============测试linear_forward==============
A = [[ 1.62434536 -0.61175641]
 [-0.52817175 -1.07296862]
 [ 0.86540763 -2.3015387 ]]
W = [[ 1.74481176 -0.7612069   0.3190391 ]]
b = [[-0.24937038]]
Z = [[ 3.26295337 -1.23429987]]

4.2 前向线性激活

  我们将使用两个激活函数:

   ∙ \bullet Sigmoid σ ( Z ) = σ ( W A + b ) = 1 1 + e − ( W A + b ) \sigma \left ( Z \right )=\sigma \left ( WA+b \right )=\frac{1}{1+e^{-\left ( WA+b \right )}} σ(Z)=σ(WA+b)=1+e(WA+b)1该函数返回两项值:激活值"a"和包含"Z"的"cache"(这是我们将馈入到相应的反向函数的内容,用于求解梯度)。可以按下述方式得到两项值:

A, activation_cache = sigmoid(Z)

   ∙ \bullet ReLU A = R E L U ( Z ) = m a x ( 0 , Z ) A = RELU\left ( Z \right ) = max\left ( 0,Z \right ) A=RELU(Z)=max(0,Z)该函数返回两项值:激活值“A”和包含“Z”的“cache”(这是我们将馈入到相应的反向函数的内容,用于求解梯度)。 可以按下述方式得到两项值:

A, activation_cache = relu(Z)

  为了更加方便,我们把两个函数(线性和激活)组合为一个函数(LINEAR-> ACTIVATION)。 因此,我们将实现一个函数用以执行LINEAR前向步骤和ACTIVATION前向步骤。

【说明】:实现 LINEAR->ACTIVATION 层的前向传播。 数学表达式为: A [ l ] = g [ l ] ( Z [ l ] ) = g [ l ] ( W [ l ] A [ l − 1 ] + b [ l ] ) A^{\left [ l \right ]}=g^{\left [ l \right ]}\left ( Z^{\left [ l \right ]} \right )=g^{\left [ l \right ]}\left ( W^{\left [ l \right ]}A^{\left [ l-1 \right ]}+b^{\left [ l \right ]} \right ) A[l]=g[l](Z[l])=g[l](W[l]A[l1]+b[l])其中激活"g" 可以是sigmoid()relu()

【代码】

# GRADED FUNCTION: linear_activation_forward

def linear_activation_forward(A_prev, W, b, activation):
    """
    Implement the forward propagation for the LINEAR->ACTIVATION layer

    Arguments:
    A_prev -- activations from previous layer (or input data): (size of previous layer, number of examples)
    W -- weights matrix: numpy array of shape (size of current layer, size of previous layer)
    b -- bias vector, numpy array of shape (size of the current layer, 1)
    activation -- the activation to be used in this layer, stored as a text string: "sigmoid" or "relu"

    Returns:
    A -- the output of the activation function, also called the post-activation value 
    cache -- a python dictionary containing "linear_cache" and "activation_cache";
             stored for computing the backward pass efficiently
    """

    if activation == "sigmoid":
        # Inputs: "A_prev, W, b". Outputs: "A, activation_cache".
        Z, linear_cache = linear_forward(A_prev, W, b)
        A, activation_cache = sigmoid(Z)

    elif activation == "relu":
        # Inputs: "A_prev, W, b". Outputs: "A, activation_cache".
        Z, linear_cache = linear_forward(A_prev, W, b)
        A, activation_cache = relu(Z)

    assert (A.shape == (W.shape[0], A_prev.shape[1]))
    cache = (linear_cache, activation_cache)

    return A, cache

【测试】

# 测试linear_activation_forward
print("==============测试linear_activation_forward==============")
A_prev, W, b = testCases.linear_activation_forward_test_case()
print("A_prev = " + str(A_prev))
print("W = " + str(W))
print("b = " + str(b))

A, linear_activation_cache = linear_activation_forward(A_prev, W, b, activation="sigmoid")
print("sigmoid,A = " + str(A))

A, linear_activation_cache = linear_activation_forward(A_prev, W, b, activation="relu")
print("ReLU,A = " + str(A))

【结果】

==============测试linear_activation_forward==============
A_prev = [[-0.41675785 -0.05626683]
 [-2.1361961   1.64027081]
 [-1.79343559 -0.84174737]]
W = [[ 0.50288142 -1.24528809 -1.05795222]]
b = [[-0.90900761]]
sigmoid,A = [[0.96890023 0.11013289]]
ReLU,A = [[3.43896131 0.        ]]

【注意】:在深度学习中,"[LINEAR->ACTIVATION]"计算被视为神经网络中的单个层,而不是两个层。

4.3 L层模型

  我们把两层模型需要的前向传播函数做完了,那多层网络模型的前向传播是怎样的呢?我们调用上面的那两个函数来实现它,为了在实现L层神经网络时更加方便,我们需要一个函数来复制前一个函数(带有RELUlinear_activation_forward)L-1次,然后用一个带有SIGMOIDlinear_activation_forward跟踪它,我们来看一下它的结构是怎样的:
在这里插入图片描述

  在下面的代码中,变量AL表示 A [ L ] = σ ( Z [ L ] ) = σ ( W [ L ] A [ L − 1 ] + b [ L ] ) A^{\left [ L \right ]}=\sigma \left ( Z^{\left [ L \right ]} \right )=\sigma \left ( W^{\left [ L \right ]}A^{\left [ L-1 \right ]}+b^{\left [ L \right ]} \right ) A[L]=σ(Z[L])=σ(W[L]A[L1]+b[L])有时也称为Yhat,即 Y ^ \hat{Y} Y^

【代码】

# GRADED FUNCTION: L_model_forward

def L_model_forward(X, parameters):
    """
    Implement forward propagation for the [LINEAR->RELU]*(L-1)->LINEAR->SIGMOID computation

    Arguments:
    X -- data, numpy array of shape (input size, number of examples)
    parameters -- output of initialize_parameters_deep()

    Returns:
    AL -- last post-activation value
    caches -- list of caches containing:
                every cache of linear_relu_forward() (there are L-1 of them, indexed from 0 to L-2)
                the cache of linear_sigmoid_forward() (there is one, indexed L-1)
    """

    caches = []
    A = X
    L = len(parameters) // 2  # number of layers in the neural network

    # Implement [LINEAR -> RELU]*(L-1). Add "cache" to the "caches" list.
    for l in range(1, L):
        A_prev = A
        A, cache = linear_activation_forward(A_prev, parameters['W' + str(l)], parameters['b' + str(l)],
                                             activation="relu")
        caches.append(cache)

    # Implement LINEAR -> SIGMOID. Add "cache" to the "caches" list.
    AL, cache = linear_activation_forward(A, parameters['W' + str(L)], parameters['b' + str(L)], activation="sigmoid")
    caches.append(cache)

    assert (AL.shape == (1, X.shape[1]))

    return AL, caches

【测试】

# 测试L_model_forward
print("==============测试L_model_forward==============")
X, parameters = testCases.L_model_forward_test_case()
AL, caches = L_model_forward(X, parameters)
print("X = " + str(X))
print("parameters = " + str(parameters))
print("AL = " + str(AL))
print("caches 的长度为 = " + str(len(caches)))
print("caches = " + str(caches))

【结果】

==============测试L_model_forward==============
X = [[ 1.62434536 -0.61175641]
 [-0.52817175 -1.07296862]
 [ 0.86540763 -2.3015387 ]
 [ 1.74481176 -0.7612069 ]]
parameters = {'W1': array([[ 0.3190391 , -0.24937038,  1.46210794, -2.06014071],
       [-0.3224172 , -0.38405435,  1.13376944, -1.09989127],
       [-0.17242821, -0.87785842,  0.04221375,  0.58281521]]), 'b1': array([[-1.10061918],
       [ 1.14472371],
       [ 0.90159072]]), 'W2': array([[ 0.50249434,  0.90085595, -0.68372786]]), 'b2': array([[-0.12289023]])}
AL = [[0.17007265 0.2524272 ]]
caches 的长度为 = 2
caches = [((array([[ 1.62434536, -0.61175641],
       [-0.52817175, -1.07296862],
       [ 0.86540763, -2.3015387 ],
       [ 1.74481176, -0.7612069 ]]), array([[ 0.3190391 , -0.24937038,  1.46210794, -2.06014071],
       [-0.3224172 , -0.38405435,  1.13376944, -1.09989127],
       [-0.17242821, -0.87785842,  0.04221375,  0.58281521]]), array([[-1.10061918],
       [ 1.14472371],
       [ 0.90159072]])), array([[-2.77991749, -2.82513147],
       [-0.11407702, -0.01812665],
       [ 2.13860272,  1.40818979]])), ((array([[0.        , 0.        ],
       [0.        , 0.        ],
       [2.13860272, 1.40818979]]), array([[ 0.50249434,  0.90085595, -0.68372786]]), array([[-0.12289023]])), array([[-1.58511248, -1.08570881]]))]

  现在,我们有了一个完整的前向传播模块,它接受输入 X X X 并输出包含预测的行向量 A [ L ] A^{\left [ L \right ]} A[L]。它还将所有中间值记录在"caches"中以计算预测的损失值。

5 损失函数

  我们已经把这两个模型的前向传播部分完成了,我们需要计算成本(损失),以确定它到底有没有在学习,使用以下公式计算交叉熵损失 J J J − 1 m ∑ i = 1 m ( y ( i ) l o g ( a [ L ] ( i ) ) + ( 1 − y ( i ) ) l o g ( 1 − a [ L ] ( i ) ) ) -\frac{1}{m}\sum_{i=1}^{m}\left ( y^{\left ( i \right )}log\left ( a^{\left [ L \right ]\left ( i \right )} \right ) +\left ( 1-y^{\left ( i \right )} \right )log\left ( 1-a^{\left [ L \right ]\left ( i \right )} \right ) \right ) m1i=1m(y(i)log(a[L](i))+(1y(i))log(1a[L](i)))

【代码】

# GRADED FUNCTION: compute_cost

def compute_cost(AL, Y):
    """
    Implement the cost function defined by equation (7).

    Arguments:
    AL -- probability vector corresponding to your label predictions, shape (1, number of examples)
    Y -- true "label" vector (for example: containing 0 if non-cat, 1 if cat), shape (1, number of examples)

    Returns:
    cost -- cross-entropy cost
    """

    m = Y.shape[1]

    # Compute loss from aL and y.
    cost = -1 / m * np.sum(Y * np.log(AL) + (1 - Y) * np.log(1 - AL), axis=1, keepdims=True)

    cost = np.squeeze(cost)  # To make sure your cost's shape is what we expect (e.g. this turns [[17]] into 17).
    assert (cost.shape == ())

    return cost

【测试】

# 测试compute_cost
print("==============测试compute_cost==============")
Y, AL = testCases.compute_cost_test_case()
print("Y = " + str(Y))
print("AL = " + str(AL))
print("cost = " + str(compute_cost(AL, Y)))

【结果】

==============测试compute_cost==============
Y = [[1 1 1]]
AL = [[0.8 0.9 0.4]]
cost = 0.41493159961539694

6 后向传播模块

  后向传播用于计算损失函数相对于参数的梯度,我们来看看前向和后向传播的流程图:

在这里插入图片描述

  如果对微积分有一定了解的话,我们知道可以使用微积分的链式规则来得出两层神经网络中的损失相对于 z [ 1 ] z^{\left [ 1 \right ]} z[1] 的导数,如下所示: d z [ 1 ] = ∂ L ∂ z [ 1 ] = ∂ L ∂ a [ 2 ] ⋅ ∂ a [ 2 ] ∂ z [ 2 ] ⋅ ∂ z [ 2 ] ∂ a [ 1 ] ⋅ ∂ a [ 1 ] ∂ z [ 1 ] (1) dz^{\left [ 1 \right ]}=\frac{\partial L}{\partial z^{\left [ 1 \right ]}}=\frac{\partial L}{\partial a^{\left [ 2 \right ]}}\cdot \frac{\partial a^{\left [ 2 \right ]}}{\partial z^{\left [ 2 \right ]}}\cdot \frac{\partial z^{\left [ 2 \right ]}}{\partial a^{\left [ 1 \right ]}}\cdot \frac{\partial a^{\left [ 1 \right ]}}{\partial z^{\left [ 1 \right ]}} \tag1 dz[1]=z[1]L=a[2]Lz[2]a[2]a[1]z[2]z[1]a[1](1)

  为了计算梯度 d W [ 1 ] dW^{\left [ 1 \right ]} dW[1],可以在公式(1)的基础上再执行: d W [ 1 ] = d z [ 1 ] ⋅ ∂ z [ 1 ] ∂ W [ 1 ] (2) dW^{\left [ 1 \right ]} = dz^{\left [ 1 \right ]}\cdot \frac{\partial z^{\left [ 1 \right ]}}{\partial W^{\left [ 1 \right ]}} \tag2 dW[1]=dz[1]W[1]z[1](2)

  同样地,为了计算梯度 d b [ 1 ] db^{\left [ 1 \right ]} db[1],可以在公式(1)的基础上再执行: d b [ 1 ] = d z [ 1 ] ⋅ ∂ z [ 1 ] ∂ b [ 1 ] (3) db^{\left [ 1 \right ]} = dz^{\left [ 1 \right ]}\cdot \frac{\partial z^{\left [ 1 \right ]}}{\partial b^{\left [ 1 \right ]}} \tag3 db[1]=dz[1]b[1]z[1](3)

  这也是为什么我们称之为反向传播

  现在,类似于前向传播,可以分三个步骤构建后向传播:

   ∙ \bullet LINEAR backward
   ∙ \bullet LINEAR -> ACTIVATION backward,其中激活函数使用ReLUsigmoid的导数计算
   ∙ \bullet [LINEAR -> RELU] x (L-1) -> LINEAR -> SIGMOID backward(整个模型)

6.1 线性后向

  对于层 l l l,线性部分为: Z [ l ] = W [ l ] A [ l − 1 ] + b [ l ] Z^{\left [ l \right ]} = W^{\left [ l \right ]}A^{\left [ l-1 \right ]}+b^{\left [ l \right ]} Z[l]=W[l]A[l1]+b[l]

在这里插入图片描述

  假设已经计算出导数 d Z [ l ] = ∂ L ∂ Z [ l ] dZ^{\left [ l \right ]}=\frac{\partial L}{\partial Z^{\left [ l \right ]}} dZ[l]=Z[l]L,则需要根据输入 d Z [ l ] dZ^{\left [ l \right ]} dZ[l] 计算三个输出 d W [ l ] dW^{\left [ l \right ]} dW[l] d b [ l ] db^{\left [ l \right ]} db[l] d A [ l − 1 ] dA^{\left [ l-1 \right ]} dA[l1]。所需要的公式如下: d W [ l ] = ∂ L ∂ W [ l ] = 1 m d Z [ l ] A [ l − 1 ] T (4) dW^{\left [ l \right ]}=\frac{\partial L}{\partial W^{\left [ l \right ]} } = \frac{1}{m}dZ^{\left [ l \right ]}A^{\left [ l-1 \right ]T} \tag4 dW[l]=W[l]L=m1dZ[l]A[l1]T(4) d b [ l ] = ∂ L ∂ b [ l ] = 1 m ∑ i = 1 m d Z [ l ] ( i ) (5) db^{\left [ l \right ]}=\frac{\partial L}{\partial b^{\left [ l \right ]} } = \frac{1}{m}\sum_{i=1}^{m}dZ^{\left [ l \right ]\left ( i \right )} \tag5 db[l]=b[l]L=m1i=1mdZ[l](i)(5) d A [ l − 1 ] = ∂ L ∂ A [ l − 1 ] = W [ l ] T d Z [ l ] (6) dA^{\left [ l-1 \right ]}=\frac{\partial L}{\partial A^{\left [ l-1 \right ]} } = W^{\left [ l \right ]T}dZ^{\left [ l \right ]} \tag6 dA[l1]=A[l1]L=W[l]TdZ[l](6)

【代码】

# GRADED FUNCTION: linear_backward

def linear_backward(dZ, cache):
    """
    Implement the linear portion of backward propagation for a single layer (layer l)

    Arguments:
    dZ -- Gradient of the cost with respect to the linear output (of current layer l)
    cache -- tuple of values (A_prev, W, b) coming from the forward propagation in the current layer

    Returns:
    dA_prev -- Gradient of the cost with respect to the activation (of the previous layer l-1), same shape as A_prev
    dW -- Gradient of the cost with respect to W (current layer l), same shape as W
    db -- Gradient of the cost with respect to b (current layer l), same shape as b
    """
    A_prev, W, b = cache
    m = A_prev.shape[1]
    dW = 1 / m * np.dot(dZ, A_prev.T)
    db = 1 / m * np.sum(dZ, axis=1, keepdims=True)
    dA_prev = np.dot(W.T, dZ)

    assert (dA_prev.shape == A_prev.shape)
    assert (dW.shape == W.shape)
    assert (db.shape == b.shape)

    return dA_prev, dW, db

【测试】

# 测试linear_backward
print("==============测试linear_backward==============")
dZ, linear_cache = testCases.linear_backward_test_case()

dA_prev, dW, db = linear_backward(dZ, linear_cache)
print("dA_prev = " + str(dA_prev))
print("dW = " + str(dW))
print("db = " + str(db))

【结果】

==============测试linear_backward==============
dA_prev = [[ 0.51822968 -0.19517421]
 [-0.40506361  0.15255393]
 [ 2.37496825 -0.89445391]]
dW = [[-0.10076895  1.40685096  1.64992505]]
db = [[0.50629448]]

6.2 后向线性激活

  为了帮助你实现linear_activation_backward,我们提供了两个反向函数:

   ∙ \bullet sigmoid_backward:实现SIGMOID单元的后向传播。 你可以这样使用:

dZ = sigmoid_backward(dA, activation_cache)

   ∙ \bullet relu_backward:实现RELU单元的后向传播。 你可以这样使用:

dZ = relu_backward(dA, activation_cache)

  如果 g ( ⋅ ) g\left ( \cdot \right ) g() 是激活函数,则sigmoid_backwardrelu_backward计算: d Z [ l ] = d A [ l ] ∗ g ′ ( Z [ l ] ) dZ^{\left [ l \right ]} = dA^{\left [ l \right ]}\ast g^{'}\left ( Z^{\left [ l \right ]} \right ) dZ[l]=dA[l]g(Z[l])

【代码】

# GRADED FUNCTION: linear_activation_backward

def linear_activation_backward(dA, cache, activation):
    """
    Implement the backward propagation for the LINEAR->ACTIVATION layer.

    Arguments:
    dA -- post-activation gradient for current layer l
    cache -- tuple of values (linear_cache, activation_cache) we store for computing backward propagation efficiently
    activation -- the activation to be used in this layer, stored as a text string: "sigmoid" or "relu"

    Returns:
    dA_prev -- Gradient of the cost with respect to the activation (of the previous layer l-1), same shape as A_prev
    dW -- Gradient of the cost with respect to W (current layer l), same shape as W
    db -- Gradient of the cost with respect to b (current layer l), same shape as b
    """
    linear_cache, activation_cache = cache

    if activation == "relu":
        dZ = relu_backward(dA, activation_cache)
        dA_prev, dW, db = linear_backward(dZ, linear_cache)

    elif activation == "sigmoid":
        dZ = sigmoid_backward(dA, activation_cache)
        dA_prev, dW, db = linear_backward(dZ, linear_cache)

    return dA_prev, dW, db

【测试】

# 测试linear_activation_backward
print("==============测试linear_activation_backward==============")
AL, linear_activation_cache = testCases.linear_activation_backward_test_case()

dA_prev, dW, db = linear_activation_backward(AL, linear_activation_cache, activation="sigmoid")
print("sigmoid:")
print("dA_prev = " + str(dA_prev))
print("dW = " + str(dW))
print("db = " + str(db) + "\n")

dA_prev, dW, db = linear_activation_backward(AL, linear_activation_cache, activation="relu")
print("relu:")
print("dA_prev = " + str(dA_prev))
print("dW = " + str(dW))
print("db = " + str(db))

【结果】

==============测试linear_activation_backward==============
sigmoid:
dA_prev = [[ 0.11017994  0.01105339]
 [ 0.09466817  0.00949723]
 [-0.05743092 -0.00576154]]
dW = [[ 0.10266786  0.09778551 -0.01968084]]
db = [[-0.05729622]]

relu:
dA_prev = [[ 0.44090989  0.        ]
 [ 0.37883606  0.        ]
 [-0.2298228   0.        ]]
dW = [[ 0.44513824  0.37371418 -0.10478989]]
db = [[-0.20837892]]

6.3 后向L层模型

  现在,你将为整个网络实现后向传播函数。回想一下,当实现L_model_forward函数时,在每次迭代中,都存储了一个包含(A,W,b和Z)的缓存。在后向传播模块中,我们将使用这些变量来计算梯度。 因此,在L_model_backward函数中,我们将从 L L L 层开始向后遍历所有隐藏层。在每个步骤中,我们都将使用 l l l 层的缓存值后向传播到层 l l l。下图展示了后向传播过程。

在这里插入图片描述

  对于输出层,有: A [ L ] = σ ( Z [ L ] ) A^{\left [ L \right ]} = \sigma \left ( Z^{\left [ L \right ]} \right ) A[L]=σ(Z[L]),所以我们首先需要计算dAL,可以使用下面的代码来计算它:

dAL = - (np.divide(Y, AL) - np.divide(1 - Y, 1 - AL))

  然后,就可以使用此激活后的梯度dAL继续后向传播。如上图所示,你现在可以将dAL输入到你实现的LINEAR-> SIGMOID后向函数中(它将使用L_model_forward函数存储的缓存值)。之后,你得通过for循环,使用LINEAR-> RELU后向函数迭代所有其他层。同时将每个dAdWdb存储在grads词典中。

【代码】

# GRADED FUNCTION: L_model_backward

def L_model_backward(AL, Y, caches):
    """
    Implement the backward propagation for the [LINEAR->RELU] * (L-1) -> LINEAR -> SIGMOID group

    Arguments:
    AL -- probability vector, output of the forward propagation (L_model_forward())
    Y -- true "label" vector (containing 0 if non-cat, 1 if cat)
    caches -- list of caches containing:
                every cache of linear_activation_forward() with "relu" (it's caches[l], for l in range(L-1) i.e l = 0...L-2)
                the cache of linear_activation_forward() with "sigmoid" (it's caches[L-1])

    Returns:
    grads -- A dictionary with the gradients
             grads["dA" + str(l)] = ...
             grads["dW" + str(l)] = ...
             grads["db" + str(l)] = ...
    """
    grads = {}
    L = len(caches)  # the number of layers
    m = AL.shape[1]
    Y = Y.reshape(AL.shape)  # after this line, Y is the same shape as AL

    # Initializing the backpropagation
    dAL = - (np.divide(Y, AL) - np.divide(1 - Y, 1 - AL))

    # Lth layer (SIGMOID -> LINEAR) gradients. 
    # Inputs: "AL, Y, caches". 
    # Outputs: "grads["dAL"], grads["dWL"], grads["dbL"]
    current_cache = caches[L - 1]
    grads["dA" + str(L)], grads["dW" + str(L)], grads["db" + str(L)] = \
        linear_activation_backward(dAL, current_cache, activation="sigmoid")

    for l in reversed(range(L - 1)):
        # lth layer: (RELU -> LINEAR) gradients.
        # Inputs: "grads["dA" + str(l + 2)], caches". 
        # Outputs: "grads["dA" + str(l + 1)] , grads["dW" + str(l + 1)] , grads["db" + str(l + 1)] 
        current_cache = caches[l]
        dA_prev_temp, dW_temp, db_temp = linear_activation_backward(grads["dA" + str(l + 2)], 
                                                                    current_cache,
                                                                    activation="relu")
        grads["dA" + str(l + 1)] = dA_prev_temp
        grads["dW" + str(l + 1)] = dW_temp
        grads["db" + str(l + 1)] = db_temp
        
    return grads

【测试】

# 测试L_model_backward
print("==============测试L_model_backward==============")
AL, Y_assess, caches = testCases.L_model_backward_test_case()
grads = L_model_backward(AL, Y_assess, caches)
print("dW1 = " + str(grads["dW1"]))
print("db1 = " + str(grads["db1"]))
print("dA1 = " + str(grads["dA1"]))

【结果】

==============测试L_model_backward==============
dW1 = [[0.41010002 0.07807203 0.13798444 0.10502167]
 [0.         0.         0.         0.        ]
 [0.05283652 0.01005865 0.01777766 0.0135308 ]]
db1 = [[-0.22007063]
 [ 0.        ]
 [-0.02835349]]
dA1 = [[ 0.          0.52257901]
 [ 0.         -0.3269206 ]
 [ 0.         -0.32070404]
 [ 0.         -0.74079187]]

6.4 更新参数

  最后,使用梯度下降来更新模型的参数: W [ l ] = W [ l ] − α d W [ l ] (7) W^{\left [ l \right ]} = W^{\left [ l \right ]} - \alpha dW^{\left [ l \right ]} \tag7 W[l]=W[l]αdW[l](7) b [ l ] = b [ l ] − α d b [ l ] (8) b^{\left [ l \right ]} = b^{\left [ l \right ]} - \alpha db^{\left [ l \right ]} \tag8 b[l]=b[l]αdb[l](8)其中 α \alpha α 是学习率。在计算更新的参数后,将它们存储在参数字典中。

【代码】

# GRADED FUNCTION: update_parameters

def update_parameters(parameters, grads, learning_rate):
    """
    Update parameters using gradient descent

    Arguments:
    parameters -- python dictionary containing your parameters 
    grads -- python dictionary containing your gradients, output of L_model_backward

    Returns:
    parameters -- python dictionary containing your updated parameters 
                  parameters["W" + str(l)] = ... 
                  parameters["b" + str(l)] = ...
    """

    L = len(parameters) // 2  # number of layers in the neural network

    # Update rule for each parameter. Use a for loop.
    for l in range(L):
        parameters["W" + str(l + 1)] = parameters["W" + str(l + 1)] - learning_rate * grads["dW" + str(l + 1)]
        parameters["b" + str(l + 1)] = parameters["b" + str(l + 1)] - learning_rate * grads["db" + str(l + 1)]

    return parameters

【测试】

# 测试update_parameters
print("==============测试update_parameters==============")
parameters, grads = testCases.update_parameters_test_case()
parameters = update_parameters(parameters, grads, 0.1)

print("W1 = " + str(parameters["W1"]))
print("b1 = " + str(parameters["b1"]))
print("W2 = " + str(parameters["W2"]))
print("b2 = " + str(parameters["b2"]))

【结果】

==============测试update_parameters==============
W1 = [[-0.59562069 -0.09991781 -2.14584584  1.82662008]
 [-1.76569676 -0.80627147  0.51115557 -1.18258802]
 [-1.0535704  -0.86128581  0.68284052  2.20374577]]
b1 = [[-0.04659241]
 [-1.28888275]
 [ 0.53405496]]
W2 = [[-0.55569196  0.0354055   1.32964895]]
b2 = [[-0.84610769]]

  至此,我们构建了深度神经网络所需的所有函数。

  • 2
    点赞
  • 9
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

Roar冷颜

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值