吴恩达机器学习笔记——多层全连接网络

最新推荐文章于 2023-07-26 23:13:38 发布

高级混子

最新推荐文章于 2023-07-26 23:13:38 发布

阅读量881

点赞数

分类专栏：机器学习-学习笔记

本文链接：https://blog.csdn.net/A843151774/article/details/108941586

版权

深度学习神经网络前向传播反向传播梯度下降

关键词由CSDN通过智能技术生成

机器学习-学习笔记专栏收录该内容

4 篇文章 0 订阅

订阅专栏

多层全连接神经网络

多层神经网络中的符号表示

在计算网络层数的时候，一般只记录输出层和隐藏层，因此，上图为一个4层的神经网络，令L为网络的层数，L=4.

对于输入层，习惯性称为“第零层”，由于计算机中的数组通常是从0开始的，第0层这种叫法有利于统一形式。

对于每一层（从1到L）都有两个参数W和b（为了区分不同层的W和b，将他们记为： $W^{[l]} ,b^{[l]}$ ，他们都是矩阵，其维度如下：

$W^{[l]}: (n^{[l]},n^{[l-1]})$

$b^{[l]}:(n^{[l]},1)$

在上面的式子中， $n^{[l]}$ 代表第l层的神经元的个数，这里就体现出“第0层”的作用了，因为对于 $W^{[1]}$ ,其维度为： $n^{[1]},n^{[0]})$ 而这里 $n^{[0]}$ 恰好对应了输入层的神经元个数（这里也可以理解为x的属性的数目）

如前面所述，每个层都会产生一个线性的输出 $Z^{[l]}$ 以及将Z通过激活函数进行非线性变换后得到的 $A^{[l]}$ , 其维度如下：

$Z^{[l]},A^{[l]}: (n^{[l]},m)$

其中m为输入样本的个数。最后还需要偏导数以进行梯度下降，对于的偏导数项 $dW^{[l]}, db^{[l]}$ 和对于参数 $W^{[l]},b^{[l]}$ 的矩阵维度相等。

$dW^{[l]}: (n^{[l]},n^{[l-1]})$

$db^{[l]}:(n^{[l]},1)$

多层神经网络的通式

对于多层神经网络，我们希望对每一个层l，都抽象出一个通式，把每一层当做一个模块或者一块“积木”（从程序设计上就是把每一层抽象成一个通用的函数），从而通过多次调用这个函数，可以很方便的刻画出整个模型。

下面给出一个模块的示意图（这里使用小写的a和z代表是对单个样本进行处理）：

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-RohuYPrY-1601977409142)(./multi-layer/2.jpg)]

一个神经网络的层l在前向传播的过程中需要接受来自前一层的输出 $a^{[l]}$ ，作为本层的输入。根据本层的 $W^{[l]} ,b^{[l]}$ 和激活函数计算出该层的输出 $a^{[l]}$ 作为输出给下一层。

同时，其需要缓存计算过程中的 $z^{[l]},a^{[l]}$ 用于在反向传播的过程中计算导数。

前向传播结束后，将计算代价函数J，然后将代价函数用于反向传播中，作为反向传播的"起点".

在反向传播的过程中，其接受来自上一层的导数 $da^{[l]}$ ,然后计算出本层参数的导数 $dW^{[l]}, db^{[l]}$ 用于梯度下降更新参数的值。同时其计算出 $da^{[l-1]}$ 被上一层用于更新参数。

下面是公式：

前向传播：

$Z^{[l]}_{n^{[l]}*m} = W^{[l]}_{n^{[l]}*n^{[l-1]}} A^{[l-1]}_{n^{[l]}*m}+b^{[l]}_{n^{[l]}*1}$

$A^{[l]}=g(Z^{[l]})$

计算代价函数：

$-\frac{1}{m} \sum\limits_{i = 1}^{m} (y^{(i)}\log\left(a^{[L] (i)}\right) + (1-y^{(i)})\log\left(1- a^{[L](i)}\right)) \tag{7}$

反向传播：

$dZ^{[l]}_{n^{[l]}*m} ={\partial J \over \partial Z^{[l]}}= dA^{[l]}_{n^{[l]}*m}\odot g'^{[l]}(Z^{[l]})_{n^{[l]}*m}$

$dW^{[l]}_{(n^{[l]}*n^{[l-1]})} = {\partial J \over \partial W^{[l]}} = {1\over m}dZ^{[l]}_{(n^{[l]}*m)} A^{[l-1]T}_{(m*n^{[l-1]})}$

$db^{[l]}_{(n^{[l]}*1)}={\partial J \over \partial db^{[l]}} = {1\over m}\sum^{m}_{i=1}dZ^{[l](i)}$

$dA^{[l-1]}_{(n^{[l-1]}*m)}={\partial J\over \partial A^{[l-1]}}=W^{[l]T}_{ (n^{[l-1]}*n^{[l]})} dZ^{[l]}_{(n^{[l]}*m)}$

多层全连接神经网络的代码

这是吴恩达课程作业的代码，将其进行分析

初始化参数

# GRADED FUNCTION: initialize_parameters_deep

def initialize_parameters_deep(layer_dims):
    """
    Arguments:
    layer_dims -- python array (list) containing the dimensions of each layer in our network
    
    Returns:
    parameters -- python dictionary containing your parameters "W1", "b1", ..., "WL", "bL":
                    Wl -- weight matrix of shape (layer_dims[l], layer_dims[l-1])
                    bl -- bias vector of shape (layer_dims[l], 1)
    """
    
    np.random.seed(3)
    parameters = {}
    L = len(layer_dims)            # number of layers in the network

    for l in range(1, L):
        ### START CODE HERE ### (≈ 2 lines of code)
        parameters['W' + str(l)] = np.random.randn(layer_dims[l],layer_dims[l-1]) * 0.01
        parameters['b' + str(l)] = np.zeros((layer_dims[l],1))
        ### END CODE HERE ###
        
        assert(parameters['W' + str(l)].shape == (layer_dims[l], layer_dims[l-1]))
        assert(parameters['b' + str(l)].shape == (layer_dims[l], 1))

        
    return parameters

initialize_parameters_deep的输入

是一个list叫做"layer_dims"，layer_dims实际的长度应该是L+1，layer_dims[0]代表第0层（也就是输入特征）的维度，layer_dims[1]到layer_dims[L]代表从第1层到第L层的维度

initialize_parameters_deep的输出

输出是一个parameters的字典，通过parameters[“W3”]来访问 $W^{[3]}$ 其他以此类推

initialize_parameters_deep的分析

根据以下两个公式初始化参数：

$W^{[l]}: (n^{[l]},n^{[l-1]})$

$b^{[l]}:(n^{[l]},1)$

初始化参数的时候，W不能初始化为全0，原因课上讲过。这里相当于初始化 $W^{[1]},b^{[1]}$ 到 $W^{[L]},b^{[L]}$ 。第0层没有所谓的W和b

这里比较有意思的是layer_dims数组。layer_dims实际的长度应该是L+1，layer_dims[0]代表第0层（也就是输入特征）的维度，layer_dims[1]到layer_dims[L]代表从第1层到第L层的维度，第L层是输出层，其他都是隐藏层，隐藏层维度可以自己设置。所以程序里的变量L’（我把程序里定义的变量L称为L’是为了区分）是层数+1，但是由于遍历的时候，range函数是不包括L’的，所以刚好从第1层遍历到第L层。

前向传播

前向传播分为两个部分，第一个部分完成一个“模块”的前向传播，第二个部分重复调用第一个部分的函数，完成整个网络的前向传播。

第一部分 —— 单个通用模块的前向传播

第一部分完成以下两个公式的实现,分别采用两个函数进行。

$Z^{[l]}_{n^{[l]}*m} = W^{[l]}_{n^{[l]}*n^{[l-1]}} A^{[l-1]}_{n^{[l]}*m}+b^{[l]}_{n^{[l]}*1}$

$A^{[l]}=g(Z^{[l]})$

linear_forward函数实现第一个公式

# GRADED FUNCTION: linear_forward

def linear_forward(A, W, b):
    """
    Implement the linear part of a layer's forward propagation.

    Arguments:
    A -- activations from previous layer (or input data): (size of previous layer, number of examples)
    W -- weights matrix: numpy array of shape (size of current layer, size of previous layer)
    b -- bias vector, numpy array of shape (size of the current layer, 1)

    Returns:
    Z -- the input of the activation function, also called pre-activation parameter 
    cache -- a python tuple containing "A", "W" and "b" ; stored for computing the backward pass efficiently
    """
    
    ### START CODE HERE ### (≈ 1 line of code)
    Z = np.dot(W,A)+b
    ### END CODE HERE ###
    
    assert(Z.shape == (W.shape[0], A.shape[1]))
    cache = (A, W, b)
    
    return Z, cache

该函数输入为 $W^{[l]},b^{[l]},A^{[l-1]}$ 三个参数，得到线性输出 $Z^{[l]}$ ,同时保存了 $W^{[l]},b^{[l]},A^{[l-1]}$ 三个参数

linear_activation_forward 实现两个公式

def linear_activation_forward(A_prev, W, b, activation):
    """
    Implement the forward propagation for the LINEAR->ACTIVATION layer

    Arguments:
    A_prev -- activations from previous layer (or input data): (size of previous layer, number of examples)
    W -- weights matrix: numpy array of shape (size of current layer, size of previous layer)
    b -- bias vector, numpy array of shape (size of the current layer, 1)
    activation -- the activation to be used in this layer, stored as a text string: "sigmoid" or "relu"

    Returns:
    A -- the output of the activation function, also called the post-activation value 
    cache -- a python tuple containing "linear_cache" and "activation_cache";
             stored for computing the backward pass efficiently
    """
    if activation == "sigmoid":
        # Inputs: "A_prev, W, b". Outputs: "A, activation_cache".
        ### START CODE HERE ### (≈ 2 lines of code)
        Z, linear_cache = linear_forward(A_prev, W, b)
        A, activation_cache = sigmoid(Z)
        ### END CODE HERE ###
    
    elif activation == "relu":
        # Inputs: "A_prev, W, b". Outputs: "A, activation_cache".
        ### START CODE HERE ### (≈ 2 lines of code)
        Z, linear_cache = linear_forward(A_prev, W, b)
        A, activation_cache = relu(Z)
        ### END CODE HERE ###
    
    assert (A.shape == (W.shape[0], A_prev.shape[1]))
    cache = (linear_cache, activation_cache)

    return A, cache

该函数同样输入 $W^{[l]},b^{[l]},A^{[l-1]}$ 三个参数，但是另外的还要指出激活函数，根据激活函数的取值不同，来选择对线性输出 $Z^{[l]}$ 的操作。

需要指出的是，relu和sigmoid函数除了输出对应的 $A^{[l]}$ 值外，还会将 $Z^{[l]}$ 缓存。因此，考虑上linear_forward()函数的缓存，对于第l层，总共缓存了四个量 $W^{[l]},b^{[l]},A^{[l-1]},Z^{[l]}$

注意这里cache的结构，cache[0] = linear_cache = ( $W^{[l]},A^{[l-1]},b^{[l]})$ 而cache[1] = activation_cache = ( $Z^{[l]}$ )

第二部分 —— L层组合形成网络

第二部分通过不断调用第一部分的函数，实现整个模型的前向传播。

def L_model_forward(X, parameters):
    """
    Implement forward propagation for the [LINEAR->RELU]*(L-1)->LINEAR->SIGMOID computation
    
    Arguments:
    X -- data, numpy array of shape (input size, number of examples)
    parameters -- output of initialize_parameters_deep()
    
    Returns:
    AL -- last post-activation value
    caches -- list of caches containing:
                every cache of linear_activation_forward() (there are L-1 of them, indexed from 0 to L-1)
    """

    caches = []
    A = X
    L = len(parameters) // 2                  # number of layers in the neural network
    
    # Implement [LINEAR -> RELU]*(L-1). Add "cache" to the "caches" list.
    for l in range(1, L):
        A_prev = A 
        ### START CODE HERE ### (≈ 2 lines of code)
        A, cache = linear_activation_forward(A_prev, parameters["W"+str(l)],parameters["b"+str(l)] , 'relu')
        caches.append(cache)
        ### END CODE HERE ###
    
    # Implement LINEAR -> SIGMOID. Add "cache" to the "caches" list.
    ### START CODE HERE ### (≈ 2 lines of code)
    AL, cache = linear_activation_forward(A, parameters["W"+str(L)],parameters["b"+str(L)] , 'sigmoid')
    caches.append(cache)
    ### END CODE HERE ###
    
    assert(AL.shape == (1,X.shape[1]))
            
    return AL, caches

L_model_forward函数输入

该函数输入为两个参数：训练的数据X（这是一个 $n^{[0]},m)$ 的矩阵），和模型的结构（假定模型所有隐藏层使用relu函数，输出层使用sigmoid函数，这里parameter就描述了整个模型的结构）

L_model_forward函数输出

函数的输出为：前向传播的最终结果，以及在传播过程中各层的缓存，一个有L个缓存。（L为模型层数）

L_model_forward过程分析

首先，这里的L不再是层数+1，而就是网络的层数，所以for循环相当于处理了L-1层，也就是所有的隐藏层，for内部使用’relu’参数。最后对输出层单独处理，使用’sigmod’

同时，每次循环完，包括最后的单独处理，都将缓存cache保存在caches里面了，因此caches是个长度为L（0到L-1的下标）的tuple。tuple中的每一个元素又是一个包含四个元素的tuple，保存 $W^{[l]},b^{[l]},A^{[l-1]},Z^{[l]}$

具体形式如上面说：caches[n] = cache，cache[0] = linear_cache = ( $W^{[l]},A^{[l-1]},b^{[l]})$ 而cache[1] = activation_cache = ( $Z^{[l]}$ )

计算代价函数

这里假设问题是二分类问题，使用交叉熵评定模型的优劣，公式如下：

$-\frac{1}{m} \sum\limits_{i = 1}^{m} (y^{(i)}\log\left(a^{[L] (i)}\right) + (1-y^{(i)})\log\left(1- a^{[L](i)}\right)) \tag{7}$

这里的python代码非常简单，输入m个样本的预测值AL和实际的标签Y，判断两者的误差。

def compute_cost(AL, Y):
    """
    Implement the cost function defined by equation (7).

    Arguments:
    AL -- probability vector corresponding to your label predictions, shape (1, number of examples)
    Y -- true "label" vector (for example: containing 0 if non-cat, 1 if cat), shape (1, number of examples)

    Returns:
    cost -- cross-entropy cost
    """
    
    m = Y.shape[1]

    # Compute loss from aL and y.
    ### START CODE HERE ### (≈ 1 lines of code)
    cost = -(1/m) * np.sum(Y*np.log(AL)+(1-Y)*np.log(1-AL), axis=1, keepdims=True)
    ### END CODE HERE ###
    
    cost = np.squeeze(cost)      # To make sure your cost's shape is what we expect (e.g. this turns [[17]] into 17).
    assert(cost.shape == ())
    
    return cost

反向传播

反向传播的代码主要负责以下四个数学公式的实现：

$dZ^{[l]}_{n^{[l]}*m} ={\partial J \over \partial Z^{[l]}}= dA^{[l]}_{n^{[l]}*m}\odot g'^{[l]}(Z^{[l]})_{n^{[l]}*m}$

$dW^{[l]}_{(n^{[l]}*n^{[l-1]})} = {\partial J \over \partial W^{[l]}} = {1\over m}dZ^{[l]}_{(n^{[l]}*m)} A^{[l-1]T}_{(m*n^{[l-1]})}$

$db^{[l]}_{(n^{[l]}*1)}={\partial J \over \partial db^{[l]}} = {1\over m}\sum^{m}_{i=1}dZ^{[l](i)}$

$dA^{[l-1]}_{(n^{[l-1]}*m)}={\partial J\over \partial A^{[l-1]}}=W^{[l]T}_{ (n^{[l-1]}*n^{[l]})} dZ^{[l]}_{(n^{[l]}*m)}$

和前向传播同样，将其分为两个部分，第一部分是对上面四个公式的实现，形成对一层网络的“模块化”。第二部分将模块化的结果进行组合，形成一个完整的网络的反向传播。

第一部分

第一部分分为两个函数来完成，这与前向传播相对应。其中后三个公式对应了前向传播中的“WX+b”部分，是线性的部分，第一个公式对应将线性输出通过激活函数做非线性处理的部分。

linear_backward函数实现后三个公式

def linear_backward(dZ, cache):
    """
    Implement the linear portion of backward propagation for a single layer (layer l)

    Arguments:
    dZ -- Gradient of the cost with respect to the linear output (of current layer l)
    cache -- tuple of values (A_prev, W, b) coming from the forward propagation in the current layer

    Returns:
    dA_prev -- Gradient of the cost with respect to the activation (of the previous layer l-1), same shape as A_prev
    dW -- Gradient of the cost with respect to W (current layer l), same shape as W
    db -- Gradient of the cost with respect to b (current layer l), same shape as b
    """
    A_prev, W, b = cache
    m = A_prev.shape[1]

    ### START CODE HERE ### (≈ 3 lines of code)
    dW = 1/m * np.dot(dZ,A_prev.T)
    db = 1/m * np.sum(dZ,axis=1, keepdims=True)
    dA_prev = np.dot(W.T,dZ)
    ### END CODE HERE ###
    
    assert (dA_prev.shape == A_prev.shape)
    assert (dW.shape == W.shape)
    assert (db.shape == b.shape)
    
    return dA_prev, dW, db

为了完成下面三个公式，需要 $dZ^{[l]},A^{[l-1]},W^{[l]}$ 三个量，其中后两个量通过cache作为一个tuple的一部分被传入函数

$dW^{[l]}_{(n^{[l]}*n^{[l-1]})} = {\partial J \over \partial W^{[l]}} = {1\over m}dZ^{[l]}_{(n^{[l]}*m)} A^{[l-1]T}_{(m*n^{[l-1]})}$

$db^{[l]}_{(n^{[l]}*1)}={\partial J \over \partial db^{[l]}} = {1\over m}\sum^{m}_{i=1}dZ^{[l](i)}$

$dA^{[l-1]}_{(n^{[l-1]}*m)}={\partial J\over \partial A^{[l-1]}}=W^{[l]T}_{ (n^{[l-1]}*n^{[l]})} dZ^{[l]}_{(n^{[l]}*m)}$

linear_activation_backward调用上面的函数实现模块功能

def linear_activation_backward(dA, cache, activation):
    """
    Implement the backward propagation for the LINEAR->ACTIVATION layer.
    
    Arguments:
    dA -- post-activation gradient for current layer l 
    cache -- tuple of values (linear_cache, activation_cache) we store for computing backward propagation efficiently
    activation -- the activation to be used in this layer, stored as a text string: "sigmoid" or "relu"
    
    Returns:
    dA_prev -- Gradient of the cost with respect to the activation (of the previous layer l-1), same shape as A_prev
    dW -- Gradient of the cost with respect to W (current layer l), same shape as W
    db -- Gradient of the cost with respect to b (current layer l), same shape as b
    """
    linear_cache, activation_cache = cache
    
    if activation == "relu":
        ### START CODE HERE ### (≈ 2 lines of code)
        dZ = relu_backward(dA, cache[1])
        dA_prev, dW, db = linear_backward(dZ,cache[0])
        ### END CODE HERE ###
        
    elif activation == "sigmoid":
        ### START CODE HERE ### (≈ 2 lines of code)
        dZ = sigmoid_backward(dA, cache[1])
        dA_prev, dW, db = linear_backward(dZ,cache[0])
        ### END CODE HERE ###
    
    return dA_prev, dW, db

函数输入：第层的 $dA^{[l]}$ 和计算梯度所必须的相关量（这里不赘述了）参考前文cache的结构。

输出： $dW^{[l]}, db^{[l]},dA^{[l-1]}$ .

这里需要注意的是：

$dZ^{[l]}_{n^{[l]}*m} ={\partial J \over \partial Z^{[l]}}= dA^{[l]}_{n^{[l]}*m}\odot g'^{[l]}(Z^{[l]})_{n^{[l]}*m}$

上面这个公式已经用relu_backward()和sigmoid_backward()写好了，感谢。其调用方式是给出 $dA^{[l]}$ 和第l层的activation_cache。

第二部分

def L_model_backward(AL, Y, caches):
    """
    Implement the backward propagation for the [LINEAR->RELU] * (L-1) -> LINEAR -> SIGMOID group
    
    Arguments:
    AL -- probability vector, output of the forward propagation (L_model_forward())
    Y -- true "label" vector (containing 0 if non-cat, 1 if cat)
    caches -- list of caches containing:
                every cache of linear_activation_forward() with "relu" (it's caches[l], for l in range(L-1) i.e l = 0...L-2)
                the cache of linear_activation_forward() with "sigmoid" (it's caches[L-1])
    
    Returns:
    grads -- A dictionary with the gradients
             grads["dA" + str(l)] = ... 
             grads["dW" + str(l)] = ...
             grads["db" + str(l)] = ... 
    """
    grads = {}
    L = len(caches) # the number of layers
    m = AL.shape[1]
    Y = Y.reshape(AL.shape) # after this line, Y is the same shape as AL
    
    # Initializing the backpropagation
    ### START CODE HERE ### (1 line of code)
    dAL = - (np.divide(Y, AL) - np.divide(1 - Y, 1 - AL))
    ### END CODE HERE ###
    
    # Lth layer (SIGMOID -> LINEAR) gradients. Inputs: "dAL, current_cache". Outputs: "grads["dAL-1"], grads["dWL"], grads["dbL"]
    ### START CODE HERE ### (approx. 2 lines)
    current_cache = caches[L-1]
    grads["dA" + str(L-1)], grads["dW" + str(L)], grads["db" + str(L)] = linear_activation_backward(dAL, current_cache, 'sigmoid')
    ### END CODE HERE ###
    
    # Loop from l=L-2 to l=0
    for l in reversed(range(L-1)):
        # lth layer: (RELU -> LINEAR) gradients.
        # Inputs: "grads["dA" + str(l + 1)], current_cache". Outputs: "grads["dA" + str(l)] , grads["dW" + str(l + 1)] , grads["db" + str(l + 1)] 
        ### START CODE HERE ### (approx. 5 lines)
        current_cache = caches[l]
        dA_prev_temp, dW_temp, db_temp = linear_activation_backward(grads["dA"+str(l+1)], current_cache, 'relu')
        grads["dA" + str(l)] = dA_prev_temp
        grads["dW" + str(l + 1)] = dW_temp
        grads["db" + str(l + 1)] = db_temp
        ### END CODE HERE ###

    return grads

输入：m个样本的预测结果，m个样本的标签Y，以及在前向传播过程中保留下来的caches。

输出：grads字典，其中保留了所有的 $d W, d b, d A$ 。grads[“dW3”] = $dW^{[3]}$ ，其他以此类推。

过程分析(几个注意点)：

L = len(caches)，这里的L是层数，而不是层数+1，caches[0]到caches[L-1]对应了第一层到第L层的缓存。
输出层的 $dA^{[L]}$ 不适用公式，需要单独处理

$dA^{[l-1]}_{(n^{[l-1]}*m)}={\partial J\over \partial A^{[l-1]}}=W^{[l]T}_{ (n^{[l-1]}*n^{[l]})} dZ^{[l]}_{(n^{[l]}*m)}$
注意反向传播是从L层传到1层，所以循环的时候使用了reversed(range(L-1))
注意for l in reversed(range(L-1))中的l是作为caches下标进行遍历，对应的是l+1层

第三部分——更新参数

def update_parameters(parameters, grads, learning_rate):
    """
    Update parameters using gradient descent
    
    Arguments:
    parameters -- python dictionary containing your parameters 
    grads -- python dictionary containing your gradients, output of L_model_backward
    
    Returns:
    parameters -- python dictionary containing your updated parameters 
                  parameters["W" + str(l)] = ... 
                  parameters["b" + str(l)] = ...
    """
    
    L = len(parameters) // 2 # number of layers in the neural network

    # Update rule for each parameter. Use a for loop.
    ### START CODE HERE ### (≈ 3 lines of code)
    for l in range(L):
        parameters["W" + str(l+1)] = parameters["W" + str(l+1)]-learning_rate*grads["dW"+str(l+1)]
        parameters["b" + str(l+1)] = parameters["b"+str(l+1)] - learning_rate*grads["db"+str(l+1)]
    ### END CODE HERE ###
    return parameters

输入：所有的W和b，保存在字典parameters中，grad保存所有W和b对应的梯度，还有最后的学习率。

输出：更新后的所有W和b

过程：

$W^{[l]} <= W^{[l]} - \lambda dW^{[l]}$

$b^{[l]} <= b^{[l]} - \lambda db^{[l]}$

代码总结

首先我们有训练集 X, Y,以及人为定义的网络各层节点类型 layer_dims数组，当然第0层和第L层大小是固定的，中间是可以自己设的。层数L=len（layer_dims)-1

首先初始化得到一个网络的初始结构的参数

parameters = initialize_parameters_deep(layer_dims)

然后进行前向传播，得到输出和缓存

AL,caches = L_model_forward(X, parameters)

计算代价函数：

cost = compute_cost(AL, Y)

进行反向传播更新参数

grads = L_model_backward(AL, Y, caches)
parameters = update_parameters(parameters, grads, learning_rate)

其中，在反向传播的过程中需要用到W和b的值，本代码中没有从parameters中取得，而是再保存了一份W,b在caches中

高级混子

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
吴恩达机器学习笔记——多层全连接网络

多层全连接神经网络多层神经网络中的符号表示在计算网络层数的时候，一般只记录输出层和隐藏层，因此，上图为一个4层的神经网络，令L为网络的层数，L=4.对于输入层，习惯性称为“第零层”，由于计算机中的数组通常是从0开始的，第0层这种叫法有利于统一形式。对于每一层（从1到L）都有两个参数W和b（为了区分不同层的W和b，将他们记为：W[l],b[l]W^{[l]} ,b^{[l]}W[l],b[l]，他们都是矩阵，其维度如下：W[l]:(n[l],n[l−1]) W^{[l]}: (n^{[l]},n^{
复制链接

扫一扫