吴恩达机器学习笔记——多层全连接网络

多层全连接神经网络

多层神经网络中的符号表示

在计算网络层数的时候,一般只记录输出层和隐藏层,因此,上图为一个4层的神经网络,令L为网络的层数,L=4.

对于输入层,习惯性称为“第零层”,由于计算机中的数组通常是从0开始的,第0层这种叫法有利于统一形式。

对于每一层(从1到L)都有两个参数W和b(为了区分不同层的W和b,将他们记为: W [ l ] , b [ l ] W^{[l]} ,b^{[l]} W[l],b[l],他们都是矩阵,其维度如下:

W [ l ] : ( n [ l ] , n [ l − 1 ] ) W^{[l]}: (n^{[l]},n^{[l-1]}) W[l]:(n[l],n[l1])

b [ l ] : ( n [ l ] , 1 ) b^{[l]}:(n^{[l]},1) b[l]:(n[l],1)

在上面的式子中, n [ l ] n^{[l]} n[l]代表第l层的神经元的个数,这里就体现出“第0层”的作用了,因为对于 W [ 1 ] W^{[1]} W[1],其维度为: ( n [ 1 ] , n [ 0 ] ) (n^{[1]},n^{[0]}) (n[1],n[0])而这里 n [ 0 ] n^{[0]} n[0]恰好对应了输入层的神经元个数(这里也可以理解为x的属性的数目)

如前面所述,每个层都会产生一个线性的输出 Z [ l ] Z^{[l]} Z[l]以及将Z通过激活函数进行非线性变换后得到的 A [ l ] A^{[l]} A[l], 其维度如下:

Z [ l ] , A [ l ] : ( n [ l ] , m ) Z^{[l]},A^{[l]}: (n^{[l]},m) Z[l],A[l]:(n[l],m)

其中m为输入样本的个数。最后还需要偏导数以进行梯度下降,对于的偏导数项 d W [ l ] , d b [ l ] dW^{[l]}, db^{[l]} dW[l],db[l]和对于参数 W [ l ] , b [ l ] W^{[l]},b^{[l]} W[l],b[l]的矩阵维度相等。

d W [ l ] : ( n [ l ] , n [ l − 1 ] ) dW^{[l]}: (n^{[l]},n^{[l-1]}) dW[l]:(n[l],n[l1])

d b [ l ] : ( n [ l ] , 1 ) db^{[l]}:(n^{[l]},1) db[l]:(n[l],1)

多层神经网络的通式

对于多层神经网络,我们希望对每一个层l,都抽象出一个通式,把每一层当做一个模块或者一块“积木”(从程序设计上就是把每一层抽象成一个通用的函数),从而通过多次调用这个函数,可以很方便的刻画出整个模型。

下面给出一个模块的示意图(这里使用小写的a和z代表是对单个样本进行处理):

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-RohuYPrY-1601977409142)(./multi-layer/2.jpg)]

一个神经网络的层l在前向传播的过程中需要接受来自前一层的输出 a [ l ] a^{[l]} a[l],作为本层的输入。根据本层的 W [ l ] , b [ l ] W^{[l]} ,b^{[l]} W[l],b[l]和激活函数计算出该层的输出 a [ l ] a^{[l]} a[l]作为输出给下一层。

同时,其需要缓存计算过程中的 z [ l ] , a [ l ] z^{[l]},a^{[l]} z[l],a[l]用于在反向传播的过程中计算导数。

前向传播结束后,将计算代价函数J,然后将代价函数用于反向传播中,作为反向传播的"起点".

在反向传播的过程中,其接受来自上一层的导数 d a [ l ] da^{[l]} da[l],然后计算出本层参数的导数 d W [ l ] , d b [ l ] dW^{[l]}, db^{[l]} dW[l],db[l]用于梯度下降更新参数的值。同时其计算出 d a [ l − 1 ] da^{[l-1]} da[l1]被上一层用于更新参数。

下面是公式:

前向传播:

Z n [ l ] ∗ m [ l ] = W n [ l ] ∗ n [ l − 1 ] [ l ] A n [ l ] ∗ m [ l − 1 ] + b n [ l ] ∗ 1 [ l ] Z^{[l]}_{n^{[l]}*m} = W^{[l]}_{n^{[l]}*n^{[l-1]}} A^{[l-1]}_{n^{[l]}*m}+b^{[l]}_{n^{[l]}*1} Zn[l]m[l]=Wn[l]n[l1][l]An[l]m[l1]+bn[l]1[l]

A [ l ] = g ( Z [ l ] ) A^{[l]}=g(Z^{[l]}) A[l]=g(Z[l])

计算代价函数:

− 1 m ∑ i = 1 m ( y ( i ) log ⁡ ( a [ L ] ( i ) ) + ( 1 − y ( i ) ) log ⁡ ( 1 − a [ L ] ( i ) ) ) (7) -\frac{1}{m} \sum\limits_{i = 1}^{m} (y^{(i)}\log\left(a^{[L] (i)}\right) + (1-y^{(i)})\log\left(1- a^{[L](i)}\right)) \tag{7} m1i=1m(y(i)log(a[L](i))+(1y(i))log(1a[L](i)))(7)

反向传播:

d Z n [ l ] ∗ m [ l ] = ∂ J ∂ Z [ l ] = d A n [ l ] ∗ m [ l ] ⊙ g ′ [ l ] ( Z [ l ] ) n [ l ] ∗ m dZ^{[l]}_{n^{[l]}*m} ={\partial J \over \partial Z^{[l]}}= dA^{[l]}_{n^{[l]}*m}\odot g'^{[l]}(Z^{[l]})_{n^{[l]}*m} dZn[l]m[l]=Z[l]J=dAn[l]m[l]g[l](Z[l])n[l]m

d W ( n [ l ] ∗ n [ l − 1 ] ) [ l ] = ∂ J ∂ W [ l ] = 1 m d Z ( n [ l ] ∗ m ) [ l ] A ( m ∗ n [ l − 1 ] ) [ l − 1 ] T dW^{[l]}_{(n^{[l]}*n^{[l-1]})} = {\partial J \over \partial W^{[l]}} = {1\over m}dZ^{[l]}_{(n^{[l]}*m)} A^{[l-1]T}_{(m*n^{[l-1]})} dW(n[l]n[l1])[l]=W[l]J=m1dZ(n[l]m)[l]A(mn[l1])[l1]T

d b ( n [ l ] ∗ 1 ) [ l ] = ∂ J ∂ d b [ l ] = 1 m ∑ i = 1 m d Z [ l ] ( i ) db^{[l]}_{(n^{[l]}*1)}={\partial J \over \partial db^{[l]}} = {1\over m}\sum^{m}_{i=1}dZ^{[l](i)} db(n[l]1)[l]=db[l]J=m1i=1mdZ[l](i)

d A ( n [ l − 1 ] ∗ m ) [ l − 1 ] = ∂ J ∂ A [ l − 1 ] = W ( n [ l − 1 ] ∗ n [ l ] ) [ l ] T d Z ( n [ l ] ∗ m ) [ l ] dA^{[l-1]}_{(n^{[l-1]}*m)}={\partial J\over \partial A^{[l-1]}}=W^{[l]T}_{ (n^{[l-1]}*n^{[l]})} dZ^{[l]}_{(n^{[l]}*m)} dA(n[l1]m)[l1]=A[l1]J=W(n[l1]n[l])[l]TdZ(n[l]m)[l]

多层全连接神经网络的代码

这是吴恩达课程作业的代码,将其进行分析

初始化参数

# GRADED FUNCTION: initialize_parameters_deep

def initialize_parameters_deep(layer_dims):
    """
    Arguments:
    layer_dims -- python array (list) containing the dimensions of each layer in our network
    
    Returns:
    parameters -- python dictionary containing your parameters "W1", "b1", ..., "WL", "bL":
                    Wl -- weight matrix of shape (layer_dims[l], layer_dims[l-1])
                    bl -- bias vector of shape (layer_dims[l], 1)
    """
    
    np.random.seed(3)
    parameters = {}
    L = len(layer_dims)            # number of layers in the network

    for l in range(1, L):
        ### START CODE HERE ### (≈ 2 lines of code)
        parameters['W' + str(l)] = np.random.randn(layer_dims[l],layer_dims[l-1]) * 0.01
        parameters['b' + str(l)] = np.zeros((layer_dims[l],1))
        ### END CODE HERE ###
        
        assert(parameters['W' + str(l)].shape == (layer_dims[l], layer_dims[l-1]))
        assert(parameters['b' + str(l)].shape == (layer_dims[l], 1))

        
    return parameters
initialize_parameters_deep的输入

是一个list叫做"layer_dims",layer_dims实际的长度应该是L+1,layer_dims[0]代表第0层(也就是输入特征)的维度,layer_dims[1]到layer_dims[L]代表从第1层到第L层的维度

initialize_parameters_deep的输出

输出是一个parameters的字典,通过parameters[“W3”]来访问 W [ 3 ] W^{[3]} W[3]其他以此类推

initialize_parameters_deep的分析

根据以下两个公式初始化参数:

W [ l ] : ( n [ l ] , n [ l − 1 ] ) W^{[l]}: (n^{[l]},n^{[l-1]}) W[l]:(n[l],n[l1])

b [ l ] : ( n [ l ] , 1 ) b^{[l]}:(n^{[l]},1) b[l]:(n[l],1)

初始化参数的时候,W不能初始化为全0,原因课上讲过。这里相当于初始化 W [ 1 ] , b [ 1 ] W^{[1]},b^{[1]} W[1],b[1] W [ L ] , b [ L ] W^{[L]},b^{[L]} W[L],b[L]。第0层没有所谓的W和b

这里比较有意思的是layer_dims数组。layer_dims实际的长度应该是L+1,layer_dims[0]代表第0层(也就是输入特征)的维度,layer_dims[1]到layer_dims[L]代表从第1层到第L层的维度,第L层是输出层,其他都是隐藏层,隐藏层维度可以自己设置。所以程序里的变量L’(我把程序里定义的变量L称为L’是为了区分)是层数+1,但是由于遍历的时候,range函数是不包括L’的,所以刚好从第1层遍历到第L层。

前向传播

前向传播分为两个部分,第一个部分完成一个“模块”的前向传播,第二个部分重复调用第一个部分的函数,完成整个网络的前向传播。

第一部分 —— 单个通用模块的前向传播

第一部分完成以下两个公式的实现,分别采用两个函数进行。

Z n [ l ] ∗ m [ l ] = W n [ l ] ∗ n [ l − 1 ] [ l ] A n [ l ] ∗ m [ l − 1 ] + b n [ l ] ∗ 1 [ l ] Z^{[l]}_{n^{[l]}*m} = W^{[l]}_{n^{[l]}*n^{[l-1]}} A^{[l-1]}_{n^{[l]}*m}+b^{[l]}_{n^{[l]}*1} Zn[l]m[l]=Wn[l]n[l1][l]An[l]m[l1]+bn[l]1[l]

A [ l ] = g ( Z [ l ] ) A^{[l]}=g(Z^{[l]}) A[l]=g(Z[l])

linear_forward函数实现第一个公式
# GRADED FUNCTION: linear_forward

def linear_forward(A, W, b):
    """
    Implement the linear part of a layer's forward propagation.

    Arguments:
    A -- activations from previous layer (or input data): (size of previous layer, number of examples)
    W -- weights matrix: numpy array of shape (size of current layer, size of previous layer)
    b -- bias vector, numpy array of shape (size of the current layer, 1)

    Returns:
    Z -- the input of the activation function, also called pre-activation parameter 
    cache -- a python tuple containing "A", "W" and "b" ; stored for computing the backward pass efficiently
    """
    
    ### START CODE HERE ### (≈ 1 line of code)
    Z = np.dot(W,A)+b
    ### END CODE HERE ###
    
    assert(Z.shape == (W.shape[0], A.shape[1]))
    cache = (A, W, b)
    
    return Z, cache

该函数输入为 W [ l ] , b [ l ] , A [ l − 1 ] W^{[l]},b^{[l]},A^{[l-1]} W[l],b[l],A[l1]三个参数,得到线性输出 Z [ l ] Z^{[l]} Z[l],同时保存了 W [ l ] , b [ l ] , A [ l − 1 ] W^{[l]},b^{[l]},A^{[l-1]} W[l],b[l],A[l1]三个参数

linear_activation_forward 实现两个公式
def linear_activation_forward(A_prev, W, b, activation):
    """
    Implement the forward propagation for the LINEAR->ACTIVATION layer

    Arguments:
    A_prev -- activations from previous layer (or input data): (size of previous layer, number of examples)
    W -- weights matrix: numpy array of shape (size of current layer, size of previous layer)
    b -- bias vector, numpy array of shape (size of the current layer, 1)
    activation -- the activation to be used in this layer, stored as a text string: "sigmoid" or "relu"

    Returns:
    A -- the output of the activation function, also called the post-activation value 
    cache -- a python tuple containing "linear_cache" and "activation_cache";
             stored for computing the backward pass efficiently
    """
    if activation == "sigmoid":
        # Inputs: "A_prev, W, b". Outputs: "A, activation_cache".
        ### START CODE HERE ### (≈ 2 lines of code)
        Z, linear_cache = linear_forward(A_prev, W, b)
        A, activation_cache = sigmoid(Z)
        ### END CODE HERE ###
    
    elif activation == "relu":
        # Inputs: "A_prev, W, b". Outputs: "A, activation_cache".
        ### START CODE HERE ### (≈ 2 lines of code)
        Z, linear_cache = linear_forward(A_prev, W, b)
        A, activation_cache = relu(Z)
        ### END CODE HERE ###
    
    assert (A.shape == (W.shape[0], A_prev.shape[1]))
    cache = (linear_cache, activation_cache)

    return A, cache

该函数同样输入 W [ l ] , b [ l ] , A [ l − 1 ] W^{[l]},b^{[l]},A^{[l-1]} W[l],b[l],A[l1]三个参数,但是另外的还要指出激活函数,根据激活函数的取值不同,来选择对线性输出 Z [ l ] Z^{[l]} Z[l]的操作。

需要指出的是,relu和sigmoid函数除了输出对应的 A [ l ] A^{[l]} A[l]值外,还会将 Z [ l ] Z^{[l]} Z[l]缓存。因此,考虑上linear_forward()函数的缓存,对于第l层,总共缓存了四个量 W [ l ] , b [ l ] , A [ l − 1 ] , Z [ l ] W^{[l]},b^{[l]},A^{[l-1]},Z^{[l]} W[l],b[l],A[l1],Z[l]

注意这里cache的结构,cache[0] = linear_cache = ( W [ l ] , A [ l − 1 ] , b [ l ] ) W^{[l]},A^{[l-1]},b^{[l]}) W[l],A[l1],b[l])而cache[1] = activation_cache = ( Z [ l ] Z^{[l]} Z[l])

第二部分 —— L层组合形成网络

第二部分通过不断调用第一部分的函数,实现整个模型的前向传播。

def L_model_forward(X, parameters):
    """
    Implement forward propagation for the [LINEAR->RELU]*(L-1)->LINEAR->SIGMOID computation
    
    Arguments:
    X -- data, numpy array of shape (input size, number of examples)
    parameters -- output of initialize_parameters_deep()
    
    Returns:
    AL -- last post-activation value
    caches -- list of caches containing:
                every cache of linear_activation_forward() (there are L-1 of them, indexed from 0 to L-1)
    """

    caches = []
    A = X
    L = len(parameters) // 2                  # number of layers in the neural network
    
    # Implement [LINEAR -> RELU]*(L-1). Add "cache" to the "caches" list.
    for l in range(1, L):
        A_prev = A 
        ### START CODE HERE ### (≈ 2 lines of code)
        A, cache = linear_activation_forward(A_prev, parameters["W"+str(l)],parameters["b"+str(l)] , 'relu')
        caches.append(cache)
        ### END CODE HERE ###
    
    # Implement LINEAR -> SIGMOID. Add "cache" to the "caches" list.
    ### START CODE HERE ### (≈ 2 lines of code)
    AL, cache = linear_activation_forward(A, parameters["W"+str(L)],parameters["b"+str(L)] , 'sigmoid')
    caches.append(cache)
    ### END CODE HERE ###
    
    assert(AL.shape == (1,X.shape[1]))
            
    return AL, caches
L_model_forward函数输入

该函数输入为两个参数:训练的数据X(这是一个 ( n [ 0 ] , m ) (n^{[0]},m) (n[0],m)的矩阵),和模型的结构(假定模型所有隐藏层使用relu函数,输出层使用sigmoid函数,这里parameter就描述了整个模型的结构)

L_model_forward函数输出

函数的输出为:前向传播的最终结果,以及在传播过程中各层的缓存,一个有L个缓存。(L为模型层数)

L_model_forward过程分析

首先,这里的L不再是层数+1,而就是网络的层数,所以for循环相当于处理了L-1层,也就是所有的隐藏层,for内部使用’relu’参数。最后对输出层单独处理,使用’sigmod’

同时,每次循环完,包括最后的单独处理,都将缓存cache保存在caches里面了,因此caches是个长度为L(0到L-1的下标)的tuple。tuple中的每一个元素又是一个包含四个元素的tuple,保存 W [ l ] , b [ l ] , A [ l − 1 ] , Z [ l ] W^{[l]},b^{[l]},A^{[l-1]},Z^{[l]} W[l],b[l],A[l1],Z[l]

具体形式如上面说:caches[n] = cache,cache[0] = linear_cache = ( W [ l ] , A [ l − 1 ] , b [ l ] ) W^{[l]},A^{[l-1]},b^{[l]}) W[l],A[l1],b[l])而cache[1] = activation_cache = ( Z [ l ] Z^{[l]} Z[l])

计算代价函数

这里假设问题是二分类问题,使用交叉熵评定模型的优劣,公式如下:

− 1 m ∑ i = 1 m ( y ( i ) log ⁡ ( a [ L ] ( i ) ) + ( 1 − y ( i ) ) log ⁡ ( 1 − a [ L ] ( i ) ) ) (7) -\frac{1}{m} \sum\limits_{i = 1}^{m} (y^{(i)}\log\left(a^{[L] (i)}\right) + (1-y^{(i)})\log\left(1- a^{[L](i)}\right)) \tag{7} m1i=1m(y(i)log(a[L](i))+(1y(i))log(1a[L](i)))(7)

这里的python代码非常简单,输入m个样本的预测值AL和实际的标签Y,判断两者的误差。

def compute_cost(AL, Y):
    """
    Implement the cost function defined by equation (7).

    Arguments:
    AL -- probability vector corresponding to your label predictions, shape (1, number of examples)
    Y -- true "label" vector (for example: containing 0 if non-cat, 1 if cat), shape (1, number of examples)

    Returns:
    cost -- cross-entropy cost
    """
    
    m = Y.shape[1]

    # Compute loss from aL and y.
    ### START CODE HERE ### (≈ 1 lines of code)
    cost = -(1/m) * np.sum(Y*np.log(AL)+(1-Y)*np.log(1-AL), axis=1, keepdims=True)
    ### END CODE HERE ###
    
    cost = np.squeeze(cost)      # To make sure your cost's shape is what we expect (e.g. this turns [[17]] into 17).
    assert(cost.shape == ())
    
    return cost

反向传播

反向传播的代码主要负责以下四个数学公式的实现:

d Z n [ l ] ∗ m [ l ] = ∂ J ∂ Z [ l ] = d A n [ l ] ∗ m [ l ] ⊙ g ′ [ l ] ( Z [ l ] ) n [ l ] ∗ m dZ^{[l]}_{n^{[l]}*m} ={\partial J \over \partial Z^{[l]}}= dA^{[l]}_{n^{[l]}*m}\odot g'^{[l]}(Z^{[l]})_{n^{[l]}*m} dZn[l]m[l]=Z[l]J=dAn[l]m[l]g[l](Z[l])n[l]m

d W ( n [ l ] ∗ n [ l − 1 ] ) [ l ] = ∂ J ∂ W [ l ] = 1 m d Z ( n [ l ] ∗ m ) [ l ] A ( m ∗ n [ l − 1 ] ) [ l − 1 ] T dW^{[l]}_{(n^{[l]}*n^{[l-1]})} = {\partial J \over \partial W^{[l]}} = {1\over m}dZ^{[l]}_{(n^{[l]}*m)} A^{[l-1]T}_{(m*n^{[l-1]})} dW(n[l]n[l1])[l]=W[l]J=m1dZ(n[l]m)[l]A(mn[l1])[l1]T

d b ( n [ l ] ∗ 1 ) [ l ] = ∂ J ∂ d b [ l ] = 1 m ∑ i = 1 m d Z [ l ] ( i ) db^{[l]}_{(n^{[l]}*1)}={\partial J \over \partial db^{[l]}} = {1\over m}\sum^{m}_{i=1}dZ^{[l](i)} db(n[l]1)[l]=db[l]J=m1i=1mdZ[l](i)

d A ( n [ l − 1 ] ∗ m ) [ l − 1 ] = ∂ J ∂ A [ l − 1 ] = W ( n [ l − 1 ] ∗ n [ l ] ) [ l ] T d Z ( n [ l ] ∗ m ) [ l ] dA^{[l-1]}_{(n^{[l-1]}*m)}={\partial J\over \partial A^{[l-1]}}=W^{[l]T}_{ (n^{[l-1]}*n^{[l]})} dZ^{[l]}_{(n^{[l]}*m)} dA(n[l1]m)[l1]=A[l1]J=W(n[l1]n[l])[l]TdZ(n[l]m)[l]

和前向传播同样,将其分为两个部分,第一部分是对上面四个公式的实现,形成对一层网络的“模块化”。第二部分将模块化的结果进行组合,形成一个完整的网络的反向传播。

第一部分

第一部分分为两个函数来完成,这与前向传播相对应。其中后三个公式对应了前向传播中的“WX+b”部分,是线性的部分,第一个公式对应将线性输出通过激活函数做非线性处理的部分。

linear_backward函数实现后三个公式
def linear_backward(dZ, cache):
    """
    Implement the linear portion of backward propagation for a single layer (layer l)

    Arguments:
    dZ -- Gradient of the cost with respect to the linear output (of current layer l)
    cache -- tuple of values (A_prev, W, b) coming from the forward propagation in the current layer

    Returns:
    dA_prev -- Gradient of the cost with respect to the activation (of the previous layer l-1), same shape as A_prev
    dW -- Gradient of the cost with respect to W (current layer l), same shape as W
    db -- Gradient of the cost with respect to b (current layer l), same shape as b
    """
    A_prev, W, b = cache
    m = A_prev.shape[1]

    ### START CODE HERE ### (≈ 3 lines of code)
    dW = 1/m * np.dot(dZ,A_prev.T)
    db = 1/m * np.sum(dZ,axis=1, keepdims=True)
    dA_prev = np.dot(W.T,dZ)
    ### END CODE HERE ###
    
    assert (dA_prev.shape == A_prev.shape)
    assert (dW.shape == W.shape)
    assert (db.shape == b.shape)
    
    return dA_prev, dW, db

为了完成下面三个公式,需要 d Z [ l ] , A [ l − 1 ] , W [ l ] dZ^{[l]},A^{[l-1]},W^{[l]} dZ[l],A[l1],W[l] 三个量,其中后两个量通过cache作为一个tuple的一部分被传入函数

d W ( n [ l ] ∗ n [ l − 1 ] ) [ l ] = ∂ J ∂ W [ l ] = 1 m d Z ( n [ l ] ∗ m ) [ l ] A ( m ∗ n [ l − 1 ] ) [ l − 1 ] T dW^{[l]}_{(n^{[l]}*n^{[l-1]})} = {\partial J \over \partial W^{[l]}} = {1\over m}dZ^{[l]}_{(n^{[l]}*m)} A^{[l-1]T}_{(m*n^{[l-1]})} dW(n[l]n[l1])[l]=W[l]J=m1dZ(n[l]m)[l]A(mn[l1])[l1]T

d b ( n [ l ] ∗ 1 ) [ l ] = ∂ J ∂ d b [ l ] = 1 m ∑ i = 1 m d Z [ l ] ( i ) db^{[l]}_{(n^{[l]}*1)}={\partial J \over \partial db^{[l]}} = {1\over m}\sum^{m}_{i=1}dZ^{[l](i)} db(n[l]1)[l]=db[l]J=m1i=1mdZ[l](i)

d A ( n [ l − 1 ] ∗ m ) [ l − 1 ] = ∂ J ∂ A [ l − 1 ] = W ( n [ l − 1 ] ∗ n [ l ] ) [ l ] T d Z ( n [ l ] ∗ m ) [ l ] dA^{[l-1]}_{(n^{[l-1]}*m)}={\partial J\over \partial A^{[l-1]}}=W^{[l]T}_{ (n^{[l-1]}*n^{[l]})} dZ^{[l]}_{(n^{[l]}*m)} dA(n[l1]m)[l1]=A[l1]J=W(n[l1]n[l])[l]TdZ(n[l]m)[l]

linear_activation_backward调用上面的函数实现模块功能
def linear_activation_backward(dA, cache, activation):
    """
    Implement the backward propagation for the LINEAR->ACTIVATION layer.
    
    Arguments:
    dA -- post-activation gradient for current layer l 
    cache -- tuple of values (linear_cache, activation_cache) we store for computing backward propagation efficiently
    activation -- the activation to be used in this layer, stored as a text string: "sigmoid" or "relu"
    
    Returns:
    dA_prev -- Gradient of the cost with respect to the activation (of the previous layer l-1), same shape as A_prev
    dW -- Gradient of the cost with respect to W (current layer l), same shape as W
    db -- Gradient of the cost with respect to b (current layer l), same shape as b
    """
    linear_cache, activation_cache = cache
    
    if activation == "relu":
        ### START CODE HERE ### (≈ 2 lines of code)
        dZ = relu_backward(dA, cache[1])
        dA_prev, dW, db = linear_backward(dZ,cache[0])
        ### END CODE HERE ###
        
    elif activation == "sigmoid":
        ### START CODE HERE ### (≈ 2 lines of code)
        dZ = sigmoid_backward(dA, cache[1])
        dA_prev, dW, db = linear_backward(dZ,cache[0])
        ### END CODE HERE ###
    
    return dA_prev, dW, db

函数输入:第层的 d A [ l ] dA^{[l]} dA[l]和计算梯度所必须的相关量(这里不赘述了)参考前文cache的结构

输出: d W [ l ] , d b [ l ] , d A [ l − 1 ] dW^{[l]}, db^{[l]},dA^{[l-1]} dW[l],db[l],dA[l1].

这里需要注意的是:

d Z n [ l ] ∗ m [ l ] = ∂ J ∂ Z [ l ] = d A n [ l ] ∗ m [ l ] ⊙ g ′ [ l ] ( Z [ l ] ) n [ l ] ∗ m dZ^{[l]}_{n^{[l]}*m} ={\partial J \over \partial Z^{[l]}}= dA^{[l]}_{n^{[l]}*m}\odot g'^{[l]}(Z^{[l]})_{n^{[l]}*m} dZn[l]m[l]=Z[l]J=dAn[l]m[l]g[l](Z[l])n[l]m

上面这个公式已经用relu_backward()和sigmoid_backward()写好了,感谢。其调用方式是给出 d A [ l ] dA^{[l]} dA[l]和第l层的activation_cache。

第二部分
def L_model_backward(AL, Y, caches):
    """
    Implement the backward propagation for the [LINEAR->RELU] * (L-1) -> LINEAR -> SIGMOID group
    
    Arguments:
    AL -- probability vector, output of the forward propagation (L_model_forward())
    Y -- true "label" vector (containing 0 if non-cat, 1 if cat)
    caches -- list of caches containing:
                every cache of linear_activation_forward() with "relu" (it's caches[l], for l in range(L-1) i.e l = 0...L-2)
                the cache of linear_activation_forward() with "sigmoid" (it's caches[L-1])
    
    Returns:
    grads -- A dictionary with the gradients
             grads["dA" + str(l)] = ... 
             grads["dW" + str(l)] = ...
             grads["db" + str(l)] = ... 
    """
    grads = {}
    L = len(caches) # the number of layers
    m = AL.shape[1]
    Y = Y.reshape(AL.shape) # after this line, Y is the same shape as AL
    
    # Initializing the backpropagation
    ### START CODE HERE ### (1 line of code)
    dAL = - (np.divide(Y, AL) - np.divide(1 - Y, 1 - AL))
    ### END CODE HERE ###
    
    # Lth layer (SIGMOID -> LINEAR) gradients. Inputs: "dAL, current_cache". Outputs: "grads["dAL-1"], grads["dWL"], grads["dbL"]
    ### START CODE HERE ### (approx. 2 lines)
    current_cache = caches[L-1]
    grads["dA" + str(L-1)], grads["dW" + str(L)], grads["db" + str(L)] = linear_activation_backward(dAL, current_cache, 'sigmoid')
    ### END CODE HERE ###
    
    # Loop from l=L-2 to l=0
    for l in reversed(range(L-1)):
        # lth layer: (RELU -> LINEAR) gradients.
        # Inputs: "grads["dA" + str(l + 1)], current_cache". Outputs: "grads["dA" + str(l)] , grads["dW" + str(l + 1)] , grads["db" + str(l + 1)] 
        ### START CODE HERE ### (approx. 5 lines)
        current_cache = caches[l]
        dA_prev_temp, dW_temp, db_temp = linear_activation_backward(grads["dA"+str(l+1)], current_cache, 'relu')
        grads["dA" + str(l)] = dA_prev_temp
        grads["dW" + str(l + 1)] = dW_temp
        grads["db" + str(l + 1)] = db_temp
        ### END CODE HERE ###

    return grads

输入:m个样本的预测结果,m个样本的标签Y,以及在前向传播过程中保留下来的caches。

输出:grads字典,其中保留了所有的 d W , d b , d A dW,db,dA dW,db,dA。grads[“dW3”] = d W [ 3 ] dW^{[3]} dW[3],其他以此类推。

过程分析(几个注意点):

  1. L = len(caches),这里的L是层数,而不是层数+1,caches[0]到caches[L-1]对应了第一层到第L层的缓存。

  2. 输出层的 d A [ L ] dA^{[L]} dA[L]不适用公式,需要单独处理

    d A ( n [ l − 1 ] ∗ m ) [ l − 1 ] = ∂ J ∂ A [ l − 1 ] = W ( n [ l − 1 ] ∗ n [ l ] ) [ l ] T d Z ( n [ l ] ∗ m ) [ l ] dA^{[l-1]}_{(n^{[l-1]}*m)}={\partial J\over \partial A^{[l-1]}}=W^{[l]T}_{ (n^{[l-1]}*n^{[l]})} dZ^{[l]}_{(n^{[l]}*m)} dA(n[l1]m)[l1]=A[l1]J=W(n[l1]n[l])[l]TdZ(n[l]m)[l]

  3. 注意反向传播是从L层传到1层,所以循环的时候使用了reversed(range(L-1))

  4. 注意for l in reversed(range(L-1))中的l是作为caches下标进行遍历,对应的是l+1层

第三部分——更新参数
def update_parameters(parameters, grads, learning_rate):
    """
    Update parameters using gradient descent
    
    Arguments:
    parameters -- python dictionary containing your parameters 
    grads -- python dictionary containing your gradients, output of L_model_backward
    
    Returns:
    parameters -- python dictionary containing your updated parameters 
                  parameters["W" + str(l)] = ... 
                  parameters["b" + str(l)] = ...
    """
    
    L = len(parameters) // 2 # number of layers in the neural network

    # Update rule for each parameter. Use a for loop.
    ### START CODE HERE ### (≈ 3 lines of code)
    for l in range(L):
        parameters["W" + str(l+1)] = parameters["W" + str(l+1)]-learning_rate*grads["dW"+str(l+1)]
        parameters["b" + str(l+1)] = parameters["b"+str(l+1)] - learning_rate*grads["db"+str(l+1)]
    ### END CODE HERE ###
    return parameters

输入:所有的W和b,保存在字典parameters中,grad保存所有W和b对应的梯度,还有最后的学习率。

输出:更新后的所有W和b

过程:

W [ l ] < = W [ l ] − λ d W [ l ] W^{[l]} <= W^{[l]} - \lambda dW^{[l]} W[l]<=W[l]λdW[l]

b [ l ] < = b [ l ] − λ d b [ l ] b^{[l]} <= b^{[l]} - \lambda db^{[l]} b[l]<=b[l]λdb[l]

代码总结

首先我们有训练集 X, Y,以及人为定义的网络各层节点类型 layer_dims数组,当然第0层和第L层大小是固定的,中间是可以自己设的。层数L=len(layer_dims)-1

首先初始化得到一个网络的初始结构的参数

parameters = initialize_parameters_deep(layer_dims)

然后进行前向传播,得到输出和缓存

AL,caches = L_model_forward(X, parameters)

计算代价函数:

cost = compute_cost(AL, Y)

进行反向传播更新参数

grads = L_model_backward(AL, Y, caches)
parameters = update_parameters(parameters, grads, learning_rate)

其中,在反向传播的过程中需要用到W和b的值,本代码中没有从parameters中取得,而是再保存了一份W,b在caches中

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值