改进深度学习网络

最新推荐文章于 2024-07-10 21:12:59 发布

开火车的小Tomas

最新推荐文章于 2024-07-10 21:12:59 发布

阅读量1k

点赞数

分类专栏：吴恩达深度学习课程系列文章标签：人工智能机器学习 python

本文链接：https://blog.csdn.net/qq_42815552/article/details/125718239

版权

吴恩达深度学习课程系列专栏收录该内容

4 篇文章 0 订阅

订阅专栏

1、设置机器学习应用程序

1.1、训练集、交叉验证集（开发集）、测试集

当训练神经网络时，我们需要做出一些决策：

设置神经网络的层数
每一层要拥有多少个隐藏单元
学习速率
每一层采用了什么激活函数

通常一开始我们不能准确知道这些超参数的信息，所以需要不断地进行迭代。从一个好的想法到编码再到实验，再进行另外的想法，不断循环，尝试各种超参数。所以，循环这一个过程的效率是决定项目进展速度的一个关键因素。

创建高质量的训练数据集、验证集（交叉验证集、开发集）和测试集可以有助于提高循环效率。在训练集运行训练算法训练模型、通过验证集选择最好的模型、通过验证后选定最终模型再通过测试集进行评估。

当数据量较小的时候，通常将我们的数据7/3分或者6/2/2分。当数据量较大时，验证集和测试集所占的数据总量比例会变得更小，所以可能变成98/1/1分，也可能变得更小。

一般情况下，要保证验证集和测试集的数据来自同一分布，可以让机器学习算法变得更快。

没有测试集也是可以的。测试集的目的是给一个无偏估计来评价最终选取的网络的性能，当你不需要无偏估计时，可以不设置测试集。

1.2、偏差、方差

请添加图片描述
高偏差、欠拟合
高方差、过拟合
通过训练集误差和验证集误差来理解偏差与方差（以下是通过基本误差为0来进行判断的，当基本误差变化时，相应的判断也会发生改变）：

假设训练集错误率为1%，验证集错误率为11%，可以看出在训练集保持的效果很好，而验证集就有一点差了，可能过度拟合了训练集，从而使得验证集没有充分发挥它的作用，这个是高方差。
假设训练集错误率为15%，验证集错误率为16%，可以看出在训练集也没有什么好的效果，训练集数据拟合度不高，则数据欠拟合，这个算法偏差比较高。同时也不能较好的拟合验证集

1.3、机器学习的基础

初始模型训练完后，判断算法的偏差高不高，查看训练集性能，当偏差较高时，可以去选择一个新的网络（有更多的隐层或隐藏单元）或花费更多时间来训练网络或尝试更优的算法。反复尝试，直到能够较好的拟合数据。

当偏差降低到可接受的范围，此时开始检查方差，查看验证集性能。当方差比较大时，我们可以使用更多的数据，也可以尝试通过正则化减少过拟合。

2、正则化神经网络

2.1、正则化

当存在过拟合（高方差）问题时，考虑使用正则化.
以逻辑回归为例，正则化公式如下所示：
请添加图片描述

其中b参数可以不加入正则化，加入正则化也没有多大影响。
这里有L1正则化和L2正则化，如果使用L1正则化，则w最终会是稀疏的，不会降低太多存储内存。所以一般使用L2正则化。
λ在此称为正则化参数，通常使用验证集或交叉验证集来配置该参数，通常设计一个比较小的值

以神经网络为例：
矩阵范数（平方范数）定义为矩阵中所有元素的平方求和，被称为Frobenius范数（L2范数）
请添加图片描述
如何使用范数？
先用反向传播计算出dw的值，然后再计算出更新后的w的值

2.2、dropout正则化

对于一个存在过拟合的神经网络，首先复制一遍神经网络，dropout会遍历网络中的每一层，并设置消除神经网络中节点的概率，随后删除节点与从该节点进出的连线，最后会得到一个节点更小、规模更小的神经网络，再使用backprop方法进行训练。随后的每一个训练样本，都将使用一个精简后的神经网络来训练。

常用方法为Inverted dropout（反向随机失活）：以一个三层网络来说明。

首先要定义一个向量d，d3表示第三层的dropout向量。

d3 = np.randm.rand(a3,shape[0],a3.shape[1])

然后查看d3是否小于某数，这个某数称之为keep-prob，它是一个具体的数字，表示保留某个隐藏单元的概率。它的作用是生成随机矩阵。
随后从第三层中获得激活函数，a3含有要计算的激活函数

a3 = np.multiply(a3, d3)

上面的作用主要是过滤d3中所有等于0的元素。
最后会向外扩展a3，用a3除以keep-prob参数。之所以要除以这个数据，可以先假设第三层有50个单元，而经过前面的变化，a3的维度是 50*m ，设keep_prob的值为0.8，这意味着，最后被删除或归零的单元平均有10个。

a3 = a3 / keep_prob

此时Z4的值应该是 Z4 = W4 * a3 + b4 ，预期是a3减少20%，为了不影响 Z4 的期望值，要使 W4 * a3 再除以0.8，它将会修正或弥补我们所需的那20%，使得原 a3 期望值不会发生改变。除以keep_prob的步骤就是dropout方法，它的功能是，无论keep_prob的值是多少，都可以确保a3的期望值不变
在测试阶段评估一个神经网络时，使用这种方法可以使测试变得更容易，因为它的数据扩展问题变少了。如果担心某一层会发生过拟合，可以将某些层的keep-prob值设置的比其他层更低。缺点是为了使用交叉验证，需要搜索更多的超参数。

2.3、其他正则化

1、在原数据基础上进行一些变换得到伪数据。这样的操作相比于寻找更多数据所花费的开销更小。
2、early stopping 在中间点停止迭代过程，得到一个W值中等大小的佛罗贝尼乌斯范数。但缺点是不能独立地处理优化代价函数和防止过拟合

3、解决优化问题

3.1、归一化输入

第一步：零均值化，令μ等于所有输入值的平均数，再将每个输入值都减去μ。
在这里插入图片描述
第二步：归一化方差

以上两个参数需要相同的用在训练集和测试集上。
如果输入特征处在不同范围内，归一化特征值就十分重要

3.2、梯度消失与爆炸

训练神经网络时，导数有时候会变得很小或者很大，这增大了训练的难度.
根据之前的知识，知道z是w与x的线性组合，所以当x的数量n越大的时候，我们越希望w的值越小。所以我们可以设置某层的权重矩阵为：

w = npp.random.randn(shape) * np.sqrt(1 / 输入的特征数量)
#如果是使用的relu激活函数，分子改为2效果可能会更好，可以降低梯度消失和爆炸的问题

3.3、梯度检测

执行梯度检验时，应该使用双边误差的方法逼近导数。
通过极限的定义方式来检验自己所求出的偏导是否正确。通过check的公式计算误差值，通常数值越小越好，。请添加图片描述
需要注意的点：

不要在训练中使用梯度检验，仅仅只用于调试。因为计算梯度的估计值将花费大量时间。
如果梯度检验失败，需要查看哪个值发生了错误，可以估测在哪些地方追踪bug。
如果使用了正则化，梯度计算也需要包含正则化部分。
梯度检测不能和dropout同时使用。
在随机初始化过程中进行梯度检测，再训练网络（可很少用）。

4、练习一

4.1、随机初始化参数

第一种：全部初始化为0，效果不好

def initialize_parameters_zeros(layers_dims):
    """
    Arguments:
    layer_dims -- python array (list) containing the size of each layer.
    
    Returns:
    parameters -- python dictionary containing your parameters "W1", "b1", ..., "WL", "bL":
                    W1 -- weight matrix of shape (layers_dims[1], layers_dims[0])
                    b1 -- bias vector of shape (layers_dims[1], 1)
                    ...
                    WL -- weight matrix of shape (layers_dims[L], layers_dims[L-1])
                    bL -- bias vector of shape (layers_dims[L], 1)
    """
    
    parameters = {}
    L = len(layers_dims)            # number of layers in the network
    
    for l in range(1, L):
        #(≈ 2 lines of code)
        # parameters['W' + str(l)] = 
        # parameters['b' + str(l)] = 
        # YOUR CODE STARTS HERE
        
        parameters['W' + str(l)] = np.zeros((layers_dims[l],layers_dims[l-1])) 
        parameters['b' + str(l)] = np.zeros((layers_dims[l],1))
        
        # YOUR CODE ENDS HERE
    return parameters

误差随迭代变化如下：
在这里插入图片描述

第二种：随机初始化

"""
Implement the following function to initialize your weights to large random values (scaled by *10) and 
your biases to zeros. Use np.random.randn(..,..) * 10 for weights and np.zeros((.., ..)) for biases. 
You're using a fixed np.random.seed(..) to make sure your "random" weights match ours,
so don't worry if running your code several times always gives you the same initial values for the parameters.
"""

def initialize_parameters_random(layers_dims):
    """
    Arguments:
    layer_dims -- python array (list) containing the size of each layer.
    
    Returns:
    parameters -- python dictionary containing your parameters "W1", "b1", ..., "WL", "bL":
                    W1 -- weight matrix of shape (layers_dims[1], layers_dims[0])
                    b1 -- bias vector of shape (layers_dims[1], 1)
                    ...
                    WL -- weight matrix of shape (layers_dims[L], layers_dims[L-1])
                    bL -- bias vector of shape (layers_dims[L], 1)
    """
    
    np.random.seed(3)               # This seed makes sure your "random" numbers will be the as ours
    parameters = {}
    L = len(layers_dims)            # integer representing the number of layers
    
    for l in range(1, L):
        #(≈ 2 lines of code)
        # parameters['W' + str(l)] = 
        # parameters['b' + str(l)] =
        # YOUR CODE STARTS HERE
        
        parameters['W' + str(l)] = np.random.randn(layers_dims[l], layers_dims[l-1])*10
        parameters['b' + str(l)] = np.zeros((layers_dims[l],1))
        
        # YOUR CODE ENDS HERE

    return parameters

误差随迭代变化如下：
在这里插入图片描述
第三种：He随机初始化

def initialize_parameters_he(layers_dims):
    """
    Arguments:
    layer_dims -- python array (list) containing the size of each layer.
    
    Returns:
    parameters -- python dictionary containing your parameters "W1", "b1", ..., "WL", "bL":
                    W1 -- weight matrix of shape (layers_dims[1], layers_dims[0])
                    b1 -- bias vector of shape (layers_dims[1], 1)
                    ...
                    WL -- weight matrix of shape (layers_dims[L], layers_dims[L-1])
                    bL -- bias vector of shape (layers_dims[L], 1)
    """
    
    np.random.seed(3)
    parameters = {}
    L = len(layers_dims) - 1 # integer representing the number of layers
     
    for l in range(1, L + 1):
        #(≈ 2 lines of code)
        # parameters['W' + str(l)] = 
        # parameters['b' + str(l)] =
        # YOUR CODE STARTS HERE
        
        parameters['W' + str(l)] = np.random.randn(layers_dims[l], layers_dims[l-1]) * np.sqrt(2./layers_dims[l-1])
        parameters['b' + str(l)] = np.zeros((layers_dims[l],1))
        
        # YOUR CODE ENDS HERE
        
    return parameters

误差随迭代变化如下：
在这里插入图片描述

4.2、对模型使用正则化

1、定义模型
在这里插入图片描述

def model(X, Y, learning_rate = 0.3, num_iterations = 30000, print_cost = True, lambd = 0, keep_prob = 1):
    """
    Implements a three-layer neural network: LINEAR->RELU->LINEAR->RELU->LINEAR->SIGMOID.
    
    Arguments:
    X -- input data, of shape (input size, number of examples)
    Y -- true "label" vector (1 for blue dot / 0 for red dot), of shape (output size, number of examples)
    learning_rate -- learning rate of the optimization
    num_iterations -- number of iterations of the optimization loop
    print_cost -- If True, print the cost every 10000 iterations
    lambd -- regularization hyperparameter, scalar
    keep_prob - probability of keeping a neuron active during drop-out, scalar.
    
    Returns:
    parameters -- parameters learned by the model. They can then be used to predict.
    """
        
    grads = {}
    costs = []                            # to keep track of the cost
    m = X.shape[1]                        # number of examples
    layers_dims = [X.shape[0], 20, 3, 1]
    
    # Initialize parameters dictionary.
    parameters = initialize_parameters(layers_dims)

    # Loop (gradient descent)

    for i in range(0, num_iterations):

        # Forward propagation: LINEAR -> RELU -> LINEAR -> RELU -> LINEAR -> SIGMOID.
        if keep_prob == 1:
            a3, cache = forward_propagation(X, parameters)
        elif keep_prob < 1:
            a3, cache = forward_propagation_with_dropout(X, parameters, keep_prob)
        
        # Cost function
        if lambd == 0:
            cost = compute_cost(a3, Y)
        else:
            cost = compute_cost_with_regularization(a3, Y, parameters, lambd)
            
        # Backward propagation.
        assert (lambd == 0 or keep_prob == 1)   # it is possible to use both L2 regularization and dropout, 
                                                # but this assignment will only explore one at a time
        if lambd == 0 and keep_prob == 1:
            grads = backward_propagation(X, Y, cache)
        elif lambd != 0:
            grads = backward_propagation_with_regularization(X, Y, cache, lambd)
        elif keep_prob < 1:
            grads = backward_propagation_with_dropout(X, Y, cache, keep_prob)
        
        # Update parameters.
        parameters = update_parameters(parameters, grads, learning_rate)
        
        # Print the loss every 10000 iterations
        if print_cost and i % 10000 == 0:
            print("Cost after iteration {}: {}".format(i, cost))
        if print_cost and i % 1000 == 0:
            costs.append(cost)
    
    # plot the cost
    plt.plot(costs)
    plt.ylabel('cost')
    plt.xlabel('iterations (x1,000)')
    plt.title("Learning rate =" + str(learning_rate))
    plt.show()
    
    return parameters

2、L2正则化
在这里插入图片描述

#计算L2正则化的代价函数

def compute_cost_with_regularization(A3, Y, parameters, lambd):
    """
    Implement the cost function with L2 regularization. See formula (2) above.
    
    Arguments:
    A3 -- post-activation, output of forward propagation, of shape (output size, number of examples)
    Y -- "true" labels vector, of shape (output size, number of examples)
    parameters -- python dictionary containing parameters of the model
    
    Returns:
    cost - value of the regularized loss function (formula (2))
    """
    m = Y.shape[1]
    W1 = parameters["W1"]
    W2 = parameters["W2"]
    W3 = parameters["W3"]
    
    cross_entropy_cost = compute_cost(A3, Y) # This gives you the cross-entropy part of the cost
    
    #(≈ 1 lines of code)
    # L2_regularization_cost = 
    # YOUR CODE STARTS HERE
    
    L2_regularization_cost = (1 / m) * (lambd / 2) * (np.sum(np.square(W1)) + np.sum(np.square(W2)) + np.sum(np.square(W3)))
    
    # YOUR CODE ENDS HERE
    
    cost = cross_entropy_cost + L2_regularization_cost
    
    return cost

在这里插入图片描述

def backward_propagation_with_regularization(X, Y, cache, lambd):
    """
    Implements the backward propagation of our baseline model to which we added an L2 regularization.
    带有L2正则化的反向传播

    Arguments:
    X -- input dataset, of shape (input size, number of examples)
    Y -- "true" labels vector, of shape (output size, number of examples)
    cache -- cache output from forward_propagation()
    lambd -- regularization hyperparameter, scalar
    
    Returns:
    gradients -- A dictionary with the gradients with respect to each parameter, activation and pre-activation variables
    """
    
    m = X.shape[1]
    (Z1, A1, W1, b1, Z2, A2, W2, b2, Z3, A3, W3, b3) = cache
    
    dZ3 = A3 - Y
    #(≈ 1 lines of code)
    # dW3 = 1./m * np.dot(dZ3, A2.T) + None
    # YOUR CODE STARTS HERE
    
    dW3 = 1./m * np.dot(dZ3, A2.T) + (lambd / m) * W3
    
    # YOUR CODE ENDS HERE
    db3 = 1. / m * np.sum(dZ3, axis=1, keepdims=True)
    
    dA2 = np.dot(W3.T, dZ3)
    dZ2 = np.multiply(dA2, np.int64(A2 > 0))
    #(≈ 1 lines of code)
    # dW2 = 1./m * np.dot(dZ2, A1.T) + None
    # YOUR CODE STARTS HERE
    
    dW2 = 1./m * np.dot(dZ2, A1.T) + (lambd / m) * W2
    
    # YOUR CODE ENDS HERE
    db2 = 1. / m * np.sum(dZ2, axis=1, keepdims=True)
    
    dA1 = np.dot(W2.T, dZ2)
    dZ1 = np.multiply(dA1, np.int64(A1 > 0))
    #(≈ 1 lines of code)
    # dW1 = 1./m * np.dot(dZ1, X.T) + None
    # YOUR CODE STARTS HERE
    
    dW1 = 1./m * np.dot(dZ1, X.T) + (lambd / m) * W1
    
    # YOUR CODE ENDS HERE
    db1 = 1. / m * np.sum(dZ1, axis=1, keepdims=True)
    
    gradients = {"dZ3": dZ3, "dW3": dW3, "db3": db3,"dA2": dA2,
                 "dZ2": dZ2, "dW2": dW2, "db2": db2, "dA1": dA1, 
                 "dZ1": dZ1, "dW1": dW1, "db1": db1}
    
    return gradients

3、dropout正则化
在这里插入图片描述

def forward_propagation_with_dropout(X, parameters, keep_prob = 0.5):
    """
    Implements the forward propagation: LINEAR -> RELU + DROPOUT -> LINEAR -> RELU + DROPOUT -> LINEAR -> SIGMOID.
    
    Arguments:
    X -- input dataset, of shape (2, number of examples)
    parameters -- python dictionary containing your parameters "W1", "b1", "W2", "b2", "W3", "b3":
                    W1 -- weight matrix of shape (20, 2)
                    b1 -- bias vector of shape (20, 1)
                    W2 -- weight matrix of shape (3, 20)
                    b2 -- bias vector of shape (3, 1)
                    W3 -- weight matrix of shape (1, 3)
                    b3 -- bias vector of shape (1, 1)
    keep_prob - probability of keeping a neuron active during drop-out, scalar
    
    Returns:
    A3 -- last activation value, output of the forward propagation, of shape (1,1)
    cache -- tuple, information stored for computing the backward propagation
    """
    
    np.random.seed(1)
    
    # retrieve parameters
    W1 = parameters["W1"]
    b1 = parameters["b1"]
    W2 = parameters["W2"]
    b2 = parameters["b2"]
    W3 = parameters["W3"]
    b3 = parameters["b3"]
    
    # LINEAR -> RELU -> LINEAR -> RELU -> LINEAR -> SIGMOID
    Z1 = np.dot(W1, X) + b1
    A1 = relu(Z1)
    #(≈ 4 lines of code)         # Steps 1-4 below correspond to the Steps 1-4 described above. 
    # D1 =                                           # Step 1: initialize matrix D1 = np.random.rand(..., ...)
    # D1 =                                           # Step 2: convert entries of D1 to 0 or 1 (using keep_prob as the threshold)
    # A1 =                                           # Step 3: shut down some neurons of A1
    # A1 =                                           # Step 4: scale the value of neurons that haven't been shut down
    # YOUR CODE STARTS HERE
    
    D1 = np.random.rand(A1.shape[0],A1.shape[1])                                          
    D1 = (D1 < keep_prob).astype(int)                                          
    A1 = A1 * D1                                          
    A1 = A1 / keep_prob
    
    # YOUR CODE ENDS HERE
    Z2 = np.dot(W2, A1) + b2
    A2 = relu(Z2)
    #(≈ 4 lines of code)
    # D2 =                                           # Step 1: initialize matrix D2 = np.random.rand(..., ...)
    # D2 =                                           # Step 2: convert entries of D2 to 0 or 1 (using keep_prob as the threshold)
    # A2 =                                           # Step 3: shut down some neurons of A2
    # A2 =                                           # Step 4: scale the value of neurons that haven't been shut down
    # YOUR CODE STARTS HERE
    
    D2 = np.random.rand(A2.shape[0],A2.shape[1])                                          
    D2 = (D2 < keep_prob).astype(int)                                          
    A2 = A2 * D2                                         
    A2 = A2 / keep_prob   
    
    # YOUR CODE ENDS HERE
    Z3 = np.dot(W3, A2) + b3
    A3 = sigmoid(Z3)
    
    cache = (Z1, D1, A1, W1, b1, Z2, D2, A2, W2, b2, Z3, A3, W3, b3)
    
    return A3, cache

在这里插入图片描述

def backward_propagation_with_dropout(X, Y, cache, keep_prob):
    """
    Implements the backward propagation of our baseline model to which we added dropout.
    
    Arguments:
    X -- input dataset, of shape (2, number of examples)
    Y -- "true" labels vector, of shape (output size, number of examples)
    cache -- cache output from forward_propagation_with_dropout()
    keep_prob - probability of keeping a neuron active during drop-out, scalar
    
    Returns:
    gradients -- A dictionary with the gradients with respect to each parameter, activation and pre-activation variables
    """
    
    m = X.shape[1]
    (Z1, D1, A1, W1, b1, Z2, D2, A2, W2, b2, Z3, A3, W3, b3) = cache
    
    dZ3 = A3 - Y
    dW3 = 1./m * np.dot(dZ3, A2.T)
    db3 = 1./m * np.sum(dZ3, axis=1, keepdims=True)
    dA2 = np.dot(W3.T, dZ3)
    #(≈ 2 lines of code)
    # dA2 =                # Step 1: Apply mask D2 to shut down the same neurons as during the forward propagation
    # dA2 =                # Step 2: Scale the value of neurons that haven't been shut down
    # YOUR CODE STARTS HERE
    
    dA2 = dA2 * D2               
    dA2 = dA2 / keep_prob             
    
    # YOUR CODE ENDS HERE
    dZ2 = np.multiply(dA2, np.int64(A2 > 0))
    dW2 = 1./m * np.dot(dZ2, A1.T)
    db2 = 1./m * np.sum(dZ2, axis=1, keepdims=True)
    
    dA1 = np.dot(W2.T, dZ2)
    #(≈ 2 lines of code)
    # dA1 =                # Step 1: Apply mask D1 to shut down the same neurons as during the forward propagation
    # dA1 =                # Step 2: Scale the value of neurons that haven't been shut down
    # YOUR CODE STARTS HERE
    
    dA1 =  dA1 * D1         
    dA1 =  dA1 / keep_prob              
    
    # YOUR CODE ENDS HERE
    dZ1 = np.multiply(dA1, np.int64(A1 > 0))
    dW1 = 1./m * np.dot(dZ1, X.T)
    db1 = 1./m * np.sum(dZ1, axis=1, keepdims=True)
    
    gradients = {"dZ3": dZ3, "dW3": dW3, "db3": db3,"dA2": dA2,
                 "dZ2": dZ2, "dW2": dW2, "db2": db2, "dA1": dA1, 
                 "dZ1": dZ1, "dW1": dW1, "db1": db1}
    
    return gradients

4.3、梯度检测

通过正向传播求得代价函数，反向传播求得参数的梯度后，通过求导的数学定义来检测反向传播所计算的导数（梯度）是否正确。
1、一维梯度检测
前向传播的函数，此处代价函数表达式仅为 theta * x

def forward_propagation(x, theta):
    """    
    Implement the linear forward propagation (compute J) presented in Figure 1 (J(theta) = theta * x)
    
    Arguments:
    x -- a real-valued input
    theta -- our parameter, a real number as well
    
    Returns:
    J -- the value of function J, computed using the formula J(theta) = theta * x
    """
    
    # (approx. 1 line)
    # J = 
    # YOUR CODE STARTS HERE
    
    J = theta * x
    
    # YOUR CODE ENDS HERE
    
    return J

由代价函数表达式可以求得J关于参数θ的导数为x，反向传播的函数如下：

def backward_propagation(x, theta):
    """
    Computes the derivative of J with respect to theta (see Figure 1).
    
    Arguments:
    x -- a real-valued input
    theta -- our parameter, a real number as well
    
    Returns:
    dtheta -- the gradient of the cost with respect to theta
    """
    
    # (approx. 1 line)
    # dtheta = 
    # YOUR CODE STARTS HERE
    
    dtheta = x
    
    # YOUR CODE ENDS HERE
    
    return dtheta

随后进行梯度检测，代码如下：

def gradient_check(x, theta, epsilon=1e-7, print_msg=False):
    """
    Implement the backward propagation presented in Figure 1.
    
    Arguments:
    x -- a float input
    theta -- our parameter, a float as well
    epsilon -- tiny shift to the input to compute approximated gradient with formula(1)
    
    Returns:
    difference -- difference (2) between the approximated gradient and the backward propagation gradient. Float output
    """
    
    # Compute gradapprox using left side of formula (1). epsilon is small enough, you don't need to worry about the limit.
    # (approx. 5 lines)
    # theta_plus =                                 # Step 1
    # theta_minus =                                # Step 2
    # J_plus =                                    # Step 3
    # J_minus =                                   # Step 4
    # gradapprox =                                # Step 5
    # YOUR CODE STARTS HERE
     
    theta_plus = theta + epsilon                             
    theta_minus = theta - epsilon                               
    J_plus = forward_propagation(x,theta_plus)                                   
    J_minus = forward_propagation(x,theta_minus)                                 
    gradapprox = (J_plus - J_minus) / (theta_plus - theta_minus)
                  
    # YOUR CODE ENDS HERE
    
    # Check if gradapprox is close enough to the output of backward_propagation()
    #(approx. 1 line) DO NOT USE "grad = gradapprox"
    # grad =
    # YOUR CODE STARTS HERE
    
    grad = backward_propagation(x, theta)
    
    # YOUR CODE ENDS HERE
    
    #(approx. 3 lines)
    # numerator =                                 # Step 1'
    # denominator =                               # Step 2'
    # difference =                                # Step 3'
    # YOUR CODE STARTS HERE
    
    numerator = np.linalg.norm(grad - gradapprox)                                
    denominator = np.linalg.norm(grad) +np.linalg.norm(gradapprox)                               
    difference =  numerator / denominator                              
    
    # YOUR CODE ENDS HERE
    if print_msg:
        if difference > 2e-7:
            print ("\033[93m" + "There is a mistake in the backward propagation! difference = " + str(difference) + "\033[0m")
        else:
            print ("\033[92m" + "Your backward propagation works perfectly fine! difference = " + str(difference) + "\033[0m")
    
    return difference

2、n维梯度检测
前向传播、计算代价函数

def forward_propagation_n(X, Y, parameters):
    """
    Implements the forward propagation (and computes the cost) presented in Figure 3.
    
    Arguments:
    X -- training set for m examples
    Y -- labels for m examples 
    parameters -- python dictionary containing your parameters "W1", "b1", "W2", "b2", "W3", "b3":
                    W1 -- weight matrix of shape (5, 4)
                    b1 -- bias vector of shape (5, 1)
                    W2 -- weight matrix of shape (3, 5)
                    b2 -- bias vector of shape (3, 1)
                    W3 -- weight matrix of shape (1, 3)
                    b3 -- bias vector of shape (1, 1)
    
    Returns:
    cost -- the cost function (logistic cost for one example)
    cache -- a tuple with the intermediate values (Z1, A1, W1, b1, Z2, A2, W2, b2, Z3, A3, W3, b3)

    """
    
    # retrieve parameters
    m = X.shape[1]
    W1 = parameters["W1"]
    b1 = parameters["b1"]
    W2 = parameters["W2"]
    b2 = parameters["b2"]
    W3 = parameters["W3"]
    b3 = parameters["b3"]

    # LINEAR -> RELU -> LINEAR -> RELU -> LINEAR -> SIGMOID
    Z1 = np.dot(W1, X) + b1
    A1 = relu(Z1)
    Z2 = np.dot(W2, A1) + b2
    A2 = relu(Z2)
    Z3 = np.dot(W3, A2) + b3
    A3 = sigmoid(Z3)

    # Cost
    log_probs = np.multiply(-np.log(A3),Y) + np.multiply(-np.log(1 - A3), 1 - Y)
    cost = 1. / m * np.sum(log_probs)
    
    cache = (Z1, A1, W1, b1, Z2, A2, W2, b2, Z3, A3, W3, b3)
    
    return cost, cache

运行反向传播，其中注释的代码为修改前的错误代码

def backward_propagation_n(X, Y, cache):
    """
    Implement the backward propagation presented in figure 2.
    
    Arguments:
    X -- input datapoint, of shape (input size, 1)
    Y -- true "label"
    cache -- cache output from forward_propagation_n()
    
    Returns:
    gradients -- A dictionary with the gradients of the cost with respect to each parameter, activation and pre-activation variables.
    """
    
    m = X.shape[1]
    (Z1, A1, W1, b1, Z2, A2, W2, b2, Z3, A3, W3, b3) = cache
    
    dZ3 = A3 - Y
    dW3 = 1. / m * np.dot(dZ3, A2.T)
    db3 = 1. / m * np.sum(dZ3, axis=1, keepdims=True)
    
    dA2 = np.dot(W3.T, dZ3)
    dZ2 = np.multiply(dA2, np.int64(A2 > 0))
    #dW2 = 1. / m * np.dot(dZ2, A1.T) * 2
    dW2 = 1. / m * np.dot(dZ2, A1.T)
    db2 = 1. / m * np.sum(dZ2, axis=1, keepdims=True)
    
    dA1 = np.dot(W2.T, dZ2)
    dZ1 = np.multiply(dA1, np.int64(A1 > 0))
    dW1 = 1. / m * np.dot(dZ1, X.T)
    #db1 = 4. / m * np.sum(dZ1, axis=1, keepdims=True)
    db1 = 1. / m * np.sum(dZ1, axis=1, keepdims=True)
    
    gradients = {"dZ3": dZ3, "dW3": dW3, "db3": db3,
                 "dA2": dA2, "dZ2": dZ2, "dW2": dW2, "db2": db2,
                 "dA1": dA1, "dZ1": dZ1, "dW1": dW1, "db1": db1}
    
    return gradients

进行梯度检测。其中需要注意，所有的参数都是以字典的形式存储在parameters参数中，为了运行梯度检测，需要先将这些参数向量化，计算完后再重新变回字典形式，以下是计算的伪代码。
在这里插入图片描述

def gradient_check_n(parameters, gradients, X, Y, epsilon=1e-7, print_msg=False):
    """
    Checks if backward_propagation_n computes correctly the gradient of the cost output by forward_propagation_n
    
    Arguments:
    parameters -- python dictionary containing your parameters "W1", "b1", "W2", "b2", "W3", "b3":
    grad -- output of backward_propagation_n, contains gradients of the cost with respect to the parameters. 
    x -- input datapoint, of shape (input size, 1)
    y -- true "label"
    epsilon -- tiny shift to the input to compute approximated gradient with formula(1)
    
    Returns:
    difference -- difference (2) between the approximated gradient and the backward propagation gradient
    """
    
    # Set-up variables
    parameters_values, _ = dictionary_to_vector(parameters)
    
    grad = gradients_to_vector(gradients)
    num_parameters = parameters_values.shape[0]
    J_plus = np.zeros((num_parameters, 1))
    J_minus = np.zeros((num_parameters, 1))
    gradapprox = np.zeros((num_parameters, 1))
    
    # Compute gradapprox
    for i in range(num_parameters):
        
        # Compute J_plus[i]. Inputs: "parameters_values, epsilon". Output = "J_plus[i]".
        # "_" is used because the function you have to outputs two parameters but we only care about the first one
        #(approx. 3 lines)
        # theta_plus =                                        # Step 1
        # theta_plus[i] =                                     # Step 2
        # J_plus[i], _ =                                     # Step 3
        # YOUR CODE STARTS HERE
        
        theta_plus = np.copy(parameters_values)                                      
        theta_plus[i] = theta_plus[i] + epsilon                                   
        J_plus[i], _ = forward_propagation_n(X, Y,vector_to_dictionary(theta_plus))                                    
        
        # YOUR CODE ENDS HERE
        
        # Compute J_minus[i]. Inputs: "parameters_values, epsilon". Output = "J_minus[i]".
        #(approx. 3 lines)
        # theta_minus =                                    # Step 1
        # theta_minus[i] =                                 # Step 2        
        # J_minus[i], _ =                                 # Step 3
        # YOUR CODE STARTS HERE
        
        theta_minus = np.copy(parameters_values)                                      
        theta_minus[i] = theta_minus[i] - epsilon                                   
        J_minus[i], _ = forward_propagation_n(X, Y,vector_to_dictionary(theta_minus))
        
        # YOUR CODE ENDS HERE
        
        # Compute gradapprox[i]
        # (approx. 1 line)
        # gradapprox[i] = 
        # YOUR CODE STARTS HERE
        
        gradapprox[i] = (J_plus[i] - J_minus[i]) / (2 * epsilon)
        
        # YOUR CODE ENDS HERE
    
    # Compare gradapprox to backward propagation gradients by computing difference.
    # (approx. 3 line)
    # numerator =                                             # Step 1'
    # denominator =                                           # Step 2'
    # difference =                                            # Step 3'
    # YOUR CODE STARTS HERE
    
    numerator = np.linalg.norm(grad - gradapprox)                                
    denominator = np.linalg.norm(grad) +np.linalg.norm(gradapprox)                               
    difference =  numerator / denominator  
    
    # YOUR CODE ENDS HERE
    if print_msg:
        if difference > 2e-7:
            print ("\033[93m" + "There is a mistake in the backward propagation! difference = " + str(difference) + "\033[0m")
        else:
            print ("\033[92m" + "Your backward propagation works perfectly fine! difference = " + str(difference) + "\033[0m")

    return difference

5、优化算法

5.1、Mini-batch梯度下降

当数据集数量很大的时候，进行梯度下降法时，总是需要处理整个训练集再进行梯度下降法。
一个更快的算法是，在运行整个训练集之前，先使用梯度下降法处理一部分，会使得运行速度更快。
具体做法是：把训练集分割为小一点的子训练集，这些子集就称为Mini-batch。同时y也要进行同样的处理。然后取一组子集进行前向传播，代价函数计算，反向传播，更新参数的值，随后完成训练。

在之前的训练方法中，一次遍历训练集只能做一次梯度下降。而这个梯度下降法，一次遍历训练集就能做一个子集数量的梯度下降

符号标注使用花括号{}上标表示第几个mini-batch集。

如果mini-batch的大小等于m那么它就是我们之前使用的批量梯度下降。
如果mini-batch的大小等于1那么它就是随机梯度下降，这样的一个梯度下降会总是会在最优解的附近波动。
当数据集特别大时，我们通常将mini-batch设置为64、128、256、512之类二的幂数，这样可能会计算的更快。

此外，在下面的代码中，我们可以看到随机梯度下降需要三个循环：一是循环所有的迭代次数，二是循环m个训练样本，三是需要遍历所有层来更新参数。

#(Batch) Gradient Descent 批量梯度下降: 
X = data_input
Y = labels
m = X.shape[1]  # Number of training examples
parameters = initialize_parameters(layers_dims)
for i in range(0, num_iterations):
    # Forward propagation
    a, caches = forward_propagation(X, parameters)
    # Compute cost
    cost_total = compute_cost(a, Y)  # Cost for m training examples
    # Backward propagation
    grads = backward_propagation(a, caches, parameters)
    # Update parameters
    parameters = update_parameters(parameters, grads)
    # Compute average cost
    cost_avg = cost_total / m
#Stochastic Gradient Descent随机梯度下降:
X = data_input
Y = labels
m = X.shape[1]  # Number of training examples
parameters = initialize_parameters(layers_dims)
for i in range(0, num_iterations):
    cost_total = 0
    for j in range(0, m):
        # Forward propagation
        a, caches = forward_propagation(X[:,j], parameters)
        # Compute cost
        cost_total += compute_cost(a, Y[:,j])  # Cost for one training example
        # Backward propagation
        grads = backward_propagation(a, caches, parameters)
        # Update parameters
        parameters = update_parameters(parameters, grads)
    # Compute average cost
    cost_avg = cost_total / m

运行mini-batch梯度下降主要有俩步
第一步将原数据集重新组合。
第二步将重新组合的数据集按照mini_batch的大小进行划分。
在作业中已经提供了划分mini-batch大小的函数，

first_mini_batch_X = shuffled_X[:, 0 : mini_batch_size]

以下是作业中的代码：

def random_mini_batches(X, Y, mini_batch_size = 64, seed = 0):
    """
    Creates a list of random minibatches from (X, Y)
    
    Arguments:
    X -- input data, of shape (input size, number of examples)
    Y -- true "label" vector (1 for blue dot / 0 for red dot), of shape (1, number of examples)
    mini_batch_size -- size of the mini-batches, integer
    
    Returns:
    mini_batches -- list of synchronous (mini_batch_X, mini_batch_Y)
    """
    
    np.random.seed(seed)            # To make your "random" minibatches the same as ours
    m = X.shape[1]                  # number of training examples
    mini_batches = []
        
    # Step 1: Shuffle (X, Y)
    permutation = list(np.random.permutation(m))
    shuffled_X = X[:, permutation]
    shuffled_Y = Y[:, permutation].reshape((1, m))
    
    inc = mini_batch_size

    # Step 2 - Partition (shuffled_X, shuffled_Y).
    # Cases with a complete mini batch size only i.e each of 64 examples.
    num_complete_minibatches = math.floor(m / mini_batch_size) # number of mini batches of size mini_batch_size in your partitionning
    for k in range(0, num_complete_minibatches):
        # (approx. 2 lines)
        # mini_batch_X =  
        # mini_batch_Y =
        # YOUR CODE STARTS HERE
        
        mini_batch_X =  shuffled_X[:,k * mini_batch_size: (k+1) * mini_batch_size]
        mini_batch_Y =  shuffled_Y[:,k * mini_batch_size: (k+1) * mini_batch_size]
        
        # YOUR CODE ENDS HERE
        mini_batch = (mini_batch_X, mini_batch_Y)
        mini_batches.append(mini_batch)
    
    # For handling the end case (last mini-batch < mini_batch_size i.e less than 64)
    if m % mini_batch_size != 0:
        #(approx. 2 lines)
        # mini_batch_X =
        # mini_batch_Y =
        # YOUR CODE STARTS HERE
        
        mini_batch_X = shuffled_X[:,math.floor(m/mini_batch_size)*mini_batch_size: m ]
        mini_batch_Y = shuffled_Y[:,math.floor(m/mini_batch_size)*mini_batch_size: m ]
        
        # YOUR CODE ENDS HERE
        mini_batch = (mini_batch_X, mini_batch_Y)
        mini_batches.append(mini_batch)
    
    return mini_batches

5.2、其他优化算法

5.2.1、前述：指数加权平均

请添加图片描述
其中β为参数，范围取值0-1。
当β较大时，得到的曲线通常会平坦一些波动更小，因为求平均的数目更多了。
当β较小时则会出现噪声甚至是出现异常值。

平常求平均值的方法是每一项的权重是n分之1，而指数加权平均的每一项权重是指数递减的。
好处是在电脑内存中只占用一行代码

在这里还有一个技术名词叫做偏差修正，让平均数运算更加准确 .
在用Vt公式对前几次数据进行计算时，可能会产生错误。所以在估测初期，需要在原Vt的基础上，除以1减去β的t次方的差。并且当t的大小逐渐变大时，整个分母也越来越接近1，即偏差修正几乎没有作用。通常这个方法仅在你关心初始时期的偏差才使用。

5.2.2、Momentum梯度下降

当执行梯度下降时，它会一步一步到达最低点，需要很多步骤。
但有时候需要在某个方向学习的快一点，因为想曲线快速到达终点。某个方向学习的慢一点，因为不想曲线大幅度摆动。（使用mini-batch在拟合的时候会产生震荡，所以通过momentum来减少这些震荡）
具体运行方式如下所示：
请添加图片描述
还有一种方式是在Vdw，Vdb式子中删去(1-β)，但是这样做需要调整我们的学习率超参数。

作业中的代码如下：
1、初始化动量参数v为0向量，大小和w与b一致

def initialize_velocity(parameters):
    """
    Initializes the velocity as a python dictionary with:
                - keys: "dW1", "db1", ..., "dWL", "dbL" 
                - values: numpy arrays of zeros of the same shape as the corresponding gradients/parameters.
    Arguments:
    parameters -- python dictionary containing your parameters.
                    parameters['W' + str(l)] = Wl
                    parameters['b' + str(l)] = bl
    
    Returns:
    v -- python dictionary containing the current velocity.
                    v['dW' + str(l)] = velocity of dWl
                    v['db' + str(l)] = velocity of dbl
    """
    
    L = len(parameters) // 2 # number of layers in the neural networks
    v = {}
    
    # Initialize velocity
    for l in range(1, L + 1):
        # (approx. 2 lines)
        # v["dW" + str(l)] =
        # v["db" + str(l)] =
        # YOUR CODE STARTS HERE
        v["dW" + str(l)] = np.zeros((parameters['W' + str(l)].shape[0],parameters['W' + str(l)].shape[1]))
        v["db" + str(l)] = np.zeros((parameters['b' + str(l)].shape[0],parameters['b' + str(l)].shape[1]))
        # YOUR CODE ENDS HERE
        
    return v

2、更新这些初始化为0的参数
在这里插入图片描述
我们要注意到参数β，这个数值越大，更新就越平稳。最常见设置的范围是0.8-0.999。

def update_parameters_with_momentum(parameters, grads, v, beta, learning_rate):
    """
    Update parameters using Momentum
    
    Arguments:
    parameters -- python dictionary containing your parameters:
                    parameters['W' + str(l)] = Wl
                    parameters['b' + str(l)] = bl
    grads -- python dictionary containing your gradients for each parameters:
                    grads['dW' + str(l)] = dWl
                    grads['db' + str(l)] = dbl
    v -- python dictionary containing the current velocity:
                    v['dW' + str(l)] = ...
                    v['db' + str(l)] = ...
    beta -- the momentum hyperparameter, scalar
    learning_rate -- the learning rate, scalar
    
    Returns:
    parameters -- python dictionary containing your updated parameters 
    v -- python dictionary containing your updated velocities
    """

    L = len(parameters) // 2 # number of layers in the neural networks
    
    # Momentum update for each parameter
    for l in range(1, L + 1):
        
        # (approx. 4 lines)
        # compute velocities
        # v["dW" + str(l)] = ...
        # v["db" + str(l)] = ...
        # update parameters
        # parameters["W" + str(l)] = ...
        # parameters["b" + str(l)] = ...
        # YOUR CODE STARTS HERE
        
        v["dW" + str(l)] = beta * v["dW" + str(l)] + (1 - beta) * grads['dW' + str(l)]
        v["db" + str(l)] = beta * v["db" + str(l)] + (1 - beta) * grads['db' + str(l)]
        parameters["W" + str(l)] = parameters["W" + str(l)] - learning_rate * v["dW" + str(l)]
        parameters["b" + str(l)] = parameters["b" + str(l)] - learning_rate * v["db" + str(l)]
        
        # YOUR CODE ENDS HERE
        
    return parameters, v

5.2.3、RMSprop

如上，想要在w的方向学习速度加快，在b的方向上减少摆动。
所以会希望S_dw相对较小，这样在更新w的时候，能够除以一个更小的数
希望S_db较大，这样在更新b的时候，能够除以一个更大的数，从而减缓纵轴上的变化。
为了防止除以0的情况发生，通常会加上一个很小的数ε。

5.2.4、Adam优化算法

该算法结合了5.2.2和5.2.3的算法，是一种广泛使用的学习算法。其中β1通常使用0.9，β2推荐使用0.999，然后尝试不同的学习率α看看效果。

作业中代码如下：
在这里插入图片描述
在上面的表达式中，t代表了运行adm的第t次。L代表第L层。β1和β2是两个超参数。ε是一个很小的值，防止表达式除以0。
随后初始化超参数：

def initialize_adam(parameters) :
    """
    Initializes v and s as two python dictionaries with:
                - keys: "dW1", "db1", ..., "dWL", "dbL" 
                - values: numpy arrays of zeros of the same shape as the corresponding gradients/parameters.
    
    Arguments:
    parameters -- python dictionary containing your parameters.
                    parameters["W" + str(l)] = Wl
                    parameters["b" + str(l)] = bl
    
    Returns: 
    v -- python dictionary that will contain the exponentially weighted average of the gradient. Initialized with zeros.
                    v["dW" + str(l)] = ...
                    v["db" + str(l)] = ...
    s -- python dictionary that will contain the exponentially weighted average of the squared gradient. Initialized with zeros.
                    s["dW" + str(l)] = ...
                    s["db" + str(l)] = ...

    """
    
    L = len(parameters) // 2 # number of layers in the neural networks
    v = {}
    s = {}
    
    # Initialize v, s. Input: "parameters". Outputs: "v, s".
    for l in range(1, L + 1):
    # (approx. 4 lines)
        # v["dW" + str(l)] = ...
        # v["db" + str(l)] = ...
        # s["dW" + str(l)] = ...
        # s["db" + str(l)] = ...
    # YOUR CODE STARTS HERE
    
        v["dW" + str(l)] = np.zeros((parameters["W" + str(l)].shape[0], parameters["W" + str(l)].shape[1]))
        v["db" + str(l)] = np.zeros((parameters["b" + str(l)].shape[0], parameters["b" + str(l)].shape[1]))
        s["dW" + str(l)] = np.zeros((parameters["W" + str(l)].shape[0], parameters["W" + str(l)].shape[1]))
        s["db" + str(l)] = np.zeros((parameters["b" + str(l)].shape[0], parameters["b" + str(l)].shape[1]))
    
    # YOUR CODE ENDS HERE
    
    return v, s

更新Adam中的参数：

def update_parameters_with_adam(parameters, grads, v, s, t, learning_rate = 0.01,
                                beta1 = 0.9, beta2 = 0.999,  epsilon = 1e-8):
    """
    Update parameters using Adam
    
    Arguments:
    parameters -- python dictionary containing your parameters:
                    parameters['W' + str(l)] = Wl
                    parameters['b' + str(l)] = bl
    grads -- python dictionary containing your gradients for each parameters:
                    grads['dW' + str(l)] = dWl
                    grads['db' + str(l)] = dbl
    v -- Adam variable, moving average of the first gradient, python dictionary
    s -- Adam variable, moving average of the squared gradient, python dictionary
    t -- Adam variable, counts the number of taken steps
    learning_rate -- the learning rate, scalar.
    beta1 -- Exponential decay hyperparameter for the first moment estimates 
    beta2 -- Exponential decay hyperparameter for the second moment estimates 
    epsilon -- hyperparameter preventing division by zero in Adam updates

    Returns:
    parameters -- python dictionary containing your updated parameters 
    v -- Adam variable, moving average of the first gradient, python dictionary
    s -- Adam variable, moving average of the squared gradient, python dictionary
    """
    
    L = len(parameters) // 2                 # number of layers in the neural networks
    v_corrected = {}                         # Initializing first moment estimate, python dictionary
    s_corrected = {}                         # Initializing second moment estimate, python dictionary
    
    # Perform Adam update on all parameters
    for l in range(1, L + 1):
        # Moving average of the gradients. Inputs: "v, grads, beta1". Output: "v".
        # (approx. 2 lines)
        # v["dW" + str(l)] = ...
        # v["db" + str(l)] = ...
        # YOUR CODE STARTS HERE
        
        v["dW" + str(l)] = beta1 * v["dW" + str(l)] + (1 - beta1) * grads['dW' + str(l)]
        v["db" + str(l)] = beta1 * v["db" + str(l)] + (1 - beta1) * grads['db' + str(l)]
        
        # YOUR CODE ENDS HERE

        # Compute bias-corrected first moment estimate. Inputs: "v, beta1, t". Output: "v_corrected".
        # (approx. 2 lines)
        # v_corrected["dW" + str(l)] = ...
        # v_corrected["db" + str(l)] = ...
        # YOUR CODE STARTS HERE
        
        v_corrected["dW" + str(l)] = v["dW" + str(l)] / (1 - beta1 ** t)
        v_corrected["db" + str(l)] = v["db" + str(l)] / (1 - beta1 ** t)
        
        # YOUR CODE ENDS HERE

        # Moving average of the squared gradients. Inputs: "s, grads, beta2". Output: "s".
        #(approx. 2 lines)
        # s["dW" + str(l)] = ...
        # s["db" + str(l)] = ...
        # YOUR CODE STARTS HERE
        
        s["dW" + str(l)] = beta2 * s["dW" + str(l)] + (1 - beta2) * (grads['dW' + str(l)] ** 2)
        s["db" + str(l)] = beta2 * s["db" + str(l)] + (1 - beta2) * (grads['db' + str(l)] ** 2)
        
        # YOUR CODE ENDS HERE

        # Compute bias-corrected second raw moment estimate. Inputs: "s, beta2, t". Output: "s_corrected".
        # (approx. 2 lines)
        # s_corrected["dW" + str(l)] = ...
        # s_corrected["db" + str(l)] = ...
        # YOUR CODE STARTS HERE
         
        s_corrected["dW" + str(l)] = s["dW" + str(l)] / (1 - beta2 ** t)
        s_corrected["db" + str(l)] = s["db" + str(l)] / (1 - beta2 ** t)

        # YOUR CODE ENDS HERE

        # Update parameters. Inputs: "parameters, learning_rate, v_corrected, s_corrected, epsilon". Output: "parameters".
        # (approx. 2 lines)
        # parameters["W" + str(l)] = ...
        # parameters["b" + str(l)] = ...
        # YOUR CODE STARTS HERE
         
        parameters["W" + str(l)] = parameters["W" + str(l)] - learning_rate * (v_corrected["dW" + str(l)] / (s_corrected["dW" + str(l)] ** 0.5+ epsilon ))
        parameters["b" + str(l)] = parameters["b" + str(l)] - learning_rate * (v_corrected["db" + str(l)] / (s_corrected["db" + str(l)] ** 0.5+ epsilon ))
        
        # YOUR CODE ENDS HERE

    return parameters, v, s, v_corrected, s_corrected

随后再参考一下使用不同优化算法的模型：

def model(X, Y, layers_dims, optimizer, learning_rate = 0.0007, mini_batch_size = 64, beta = 0.9,
          beta1 = 0.9, beta2 = 0.999,  epsilon = 1e-8, num_epochs = 5000, print_cost = True):
    """
    3-layer neural network model which can be run in different optimizer modes.
    
    Arguments:
    X -- input data, of shape (2, number of examples)
    Y -- true "label" vector (1 for blue dot / 0 for red dot), of shape (1, number of examples)
    optimizer -- the optimizer to be passed, gradient descent, momentum or adam
    layers_dims -- python list, containing the size of each layer
    learning_rate -- the learning rate, scalar.
    mini_batch_size -- the size of a mini batch
    beta -- Momentum hyperparameter
    beta1 -- Exponential decay hyperparameter for the past gradients estimates 
    beta2 -- Exponential decay hyperparameter for the past squared gradients estimates 
    epsilon -- hyperparameter preventing division by zero in Adam updates
    num_epochs -- number of epochs
    print_cost -- True to print the cost every 1000 epochs

    Returns:
    parameters -- python dictionary containing your updated parameters 
    """

    L = len(layers_dims)             # number of layers in the neural networks
    costs = []                       # to keep track of the cost
    t = 0                            # initializing the counter required for Adam update
    seed = 10                        # For grading purposes, so that your "random" minibatches are the same as ours
    m = X.shape[1]                   # number of training examples
    
    # Initialize parameters
    parameters = initialize_parameters(layers_dims)

    # Initialize the optimizer
    if optimizer == "gd":
        pass # no initialization required for gradient descent
    elif optimizer == "momentum":
        v = initialize_velocity(parameters)
    elif optimizer == "adam":
        v, s = initialize_adam(parameters)
    
    # Optimization loop
    for i in range(num_epochs):
        
        # Define the random minibatches. We increment the seed to reshuffle differently the dataset after each epoch
        seed = seed + 1
        minibatches = random_mini_batches(X, Y, mini_batch_size, seed)
        cost_total = 0
        
        for minibatch in minibatches:

            # Select a minibatch
            (minibatch_X, minibatch_Y) = minibatch

            # Forward propagation
            a3, caches = forward_propagation(minibatch_X, parameters)

            # Compute cost and add to the cost total
            cost_total += compute_cost(a3, minibatch_Y)

            # Backward propagation
            grads = backward_propagation(minibatch_X, minibatch_Y, caches)

            # Update parameters
            if optimizer == "gd":
                parameters = update_parameters_with_gd(parameters, grads, learning_rate)
            elif optimizer == "momentum":
                parameters, v = update_parameters_with_momentum(parameters, grads, v, beta, learning_rate)
            elif optimizer == "adam":
                t = t + 1 # Adam counter
                parameters, v, s, _, _ = update_parameters_with_adam(parameters, grads, v, s,
                                                               t, learning_rate, beta1, beta2,  epsilon)
        cost_avg = cost_total / m
        
        # Print the cost every 1000 epoch
        if print_cost and i % 1000 == 0:
            print ("Cost after epoch %i: %f" %(i, cost_avg))
        if print_cost and i % 100 == 0:
            costs.append(cost_avg)
                
    # plot the cost
    plt.plot(costs)
    plt.ylabel('cost')
    plt.xlabel('epochs (per 100)')
    plt.title("Learning rate = " + str(learning_rate))
    plt.show()

    return parameters

最后效果如下所示：
在这里插入图片描述
Momentum通常也有帮助，但是当我们的学习率比较小，数据集比较简单时候，它的作用往往是负面的。而Adam的表现比另外两种更好。

5.3、学习率衰减

加快学习算法的另一个办法是随时间慢慢减少学习率，也叫学习率衰减。这样使得我们在早期能够承受较大的学习步伐，但当开始收敛时，小的学习率能够让我们的前进步伐变得更小一点。
有以下几种：请添加图片描述

请添加图片描述
相关作业：
在原模型中加入了学习率衰减

def model(X, Y, layers_dims, optimizer, learning_rate = 0.0007, mini_batch_size = 64, beta = 0.9,
          beta1 = 0.9, beta2 = 0.999,  epsilon = 1e-8, num_epochs = 5000, print_cost = True, decay=None, decay_rate=1):
    """
    3-layer neural network model which can be run in different optimizer modes.
    
    Arguments:
    X -- input data, of shape (2, number of examples)
    Y -- true "label" vector (1 for blue dot / 0 for red dot), of shape (1, number of examples)
    layers_dims -- python list, containing the size of each layer
    learning_rate -- the learning rate, scalar.
    mini_batch_size -- the size of a mini batch
    beta -- Momentum hyperparameter
    beta1 -- Exponential decay hyperparameter for the past gradients estimates 
    beta2 -- Exponential decay hyperparameter for the past squared gradients estimates 
    epsilon -- hyperparameter preventing division by zero in Adam updates
    num_epochs -- number of epochs
    print_cost -- True to print the cost every 1000 epochs

    Returns:
    parameters -- python dictionary containing your updated parameters 
    """

    L = len(layers_dims)             # number of layers in the neural networks
    costs = []                       # to keep track of the cost
    t = 0                            # initializing the counter required for Adam update
    seed = 10                        # For grading purposes, so that your "random" minibatches are the same as ours
    m = X.shape[1]                   # number of training examples
    lr_rates = []
    learning_rate0 = learning_rate   # the original learning rate
    
    # Initialize parameters
    parameters = initialize_parameters(layers_dims)

    # Initialize the optimizer
    if optimizer == "gd":
        pass # no initialization required for gradient descent
    elif optimizer == "momentum":
        v = initialize_velocity(parameters)
    elif optimizer == "adam":
        v, s = initialize_adam(parameters)
    
    # Optimization loop
    for i in range(num_epochs):
        
        # Define the random minibatches. We increment the seed to reshuffle differently the dataset after each epoch
        seed = seed + 1
        minibatches = random_mini_batches(X, Y, mini_batch_size, seed)
        cost_total = 0
        
        for minibatch in minibatches:

            # Select a minibatch
            (minibatch_X, minibatch_Y) = minibatch

            # Forward propagation
            a3, caches = forward_propagation(minibatch_X, parameters)

            # Compute cost and add to the cost total
            cost_total += compute_cost(a3, minibatch_Y)

            # Backward propagation
            grads = backward_propagation(minibatch_X, minibatch_Y, caches)

            # Update parameters
            if optimizer == "gd":
                parameters = update_parameters_with_gd(parameters, grads, learning_rate)
            elif optimizer == "momentum":
                parameters, v = update_parameters_with_momentum(parameters, grads, v, beta, learning_rate)
            elif optimizer == "adam":
                t = t + 1 # Adam counter
                parameters, v, s, _, _ = update_parameters_with_adam(parameters, grads, v, s,
                                                               t, learning_rate, beta1, beta2,  epsilon)
        cost_avg = cost_total / m
        if decay:
            learning_rate = decay(learning_rate0, i, decay_rate)
        # Print the cost every 1000 epoch
        if print_cost and i % 1000 == 0:
            print ("Cost after epoch %i: %f" %(i, cost_avg))
            if decay:
                print("learning rate after epoch %i: %f"%(i, learning_rate))
        if print_cost and i % 100 == 0:
            costs.append(cost_avg)
                
    # plot the cost
    plt.plot(costs)
    plt.ylabel('cost')
    plt.xlabel('epochs (per 100)')
    plt.title("Learning rate = " + str(learning_rate))
    plt.show()

    return parameters

用到的两公式如下：
在这里插入图片描述

6、超参数调整

通常学习率α是比较重要的超参数。其他超参数在我们不知道各自的重要性的情况下，应该随机取值而不是网格取值，这样我们能够探究更多重要超参数的潜在值。
还有一种方式是从粗到细的搜索，当能确定一个范围时便缩小我们的搜索范围。

以学习率α的调整为例，当你获得了α取值范围是【0.0001，1】时，如果将这个数值划分为一个数轴均匀取值，那么在【0.0001，0.1】之间将只会获得10%的资源。
所以使用对数标尺搜索超参数，在对数轴上均匀随机取点，即

r = -4 * np.random.rand()
aerfa = np.power(10,r)

请添加图片描述
还有一点要注意，当你的计算机资源较少时，可以只专注于一个模型，并不断的观察代价函数是否递减，调整各种参数。
当你拥有较多的资源时，可以同时运行多种模型，并观察哪种模型的效果更好。

6.1、Batch-Norm拟合神经网络

在Z传向激活函数a之前，进行Batch Norm。
简约流程如下图红线右边公式所示：请添加图片描述
新多出来的俩个参数在红色右边显示，它们是用来构造其他平均值和方差的隐藏单元值。随后我们将使用新的z值进入激活函数当中。

7、Softmax回归

softmax应用在输出层，让我们的网络不仅仅是0/1分类，能让我们进行多分类。
当从前一层网络中得到了线性输出z后，随后我们的计算将如下：
$a(z)^{(l)} = \frac {e^{Z^{[l]}}}{ \sum_{i=1}^{numZ^{[l]}}t_{i}}$
分子表示分别求e的z次方，其中z是一个向量。
随后令t等于分子，分母则将分子求出的数进行求和
新的a值将表示我们对各分类预测的概率。