05-(第二课)第一周作业:深度学习的实用层面

init

1. Initalization

introduction
dataset

1.1 神经网络模型

You will use a 3-layer neural network (already implemented for you). Here are the initialization methods you will experiment with: (你将使用一个三层神经网络(不包含输入层). 下面的初始化方法将被尝试:)

  • Zeros initialization – setting initialization = "zeros" in the input argument.
  • Random initialization – setting initialization = "random" in the input argument. This initializes the weights to large random values.
  • He initialization – setting initialization = "he" in the input argument. This initializes the weights to random values scaled according to a paper by He et al., 2015.
def model(X, Y, learning_rate = 0.01, num_iterations = 15000, print_cost = True, initialization = "he"):
    """
    Implements a three-layer neural network: LINEAR->RELU->LINEAR->RELU->LINEAR->SIGMOID.
    实现一个三层神经网络:LINEAR->RELU->LINEAR->RELU->LINEAR->SIGMOID
    Arguments:
    X -- input data, of shape (2, number of examples)
    Y -- true "label" vector (containing 0 for red dots; 1 for blue dots), of shape (1, number of examples)
    learning_rate -- learning rate for gradient descent            梯度下降的学习率
    num_iterations -- number of iterations to run gradient descent 运行梯度下降的迭代次数
    print_cost -- if True, print the cost every 1000 iterations    每1000次输出cost
    initialization -- flag to choose which initialization to use ("zeros","random" or "he")  选择初始化方法

    Returns:
    parameters -- parameters learnt by the model
    """

    grads = {}
    costs = [] # to keep track of the loss
    m = X.shape[1] # number of examples      样本数
    layers_dims = [X.shape[0], 10, 5, 1]   # 神经网络的结构列表

    # Initialize parameters dictionary.以下三种初始化方法将在后面实现
    if initialization == "zeros":
        parameters = initialize_parameters_zeros(layers_dims)
    elif initialization == "random":
        parameters = initialize_parameters_random(layers_dims)
    elif initialization == "he":
        parameters = initialize_parameters_he(layers_dims)

    # Loop (gradient descent)

    for i in range(0, num_iterations):

        # Forward propagation: LINEAR -> RELU -> LINEAR -> RELU -> LINEAR -> SIGMOID.
        a3, cache = forward_propagation(X, parameters)

        # Loss
        cost = compute_loss(a3, Y)

        # Backward propagation.
        grads = backward_propagation(X, Y, cache)

        # Update parameters.
        parameters = update_parameters(parameters, grads, learning_rate)

        # Print the loss every 1000 iterations
        if print_cost and i % 1000 == 0:
            print("Cost after iteration {}: {}".format(i, cost))
            costs.append(cost)

    # plot the loss
    plt.plot(costs)
    plt.ylabel('cost')
    plt.xlabel('iterations (per hundreds)')
    plt.title("Learning rate =" + str(learning_rate))
    plt.show()

    return parameters

1.2 零初始化

There are two types of parameters to initialize in a neural network:(在神经网络中有两种类型的参数需要被初始化)

  • the weight matrices (W[1],W[2],W[3],...,W[L1],W[L])
  • the bias vectors (b[1],b[2],b[3],...,b[L1],b[L])

Exercise: Implement the following function to initialize all parameters to zeros. You’ll see later that this does not work well since it fails to “break symmetry”, but lets try it anyway and see what happens. Use np.zeros((..,..)) with the correct shapes.(实现下面函数将所有参数初始化为0,其实它工作的并不好因为没有”打破对称”,但还是让我们看看发生了什么)

def initialize_parameters_zeros(layers_dims):
    """
    Arguments:
    layer_dims -- python array (list) containing the size of each layer.

    Returns:
    parameters -- python dictionary containing your parameters "W1", "b1", ..., "WL", "bL":
                    W1 -- weight matrix of shape (layers_dims[1], layers_dims[0])
                    b1 -- bias vector of shape (layers_dims[1], 1)
                    ...
                    WL -- weight matrix of shape (layers_dims[L], layers_dims[L-1])
                    bL -- bias vector of shape (layers_dims[L], 1)
    """

    parameters = {}
    L = len(layers_dims)            # number of layers in the network

    for l in range(1, L):
        ### START CODE HERE ### (≈ 2 lines of code)
        parameters['W' + str(l)] = np.zeros((layers_dims[l], layers_dims[l-1]))
        parameters['b' + str(l)] = 0
        ### END CODE HERE ###
    return parameters

观察结果:

parameters = model(train_X, train_Y, initialization = "zeros")
print ("On the train set:")
predictions_train = predict(train_X, train_Y, parameters)
print ("On the test set:")
predictions_test = predict(test_X, test_Y, parameters)

zero
可以看到cost在所有迭代周期中都没有改变,始终与第一次的cost相同. 说明所有的参数都没有更新,每次最终的激活值(sigmoid函数)为0.5, 这里把0~0.5归为label值0,所以每次它都预测样本的标签为0。这就是为什么训练集和测试集的正确率为50%.

让我们将权重矩阵全部设置为相同数值0.01与1,看看结果:

parameters['W' + str(l)] = np.ones((layers_dims[l], layers_dims[l-1]))*0.01   权重矩阵初始化为0.01

0.01
0.01d

 parameters['W' + str(l)] = np.ones((layers_dims[l], layers_dims[l-1]))   权重矩阵初始化为1

1
1m
可以看到无论是0.01还是1,cost每次迭代都有下降但是极其缓慢,学习效果很差.
In general, initializing all the weights to zero results in the network failing to break symmetry. This means that every neuron in each layer will learn the same thing, and you might as well be training a neural network with n[l]=1 for every layer, and the network is no more powerful than a linear classifier such as logistic regression. (总体来说,将网络中所有权重初始为0(或者其它的值)没能打破对称. 这意味着每层的神经元在学习相同的事情,就好像你在训练每层只有一个神经元的网络,这种网络甚至不如线性分类器(比如逻辑回归)).

What you should remember(你需要记住):

  • The weights W[l] should be initialized randomly to break symmetry(权重被随机初始化用来打破对称).
  • It is however okay to initialize the biases b[l] to zeros. Symmetry is still broken so long as W[l] is initialized randomly(bias初始化为0这无所谓, 只要权重被随机初始化对称就被打破了).

1.3 随机初始化

To break symmetry, lets intialize the weights randomly. Following random initialization, each neuron can then proceed to learn a different function of its inputs. In this exercise, you will see what happens if the weights are intialized randomly, but to very large values.(为了打破对称,让我们随机初始权重. 下面的随机初始化每个神经元会根据它的输入学习不同的功能. 在这个练习中你将看到权重被随机初始化发生了什么,但是这些值都很大)

parameters['W' + str(l)] = np.random.randn(layers_dims[l], layers_dims[l - 1]) * 10
        parameters['b' + str(l)] = np.zeros((layers_dims[l], 1))

random
result
Observations:
- The cost starts very high. This is because with large random-valued weights, the last activation (sigmoid) outputs results that are very close to 0 or 1 for some examples, and when it gets that example wrong it incurs a very high loss for that example. Indeed, when log(a[3])=log(0) , the loss goes to infinity.(cost在开始很高. 这是因为对于一些样本较大的随机值在最后激活输出时十分接近0 or 1)
- Poor initialization can lead to vanishing/exploding gradients, which also slows down the optimization algorithm. (不好的初始化可能导致梯度消失/梯度爆炸问题, 也会降低优化的速度)
- If you train this network longer you will see better results, but initializing with overly large random numbers slows down the optimization.(如果训练更长时间会得到更好的结果)

In summary:

  • Initializing weights to very large random values does not work well. (初始值过大效果不是很好)
  • Hopefully intializing with small random values does better. The important question is: how small should be these random values be? Lets find out in the next part! (更小的随机值会更好. 但问题是应该怎样让随机值更小呢?)

1.4 He 初始化

Finally, try “He Initialization”; this is named for the first author of He et al., 2015. (If you have heard of “Xavier initialization”, this is similar except Xavier initialization uses a scaling factor for the weights W[l] of sqrt(1./layers_dims[l-1]) where He initialization would use sqrt(2./layers_dims[l-1]).)(最后让我们试一下”He 初始化”, 它与”Xavier初始化”相似,区别在于权重的比例因子)

parameters['W' + str(l)] = np.random.randn(layers_dims[l], layers_dims[l - 1]) * np.sqrt(2 / layers_dims[l - 1])
        parameters['b' + str(l)] = np.zeros((layers_dims[l], 1))

he
result

1.5 总结

You have seen three different types of initializations. For the same number of iterations and same hyperparameters the comparison is:(你已经看到三种不同类型的初始化了. 相同的迭代次数和相同的超参数)
compare
What you should remember from this notebook:

  • Different initializations lead to different results(不同的初始化将导致不同的结果)
  • Random initialization is used to break symmetry and make sure different hidden units can learn different things(随机初始化能够打破对称,不同的隐藏单元能够学习不同的事情)
  • Don’t intialize to values that are too large(不要将值初始过大)
  • He initialization works well for networks with ReLU activations. (在使用ReLU激活函数时He初始化方法表现更好)

2. Regularization

Welcome to the second assignment of this week. Deep Learning models have so much flexibility and capacity that overfitting can be a serious problem, if the training dataset is not big enough. Sure it does well on the training set, but the learned network doesn’t generalize to new examples that it has never seen!(欢迎来到这周的第二个作业. 如果训练集不是很大,深度学习模型将十分灵活那么过拟合是一个严重的问题. 确实它在训练集上表现很好,但是这个模型不能适应它从没见过的样本)

Problem Statement: You have just been hired as an AI expert by the French Football Corporation. They would like you to recommend positions where France’s goal keeper should kick the ball so that the French team’s players can then hit it with their head. (你刚刚被法国足球公司聘请为AI专家。他们希望你能推荐法国守门员踢球的位置,以便法国队的球员能用头来击中球。)
playerdot
Each dot corresponds to a position on the football field where a football player has hit the ball with his/her head after the French goal keeper has shot the ball from the left side of the football field.(每个点对应法国守门员从球场左侧发球后,球员用头击中足球的位置.)

  • If the dot is blue, it means the French player managed to hit the ball with his/her head(如果点是蓝色, 这表明法国球员用头击中了球)
  • If the dot is red, it means the other team’s player hit the ball with their head(如果点是红色,这表明另一支球队的球员得到球)

Your goal: Use a deep learning model to find the positions on the field where the goalkeeper should kick the ball.(你的目标是通过深度学习模型找到守门员发球的位置)

2.1 非正则化模型

You will use the following neural network (already implemented for you below). This model can be used:(你将使用下面的神经网络)

  • in regularization mode – by setting the lambd input to a non-zero value. We use “lambd” instead of “lambda” because “lambda” is a reserved keyword in Python. (在”regularization mode”中通过设置一个不为0的lambd,因为lambda在Python中是保留关键字)
  • in dropout mode – by setting the keep_prob to a value less than one(不超过1的值)

You will first try the model without any regularization. Then, you will implement:(首先将尝试不使用正则化,这时你将实现:)

  • L2 regularization – functions: “compute_cost_with_regularization()” and “backward_propagation_with_regularization()
  • Dropout – functions: “forward_propagation_with_dropout()” and “backward_propagation_with_dropout()
def model(X, Y, learning_rate = 0.3, num_iterations = 30000, print_cost = True, lambd = 0, keep_prob = 1):
    """
    Implements a three-layer neural network: LINEAR->RELU->LINEAR->RELU->LINEAR->SIGMOID.

    Arguments:
    X -- input data, of shape (input size, number of examples)
    Y -- true "label" vector (1 for blue dot / 0 for red dot), of shape (output size, number of examples)
    learning_rate -- learning rate of the optimization
    num_iterations -- number of iterations of the optimization loop
    print_cost -- If True, print the cost every 10000 iterations
    lambd -- regularization hyperparameter, scalar  正则化超参数,标量
    keep_prob - probability of keeping a neuron active during drop-out, scalar.

    Returns:
    parameters -- parameters learned by the model. They can then be used to predict.
    """

    grads = {}
    costs = []                            # to keep track of the cost
    m = X.shape[1]                        # number of examples
    layers_dims = [X.shape[0], 20, 3, 1]

    # Initialize parameters dictionary.
    parameters = initialize_parameters(layers_dims)

    # Loop (gradient descent)

    for i in range(0, num_iterations):

        # Forward propagation: LINEAR -> RELU -> LINEAR -> RELU -> LINEAR -> SIGMOID.
        if keep_prob == 1:
            a3, cache = forward_propagation(X, parameters)
        elif keep_prob < 1:
            a3, cache = forward_propagation_with_dropout(X, parameters, keep_prob)

        # Cost function
        if lambd == 0:
            cost = compute_cost(a3, Y)
        else:
            cost = compute_cost_with_regularization(a3, Y, parameters, lambd)

        # Backward propagation.
        assert(lambd==0 or keep_prob==1)    # it is possible to use both L2 regularization and dropout, 可以同时使用两种正则化方式
                                            # but this assignment will only explore one at a time 但在这次作业同一时刻只讨论一种
        if lambd == 0 and keep_prob == 1:
            grads = backward_propagation(X, Y, cache)
        elif lambd != 0:
            grads = backward_propagation_with_regularization(X, Y, cache, lambd) # L2正则化使用的地方仅在反向传播过程和计算cost
        elif keep_prob < 1:
            grads = backward_propagation_with_dropout(X, Y, cache, keep_prob)

        # Update parameters.
        parameters = update_parameters(parameters, grads, learning_rate)

        # Print the loss every 10000 iterations
        if print_cost and i % 10000 == 0:
            print("Cost after iteration {}: {}".format(i, cost))
        if print_cost and i % 1000 == 0:
            costs.append(cost)

    # plot the cost
    plt.plot(costs)
    plt.ylabel('cost')
    plt.xlabel('iterations (x1,000)')
    plt.title("Learning rate =" + str(learning_rate))
    plt.show()

    return parameters

non
The train accuracy is 94.8% while the test accuracy is 91.5%. This is the baseline model (you will observe the impact of regularization on this model). Run the following code to plot the decision boundary of your model.(训练正确率为94.8%,测试正确率为91.5%. 这是基本模型(在这个模型上你将看到正则化的影响))
boundary
The non-regularized model is obviously overfitting the training set. It is fitting the noisy points! Lets now look at two techniques to reduce overfitting.(非正则化模型显然过拟合训练集. 拟合了一些噪点!现在让我们看看两种方法去降低过拟合)

2.2 L2正则化

The standard way to avoid overfitting is called L2 regularization. It consists of appropriately modifying your cost function, from:

J=1mi=1m(y(i)log(a[L](i))+(1y(i))log(1a[L](i)))(1)

To:
Jregularized=1mi=1m(y(i)log(a[L](i))+(1y(i))log(1a[L](i)))cross-entropy cost+1mλ2lkjW[l]2k,jL2 regularization cost(2)

Let’s modify your cost and observe the consequences.

Exercise: Implement compute_cost_with_regularization() which computes the cost given by formula (2). To calculate kjW[l]2k,j , use :

np.sum(np.square(Wl))

Note that you have to do this for W[1] , W[2] and W[3] , then sum the three terms and multiply by 1mλ2 .

def compute_cost_with_regularization(A3, Y, parameters, lambd):
    """
    Implement the cost function with L2 regularization. See formula (2) above.

    Arguments:
    A3 -- post-activation, output of forward propagation, of shape (output size, number of examples)
    Y -- "true" labels vector, of shape (output size, number of examples)
    parameters -- python dictionary containing parameters of the model

    Returns:
    cost - value of the regularized loss function (formula (2))
    """
    m = Y.shape[1]
    W1 = parameters["W1"]
    W2 = parameters["W2"]
    W3 = parameters["W3"]

    cross_entropy_cost = compute_cost(A3, Y) # This gives you the cross-entropy part of the cost

    ### START CODE HERE ### (approx. 1 line)
    L2_regularization_cost = lambd * (np.sum(np.square(W1)) + np.sum(np.square(W2)) + np.sum(np.square(W3))) / (2 * m)

    ### END CODER HERE ###

    cost = cross_entropy_cost + L2_regularization_cost

    return cost

Of course, because you changed the cost, you have to change backward propagation as well! All the gradients have to be computed with respect to this new cost. (当然, 因为改变了cost所以也必须改变反向传播!所有的梯度必须关于新的cost)
Exercise: Implement the changes needed in backward propagation to take into account regularization. The changes only concern dW1, dW2 and dW3. For each, you have to add the regularization term’s gradient ( ddW(12λmW2)=λmW ).(这个改变只是针对dW1, dW2 and dW3.对每个必须加上正则项的梯度)

def backward_propagation_with_regularization(X, Y, cache, lambd):
    """
    Implements the backward propagation of our baseline model to which we added an L2 regularization.

    Arguments:
    X -- input dataset, of shape (input size, number of examples)
    Y -- "true" labels vector, of shape (output size, number of examples)
    cache -- cache output from forward_propagation()
    lambd -- regularization hyperparameter, scalar

    Returns:
    gradients -- A dictionary with the gradients with respect to each parameter, activation and pre-activation variables
    """

    m = X.shape[1]
    (Z1, A1, W1, b1, Z2, A2, W2, b2, Z3, A3, W3, b3) = cache

    dZ3 = A3 - Y

    ### START CODE HERE ### (approx. 1 line)
    dW3 = 1./m * np.dot(dZ3, A2.T) + (lambd * W3) / m
    ### END CODE HERE ###
    db3 = 1./m * np.sum(dZ3, axis=1, keepdims = True)

    dA2 = np.dot(W3.T, dZ3)
    dZ2 = np.multiply(dA2, np.int64(A2 > 0))
    ### START CODE HERE ### (approx. 1 line)
    dW2 = 1./m * np.dot(dZ2, A1.T) + (lambd * W2) / m
    ### END CODE HERE ###
    db2 = 1./m * np.sum(dZ2, axis=1, keepdims = True)

    dA1 = np.dot(W2.T, dZ2)
    dZ1 = np.multiply(dA1, np.int64(A1 > 0))
    ### START CODE HERE ### (approx. 1 line)
    dW1 = 1./m * np.dot(dZ1, X.T) + (lambd * W1) / m
    ### END CODE HERE ###
    db1 = 1./m * np.sum(dZ1, axis=1, keepdims = True)

    gradients = {"dZ3": dZ3, "dW3": dW3, "db3": db3,"dA2": dA2,
                 "dZ2": dZ2, "dW2": dW2, "db2": db2, "dA1": dA1, 
                 "dZ1": dZ1, "dW1": dW1, "db1": db1}

    return gradients

L2
L2
Observations:

  • The value of λ is a hyperparameter that you can tune using a dev set.( λ 的值是一个超参数,应该和Dev set一样)
  • L2 regularization makes your decision boundary smoother. If λ is too large, it is also possible to “oversmooth”, resulting in a model with high bias.(L2使边界更加平滑. 如果 λ 更大些边界就会过于平滑,结果就是模型高偏差)

What is L2-regularization actually doing?:

L2-regularization relies on the assumption that a model with small weights is simpler than a model with large weights. Thus, by penalizing the square values of the weights in the cost function you drive all the weights to smaller values. It becomes too costly for the cost to have large weights! This leads to a smoother model in which the output changes more slowly as the input changes. (L2正则化根据这样一个假设:模型拥有更小的权重会更好. 通过在代价函数中惩罚权重值的平方,所有的权重将变得更小. 因为过大的权重对于代价函数来说代价太大!)

What you should remember – the implications of L2-regularization on:

  • The cost computation:
    • A regularization term is added to the cost(在代价函数后面加上正则项)
  • The backpropagation function:
    • There are extra terms in the gradients with respect to weight matrices
  • Weights end up smaller (“weight decay”)(权重最终更小(“权重衰减”)):
    • Weights are pushed to smaller values.

2.3 Dropout

Finally, dropout is a widely used regularization technique that is specific to deep learning. It randomly shuts down some neurons in each iteration. (dropout是广泛应用的正则化技术特别在深度学习. 它通过在每次迭代中随机关闭一些神经元)
drop
At each iteration, you shut down (= set to zero) each neuron of a layer with probability 1keep_prob or keep it with probability keep_prob (50% here). The dropped neurons don’t contribute to the training in both the forward and backward propagations of the iteration.(在每次迭代中,你将关闭一层中百分之 1keep_prob 的神经元. 对于训练中的前向和反向传播失活的神经元将不再参与)

2.3.1 Forward propagation with dropout

Exercise: Implement the forward propagation with dropout. You are using a 3 layer neural network, and will add dropout to the first and second hidden layers. We will not apply dropout to the input layer or output layer. (实现前向传播中的dropout. 使用一个三层神经网络并对第一和第二隐藏层添加dropout. 我们通常不对输入或输出层应用dropout)

Instructions:
You would like to shut down some neurons in the first and second layers. To do that, you are going to carry out 4 Steps:(你将执行4步:)
1. In lecture, we dicussed creating a variable d[1] with the same shape as a[1] using np.random.rand() to randomly get numbers between 0 and 1. Here, you will use a vectorized implementation, so create a random matrix D[1]=[d[1](1)d[1](2)...d[1](m)] of the same dimension as A[1] .(在课堂上我们讨论了将创建一个 d[1] 变量shape和 a[1] 相同,通过np.random.rand() 产生[0,1)之间的随机值. 这里将给出一个向量化的实现)
2. Set each entry of D[1] to be 0 with probability (1-keep_prob) or 1 with probability (keep_prob), by thresholding values in D[1] appropriately. Hint: to set all the entries of a matrix X to 0 (if entry is less than 0.5) or 1 (if entry is more than 0.5) you would do: X = (X < 0.5). Note that 0 and 1 are respectively equivalent to False and True.(将 D[1] 矩阵中对应的概率设置为0 or 1. 想将矩阵中所有项设置为0 or 1可以通过X = (X < 0.5)来实现. 注意0和1分别等于False和True)
3. Set A[1] to A[1]D[1] . (You are shutting down some neurons). You can think of D[1] as a mask, so that when it is multiplied with another matrix, it shuts down some of the values.
4. Divide A[1] by keep_prob. By doing this you are assuring that the result of the cost will still have the same expected value as without drop-out. (This technique is also called inverted dropout.)( A[1] 除以keep_prob. 通过这样确保代价函数的结果仍和期待值一样就像没有dropout)

def forward_propagation_with_dropout(X, parameters, keep_prob = 0.5):
    """
    Implements the forward propagation: LINEAR -> RELU + DROPOUT -> LINEAR -> RELU + DROPOUT -> LINEAR -> SIGMOID.

    Arguments:
    X -- input dataset, of shape (2, number of examples)
    parameters -- python dictionary containing your parameters "W1", "b1", "W2", "b2", "W3", "b3":
                    W1 -- weight matrix of shape (20, 2)
                    b1 -- bias vector of shape (20, 1)
                    W2 -- weight matrix of shape (3, 20)
                    b2 -- bias vector of shape (3, 1)
                    W3 -- weight matrix of shape (1, 3)
                    b3 -- bias vector of shape (1, 1)
    keep_prob - probability of keeping a neuron active during drop-out, scalar

    Returns:
    A3 -- last activation value, output of the forward propagation, of shape (1,1)
    cache -- tuple, information stored for computing the backward propagation
    """

    np.random.seed(1)

    # retrieve parameters
    W1 = parameters["W1"]
    b1 = parameters["b1"]
    W2 = parameters["W2"]
    b2 = parameters["b2"]
    W3 = parameters["W3"]
    b3 = parameters["b3"]

    # LINEAR -> RELU -> LINEAR -> RELU -> LINEAR -> SIGMOID
    Z1 = np.dot(W1, X) + b1
    A1 = relu(Z1)
    ### START CODE HERE ### (approx. 4 lines)         # Steps 1-4 below correspond to the Steps 1-4 described above. 
    D1 = np.random.rand(A1.shape[0], A1.shape[1])     # Step 1: initialize matrix D1 = np.random.rand(..., ...)
    D1 = D1 < keep_prob                               # Step 2: convert entries of D1 to 0 or 1 (using keep_prob as the threshold)
    A1 = A1 * D1                                      # Step 3: shut down some neurons of A1
    A1 = A1 / keep_prob                               # Step 4: scale the value of neurons that haven't been shut down
    ### END CODE HERE ###
    Z2 = np.dot(W2, A1) + b2
    A2 = relu(Z2)
    ### START CODE HERE ### (approx. 4 lines)
    D2 = np.random.rand(A2.shape[0], A2.shape[1])      # Step 1: initialize matrix D2 = np.random.rand(..., ...)
    D2 = D2 < keep_prob                                # Step 2: convert entries of D2 to 0 or 1 (using keep_prob as the threshold)
    A2 = A2 * D2                                       # Step 3: shut down some neurons of A2
    A2 = A2 / keep_prob                                # Step 4: scale the value of neurons that haven't been shut down
    ### END CODE HERE ###
    Z3 = np.dot(W3, A2) + b3
    A3 = sigmoid(Z3)

    cache = (Z1, D1, A1, W1, b1, Z2, D2, A2, W2, b2, Z3, A3, W3, b3)  # cache中包含了D1、D2

    return A3, cache

2.3.2 Backward propagation with dropout

Exercise: Implement the backward propagation with dropout. As before, you are training a 3 layer network. Add dropout to the first and second hidden layers, using the masks D[1] and D[2] stored in the cache.

Instruction:
Backpropagation with dropout is actually quite easy. You will have to carry out 2 Steps:(反向传播dropout其实十分简单. 你只需要执行下面两步:)
1. You had previously shut down some neurons during forward propagation, by applying a mask D[1] to A1. In backpropagation, you will have to shut down the same neurons, by reapplying the same mask D[1] to dA1. (在反向传播中,你也必须通过相同的掩码 D[1] 关闭相同的神经元 )
2. During forward propagation, you had divided A1 by keep_prob. In backpropagation, you’ll therefore have to divide dA1 by keep_prob again (the calculus interpretation is that if A[1] is scaled by keep_prob, then its derivative dA[1] is also scaled by the same keep_prob).(在前向传播中,A1 除以了 keep_prob. 因此在反向传播中dA1 也必须除以 keep_prob)

def backward_propagation_with_dropout(X, Y, cache, keep_prob):
    """
    Implements the backward propagation of our baseline model to which we added dropout.

    Arguments:
    X -- input dataset, of shape (2, number of examples)
    Y -- "true" labels vector, of shape (output size, number of examples)
    cache -- cache output from forward_propagation_with_dropout()
    keep_prob - probability of keeping a neuron active during drop-out, scalar

    Returns:
    gradients -- A dictionary with the gradients with respect to each parameter, activation and pre-activation variables
    """

    m = X.shape[1]
    (Z1, D1, A1, W1, b1, Z2, D2, A2, W2, b2, Z3, A3, W3, b3) = cache

    dZ3 = A3 - Y
    dW3 = 1./m * np.dot(dZ3, A2.T)
    db3 = 1./m * np.sum(dZ3, axis=1, keepdims = True)
    dA2 = np.dot(W3.T, dZ3)
    ### START CODE HERE ### (≈ 2 lines of code)
    dA2 = dA2 * D2               # Step 1: Apply mask D2 to shut down the same neurons as during the forward propagation
    dA2 = dA2 / keep_prob       # Step 2: Scale the value of neurons that haven't been shut down
    ### END CODE HERE ###
    dZ2 = np.multiply(dA2, np.int64(A2 > 0))
    dW2 = 1./m * np.dot(dZ2, A1.T)
    db2 = 1./m * np.sum(dZ2, axis=1, keepdims = True)

    dA1 = np.dot(W2.T, dZ2)
    ### START CODE HERE ### (≈ 2 lines of code)
    dA1 = dA1 * D1              # Step 1: Apply mask D1 to shut down the same neurons as during the forward propagation
    dA1 = dA1 / keep_prob              # Step 2: Scale the value of neurons that haven't been shut down
    ### END CODE HERE ###
    dZ1 = np.multiply(dA1, np.int64(A1 > 0))
    dW1 = 1./m * np.dot(dZ1, X.T)
    db1 = 1./m * np.sum(dZ1, axis=1, keepdims = True)

    gradients = {"dZ3": dZ3, "dW3": dW3, "db3": db3,"dA2": dA2,
                 "dZ2": dZ2, "dW2": dW2, "db2": db2, "dA1": dA1, 
                 "dZ1": dZ1, "dW1": dW1, "db1": db1}

    return gradients

Let’s now run the model with dropout (keep_prob = 0.86). It means at every iteration you shut down each neurons of layer 1 and 2 with 24% probability. The function model() will now call:(让我们运行这个模型dropout(keep_prob = 0.86),这意味着每次迭代第一二层的每个神经元都有24%的可能性被失活)

  • forward_propagation_with_dropout instead of forward_propagation.
  • backward_propagation_with_dropout instead of backward_propagation.
parameters = model(train_X, train_Y, keep_prob = 0.86, learning_rate = 0.3)

print ("On the train set:")
predictions_train = predict(train_X, train_Y, parameters)
print ("On the test set:")
predictions_test = predict(test_X, test_Y, parameters)

result
d
Note:

  • A common mistake when using dropout is to use it both in training and testing. You should use dropout (randomly eliminate nodes) only in training. (dropout只能在训练的时候使用而不能在测试的时候使用)
  • Deep learning frameworks like tensorflow, PaddlePaddle, keras or caffe come with a dropout layer implementation. Don’t stress - you will soon learn some of these frameworks.

What you should remember about dropout:

  • Dropout is a regularization technique.(dropout是一个正则化技术)
  • You only use dropout during training. Don’t use dropout (randomly eliminate nodes) during test time.(只能在训练的时候使用dropout)
  • Apply dropout both during forward and backward propagation.(dropout在前向和反向的时候都会运用)
  • During training time, divide each dropout layer by keep_prob to keep the same expected value for the activations. For example, if keep_prob is 0.5, then we will on average shut down half the nodes, so the output will be scaled by 0.5 since only the remaining half are contributing to the solution. Dividing by 0.5 is equivalent to multiplying by 2. Hence, the output now has the same expected value. You can check that this works even when keep_prob is other values than 0.5. (在训练期每个dropout层除以keep_prob是为了和期望激活值一致. 比如说,如果keep_prob 是0.5这时将平均失活一半神经元, 所以输出将减少一半. 通过除以0.5其实就相当于乘以2,这时输出就和期待输出值相同了.)

2.4 总结

Here are the results of our three models:
3
What we want you to remember from this notebook:

  • Regularization will help you reduce overfitting.(正则化将帮助你降低过拟合)
  • Regularization will drive your weights to lower values.(正则化将使权重值更小)
  • L2 regularization and Dropout are two very effective regularization techniques.(L2和dropout是两种十分有效的正则化技术)

3. Gradient Checking

Welcome to the final assignment for this week! In this assignment you will learn to implement and use gradient checking. (欢迎来到这周最后一个编程作业! 在这周作业中你将学会实现和使用梯度检验)

You are part of a team working to make mobile payments available globally, and are asked to build a deep learning model to detect fraud–whenever someone makes a payment, you want to see if the payment might be fraudulent, such as if the user’s account has been taken over by a hacker. (你属于一个致力于在全球范围内提供移动支付的团队,并被要求建立一个深度学习模型来检测欺诈,每当有人支付,你想看看付款是否是虚假的,例如是否用户的帐户已经被黑客接管)

But backpropagation is quite challenging to implement, and sometimes has bugs. Because this is a mission-critical application, your company’s CEO wants to be really certain that your implementation of backpropagation is correct. Your CEO says, “Give me a proof that your backpropagation is actually working!” To give this reassurance, you are going to use “gradient checking”.(但是反向传播的实现是非常具有挑战性的,并且有时会有bug。因为这是一个关键任务应用程序,所以公司的CEO想要确保反向传播的实现是正确的。你的CEO说:“给我一个证明,你的反向传播实际上是有效的!”为了保证这一点,你将会使用“梯度检查”)

3.1 梯度检验是怎样工作的?

Backpropagation computes the gradients Jθ , where θ denotes the parameters of the model. J is computed using forward propagation and your loss function.(反向传播是计算梯度Jθ(其实就是 θ 关于 J 的偏导))

Because forward propagation is relatively easy to implement, you’re confident you got that right, and so you’re almost 100% sure that you’re computing the cost J correctly. Thus, you can use your code for computing J to verify the code for computing Jθ. (因为前向传播相对容易实现,可以确信结果正确,所以你可以100%确定代价函数 J 是正确的。因此代码可以使用J去验证梯度 Jθ 是否正确)

Let’s look back at the definition of a derivative (or gradient)(让我们回想一下导数(或者梯度)的定义):

Jθ=limε0J(θ+ε)J(θε)2ε(1)

If you’re not familiar with the “ limε0 ” notation, it’s just a way of saying “when ε is really really small.”

We know the following:

  • Jθ is what you want to make sure you’re computing correctly. ( Jθ 是你想要去确定是否正确的式子)
  • You can compute J(θ+ε) and J(θε) (in the case that θ is a real number), since you’re confident your implementation for J is correct. (你将计算J(θ+ε) J(θε) ,因为你确信 J 的实现是正确的 )

Lets use equation (1) and a small value for ε to convince your CEO that your code for computing Jθ is correct!

3.2 一维梯度检验

Consider a 1D linear function J(θ)=θx . The model contains only a single real-valued parameter θ , and takes x as input.(考虑一个一维线性函数J(θ)=θx,这个模型只包含了一个参数 θ 和一个输入 x )

You will implement code to compute J(.) and its derivative Jθ . You will then use gradient checking to make sure your derivative computation for J is correct.
1
Exercise: implement “forward propagation” and “backward propagation” for this simple function. I.e., compute both J(.) (“forward propagation”) and its derivative with respect to θ (“backward propagation”), in two separate functions. (对这个简单的函数实现前向和反向传播)

def forward_propagation(x, theta):
    """
    Implement the linear forward propagation (compute J) presented in Figure 1 (J(theta) = theta * x)

    Arguments:
    x -- a real-valued input 一个实数输入
    theta -- our parameter, a real number as well

    Returns:
    J -- the value of function J, computed using the formula J(theta) = theta * x
    """

    ### START CODE HERE ### (approx. 1 line)
    J = np.dot(theta, x)
    ### END CODE HERE ###

    return J

Exercise: Now, implement the backward propagation step (derivative computation) of Figure 1. That is, compute the derivative of J(θ)=θx with respect to θ . To save you from doing the calculus, you should get dtheta=Jθ=x .(现在实现反向传播步骤。这里将计算函数 J(θ) 关于 θ 的导数。)

def backward_propagation(x, theta):
    """
    Computes the derivative of J with respect to theta (see Figure 1).

    Arguments:
    x -- a real-valued input
    theta -- our parameter, a real number as well

    Returns:
    dtheta -- the gradient of the cost with respect to theta
    """

    ### START CODE HERE ### (approx. 1 line)
    dtheta = x
    ### END CODE HERE ###

    return dtheta

Instructions:

  • First compute “gradapprox” using the formula above (1) and a small value of ε . Here are the Steps to follow:(首先使用上面的公式(1)计算”gradapprox”这里的 ε 十分小)
    1. θ+=θ+ε
    2. θ=θε
    3. J+=J(θ+)
    4. J=J(θ)
    5. gradapprox=J+J2ε
  • Then compute the gradient using backward propagation, and store the result in a variable “grad”(这时用backward propagation计算梯度,并存放到变量”grad”中)
  • Finally, compute the relative difference between “gradapprox” and the “grad” using the following formula:
    difference=gradgradapprox2grad2+gradapprox2(2)

    You will need 3 Steps to compute this formula:
    • 1’. compute the numerator using np.linalg.norm(…)(计算分子通过Numpy中的线性代数库linalg)
    • 2’. compute the denominator. You will need to call np.linalg.norm(…) twice.
    • 3’. divide them.
  • If this difference is small (say less than 107 ), you can be quite confident that you have computed your gradient correctly. Otherwise, there may be a mistake in the gradient computation. (如果这个差距很小(也就是说小于 107 ),那么可以十分肯定计算的梯度是正确的。否则在梯度计算中可能发生了错误)
def gradient_check(x, theta, epsilon = 1e-7):
    """
    Implement the backward propagation presented in Figure 1.

    Arguments:
    x -- a real-valued input
    theta -- our parameter, a real number as well
    epsilon -- tiny shift to the input to compute approximated gradient with formula(1)

    Returns:
    difference -- difference (2) between the approximated gradient and the backward propagation gradient
    """

    # Compute gradapprox using left side of formula (1). epsilon is small enough, you don't need to worry about the limit.
    ### START CODE HERE ### (approx. 5 lines)
    thetaplus = theta + epsilon                               # Step 1
    thetaminus = theta - epsilon                              # Step 2
    J_plus = forward_propagation(x, thetaplus)                # Step 3
    J_minus = forward_propagation(x, thetaminus)              # Step 4
    gradapprox = (J_plus - J_minus) / (2 * epsilon)           # Step 5
    ### END CODE HERE ###

    # Check if gradapprox is close enough to the output of backward_propagation()
    ### START CODE HERE ### (approx. 1 line)
    grad = backward_propagation(x, theta)
    ### END CODE HERE ###

    ### START CODE HERE ### (approx. 1 line)
    numerator = np.linalg.norm(grad - gradapprox)                               # Step 1'
    denominator = np.linalg.norm(grad) + np.linalg.norm(gradapprox)             # Step 2'
    difference = numerator / denominator                                        # Step 3'
    ### END CODE HERE ###

    if difference < 1e-7:
        print ("The gradient is correct!")
    else:
        print ("The gradient is wrong!")

    return difference

3.3 n维梯度检验

check
反向传播模块

def backward_propagation_n(X, Y, cache):
    """
    Implement the backward propagation presented in figure 2.

    Arguments:
    X -- input datapoint, of shape (input size, 1)
    Y -- true "label"
    cache -- cache output from forward_propagation_n()

    Returns:
    gradients -- A dictionary with the gradients of the cost with respect to each parameter, activation and pre-activation variables.
    """

    m = X.shape[1]
    (Z1, A1, W1, b1, Z2, A2, W2, b2, Z3, A3, W3, b3) = cache

    dZ3 = A3 - Y
    dW3 = 1./m * np.dot(dZ3, A2.T)
    db3 = 1./m * np.sum(dZ3, axis=1, keepdims = True)

    dA2 = np.dot(W3.T, dZ3)
    dZ2 = np.multiply(dA2, np.int64(A2 > 0))
    dW2 = 1./m * np.dot(dZ2, A1.T) * 2  # 原始错误的代码
    #dW2 = 1./m * np.dot(dZ2, A1.T)
    db2 = 1./m * np.sum(dZ2, axis=1, keepdims = True)

    dA1 = np.dot(W2.T, dZ2)
    dZ1 = np.multiply(dA1, np.int64(A1 > 0))
    dW1 = 1./m * np.dot(dZ1, X.T)
    db1 = 4./m * np.sum(dZ1, axis=1, keepdims = True)  # 原始错误的代码
    # db1 = 1./m * np.sum(dZ1, axis=1, keepdims = True)

    gradients = {"dZ3": dZ3, "dW3": dW3, "db3": db3,
                 "dA2": dA2, "dZ2": dZ2, "dW2": dW2, "db2": db2,
                 "dA1": dA1, "dZ1": dZ1, "dW1": dW1, "db1": db1}

    return gradients

How does gradient checking work?.

As in 1) and 2), you want to compare “gradapprox” to the gradient computed by backpropagation. The formula is still:(就像公式1)和2)你也需要比较”gradapprox”和反向传播计算的梯度)

Jθ=limε0J(θ+ε)J(θε)2ε(1)

However, θ is not a scalar anymore. It is a dictionary called “parameters”. We implemented a function “dictionary_to_vector()” for you. It converts the “parameters” dictionary into a vector called “values”, obtained by reshaping all parameters (W1, b1, W2, b2, W3, b3) into vectors and concatenating them.(然而这里的 θ 不再是一个标量,而是一个叫做”parameters”的字典。我们需要实现一个”dictionary_to_vector()“函数,它将字典转换为一个叫做”values”的向量)

The inverse function is “vector_to_dictionary” which outputs back the “parameters” dictionary.(“vector_to_dictionary“函数则是将向量转化为字典)
n
Exercise: Implement gradient_check_n().

Instructions: Here is pseudo-code that will help you implement the gradient check.(下面的伪代码将帮助你实现梯度检验)

For each i in num_parameters(对所有参数进行迭代):

  • To compute J_plus[i](计算参数加上 ε 后的关于代价函数 J 的值):
    1. Set θ+ to np.copy(parameters_values)
    2. Set θ+i to θ+i+ε
    3. Calculate J+i using to forward_propagation_n(x, y, vector_to_dictionary( θ+ )).
    4. To compute J_minus[i]: do the same thing with θ
    5. Compute gradapprox[i]=J+iJi2ε

Thus, you get a vector gradapprox, where gradapprox[i] is an approximation of the gradient with respect to parameter_values[i]. You can now compare this gradapprox vector to the gradients vector from backpropagation. Just like for the 1D case (Steps 1’, 2’, 3’), compute:

difference=gradgradapprox2grad2+gradapprox2(3)
(因此你得到一个gradapprox向量,这里的gradapprox[i]是关于参数parameter_values[i]梯度的近似值。现在你就可以比较gradapprox向量和从反向传播中得到的梯度向量。)

def gradient_check_n(parameters, gradients, X, Y, epsilon = 1e-7):
    """
    Checks if backward_propagation_n computes correctly the gradient of the cost output by forward_propagation_n

    Arguments:
    parameters -- python dictionary containing your parameters "W1", "b1", "W2", "b2", "W3", "b3":
    grad -- output of backward_propagation_n, contains gradients of the cost with respect to the parameters. 
    x -- input datapoint, of shape (input size, 1)
    y -- true "label"
    epsilon -- tiny shift to the input to compute approximated gradient with formula(1)

    Returns:
    difference -- difference (2) between the approximated gradient and the backward propagation gradient
    """

    # Set-up variables
    parameters_values, _ = dictionary_to_vector(parameters)  # 将parameters字典转换为parameters_values向量
    grad = gradients_to_vector(gradients)
    num_parameters = parameters_values.shape[0]       # 可知此处转换为了列向量
    J_plus = np.zeros((num_parameters, 1))
    J_minus = np.zeros((num_parameters, 1))
    gradapprox = np.zeros((num_parameters, 1))

    # Compute gradapprox
    for i in range(num_parameters):

        # Compute J_plus[i]. Inputs: "parameters_values, epsilon". Output = "J_plus[i]".
        # "_" is used because the function you have to outputs two parameters but we only care about the first one
        ### START CODE HERE ### (approx. 3 lines)
        thetaplus = np.copy(parameters_values)                                      # Step 1
        thetaplus[i][0] = thetaplus[i][0] + epsilon                                 # Step 2
        J_plus[i], _ = forward_propagation_n(X, Y, vector_to_dictionary(thetaplus))  # Step 3
        ### END CODE HERE ###

        # Compute J_minus[i]. Inputs: "parameters_values, epsilon". Output = "J_minus[i]".
        ### START CODE HERE ### (approx. 3 lines)
        thetaminus = np.copy(parameters_values)                                     # Step 1
        thetaminus[i][0] = thetaminus[i][0] - epsilon                               # Step 2        
        J_minus[i], _ = forward_propagation_n(X, Y, vector_to_dictionary(thetaminus))  # Step 3
        ### END CODE HERE ###

        # Compute gradapprox[i]
        ### START CODE HERE ### (approx. 1 line)
        gradapprox[i] = (J_plus[i] - J_minus[i]) / (2 * epsilon)
        ### END CODE HERE ###

    # Compare gradapprox to backward propagation gradients by computing difference.
    ### START CODE HERE ### (approx. 1 line)
    numerator = np.linalg.norm(grad - gradapprox)                                  # Step 1'
    denominator = np.linalg.norm(grad) + np.linalg.norm(gradapprox)                # Step 2'
    difference = numerator / denominator                                           # Step 3'
    ### END CODE HERE ###

    if difference > 2e-7:
        print ("\033[93m" + "There is a mistake in the backward propagation! difference = " + str(difference) + "\033[0m")
    else:
        print ("\033[92m" + "Your backward propagation works perfectly fine! difference = " + str(difference) + "\033[0m")

    return difference

测试n维梯度检验:
error
It seems that there were errors in the backward_propagation_n code we gave you! Good that you’ve implemented the gradient check. Go back to backward_propagation and try to find/correct the errors (Hint: check dW2 and db1). Rerun the gradient check when you think you’ve fixed it. Remember you’ll need to re-execute the cell defining backward_propagation_n() if you modify the code. (看起来在backward_propagation_n 中有错误,让我们返回去看看哪里错了(提示:检查dW2 and db1)并改正错误)
改正backward_propagation_n代码

#dW2 = 1./m * np.dot(dZ2, A1.T) * 2  # 原始错误的代码
dW2 = 1./m * np.dot(dZ2, A1.T)
#db1 = 4./m * np.sum(dZ1, axis=1, keepdims = True)  # 原始错误的代码
db1 = 1./m * np.sum(dZ1, axis=1, keepdims = True)

right

Can you get gradient check to declare your derivative computation correct? Even though this part of the assignment isn’t graded, we strongly urge you to try to find the bug and re-run gradient check until you’re convinced backprop is now correctly implemented. (你学会用梯度检验去声明你的导数计算是正确的了吗?即使这部分不打分但我们还是强烈建议你去找出bug并改正它)

Note

  • Gradient Checking is slow! Approximating the gradient with JθJ(θ+ε)J(θε)2ε is computationally costly. For this reason, we don’t run gradient checking at every iteration during training. Just a few times to check if the gradient is correct. (梯度检验很慢!近似梯度计算代价高。因为这个原因我们不会在训练的每个迭代期使用梯度检验。而是仅仅使用几次去检验梯度是否正确)
  • Gradient Checking, at least as we’ve presented it, doesn’t work with dropout. You would usually run the gradient check algorithm without dropout to make sure your backprop is correct, then add dropout. (像我们展示的,梯度检验不适用dropout。通常先无dropout运行梯度检验确保反向传播是正确的,之后再加上dropout)

Congrats, you can be confident that your deep learning model for fraud detection is working correctly! You can even use this to convince your CEO. :)

What you should remember from this notebook:

  • Gradient checking verifies closeness between the gradients from backpropagation and the numerical approximation of the gradient (computed using forward propagation).(梯度检验是核实反向传播梯度和数值逼近梯度的近似值)
  • Gradient checking is slow, so we don’t run it in every iteration of training. You would usually run it only to make sure your code is correct, then turn it off and use backprop for the actual learning process. (梯度检验很慢,所以我们不会在训练的每个迭代期使用它。通常使用它仅仅是确保代码正确与否,然后关闭它并使用dropout开始真正的学习过程)
  • 0
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值