吴恩达深度学习-Initialization---Regularization---Gradient Checking


parameters – python dictionary containing your parameters “W1”, “b1”, …, “WL”, “bL”:

  • W1 – weight matrix of shape (layers_dims[1], layers_dims[0])
  • b1 – bias vector of shape (layers_dims[1], 1)
  • WL – weight matrix of shape (layers_dims[L], layers_dims[L-1])
  • bL – bias vector of shape (layers_dims[L], 1)

Zero initialization

# GRADED FUNCTION: initialize_parameters_zeros 
def initialize_parameters_zeros(layers_dims):
    parameters = {}
    L = len(layers_dims)            # number of layers in the network
    for l in range(1, L):
        parameters['W' + str(l)] = np.zeros((layers_dims[l] , layers_dims[l-1]))
        parameters['b' + str(l)] = np.zeros((layers_dims[l] , 1))
    return parameters

Random initialization

# GRADED FUNCTION: initialize_parameters_random

def initialize_parameters_random(layers_dims):
    parameters = {}
    L = len(layers_dims)            # integer representing the number of layers
    for l in range(1, L):
        parameters['W' + str(l)] = np.random.randn(layers_dims[l] , layers_dims[l-1]) * 10
        parameters['b' + str(l)] = np.zeros((layers_dims[l] , 1))
    return parameters

He initialization

He 初始化对应的是非线性激活函数(Relu 和 Prelu)。
任意层的权重 W [ l ] W^{[l]} W[l],按照均值为 0,且方差为 2 n [ l − 1 ] \sqrt{\frac{2}{n^[l-1]}} n[l1]2 的高斯分布进行初始化,可以保证每一层的输入方差尺度一致。

W [ l ] = r a n d o m ∗ 2 layers_dims[l-1] W^{[l]} = random * \sqrt{\frac{2}{\text{layers\_dims[l-1]}}} W[l]=randomlayers_dims[l-1]2

# GRADED FUNCTION: initialize_parameters_he
def initialize_parameters_he(layers_dims):
    parameters = {}
    L = len(layers_dims) - 1 
    for l in range(1, L + 1):
        parameters['W' + str(l)] = np.random.randn(layers_dims[l] , layers_dims[l-1]) * np.sqrt(2 / layers_dims[l-1])
        parameters['b' + str(l)] = np.zeros((layers_dims[l], 1))
    return parameters


L2 Regularization

J = − 1 m ∑ i = 1 m ( y ( i ) log ⁡ ( a [ L ] ( i ) ) + ( 1 − y ( i ) ) log ⁡ ( 1 − a [ L ] ( i ) ) ) J = -\frac{1}{m} \sum\limits_{i = 1}^{m} \large{(}\small y^{(i)}\log\left(a^{[L](i)}\right) + (1-y^{(i)})\log\left(1- a^{[L](i)}\right) \large{)} J=m1i=1m(y(i)log(a[L](i))+(1y(i))log(1a[L](i)))
J r e g u l a r i z e d = − 1 m ∑ i = 1 m ( y ( i ) log ⁡ ( a [ L ] ( i ) ) + ( 1 − y ( i ) ) log ⁡ ( 1 − a [ L ] ( i ) ) ) ⏟ cross-entropy cost + 1 m λ 2 ∑ l ∑ k ∑ j W k , j [ l ] 2 ⏟ L2 regularization cost J_{regularized} = \small \underbrace{-\frac{1}{m} \sum\limits_{i = 1}^{m} \large{(}\small y^{(i)}\log\left(a^{[L](i)}\right) + (1-y^{(i)})\log\left(1- a^{[L](i)}\right) \large{)} }_\text{cross-entropy cost} + \underbrace{\frac{1}{m} \frac{\lambda}{2} \sum\limits_l\sum\limits_k\sum\limits_j W_{k,j}^{[l]2} }_\text{L2 regularization cost} Jregularized=cross-entropy cost m1i=1m(y(i)log(a[L](i))+(1y(i))log(1a[L](i)))+L2 regularization cost m12λlkjWk,j[l]2

To calculate ∑ k ∑ j W k , j [ l ] 2 \sum\limits_k\sum\limits_j W_{k,j}^{[l]2} kjWk,j[l]2 , use :



# GRADED FUNCTION: compute_cost_with_regularization
def compute_cost_with_regularization(A3, Y, parameters, lambd):

    m = Y.shape[1]
    W1 = parameters["W1"]
    W2 = parameters["W2"]
    W3 = parameters["W3"]
    cross_entropy_cost = compute_cost(A3, Y)
    L2_regularization_cost =  (1 / m) * (lambd / 2) * (np.sum(np.square(W1)) + np.sum(np.square(W2)) + np.sum(np.square(W3))
    cost = cross_entropy_cost + L2_regularization_cost
    return cost


d d W ( 1 2 λ m W 2 ) = λ m W \frac{d}{dW} ( \frac{1}{2}\frac{\lambda}{m} W^2) = \frac{\lambda}{m} W dWd(21mλW2)=mλW

# GRADED FUNCTION: backward_propagation_with_regularization
def backward_propagation_with_regularization(X, Y, cache, lambd):

    m = X.shape[1]
    (Z1, A1, W1, b1, Z2, A2, W2, b2, Z3, A3, W3, b3) = cache
    dZ3 = A3 - Y  # sigmoid梯度
    ### START CODE HERE ### (approx. 1 line)
    dW3 = 1. / m * np.dot(dZ3 , A2.T) + lambd / m * W3
    ### END CODE HERE ###

    db3 = 1./m * np.sum(dZ3, axis=1, keepdims = True)
    dA2 = np.dot(W3.T, dZ3)
    # 除了最后一层,其他层都是ReLU激活函数,ReLU 的梯度只有在 x > 0 时才为 1,否则为 0
    dZ2 = np.multiply(dA2, np.int64(A2 > 0))  

    ### START CODE HERE ### (approx. 1 line)
    dW2 = 1. / m * np.dot(dZ2 , A1.T) + lambd / m * W2
    ### END CODE HERE ###

    db2 = 1./m * np.sum(dZ2, axis=1, keepdims = True)
    dA1 = np.dot(W2.T, dZ2)
    dZ1 = np.multiply(dA1, np.int64(A1 > 0))

    ### START CODE HERE ### (approx. 1 line)
    dW1 = 1. / m * np.dot(dZ1 , X.T) + lambd / m * W1
    ### END CODE HERE ###

    db1 = 1./m * np.sum(dZ1, axis=1, keepdims = True)
    gradients = {"dZ3": dZ3, "dW3": dW3, "db3": db3,"dA2": dA2,
                 "dZ2": dZ2, "dW2": dW2, "db2": db2, "dA1": dA1, 
                 "dZ1": dZ1, "dW1": dW1, "db1": db1}
    return gradients


  • λ \lambda λ 的值是一个超参数,可以通过使用验证集进行调整。
  • L2 正则化会使决策边界更平滑。如果 λ \lambda λ 过大,也可能导致“过平滑”,进而产生高偏差的模型。

L2 正则化基于这样的假设:权重较小的模型比权重较大的模型更为简单。因此,通过在成本函数中惩罚权重的平方值,驱使所有权重趋于更小。这样一来,成本函数中拥有较大权重变得过于昂贵!这导致了一个更平滑的模型,其中输出随着输入的变化更为缓慢。

L2 正则化对以下方面的含义:

  • 成本计算:成本中添加了正则化项
  • 反向传播函数:与权重矩阵相关的梯度中包含了额外项
  • 权重最终变小(“权重衰减”):权重被推向更小的值



  • 使用 Dropout 时常犯的 一个错误 是在训练和测试阶段都使用它。您应当仅在训练阶段使用 Dropout(随机删除节点)。
  • 深度学习框架如 tensorflow、PaddlePaddle、keras 或 caffe 提供了 Dropout 层的实现。不必紧张,您很快就能学到其中一些框架。

关于 Dropout 应该记住的事项:

  • Dropout 是一种正则化技术。
  • 您只能在训练阶段使用 Dropout。在测试阶段不要使用 Dropout(随机删除节点)。
  • 在前向传播和反向传播中都要应用 Dropout。
  • 训练阶段时,为了使激活值的期望值保持不变,应对每个 Dropout 层除以 keep_prob。例如,如果 keep_prob 为 0.5,那么平均会关闭一半的节点,所以输出将缩小为原来的 0.5 倍,因为只有剩下的一半节点对解决方案有所贡献。除以 0.5 等同于乘以 2。因此,现在输出具有相同的期望值。您可以检查即使 keep_prob 为 0.5 以外的其他值,这种方法仍然有效。



  1. 在课程中,我们讨论过通过 np.random.rand() 生成0到1之间的随机数,创建一个与 a [ 1 ] a^{[1]} a[1] 形状相同的变量 d [ 1 ] d^{[1]} d[1]。在此处,您将采用向量化方法,故需构建一个与 A [ 1 ] A^{[1]} A[1] 维度相同的随机矩阵 D [ 1 ] = [ d 1 d 1 . . . d 1 ] D^{[1]} = [d^{1} d^{1} ... d^{1}] D[1]=[d1d1...d1]
  2. 通过合理设定阈值,将 D [ 1 ] D^{[1]} D[1] 中各元素以概率 1-keep_prob 设为0,以概率 keep_prob 设为1。提示:若要将矩阵 X 的所有元素小于0.5的设为0,大于等于0.5的设为1,可以运行 X = (X < 0.5)。需要注意,0和1分别代表False和True。
  3. A [ 1 ] A^{[1]} A[1] 设置为 A [ 1 ] × D [ 1 ] A^{[1]} \times D^{[1]} A[1]×D[1]。(此时您正在关闭部分神经元)。可以将 D [ 1 ] D^{[1]} D[1] 视作一个掩膜,当它与另一个矩阵相乘时,会“屏蔽”掉某些值。
  4. A [ 1 ] A^{[1]} A[1] 除以 keep_prob。这样操作的目的是保证即使使用了dropout,成本函数的输出仍具有与未使用时相同的期望值。(这种技术也被称为反转dropout)。
# GRADED FUNCTION: forward_propagation_with_dropout
def forward_propagation_with_dropout(X, parameters, keep_prob = 0.5):
     # retrieve parameters
     W1 = parameters["W1"]
     b1 = parameters["b1"]
     W2 = parameters["W2"]
     b2 = parameters["b2"]
     W3 = parameters["W3"]
     b3 = parameters["b3"]
     Z1 = np.dot(W1, X) + b1
     A1 = relu(Z1)
     ### START CODE HERE ### (approx. 4 lines)         
     # Steps 1-4 below correspond to the Steps 1-4 described above. 
      # Step 1: initialize matrix D1 = np.random.rand(..., ...)   
     # 对生成的随机矩阵 D1 的每个元素,执行 < keep_prob 的比较操作。这将产生一个与 D1 形状相同的布尔矩阵(True/False)
     D1 = np.random.rand(A1.shape[0] , A1.shape[1])
      # Step 2: convert entries of D1 to 0 or 1 (using keep_prob as the threshold)
     D1 = (D1 < keep_prob)
     # Step 3: shut down some neurons of A1
     A1 *= D1

     # Step 4: scale the value of neurons that haven't been shut down
     A1 = np.divide(A1 , keep_prob)

     ### END CODE HERE ###
     Z2 = np.dot(W2, A1) + b2
     A2 = relu(Z2)
     ### START CODE HERE ### (approx. 4 lines)
     # Step 1: initialize matrix D2 = np.random.rand(..., ...)
     D2 = np.random.rand(A2.shape[0] , A2.shape[1])

     # Step 2: convert entries of D2 to 0 or 1 (using keep_prob as the threshold)
     D2 = (D2 < keep_prob)

     # Step 3: shut down some neurons of A2
     A2 *= D2
     # Step 4: scale the value of neurons that haven't been shut down
     A2 = np.divide(A2 , keep_prob)
     ### END CODE HERE ###
     Z3 = np.dot(W3, A2) + b3
     A3 = sigmoid(Z3)
     cache = (Z1, D1, A1, W1, b1, Z2, D2, A2, W2, b2, Z3, A3, W3, b3)
     return A3, cache


反向传播中使用 Dropout 实际上相当简单,您需执行以下两步:

  1. 在前向传播阶段,您通过将掩模 D [ 1 ] D^{[1]} D[1] 应用于 A1 关闭了部分神经元。在反向传播时,您需要通过将相同的掩模 D [ 1 ] D^{[1]} D[1] 重新应用于 dA1 来关闭相同的神经元。
  2. 在前向传播过程中,您已将 A1 除以 keep_prob。因此,在反向传播时,您需要再次将 dA1 除以 keep_prob(从微积分角度看,若 A [ 1 ] A^{[1]} A[1]keep_prob 缩放,则其导数 d A [ 1 ] dA^{[1]} dA[1] 也会被相同的 keep_prob 缩放)。
# GRADED FUNCTION: backward_propagation_with_dropout
def backward_propagation_with_dropout(X, Y, cache, keep_prob):

    m = X.shape[1]
    (Z1, D1, A1, W1, b1, Z2, D2, A2, W2, b2, Z3, A3, W3, b3) = cache
    dZ3 = A3 - Y
    dW3 = 1./m * np.dot(dZ3, A2.T)
    db3 = 1./m * np.sum(dZ3, axis=1, keepdims = True)
    dA2 = np.dot(W3.T, dZ3)
    ### START CODE HERE ### (≈ 2 lines of code)
        # Step 1: Apply mask D2 to shut down the same neurons as during the forward propagation
    dA2 = dA2 * D2
        # Step 2: Scale the value of neurons that haven't been shut down
    dA2 = dA2 / keep_prob
    ### END CODE HERE ###

    dZ2 = np.multiply(dA2, np.int64(A2 > 0))
    dW2 = 1./m * np.dot(dZ2, A1.T)
    db2 = 1./m * np.sum(dZ2, axis=1, keepdims = True)
    dA1 = np.dot(W2.T, dZ2)
    ### START CODE HERE ### (≈ 2 lines of code)
        # Step 1: Apply mask D1 to shut down the same neurons as during the forward propagation
    dA1 = dA1 * D1
        # Step 2: Scale the value of neurons that haven't been shut down
    dA1 = dA1 / keep_prob
    ### END CODE HERE ###
    dZ1 = np.multiply(dA1, np.int64(A1 > 0))
    dW1 = 1./m * np.dot(dZ1, X.T)
    db1 = 1./m * np.sum(dZ1, axis=1, keepdims = True)
    gradients = {"dZ3": dZ3, "dW3": dW3, "db3": db3,"dA2": dA2,
                 "dZ2": dZ2, "dW2": dW2, "db2": db2, "dA1": dA1, 
                 "dZ1": dZ1, "dW1": dW1, "db1": db1}
    return gradients

Gradient Checking

N-dimensional gradient checking

  • 提供前向传播函数 forward_propagation_n,后向传播函数 backward_propagation_n

∂ J ∂ θ = lim ⁡ ε → 0 J ( θ + ε ) − J ( θ − ε ) 2 ε \frac{\partial J}{\partial \theta} = \lim_{\varepsilon \to 0} \frac{J(\theta + \varepsilon) - J(\theta - \varepsilon)}{2 \varepsilon} θJ=ε0lim2εJ(θ+ε)J(θε)

每一个参数(w或b)发生 epsilon 微小变化后,得到 新的J ,通过导数定义就可以得到对应参数导数的近似值
对于 num_parameters 中的每个索引 i:

  • 计算 J_plus[i]:
    1. θ + \theta^{+} θ+ 设置为 np.copy(parameters_values)
    2. θ i + \theta^{+}_i θi+ 设置为 θ i + + ε \theta^{+}_i + \varepsilon θi++ε
    3. 使用 forward_propagation_n(x, y, vector_to_dictionary( θ + \theta^{+} θ+ )) 计算 J i + J^{+}_i Ji+
  • 计算 J_minus[i]:对 θ − \theta^{-} θ 执行相同的操作
  • 计算 g r a d a p p r o x [ i ] = J i + − J i − 2 ε gradapprox[i] = \frac{J^{+}_i - J^{-}_i}{2 \varepsilon} gradapprox[i]=2εJi+Ji

因此,您获得了一个向量 gradapprox,其中 gradapprox[i] 是关于 parameter_values[i] 的梯度近似值。现在您可以将这个 gradapprox 向量与反向传播得到的梯度向量进行比较。计算:
d i f f e r e n c e = ∥ g r a d − g r a d a p p r o x ∥ 2 ∥ g r a d ∥ 2 + ∥ g r a d a p p r o x ∥ 2 difference = \frac {\| grad - gradapprox \|_2}{\| grad \|_2 + \| gradapprox \|_2 } difference=grad2+gradapprox2gradgradapprox2

# GRADED FUNCTION: gradient_check_n

def gradient_check_n(parameters, gradients, X, Y, epsilon = 1e-7):
    Checks if backward_propagation_n computes correctly the gradient of the cost output by forward_propagation_n
    parameters -- python dictionary containing your parameters "W1", "b1", "W2", "b2", "W3", "b3":
    grad -- output of backward_propagation_n, contains gradients of the cost with respect to the parameters. 
    x -- input datapoint, of shape (input size, 1)
    y -- true "label"
    epsilon -- tiny shift to the input to compute approximated gradient with formula(1)
    difference -- difference (2) between the approximated gradient and the backward propagation gradient
    # Set-up variables
    parameters_values, _ = dictionary_to_vector(parameters)
    grad = gradients_to_vector(gradients)
    num_parameters = parameters_values.shape[0] 
    J_plus = np.zeros((num_parameters, 1))
    J_minus = np.zeros((num_parameters, 1))
    gradapprox = np.zeros((num_parameters, 1))
    # Compute gradapprox
    # 每一个参数发生theta变化后,得到cost,导数定义可以得到J对这个参数的近似导数
    for i in range(num_parameters):
        # Compute J_plus[i]. Inputs: "parameters_values, epsilon". Output = "J_plus[i]".
        # "_" is used because the function you have to outputs two parameters but we only care about the first one
        ### START CODE HERE ### (approx. 3 lines)
        theta_plus = np.copy(parameters_values)
        theta_plus[i] += epsilon
        J_plus[i], _ = forward_propagation_n(X , Y , vector_to_dictionary(theta_plus))
        ### END CODE HERE ###
        # Compute J_minus[i]. Inputs: "parameters_values, epsilon". Output = "J_minus[i]".
        ### START CODE HERE ### (approx. 3 lines)
        theta_minus = np.copy(parameters_values)
        theta_minus[i] -= epsilon
        J_minus[i], _ = forward_propagation_n(X , Y , vector_to_dictionary(theta_minus))
        ### END CODE HERE ###
        # Compute gradapprox[i]
        ### START CODE HERE ### (approx. 1 line)
        gradapprox[i] = (J_plus[i] - J_minus[i]) / (2 * epsilon)
        ### END CODE HERE ###
    # Compare gradapprox to backward propagation gradients by computing difference.
    ### START CODE HERE ### (approx. 1 line)
        # Step 1'
    numerator = np.linalg.norm(grad - gradapprox)  # np.linalg.norm 计算范数(Norm)
        # Step 2'
    denominator = np.linalg.norm(grad) + np.linalg.norm(gradapprox)
        # Step 3'
    difference = numerator / denominator
    ### END CODE HERE ###

    if difference > 1e-7:
        print ("\033[93m" + "There is a mistake in the backward propagation! difference = " + str(difference) + "\033[0m")
        print ("\033[92m" + "Your backward propagation works perfectly fine! difference = " + str(difference) + "\033[0m")
    return difference
  • 作者在 backward_propagation_n 函数中设计了bugs(提示:检查 dW2 和 db1)


  • 梯度检查很慢!使用 ∂ J ∂ θ ≈ J ( θ + ε ) − J ( θ − ε ) 2 ε \frac{\partial J}{\partial \theta} \approx \frac{J(\theta + \varepsilon) - J(\theta - \varepsilon)}{2 \varepsilon} θJ2εJ(θ+ε)J(θε) 近似梯度计算成本较高。因此,我们在训练过程中不会在每次迭代中都运行梯度检查。只需运行几次以检查梯度是否正确即可。
  • 至少按照我们目前展示的方式,梯度检查与 Dropout 不兼容。通常,您会先在没有 Dropout 的情况下运行梯度检查以确保反向传播正确无误,然后再添加 Dropout。😃


  1. 1e-7:这是一个常见的严格标准,意味着允许的最大相对误差在百万分之一左右。如果差异小于这个值,通常认为反向传播计算的梯度与数值近似梯度非常接近,可以接受。
  2. 1e-5:稍宽松的标准,允许的相对误差在十万分之一左右。在某些情况下,如果计算资源有限或模型较为复杂,可能需要接受略高的差异容忍度。
  3. 1e-2:对于某些大型网络或计算资源极其受限的情况,可能需要设定更宽松的标准。但请注意,如此高的差异可能表明反向传播存在显著误差,应谨慎对待。

选择差异标准时需权衡精度需求、计算资源消耗以及模型复杂度等因素。通常,梯度检查仅在模型开发初期用于调试阶段,一旦确认反向传播正确,即可停止使用,因其计算开销较大,不适合在常规训练过程中持续使用。在实践中,大部分开发者倾向于选择一个相对严格的阈值(如 1e-7 或 1e-5),确保反向传播的高精度。如果差异超过设定的阈值,则应进一步排查反向传播实现中的潜在错误,并在修正后再次进行梯度检查。

