Gradient Checking

This article first appeared in ai-exception

Abstract

Gradient Checking is a important method to verify whether our backpropagation code is correct or not. In this article, I will introduce the theory of n-layer gradient checking firstly, and then i will implement a 3 layer gradient checking step by step,hope after this article,you can implement the gradient checking method to resolve your neural network’s backpropagation bugs.

Theory

We know we can use J ( θ ϵ ) − J ( θ − ϵ ) 2 ϵ \frac{J(\theta \epsilon)-J(\theta-\epsilon)}{2\epsilon} 2ϵJ(θ ϵ)J(θϵ) to estimate the grad of J in θ \theta θ :
lim ⁡ ϵ → 0 J ( θ ϵ ) − J ( θ − ϵ ) 2 ϵ ≈ α J α θ \lim\limits_{\epsilon \to0}\frac{J(\theta \epsilon)-J(\theta-\epsilon)}{2\epsilon}\approx\frac{\alpha{J}}{\alpha{\theta}} ϵ0lim2ϵJ(θ ϵ)J(θϵ)αθαJ
But for the n layer neural network ,there are many parameters as W [ 1 ] , b [ 1 ] , W [ 2 ] , b [ 2 ] , . . . , W [ l ] , b [ l ] W^{[1]},b^{[1]},W^{[2]},b^{[2]},...,W^{[l]},b^{[l]} W[1],b[1],W[2],b[2],...,W[l],b[l], and d W [ 1 ] , d b [ 1 ] , d W [ 2 ] , d b [ 2 ] , . . . , d W [ l ] , d b [ l ] dW^{[1]},db^{[1]},dW^{[2]},db^{[2]},...,dW^{[l]},db^{[l]} dW[1],db[1],dW[2],db[2],...,dW[l],db[l] for these parameters when we do backpropagation, we can merge these parameters to a big parameter named θ \theta θ wich contains θ [ 1 ] , θ [ 2 ] , . . , θ [ l ] \theta^{[1]},\theta^{[2]},..,\theta^{[l]} θ[1],θ[2],..,θ[l]and d θ d\theta dθ contains d θ [ 1 ] , d θ [ 2 ] , . . , d θ [ l ] d\theta^{[1]},d\theta^{[2]},..,d\theta^{[l]} dθ[1],dθ[2],..,dθ[l]. For gradient checking,we can fix the number of l − 1 l-1 l1 parameters and assume that J J J is only related with one parameter θ [ i ] \theta^{[i]} θ[i] ,so we can estiimate the grad of J in θ [ i ] \theta^{[i]} θ[i]:
d θ a p p r o x [ i ] = lim ⁡ ϵ → 0 J ( θ [ i ] ϵ ) − J ( θ [ i ] − ϵ ) 2 ϵ ≈ α J α θ [ i ] d\theta_{approx}^{[i]}=\lim\limits_{\epsilon \to0}\frac{J(\theta^{[i]} \epsilon)-J(\theta^{[i]}-\epsilon)}{2\epsilon}\approx\frac{\alpha{J}}{\alpha{\theta^{[i]}}} dθapprox[i]=ϵ0lim2ϵJ(θ[i] ϵ)J(θ[i]ϵ)αθ[i]αJ
Thus we can coculate d θ a p p r o x [ i ] d\theta_{approx}^{[i]} dθapprox[i] for every θ [ i ] \theta^{[i]} θ[i] and then we should measure whether d θ a p p r o x d\theta_{approx} dθapprox is close d θ d\theta dθ within our acceptable range by this formla :
d i f f e r e n c e = ∣ ∣ d θ a p p r o x − d θ ∣ ∣ 2 ∣ ∣ d θ a p p r o x ∣ ∣ 2 ∣ ∣ d θ ∣ ∣ 2 difference=\frac{||d\theta_{approx}-d\theta||_2}{||d\theta_{approx}||_2 ||d\theta||_2} difference=dθapprox2 dθ2dθapproxdθ2
I will set ϵ = 1 e − 7 \epsilon=1e^{-7} ϵ=1e7 and if d i f f e r e n c e ≤ 1 e − 7 difference \le 1e^{-7} difference1e7 ,we can say our backpropagation is correct , if 1 e − 7 &lt; d i f f e r e n c e ≤ 1 e − 5 1e^{-7} &lt; difference\le 1e^{-5} 1e7<difference1e5 ,we should alert whether our backpropagation is correct,and if d i f f e r e n c e &gt; 1 e − 5 difference&gt;1e^{-5} difference>1e5

there is a great possibility that our backpropagation is incorrect.

Implement

First of all, we need implement the dictionary_to_vector and vector_to_dictionary function to convert the parameters with parameter:

dictionary_to_vector

def dictionary_to_vector(parameters):
    """
    Roll all our parameters dictionary into a single vector satisfying our specific required shape.
    """
    keys = []
    count = 0
    for key in ["W1", "b1", "W2", "b2", "W3", "b3"]:
        
        # flatten parameter
        new_vector = np.reshape(parameters[key], (-1,1))
        keys = keys   [key]*new_vector.shape[0]
        
        if count == 0:
            theta = new_vector
        else:
            theta = np.concatenate((theta, new_vector), axis=0)
        count = count   1

    return theta, keys

vector_to_dictionary

def vector_to_dictionary(theta):
    """
    Unroll all our parameters dictionary from a single vector satisfying our specific required shape.
    """
    parameters = {}
    parameters["W1"] = theta[:20].reshape((5,4))
    parameters["b1"] = theta[20:25].reshape((5,1))
    parameters["W2"] = theta[25:40].reshape((3,5))
    parameters["b2"] = theta[40:43].reshape((3,1))
    parameters["W3"] = theta[43:46].reshape((1,3))
    parameters["b3"] = theta[46:47].reshape((1,1))

    return parameters

Then we can implements the forward and backward propagation of neural network:

forward_propagation_n


def forward_propagation_n(X, Y, parameters):
    """
    Implements the forward propagation (and computes the cost) presented in Figure 3.

    Arguments:
    X -- training set for m examples
    Y -- labels for m examples
    parameters -- python dictionary containing your parameters "W1", "b1", "W2", "b2", "W3", "b3":
                    W1 -- weight matrix of shape (5, 4)
                    b1 -- bias vector of shape (5, 1)
                    W2 -- weight matrix of shape (3, 5)
                    b2 -- bias vector of shape (3, 1)
                    W3 -- weight matrix of shape (1, 3)
                    b3 -- bias vector of shape (1, 1)

    Returns:
    cost -- the cost function (logistic cost for one example)
    """

    # retrieve parameters
    m = X.shape[1]
    W1 = parameters["W1"]
    b1 = parameters["b1"]
    W2 = parameters["W2"]
    b2 = parameters["b2"]
    W3 = parameters["W3"]
    b3 = parameters["b3"]

    # LINEAR -> RELU -> LINEAR -> RELU -> LINEAR -> SIGMOID
    Z1 = np.dot(W1, X)   b1
    A1 = relu(Z1)
    Z2 = np.dot(W2, A1)   b2
    A2 = relu(Z2)
    Z3 = np.dot(W3, A2)   b3
    A3 = sigmoid(Z3)

    # Cost
    logprobs = np.multiply(-np.log(A3), Y)   np.multiply(-np.log(1 - A3), 1 - Y)
    cost = 1. / m * np.sum(logprobs)

    cache = (Z1, A1, W1, b1, Z2, A2, W2, b2, Z3, A3, W3, b3)

    return cost, cache

backward_propagation_n

def backward_propagation_n(X, Y, cache):
    """
    Implement the backward propagation presented in figure 2.

    Arguments:
    X -- input datapoint, of shape (input size, 1)
    Y -- true "label"
    cache -- cache output from forward_propagation_n()

    Returns:
    gradients -- A dictionary with the gradients of the cost with respect to each parameter, activation and pre-activation variables.
    """

    m = X.shape[1]
    (Z1, A1, W1, b1, Z2, A2, W2, b2, Z3, A3, W3, b3) = cache

    dZ3 = A3 - Y
    dW3 = 1. / m * np.dot(dZ3, A2.T)
    db3 = 1. / m * np.sum(dZ3, axis=1, keepdims=True)

    dA2 = np.dot(W3.T, dZ3)
    dZ2 = np.multiply(dA2, np.int64(A2 > 0))
    dW2 = 1. / m * np.dot(dZ2, A1.T) * 2
    db2 = 1. / m * np.sum(dZ2, axis=1, keepdims=True)

    dA1 = np.dot(W2.T, dZ2)
    dZ1 = np.multiply(dA1, np.int64(A1 > 0))
    dW1 = 1. / m * np.dot(dZ1, X.T)
    db1 = 4. / m * np.sum(dZ1, axis=1, keepdims=True)

    gradients = {"dZ3": dZ3, "dW3": dW3, "db3": db3,
                 "dA2": dA2, "dZ2": dZ2, "dW2": dW2, "db2": db2,
                 "dA1": dA1, "dZ1": dZ1, "dW1": dW1, "db1": db1}

    return gradients

then we can implement the gradient_check:

gradient_check

Instructions

Here is pseudo-code that will help you implement the gradient check.

For each i in num_parameters:

  • To compute J_plus[i]:
    1. Set θ \theta^{ } θ to np.copy(parameters_values)
    2. Set θ i \theta^{ }_i θi to θ i ε \theta^{ }_i \varepsilon θi ε
    3. Calculate J i J^{ }_i Ji using to forward_propagation_n(x, y, vector_to_dictionary( θ \theta^{ } θ )).
  • To compute J_minus[i]: do the same thing with θ − \theta^{-} θ
  • Compute g r a d a p p r o x [ i ] = J i − J i − 2 ε gradapprox[i] = \frac{J^{ }_i - J^{-}_i}{2 \varepsilon} gradapprox[i]=2εJi Ji

Thus, we get a vector gradapprox, where gradapprox[i] is an approximation of the gradient with respect to parameter_values[i]. You can now compare this gradapprox vector to the gradients vector from backpropagation. Just like for the 1D case (Steps 1’, 2’, 3’), compute:
(3) d i f f e r e n c e = ∥ g r a d − g r a d a p p r o x ∥ 2 ∥ g r a d ∥ 2 ∥ g r a d a p p r o x ∥ 2 difference = \frac {\| grad - gradapprox \|_2}{\| grad \|_2 \| gradapprox \|_2 } \tag{3} difference=grad2 gradapprox2gradgradapprox2(3)

def gradient_check_n(parameters, gradients, X, Y, epsilon = 1e-7):
    """
    Checks if backward_propagation_n computes correctly the gradient of the cost output by forward_propagation_n
    
    Arguments:
    parameters -- python dictionary containing your parameters "W1", "b1", "W2", "b2", "W3", "b3":
    grad -- output of backward_propagation_n, contains gradients of the cost with respect to the parameters. 
    x -- input datapoint, of shape (input size, 1)
    y -- true "label"
    epsilon -- tiny shift to the input to compute approximated gradient with formula(1)
    
    Returns:
    difference -- difference (2) between the approximated gradient and the backward propagation gradient
    """
    
    # Set-up variables
    parameters_values, _ = dictionary_to_vector(parameters)
    grad = gradients_to_vector(gradients)
    num_parameters = parameters_values.shape[0]
    J_plus = np.zeros((num_parameters, 1))
    J_minus = np.zeros((num_parameters, 1))
    gradapprox = np.zeros((num_parameters, 1))
    
    # Compute gradapprox
    for i in range(num_parameters):
        
         # Compute J_plus[i]. Inputs: "parameters_values, epsilon". Output = "J_plus[i]".
        # "_" is used because the function you have to outputs two parameters but we only care about the first one
       
        thetaplus = np.copy(parameters_values)  # Step 1
        thetaplus[i][0] = thetaplus[i]   epsilon  # Step 2
        J_plus[i], _ = forward_propagation_n(X, Y, vector_to_dictionary(thetaplus))  # Step 3
    

        # Compute J_minus[i]. Inputs: "parameters_values, epsilon". Output = "J_minus[i]".
       
        thetaminus = np.copy(parameters_values)  # Step 1
        thetaminus[i][0] = thetaminus[i] - epsilon  # Step 2
        J_minus[i], _ = forward_propagation_n(X, Y, vector_to_dictionary(thetaminus))  # Step 3
       

        # Compute gradapprox[i]
       
        gradapprox[i] = (J_plus[i] - J_minus[i]) / (2 * epsilon)
      
    # Compare gradapprox to backward propagation gradients by computing difference.
   
    numerator = np.linalg.norm(grad - gradapprox)  # Step 1'
    denominator = np.linalg.norm(grad)   np.linalg.norm(gradapprox)  # Step 2'
    difference = numerator / denominator  # Step 3'
  

    if difference > 2e-7:
        print ("\033[93m"   "There is a mistake in the backward propagation! difference = "   str(difference)   "\033[0m")
    else:
        print ("\033[92m"   "Your backward propagation works perfectly fine! difference = "   str(difference)   "\033[0m")
    
    return difference

test gradient checking

X, Y, parameters = gradient_check_n_test_case()

cost, cache = forward_propagation_n(X, Y, parameters)
gradients = backward_propagation_n(X, Y, cache)
difference = gradient_check_n(parameters, gradients, X, Y)

Output:

image-20181214113603114

It seems that there were errors in the backward_propagation_n code we implement! Good that we’ve implemented the gradient check. Go back to backward_propagation and try to find/correct the errors :

dW2 = 1. / m * np.dot(dZ2, A1.T) * 2
db1 = 4. / m * np.sum(dZ1, axis=1, keepdims=True)

It should be:

dW2 = 1. / m * np.dot(dZ2, A1.T) 
db1 = 1. / m * np.sum(dZ1, axis=1, keepdims=True)

then we run the test gradient cheking code:

image-20181214114130311

the backpagation is correct now!

What you should remember from this article

  • Gradient checking verifies closeness between the gradients from backpropagation and the numerical approximation of the gradient (computed using forward propagation).
  • Gradient checking is slow, so we don’t run it in every iteration of training. You would usually run it only to make sure your code is correct, then turn it off and use backprop for the actual learning process.

Links

github

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值