Neural Networks: Learning: Gradient checking

摘要: 本文是吴恩达 (Andrew Ng)老师《机器学习》课程,第十章《神经网络参数的反向传播算法》中第76课时《梯度检测》的视频原文字幕。为本人在视频学习过程中记录下来并加以修正,使其更加简洁,方便阅读,以便日后查阅使用。现分享给大家。如有错误,欢迎大家批评指正,在此表示诚挚地感谢!同时希望对大家的学习能有所帮助.
————————————————
In the last few videos, we talked about how to do forward propagation and back propagation in a neural network in order to compute derivatives. But back prop as an algorithm has a lot of details and can be a little tricky to implement. And one unfortunate property is that there are many ways to have subtle bugs in back prop so that if you run it with gradient descent or some other optimization algorithm, it could actually look like it’s working. And your cost function J(\Theta ) may end up decreasing on every iteration of gradient descent, but this could pull through even though there might be some bug in your implementation of back prop. So it looks like J(\Theta ) is decreasing, but you might just wind up with a neural network that has a higher level of error than you would with a bug-free implementation and you might just not know that there was this subtle bug that’s giving you this performance. So what can we do about this? There’s an idea called Gradient Checking that eliminates almost all of these problems. So today, every time I implement back propagation or a similar gradient descent algorithm on the neural network or any other reasonably complex model, I always implement gradient checking. And if you do this it will help you make sure and so get high confidence that your implementation of forward prop and backprop, whatever, is 100% correct. And in what I’ve seen this pretty much eliminate all the problems associated with sort of a buggy implementation of the backprop. And in the previous videos, I hope to ask you to take on faith that the formulas I gave for computing the \delta s, and the Ds, and so on, I ask you to take on faith that those actually do compute the gradients of the cost function. But once you implement numerical gradient checking, which is the topic of this video, you’ll be able to verify for yourself that the code you’re writing that’s indeed is computing the derivative of the cost function J(\Theta ). So here’s the idea.

Consider the following example. Suppose I have the function J(\Theta ) and I have some value, \theta, and for this example I’m going to assume that \theta is just a real number. And let’s say I want to estimate the derivative of this function at this point. And so the derivative is equal to the slope of that sort of tangent line. Here’s how I’m going to numerically approximate the derivative, or rather here’s a procedure for numerically approximating the derivative. I’m going to compute \theta +\epsilon, so value a little bit to the right. And we’re going to compute \theta -\epsilon. And I’m going to look at those two points and connect them by a straight line. And I’m going to use the slope of that little red line as my approximation to the derivative, which is the true derivative is the slope of the blue line over there. So it seems like it would be a pretty good approximation. Mathematically, the slope of this red line is this vertical height divided by the horizontal width, so this point on top is J(\theta +\epsilon ). This point here is J(\theta -\epsilon ). So this vertical difference is J(\theta +\epsilon )-J(\theta -\epsilon ). And this horizontal distance is just 2\epsilon. So, my approximation is going to be that the derivative with respect to \theta of J(\Theta ) at this value of \theta that’s approximately:

\frac{d}{d\theta }J(\theta )=\frac{J(\theta +\epsilon )-J(\theta -\epsilon )}{2\epsilon }

Usually, I use a pretty small value for ε and set \epsilon to be maybe on the order of 10^{-4}. There’s usually a large range of different values for \epsilon that work just fine. And in fact, if you let \epsilon become really small, then, mathematically, this term here actually mathematically becomes the derivative, becomes exactly the slope of the function at this point. It’s just that we don’t want to use \epsilon that’s too too small because then you might run into numerical problems. So I usually use \epsilon around 10^{-4}, say. And by the way some of you may have seen an alternative formula for estimating the derivative which is this formula. This one on the right is called one-sided difference. Whereas, the formula on the left that’s called a two-sides difference. The two-sides difference gives us a slightly more accurate estimate, so I usually use that rather than this one-sides difference estimate. So concretely, what you implement in Octave is you implement the following you implement call to compute gradApprox which is going to be approximation to derivative as just this formula. And this will give you a numerical estimate of the gradient at that point. And in this example it seems like it’s a pretty good estimate of the gradient at that point.

Now, on the previous slide, we considered the case of when \theta was a real number. Now, let’s look at the more general case of when \theta is a vector parameter. So let’s say \theta\in \mathbb{R}^{n}, and it might be an unrolled version of the parameters of our neural network. So \theta is a vector that has n elements, \theta _{1} up to \theta _{n}. We can then use a similar idea to approximate all of the partial derivative terms. So, these equations give you a way to numerically approximate the partial derivative of J with respect to any one of your parameters they derive. Concretely, what you implement is therefore, the following.

Concretely, what you implement is following. We implement the following in Octave to numerically compute the derivatives. We say for i equals 1 through n where n is the dimension of our parameter vector \theta. And I usually do this with the unrolled version of the parameters. So \theta is just a long list of all of my parameters in my neural networks. I’m going to set thetaPlus=theta, then increase thetaPlus the ith element by \epsilon. And so this is basically thetaPlus is equal to theta except forthetaPlus(i), which is now incremented by \epsilon. So if thetaPlus is equal to \theta _{1}, \theta _{2} and so on. And then \theta _{i} has \epsilon added to it and then go down to \theta _{n}. So this is what thetaPlus is. And similarly these two lines set thetaMinus to something similar except \theta _{i}+\epsilon, this now becomes \theta _{i}-\epsilon. And then finally, you implement this gradApprox(i) and this will give you your approximation to the partial derivative with respect to \theta _{i} of J(\theta ). And the way we use this in our neural network implementation is we would implement this for loop to compute, the top partial derivatives of the cost function with respect to every parameter in our network. And we can then take the gradient that we got from back propagation. So DVec was the derivatives we got from back prop. So back prop, back-propagation was a relatively efficient way to compute the derivatives or the partial derivatives of a cost function with respect to all of our parameters. And what I usually do is then take my numerically computed derivative, that is this gradProx that we just had from up here and make sure that that is equal or approximately equal up to small values of numerically round off that is pretty close to the DVec that I got from the back prop. And if these two ways of computing the derivative give me the same answer or at least give me very similar answers, up to a few decimal places. Then I’m much more confident that my implementation of back prop is correct. And when I plug these DVec vectors into gradient descent or some advanced optimization algorithm, I can then be much more confident that I’m computing the derivatives correctly and therefore, that hopefully  my codes will run closely and do a good job optimizing J(\theta ).

Finally, I want to put everything together and tell you how to implement this numerical gradient checking. Here’s what I usually do. First thing I do is implement back-propagation to compute DVec. So, this is a procedure we talked about in an earlier video to compute DVec which maybe our unrolled version of these matrices. Then what I do is implement a numerical gradient checking to compute gradProx. So this is what I decided earlier in this video, in the previous slide. Then you should make sure that DVec and gradProx give similar values, let’s say up to a few decimal places. And finally, and this the important step, before you start to use your code for learning, for seriously training your network, it’s important to turn off gradient checking. And to no longer compute this gradProx thing using the numerical derivative formulas that we talked about earlier in this video. And the reason for that is the numerical gradient checking code, the stuff we talked about in this video, that’s a very computationally expensive, that’s a very slow way to try to approximate the derivative. Whereas in contrast, the back-propagation algorithm that we talked about earlier, that’s the thing we talked about earlier for computing D_{1}, D_{2}, D_{3} or DVec. Back prop is a much more computationally efficient way of computing the derivatives. So once you’ve verified that your implementation of back-propagation is correct, you should turn off gradient checking and just stop using that. So just to reiterate, should be sure to disable your gradient checking code before running your algorithm for many iterations of gradient descent, or for many iterations of the advanced optimization algorithms in order to train your classifier. Concretely, if you were to run numerical gradient checking on every single integration of gradient descent, or if you were to in the inner loop of your cost function, then your code will be very slow. Because the numerical gradient checking code is much slower than the back-propagation algorithm, than a back-propagation method where you remember we were computing \delta ^{(4)}, \delta ^{(3)}, \delta ^{(2)}, and so on. That was the back-propagation algorithm. That is a much faster way  to compute derivatives than gradient checking. So when you’re ready, once you verify the implementation of back-propagation is correct, make sure you turn off, or you disable your gradient checking code while you’re training your algorithm or your code could run very slowly. So that’s how you take gradient numerically and that’s how you can verify the implementation of back-propagation is correct. Whenever I implement back-propagation or similar gradient descent algorithm for complex model I always use gradient checking this really helps me make sure that my code is correct.

<end>

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值