Neural Networks: Learning: Gradient checking

最新推荐文章于 2021-04-26 21:00:30 发布

王彩旗 edwardwangcq.com

最新推荐文章于 2021-04-26 21:00:30 发布

阅读量258

点赞数

分类专栏： # 机器学习人工智能

本文链接：https://blog.csdn.net/edward_wang1/article/details/106236879

版权

人工智能同时被 2 个专栏收录

142 篇文章 0 订阅

订阅专栏

机器学习

109 篇文章 0 订阅

订阅专栏

摘要: 本文是吴恩达 (Andrew Ng)老师《机器学习》课程，第十章《神经网络参数的反向传播算法》中第76课时《梯度检测》的视频原文字幕。为本人在视频学习过程中记录下来并加以修正，使其更加简洁，方便阅读，以便日后查阅使用。现分享给大家。如有错误，欢迎大家批评指正，在此表示诚挚地感谢！同时希望对大家的学习能有所帮助.
————————————————
In the last few videos, we talked about how to do forward propagation and back propagation in a neural network in order to compute derivatives. But back prop as an algorithm has a lot of details and can be a little tricky to implement. And one unfortunate property is that there are many ways to have subtle bugs in back prop so that if you run it with gradient descent or some other optimization algorithm, it could actually look like it’s working. And your cost function $J(\Theta )$ may end up decreasing on every iteration of gradient descent, but this could pull through even though there might be some bug in your implementation of back prop. So it looks like $J(\Theta )$ is decreasing, but you might just wind up with a neural network that has a higher level of error than you would with a bug-free implementation and you might just not know that there was this subtle bug that’s giving you this performance. So what can we do about this? There’s an idea called Gradient Checking that eliminates almost all of these problems. So today, every time I implement back propagation or a similar gradient descent algorithm on the neural network or any other reasonably complex model, I always implement gradient checking. And if you do this it will help you make sure and so get high confidence that your implementation of forward prop and backprop, whatever, is 100% correct. And in what I’ve seen this pretty much eliminate all the problems associated with sort of a buggy implementation of the backprop. And in the previous videos, I hope to ask you to take on faith that the formulas I gave for computing the $\delta s$ , and the $Ds$ , and so on, I ask you to take on faith that those actually do compute the gradients of the cost function. But once you implement numerical gradient checking, which is the topic of this video, you’ll be able to verify for yourself that the code you’re writing that’s indeed is computing the derivative of the cost function $J(\Theta )$ . So here’s the idea.

Consider the following example. Suppose I have the function $J(\Theta )$ and I have some value, $\theta$ , and for this example I’m going to assume that $\theta$ is just a real number. And let’s say I want to estimate the derivative of this function at this point. And so the derivative is equal to the slope of that sort of tangent line. Here’s how I’m going to numerically approximate the derivative, or rather here’s a procedure for numerically approximating the derivative. I’m going to compute $\theta +\epsilon$ , so value a little bit to the right. And we’re going to compute $\theta -\epsilon$ . And I’m going to look at those two points and connect them by a straight line. And I’m going to use the slope of that little red line as my approximation to the derivative, which is the true derivative is the slope of the blue line over there. So it seems like it would be a pretty good approximation. Mathematically, the slope of this red line is this vertical height divided by the horizontal width, so this point on top is $J(\theta +\epsilon )$ . This point here is $J(\theta -\epsilon )$ . So this vertical difference is $J(\theta +\epsilon )-J(\theta -\epsilon )$ . And this horizontal distance is just $2\epsilon$ . So, my approximation is going to be that the derivative with respect to $\theta$ of $J(\Theta )$ at this value of $\theta$ that’s approximately:

$\frac{d}{d\theta }J(\theta )=\frac{J(\theta +\epsilon )-J(\theta -\epsilon )}{2\epsilon }$

Usually, I use a pretty small value for ε and set $\epsilon$ to be maybe on the order of $10^{-4}$ . There’s usually a large range of different values for $\epsilon$ that work just fine. And in fact, if you let $\epsilon$ become really small, then, mathematically, this term here actually mathematically becomes the derivative, becomes exactly the slope of the function at this point. It’s just that we don’t want to use $\epsilon$ that’s too too small because then you might run into numerical problems. So I usually use $\epsilon$ around $10^{-4}$ , say. And by the way some of you may have seen an alternative formula for estimating the derivative which is this formula. This one on the right is called one-sided difference. Whereas, the formula on the left that’s called a two-sides difference. The two-sides difference gives us a slightly more accurate estimate, so I usually use that rather than this one-sides difference estimate. So concretely, what you implement in Octave is you implement the following you implement call to compute $gradApprox$ which is going to be approximation to derivative as just this formula. And this will give you a numerical estimate of the gradient at that point. And in this example it seems like it’s a pretty good estimate of the gradient at that point.

Now, on the previous slide, we considered the case of when $\theta$ was a real number. Now, let’s look at the more general case of when $\theta$ is a vector parameter. So let’s say $\theta\in \mathbb{R}^{n}$ , and it might be an unrolled version of the parameters of our neural network. So $\theta$ is a vector that has $n$ elements, $\theta _{1}$ up to $\theta _{n}$ . We can then use a similar idea to approximate all of the partial derivative terms. So, these equations give you a way to numerically approximate the partial derivative of $J$ with respect to any one of your parameters they derive. Concretely, what you implement is therefore, the following.

Concretely, what you implement is following. We implement the following in Octave to numerically compute the derivatives. We say for $i$ equals $1$ through $n$ where $n$ is the dimension of our parameter vector $\theta$ . And I usually do this with the unrolled version of the parameters. So $\theta$ is just a long list of all of my parameters in my neural networks. I’m going to set $thetaPlus=theta$ , then increase $thetaPlus$ the $ith$ element by $\epsilon$ . And so this is basically $thetaPlus$ is equal to $theta$ except for $thetaPlus(i)$ , which is now incremented by $\epsilon$ . So if $thetaPlus$ is equal to $\theta _{1}$ , $\theta _{2}$ and so on. And then $\theta _{i}$ has $\epsilon$ added to it and then go down to $\theta _{n}$ . So this is what $thetaPlus$ is. And similarly these two lines set $thetaMinus$ to something similar except $\theta _{i}+\epsilon$ , this now becomes $\theta _{i}-\epsilon$ . And then finally, you implement this $gradApprox(i)$ and this will give you your approximation to the partial derivative with respect to $\theta _{i}$ of $J(\theta )$ . And the way we use this in our neural network implementation is we would implement this for loop to compute, the top partial derivatives of the cost function with respect to every parameter in our network. And we can then take the gradient that we got from back propagation. So $DVec$ was the derivatives we got from back prop. So back prop, back-propagation was a relatively efficient way to compute the derivatives or the partial derivatives of a cost function with respect to all of our parameters. And what I usually do is then take my numerically computed derivative, that is this $gradProx$ that we just had from up here and make sure that that is equal or approximately equal up to small values of numerically round off that is pretty close to the $DVec$ that I got from the back prop. And if these two ways of computing the derivative give me the same answer or at least give me very similar answers, up to a few decimal places. Then I’m much more confident that my implementation of back prop is correct. And when I plug these $DVec$ vectors into gradient descent or some advanced optimization algorithm, I can then be much more confident that I’m computing the derivatives correctly and therefore, that hopefully my codes will run closely and do a good job optimizing $J(\theta )$ .

Finally, I want to put everything together and tell you how to implement this numerical gradient checking. Here’s what I usually do. First thing I do is implement back-propagation to compute $DVec$ . So, this is a procedure we talked about in an earlier video to compute $DVec$ which maybe our unrolled version of these matrices. Then what I do is implement a numerical gradient checking to compute $gradProx$ . So this is what I decided earlier in this video, in the previous slide. Then you should make sure that $DVec$ and $gradProx$ give similar values, let’s say up to a few decimal places. And finally, and this the important step, before you start to use your code for learning, for seriously training your network, it’s important to turn off gradient checking. And to no longer compute this $gradProx$ thing using the numerical derivative formulas that we talked about earlier in this video. And the reason for that is the numerical gradient checking code, the stuff we talked about in this video, that’s a very computationally expensive, that’s a very slow way to try to approximate the derivative. Whereas in contrast, the back-propagation algorithm that we talked about earlier, that’s the thing we talked about earlier for computing $D_{1}$ , $D_{2}$ , $D_{3}$ or $DVec$ . Back prop is a much more computationally efficient way of computing the derivatives. So once you’ve verified that your implementation of back-propagation is correct, you should turn off gradient checking and just stop using that. So just to reiterate, should be sure to disable your gradient checking code before running your algorithm for many iterations of gradient descent, or for many iterations of the advanced optimization algorithms in order to train your classifier. Concretely, if you were to run numerical gradient checking on every single integration of gradient descent, or if you were to in the inner loop of your cost function, then your code will be very slow. Because the numerical gradient checking code is much slower than the back-propagation algorithm, than a back-propagation method where you remember we were computing $\delta ^{(4)}$ , $\delta ^{(3)}$ , $\delta ^{(2)}$ , and so on. That was the back-propagation algorithm. That is a much faster way to compute derivatives than gradient checking. So when you’re ready, once you verify the implementation of back-propagation is correct, make sure you turn off, or you disable your gradient checking code while you’re training your algorithm or your code could run very slowly. So that’s how you take gradient numerically and that’s how you can verify the implementation of back-propagation is correct. Whenever I implement back-propagation or similar gradient descent algorithm for complex model I always use gradient checking this really helps me make sure that my code is correct.

<end>

王彩旗 edwardwangcq.com

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Neural Networks: Learning: Gradient checking

In the last few videos, we talked about how to do forward propagation and back propagation in a neural network in order to compute derivatives. But back prop as an algorithm has a lot of details and can be a little tricky to implement. And one unfortunate
复制链接

扫一扫