Neural Networks: Learning: Backpropagation intuition

摘要: 本文是吴恩达 (Andrew Ng)老师《机器学习》课程,第十章《神经网络参数的反向传播算法》中第74课时《理解反向传播算法》的视频原文字幕。为本人在视频学习过程中记录下来并加以修正,使其更加简洁,方便阅读,以便日后查阅使用。现分享给大家。如有错误,欢迎大家批评指正,在此表示诚挚地感谢!同时希望对大家的学习能有所帮助.
————————————————
In the previous video, we talked about the back propagation algorithm. To a lot of people seen for the first time, the first impression is often that wow, this is a very complicated algorithm and there are all these different steps. And I’m not quite sure how they fit together and it’s kind of like a black box with all these complicated steps. In case that’s how you are feeling about back propagation, that’s actually okay. Back propagation may be unfortunately is a less mathematically clean or less mathematically simple algorithm compared to linear regression or logistic regression, and I’ve actually used back propagation pretty successful for many years and even today, I still don’t sometimes feel like I have a very good sense of just what it’s doing most of intuition about what back propagation is doing. For those of you that are doing the programming exercises that will at least mechanically step you through the different steps of how to implement back propagation so you will be able to get it work for yourself. And what I want to do in this video is look a little bit more at the mechanical steps of back propagation and try to give you a little more intuition about what the mechanical steps of back propagation is doing to hopefully convince you that it is at least a reasonable algorithm. In case even after this video, in case back propagation still seems very black box and kind of like, too many complicated steps, a little bit magical to you, that’s actually okay. And even though, I have used back prop for many years, sometimes it’s a difficult algorithm to understand. But hopefully this video will help a little bit.

In order to better understand back propagation, let’s take another close look at what forward propagation is doing. Here is a neural network with 2 input units that is not counting the bias unit, and two hidden units in this layer and 2 hidden units in the next layer, and then finally one output unit. And again, these counts 2, 2, 2 are not counting these bias units on top. In order to illustrate forward propagation, I’m going to draw this network a little bit differently.

And in particular, I’m going to draw this neural network with the nodes drawn as these very fat ellipses, so that I can write text in them. When perform forward propagation, we might have some particular example, say some example (x^{(i)}, y^{(i)}) and it will be this x^{(i)} that we feed into the input layer, so this maybe x^{(i)}_{1} and x^{(i)}_{2} are the values we set the input layer to and when we forward propagate it to the first hidden layer here, what we do is compute z^{(2)}_{1} and z^{(2)}_{2}, so these are the weighted sum of inputs of the input units, and then we applied the sigmoid of the logistic function and the sigmoid activation function applied to the z value, gives us these activation values. So that gives us a^{(2)}_{1} and a^{(2)}_{2}, and then we forward propagate again to get, here z^{(3)}_{1}, apply the sigmoid of the logistic function, the activation function to that, to get a^{(3)}_{1} and similarly like so, until we get z^{(4)}_{1}, apply the activation function that gives us a^{(4)}_{1} which is the final output value of the neural network. Let’s erase this arrow to give myself some space, and if you look at what this computation really is doing, focusing on this hidden unit let’s say, we have that this weight, shown in magenta there, is my weight \theta ^{(2)}_{10}. The index is not important, and this way here which I guess I’m highlight in red, that is \theta ^{(2)}_{11}. And this weight here, which I’m drawing in green, in a cyan, is \theta ^{(2)}_{12}. So the way we computer value z^{(3)}_{1} is z^{(3)}_{1}=\theta ^{(2)}_{10}\times 1+\theta ^{(2)}_{11}\times a^{(2)}_{1}+\theta ^{(2)}_{12}\times a^{(2)}_{2}. And so that’s forward propagation. And it turns out that, as we see later on in this video, what back propagation is doing, is doing a process very similar to this except that instead of the computations flowing from the left to the right of this network, the computation is there flow from the right to the left of the network, and using a very similar computation as this, and I’ll say in two slides exactly what I mean by that.

To better understand what back propagation is doing, let’s look at the cost function, here’s the cost function that we had for when we have only one output unit. If we have more than one output unit, we just have a summation over the output units indexed by k there. But if only one output unit then this is a cost function. And if we do forward propagation and back propagation on one example at a time. So, let’s just focus on the single example (x^{(i)}, y^{(i)}), and focus on the case of having one output unit, so y^{(i)} here is just a real number, and let’s ignore regularization, so \lambda =0. And this final term that regularization term goes away. Now, if you look inside this summation, you find that this cost term associated with the i^{th} training example, that is, the cost associated with the training example (x^{(i)}, y^{(i)}), that’s going to be given by this expression, that the cost of training example i is written as follows. And what this cost function does is it plays a role similar to the square error. So, rather than looking at this complicated expression, if you want you can think of cost(i) being approximately, the square difference between what the neural network outputs versus what is the actual value. Just as in logistic regression, we actually prefer to use this slightly more complicated cost function using the log, but for the purpose of intuition, feel free to think of the cost function as being sort of the squared error cost function, and so this cost(i) measures how well is the network doing on correctly predicting example i. How close is the output to the actually observed label y^{(i)}.

Now let’s look at what back propagation is doing. One useful intuition is that back propagation is computing these \delta ^{(l)}_{j} items, and we can think of these as the call error of the activation value that we got for unit j in layer l. More formally, and this is maybe only for those of you that are familiar with calculus, what the \delta terms actually are is this: they are the partial derivative with respect to z^{(l)}_{j}, that is the weighted sum of inputs that we’re computing the z terms, partial derivative respect of these things of the cost function. So concretely the cost function is a function of the label y, and of the value, this h_{\theta }(x) output value of neural network. And if we could go inside the neural network and just change those z^{(l)}_{j} a little bit, then that would affect these values that the neural network outputting. And so that will end up changing the cost function. And again really this is only for those of you expert in calculus if you’re familiar with partial derivation. What these \delta terms are, it turn out to be the partial derivative of the cost function with respect to these intermediate terms that we’re computing. And so they’re measure of how much we would like to change the neural network’s weights in order to affect intermediate values of the computation, so as to affect the final output of the neural network h_{\theta }(x) and therefore affect the overall cost. In case this last part of this partial derivative intuition, in case that didn’t make sense, don’t worry about it, the rest of this we can do without really talking about partial derivatives. But let’s look in more detail of what back propagation is doing. For the output layer, if first sets this \delta term, we say \delta ^{(4)}_{1}=y^{(i)}_{1}-a^{(4)}_{1} if we’re doing forward propagation and back propagation on this training example i. So this really is the error, the difference between the actual value of y minus what was the value predicted. So we’re going to compute \delta ^{(4)}_{1} like so. Next we’re going to do propagate these values backwards. I explain this in a second, and end up computing the \delta terms of the previous layer. I’m going to end up with \delta ^{(3)}_{1} and \delta ^{(3)}_{2} and then we’re going to propagate this further backward and end up computing \delta ^{(2)}_{1} and \delta ^{(2}_{2}. Now the back propagation calculation is a lot like running the forward propagation algorithm, but doing it backwards. So here’s what I mean. Let’s look at how we end up with this value of \delta ^{(2)}_{2}. So we have \delta ^{(2)}_{2}. And similar to forward propagation, let me label a couple of weights. So this weight should be drawn cyan and let’s say that weight is \theta ^{(2)}_{12}, and this weight down here, let me highlight this in red. That’s going to be, let’s say \theta ^{(2)}_{22}. So if we look at how \delta ^{(2)}_{2} is computed, how it’s computed for this node. It turns out that what we’re going to do is we’re going to take this value \delta ^{(3)}_{1} and multiply it by this weight \theta ^{(2)}_{12}, and add it to this value \delta ^{(3)}_{2} multiplied by that weight  \theta ^{(2)}_{22}. So it’s really a weighted sum of these \delta values weighted by the corresponding edge strength. So concretely, let me fill this in. This \delta ^{(2)}_{2}=\theta ^{(2)}_{12}\delta ^{(3)}_{1}+\theta ^{(2)}_{22}\delta ^{(3)}_{2}.  And just another example, let’s look at this value. How do we get that value \delta ^{(3)}_{2}? It’s a similar process if this weight, which I’m going to highlight in green, if this weight is equal to, say, \theta ^{(3)}_{12}. Then we have that \delta ^{(3)}_{2}=\theta ^{(3)}_{12}\delta ^{(4)}_{1}. And by the way so far I’ve been writing the \delta values only for hidden units but excluding the bias units depending on how you define the back propagation algorithm and depend on how you implement it. You may end up implementing something to compute \delta values for these bias units as well. The bias units always output the value “+1” and they are just what they are and there’s no way for us to change the value, and so depending on your implementation in back prop. The way I usually implement it, I do end up computing these \delta values but we just discard them, we don’t use them because they don’t end up being part of the calculation needed to compute a derivative. So hopefully that gives you a little better intuition about what back propagation is doing.

In case all of these still seem sort of magical and sort of black box. In the later video, in the “putting it together” video, I’ll try to give a little bit more intuition about what back propagation is doing. But unfortunately this is, difficult algorithm to try to visualize and understand what it is really doing. But fortunately, I guess many people have been using it very successfully for many years and if you implement the algorithm you can have a very effective learning algorithm even though the inner working of exactly how it works can be hard to visualize.

<end>

  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值