Neural Networks: Learning: Backpropagation algorithm

最新推荐文章于 2020-09-21 10:29:11 发布

王彩旗 edwardwangcq.com

最新推荐文章于 2020-09-21 10:29:11 发布

阅读量170

点赞数

分类专栏：人工智能 # 机器学习

本文链接：https://blog.csdn.net/edward_wang1/article/details/106063680

版权

人工智能同时被 2 个专栏收录

142 篇文章 0 订阅

订阅专栏

机器学习

109 篇文章 0 订阅

订阅专栏

摘要: 本文是吴恩达 (Andrew Ng)老师《机器学习》课程，第十章《神经网络参数的反向传播算法》中第73课时《反向传播算法》的视频原文字幕。为本人在视频学习过程中记录下来并加以修正，使其更加简洁，方便阅读，以便日后查阅使用。现分享给大家。如有错误，欢迎大家批评指正，在此表示诚挚地感谢！同时希望对大家的学习能有所帮助.
————————————————
In the previous video, we talked about a cost function for the neural network. In this video, let’s start to talk about an algorithm, for trying to minimize the cost function. In particular, we’ll talk about the back propagation algorithm.

Here’s the cost function that we wrote down in the previous video. What we’d like to do is try to find parameters $\Theta$ to try minimize $J(\Theta )$ . In order to use either gradient descent, or one of the advanced optimization algorithms, what we need to do therefor is to write code that takes this input the parameters $\Theta$ and computes $J(\Theta )$ and these partial derivative terms. Remember, that the parameters in the neural networks are these things $\theta ^{(l)}_{ji}$ , that is the real number, and so, these are the partial derivative terms we need to compute. In order to compute the cost function $J(\Theta )$ , we just use this formula up here, and so what I want to do for the most of the video is focus on talking about how we can compute these partial derivative terms.

Let’s start by talking about the case of when we have only one training example. So imagine, if you will that our entire training set comprises only one training example which is a pair (x,y). I’m not going to write $(x^{(1)},y^{(1)})$ just as this. Write a one training example as (x,y) and let’s tap through the sequence of calculations we would do with this one training example. The first thing we do is we apply forward propagation in order to compute whether a hypotheses actually outputs given the input x. Concretely, remember the called a(1) is the activation values of this first layer that is the input layer. So, I’m going to set that to x, and then we’re going to compute $z^{(2)}=\Theta ^{(1)}a^{(1)}$ and $a^{(2)}=g(z^{(2)})$ , and this would give us our activations for the first hidden layer. That is for layer two of the network and we also add those bias terms. Next we apply 2 more steps of this forward propagation to compute $a^{(3)}$ and $a^{(4)}$ which is also the output of the hypotheses $h_{\Theta }(x)$ . So this is our vectorized implementation of forward propagation and it allows us to compute the activation values for all of the neurons in our neural network. Next, in order to compute the derivatives, we’re going to use an algorithm called back propagation.

The intuition of the back propagation algorithm is that for each node we’re going to compute the term $\delta ^{(l)}_{j}$ that’s going to somehow represent the error of node $j$ in the layer $l$ . So recall that $a^{(l)}_{j}$ that does the activation of the $j$ of unit of layer $l$ , and so, this $\delta$ term is in some sense going to capture our error in the activation of that neural duo. So, how we might wish the activation of that node is slightly different. Concretely, taking the example neural network that we have on the right which has four layers. And so capital L is equal to 4. For each output unit we’re going to compute these $\delta$ terms. So, $\delta$ for the $j$ of unit in the fourth layer is equal to just the activation of that unit minus what was the actual value observed in our training example. So, this term here can also be written $h_{\Theta }(x)_{j}$ , right. So this $\delta$ term is just the difference between what a hypotheses output and what was the value of y in our training set whereas $y_{j}$ is the j of element of the vector value y in our labeled training set. And by the way, if you think of $\delta$ , a and y as vectors then you can also take this and come up with a vectorized implementation of it, which is just $\delta ^{(4)}=a^{(4)}-y$ . Where here, each of these $\delta ^{(4)}$ , $a^{(4)}$ and y, each of these is a vector whose dimension is equal to the number of output units in our network. So, we’ve now computed the era term’s $\delta ^{(4)}$ for our network. What we do next is compute the $\delta$ terms for the earlier layers in our network. Here’s a formula for computing $\delta ^{(3)}=(\Theta ^{(3)})^{T}\delta ^{(4)}.*g^{'}(z^{(3)})$ . This term $g^{'}(z^{(3)})$ , that formally is actually the derivative of the activation function g evaluated at the input values given by $z^{(3)}$ . If you know calculus, you can try to work it out yourself and see that you can simplify it to the same answer that I get. But I’ll just tell you pragmatically what that means. What you do to compute this g’, this derivative term is just $a^{(3)}.*(1-a^{(3)})$ , where $a^{(3)}$ is the vector of activations. 1 is the vector of ones and $a^{(3)}$ is again the activation, the vector of activation values for that layer. Next you apply a similar formula to compute $\delta ^{(2)}$ where again that can be computed using a similar formula. Only now is $\delta ^{(2)}=(\Theta ^{(2)})^{T}\delta ^{(3)}.*g^{'}(z^{(2)})$ . And I then prove it here but you can actually, it’s possible to prove it if you know calculus that this expression is equal to mathematically, the derivative of the g function of the activation, which I’m denoting by g prime (g’). And finally, that’s it and there is no $\delta ^{(1)}$ term, because the first layer corresponds to the input layer and that’s just the features we observed in our training sets so that doesn’t have any error associated with that. It’s not like, you know, we don’t really want to try to change those values. And so we have $\delta$ terms only for layers 2, 3 and 4 for this example. The name back propagation comes from the fact that we start by computing the $\delta$ term for the output layer and then we go back a layer and compute the $\delta$ terms for the third hidden layer and then we go back another step to compute $\delta ^{(2)}$ and so, we’re sort of back propagating the errors from the output layer to layer 3 the layer 2 hence the name back propagation. Finally, the derivation is surprisingly complicated and surprisingly involved but if you just do these few steps of computation it is possible to prove viral frankly somewhat complicated mathematical proof. It’s possible to prove that if you ignore regularization then the partial derivative terms you want are exactly given by the activations and these $\delta$ terms $\frac{\partial }{\partial \Theta _{ij}}J(\Theta )=a^{(l)}_{j}\delta ^{(l+1)}_{i}$ . This is ignoring $\lambda$ or alternatively if the regularization term $\lambda$ will be equal to 0. We’ll fix this detail later about the regularization term, but so by performing back propagation and computing these $\delta$ terms, you can, you know, pretty quickly compute these partial derivative terms for all of your parameters. So this is a lot of detail. Let’s take everything and put it all together to talk about how to implement back propagation to compute derivatives with respect to your parameters. And for the case of when we have a large training set, not just a training set of one example. Here’s what we do.

Suppose we have a training set of m examples like that shown here. The first thing we’re going to do is we’re going to set these $\Delta ^{(l)}_{ij}$ (for all $l,i,j$ ). So this triangular symbol, that’s actually the capital Greek alphabet $\delta$ , the symbol we had on the previous slide was the lower case $\delta$ . So the triangle is capital $\delta$ . We’re gonna set this equal to 0 for all values of $l,i,j$ . Eventually, this capital $\Delta ^{(l)}_{ij}$ will be used to compute the partial derivative term, partial derivative respect to $\Theta ^{(l)}_{ij}$ of $J(\Theta )$ . So as we’ll see in a second these $\Delta s$ are going to be used as accumulators that will slowly add things in order to compute these partial derivatives. Next, we’re going to loop through our training set. So, we’ll say for $i$ equals 1 through $m$ and so for the $i$ iteration, we’re going to working with the training example $(x^{(i)}, y^{(i)})$ . So the first thing we’re going to do is set $a^{(1)}$ which is the activations of the input layer, set that to be equal to $x^{(i)}$ that is the input for our $i^{th}$ training example, and then we’re going to perform forward propagation to compute the activations for layer 2, 3 and so on up to the final layer, layer L. Next, we’re going to use the output label $y^{(i)}$ from this specific example we’re looking at to compute the error term $\delta ^{(L)}$ for the output there. So $\delta ^{(L)}$ is what a hypotheses output minus what the target label was. And then we’re going to use the back propagation algorithm to compute $\delta (L-1)$ , $\delta (L-2)$ , and so on down to $\delta (2)$ and once again there is no $\delta (1)$ because we don’t associate an error term with the input layer. And finally, we’re going to use these capital $\Delta$ terms to accumulate these partial derivatives terms that we wrote down on the previous line. And by the way, if you look at this expression it’s possible to vectorize this too. Concretely, if you think of $\Delta ^{(l)}_{ij}$ as a matrix, indexed by subscript $ij$ . Then, if $\delta (L)$ is a matrix, we can rewrite this as $\Delta ^{(l)}_{ij}:=\Delta ^{(l)}_{ij}+\delta ^{l+1}(a^{(l)})^{T}$ . So that’s a vectorized implementation of this that automatically does an update for all values of $i$ and $j$ . Finally, after executing the body of the for loop we then go outside the for loop and we computing the following. We compute capital D as follows and we have two separate cases for j=0 and j≠0. The case of j=0 corresponds to the bias term so when j=0 that’s why we’re missing this extra regularization term. Finally, while the formal proof is pretty complicated what you can show is that once you’ve computed these D terms, that is exactly the partial derivative of the cost function with respect to each of your parameters. And so you can use those in either gradient descent, or in one of the advanced optimization algorithms.

So that’s the back propagation algorithm and how you compute derivatives of all your cost function for neural network. I know this looks like there are a lot of details and those a lot of steps together. But both in the programming assignments and later from this video, we’ll give you a summary of this so they connect all the pieces of algorithms together. So that you know exactly what you need to implement if you want to implement back propagation to compute derivatives of your neural network cost function with respect to these parameters.

<end>

王彩旗 edwardwangcq.com

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Neural Networks: Learning: Backpropagation algorithm

In the previous video, we talked about a cost function for the neural network. In this video, let’s start to talk about an algorithm, for trying to minimize the cost function. In particular, we’ll talk about the back propagation algorithm.Here’s the co
复制链接

扫一扫