摘要: 本文是吴恩达 (Andrew Ng)老师《机器学习》课程,第十章《神经网络参数的反向传播算法》中第75课时《使用注意: 展开参数》的视频原文字幕。为本人在视频学习过程中记录下来并加以修正,使其更加简洁,方便阅读,以便日后查阅使用。现分享给大家。如有错误,欢迎大家批评指正,在此表示诚挚地感谢!同时希望对大家的学习能有所帮助.
————————————————
In the previous video, we talked about how to use back propagation to compute the derivatives of your cost function. In this video, I want to quickly tell you about one implementation detail of unrolling your parameters from matrices to vectors, which we need in order to use the advanced optimization routines.
Concretely, let’ say you’ve implemented a cost function that takes this input, you know, parameters theta() and returns cost function() and return derivatives(). Then you can pass this to an advanced optimization algorithm like fminunc and fminunc isn’t the only one by the way. There are also other advanced optimization algorithms. But what all of them do is take this input that point to the cost function(), and some value of theta(). And both, and these routines ( & ) assume that theta( of ) and the initial value of theta( of ), that these are parameter vectors, maybe or , but these are vectors. And it also assumes that, you know, your cost function() will return, as a second return value, this gradient() which is also or , so also a vector. This worked fine when we were using logistic regression, but now that we’re using a neural network, our parameters are no longer vectors, but instead they are these matrices where for a 4 layer neural network we would have parameter matrices , , that we might represent in Octave as these matrices , , . And similarly, these gradient terms that we expected to return. Well, in the previous video we showed how to compute these gradient matrices, which was , and , which we might represent in Octave as matrices D(1), D(2) and D(3). In this video I want to quickly tell you about the idea of how to take these matrices and unroll them into vectors. So that they end up being in a format suitable for passing into as theta here or getting output for a gradient there.
Concretely, let’s say we have a neural network with one input layer with 10 units, two hidden layers each with 10 units and one output layer with just 1 unit. So is the number of units in layer one and is the number of units in layer two, and is the number of units in layer three, is the number of units in layer four. In this case, the dimension of your matrices and are going to be given by these expressions. For example, is going to a matrix and so on. So in Octave, if you want to convert between these matrices and vectors. What you can do is take your , and and write this piece of code and this will take all the elements of your three theta matrices and take all the elements of , all the elements of , all the elements of and unroll them and put all the elements into a big long vector, which is . And similarly the second command would take all of your matrices and unroll them into a long big vector and call them . And finally if you want to go back from the vector representations to the matrix representations, what you do to get back to Theta say is take and pull out the first 110 elements. So has 110 elements because it’s a 10×11 matrix, so that pull out the first 110 elements and then you can use the reshape command to reshape those back into . And similarly, to get back you pull out next 110 elements and reshape it. And for , you pull out the final 11 elements and run reshape to get backup the .
>> Theta1 = ones(10,11)
Theta1 =
1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1
>> Theta2 = 2*ones(10,11)
Theta2 =
2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2
>> Theta3 = 3*ones(1,11)
Theta3 =
3 3 3 3 3 3 3 3 3 3 3
>> thetaVec = [Theta1(:);Theta2(:);Theta3(:)];
>> size(thetaVec)
ans =
231 1
>> reshape(thetaVec(1:110), 10, 11)
ans =
1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1
>> reshape(thetaVec(111:220), 10, 11)
ans =
2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2
>> reshape(thetaVec(221:231), 1, 11)
ans =
3 3 3 3 3 3 3 3 3 3 3
>>
Here’s a quick Octave demo of that process. So for this example, let’s set Theta1 to be ones of 10×11, so it’s a matrix of all ones. And just to make this easier seen, let’s set that Theta2 to be 2 times ones(10,11). And let’s set Theta3 equals 3 times ones(1,11). So this is 3 separate matrices Theta1, Theta2, Theta3. We want to put all of these as a vector. Set thetaVec=[Theta1(:); Theta2(:); Theta3(:)]. Right, that’s a colon in the middle and like so. And now thetaVec is gong to be a very long vector. That’s 231 elements. If I display it, I find that this very long vector with all the elements of the first matrix, all the elements of the second matrix, then all the elements of the third matrix. And if I want to get back my original matrices, I can do reshape thetaVec. Let’s pull out the first 110 elements, and reshape them to a 10×11 matrix. This gives me back Theta1. And if I then pull out the next 110 elements, so that’s indices 111 to 220. I get back all of my 2s. And if I go from 221 up to the last element, and reshape to 1×11, I get back Theta3.
To make this process really concrete, here’s how we use the unrolling idea to implement our learning algorithm. Let’s say that you have some initial value of the parameters , , . What we’re going to do is take these and unroll them into a long vector we’re gonna call to pass into fminunc as this initial setting of the parameters theta. The other thing we need to do is implement the cost function. Here’s my implementation of the cost function. The cost function is going to get the input, , which is going to be all of my parameters vectors that in the form that’s been unrolled into a vector. So the first thing I’m going to do is I’m going to use and I’m going to use the reshape functions. So I’ll pull out elements from and use reshape to get back my original parameter matrices, , , . So there are going to be matrices that I’m going to get. So that gives me more convenient form in which to use these matrices so that I’m run forward propagation and back propagation to compute my derivatives, and to compute my cost function . And finally, I can then take my derivatives, and unroll them, to keep the elements in the same ordering as I did when I unroll my thetas. But I’m gonna unroll , , , to get which is now what my cost function can return. It can return a vector of these derivatives.
So, hopefully, you now have a good sense of how to convert back and forth between the matrix representation of the parameters versus the vector representation of the parameters. The advantage of the matrix representation is that when your parameters are stored as matrices it’s more convenient when you’re doing forward propagation and back propagation and it’s easier when your parameters are stored as matrices to take advantage of the vectorized implementations. Whereas in contrast, the advantage of the vector representation when you have and is that when you are using the advanced optimization algorithms. Those algorithms tend to assume that you have all of your parameters unrolled into a big long vector. And so with what we just went through, hopefully you can now quickly convert between the two as needed.
<end>