Neural Networks: Learning: Implementation note: unrolling parameters

摘要: 本文是吴恩达 (Andrew Ng)老师《机器学习》课程,第十章《神经网络参数的反向传播算法》中第75课时《使用注意: 展开参数》的视频原文字幕。为本人在视频学习过程中记录下来并加以修正,使其更加简洁,方便阅读,以便日后查阅使用。现分享给大家。如有错误,欢迎大家批评指正,在此表示诚挚地感谢!同时希望对大家的学习能有所帮助.
————————————————
In the previous video, we talked about how to use back propagation to compute the derivatives of your cost function. In this video, I want to quickly tell you about one implementation detail of unrolling your parameters from matrices to vectors, which we need in order to use the advanced optimization routines.

Concretely, let’ say you’ve implemented a cost function that takes this input, you know, parameters theta(theta) and returns cost function(jVal) and return derivatives(gradient). Then you can pass this to an advanced optimization algorithm like fminunc and fminunc isn’t the only one by the way. There are also other advanced optimization algorithms. But what all of them do is take this input that point to the cost function(@costFunction), and some value of theta(initialTheta). And both, and these routines (costFunction & fminunc) assume that theta(theta of costFunction) and the initial value of theta(initialTheta of fminunc), that these are parameter vectors, maybe \mathbb{R}^{n} or \mathbb{R}^{n+1}, but these are vectors. And it also assumes that, you know, your cost function(costFunction) will return, as a second return value, this gradient(gradient) which is also \mathbb{R}^{n} or \mathbb{R}^{n+1}, so also a vector. This worked fine when we were using logistic regression, but now that we’re using a neural network, our parameters are no longer vectors, but instead they are these matrices where for a 4 layer neural network we would have parameter matrices \Theta ^{(1)}, \Theta ^{(2)}, \Theta ^{(3)} that we might represent in Octave as these matrices Theta1, Theta2, Theta3. And similarly, these gradient terms that we expected to return. Well, in the previous video we showed how to compute these gradient matrices, which was D^{(1)}, D^{(2)} and D^{(3)}, which we might represent in Octave as matrices D(1), D(2) and D(3). In this video I want to quickly tell you about the idea of how to take these matrices and unroll them into vectors. So that they end up being in a format suitable for passing into as theta here or getting output for a gradient there.

 

Concretely, let’s say we have a neural network with one input layer with 10 units, two hidden layers each with 10 units and one output layer with just 1 unit. So s_{1} is the number of units in layer one and s_{2} is the number of units in layer two, and s_{3} is the number of units in layer three, s_{4} is the number of units in layer four. In this case, the dimension of your matrices \Theta and D are going to be given by these expressions. For example, \Theta ^{(1)} is going to a 10 \times11 matrix and so on. So in Octave, if you want to convert between these matrices and vectors. What you can do is take your Theta1, Theta2 and Theta3 and write this piece of code and this will take all the elements of your three theta matrices and take all the elements of Theta1, all the elements of Theta2, all the elements of Theta3 and unroll them and put all the elements into a big long vector, which is thetaVec. And similarly the second command would take all of your D matrices and unroll them into a long big vector and call them DVec. And finally if you want to go back from the vector representations to the matrix representations, what you do to get back to Theta say is take thetaVec and pull out the first 110 elements. So Theta1 has 110 elements because it’s a 10×11 matrix, so that pull out the first 110 elements and then you can use the reshape command to reshape those back into Theta1. And similarly, to get back Theta2 you pull out next 110 elements and reshape it. And for Theta3, you pull out the final 11 elements and run reshape to get backup the Theta3

>> Theta1 = ones(10,11)
Theta1 =

   1   1   1   1   1   1   1   1   1   1   1
   1   1   1   1   1   1   1   1   1   1   1
   1   1   1   1   1   1   1   1   1   1   1
   1   1   1   1   1   1   1   1   1   1   1
   1   1   1   1   1   1   1   1   1   1   1
   1   1   1   1   1   1   1   1   1   1   1
   1   1   1   1   1   1   1   1   1   1   1
   1   1   1   1   1   1   1   1   1   1   1
   1   1   1   1   1   1   1   1   1   1   1
   1   1   1   1   1   1   1   1   1   1   1

>> Theta2 = 2*ones(10,11)
Theta2 =

   2   2   2   2   2   2   2   2   2   2   2
   2   2   2   2   2   2   2   2   2   2   2
   2   2   2   2   2   2   2   2   2   2   2
   2   2   2   2   2   2   2   2   2   2   2
   2   2   2   2   2   2   2   2   2   2   2
   2   2   2   2   2   2   2   2   2   2   2
   2   2   2   2   2   2   2   2   2   2   2
   2   2   2   2   2   2   2   2   2   2   2
   2   2   2   2   2   2   2   2   2   2   2
   2   2   2   2   2   2   2   2   2   2   2

>> Theta3 = 3*ones(1,11)
Theta3 =

   3   3   3   3   3   3   3   3   3   3   3

>> thetaVec = [Theta1(:);Theta2(:);Theta3(:)];
>> size(thetaVec)
ans =

   231     1
>> reshape(thetaVec(1:110), 10, 11)
ans =

   1   1   1   1   1   1   1   1   1   1   1
   1   1   1   1   1   1   1   1   1   1   1
   1   1   1   1   1   1   1   1   1   1   1
   1   1   1   1   1   1   1   1   1   1   1
   1   1   1   1   1   1   1   1   1   1   1
   1   1   1   1   1   1   1   1   1   1   1
   1   1   1   1   1   1   1   1   1   1   1
   1   1   1   1   1   1   1   1   1   1   1
   1   1   1   1   1   1   1   1   1   1   1
   1   1   1   1   1   1   1   1   1   1   1

>> reshape(thetaVec(111:220), 10, 11)
ans =

   2   2   2   2   2   2   2   2   2   2   2
   2   2   2   2   2   2   2   2   2   2   2
   2   2   2   2   2   2   2   2   2   2   2
   2   2   2   2   2   2   2   2   2   2   2
   2   2   2   2   2   2   2   2   2   2   2
   2   2   2   2   2   2   2   2   2   2   2
   2   2   2   2   2   2   2   2   2   2   2
   2   2   2   2   2   2   2   2   2   2   2
   2   2   2   2   2   2   2   2   2   2   2
   2   2   2   2   2   2   2   2   2   2   2

>> reshape(thetaVec(221:231), 1, 11)
ans =

   3   3   3   3   3   3   3   3   3   3   3

>>

Here’s a quick Octave demo of that process. So for this example, let’s set Theta1 to be ones of 10×11, so it’s a matrix of all ones. And just to make this easier seen, let’s set that Theta2 to be 2 times ones(10,11). And let’s set Theta3 equals 3 times ones(1,11). So this is 3 separate matrices Theta1, Theta2, Theta3. We want to put all of these as a vector. Set thetaVec=[Theta1(:); Theta2(:); Theta3(:)]. Right, that’s a colon in the middle and like so. And now thetaVec is gong to be a very long vector. That’s 231 elements. If I display it, I find that this very long vector with all the elements of the first matrix, all the elements of the second matrix, then all the elements of the third matrix. And if I want to get back my original matrices, I can do reshape thetaVec. Let’s pull out the first 110 elements, and reshape them to a 10×11 matrix. This gives me back Theta1. And if I then pull out the next 110 elements, so that’s indices 111 to 220. I get back all of my 2s. And if I go from 221 up to the last element, and reshape to 1×11, I get back Theta3.

To make this process really concrete, here’s how we use the unrolling idea to implement our learning algorithm. Let’s say that you have some initial value of the parameters \Theta ^{(1)}, \Theta ^{(2)}, \Theta ^{(3)}. What we’re going to do is take these and unroll them into a long vector we’re gonna call initialTheta to pass into fminunc as this initial setting of the parameters theta. The other thing we need to do is implement the cost function. Here’s my implementation of the cost function. The cost function is going to get the input, thetaVec, which is going to be all of my parameters vectors that in the form that’s been unrolled into a vector. So the first thing I’m going to do is I’m going to use thetaVec and I’m going to use the reshape functions. So I’ll pull out elements from thetaVec and use reshape to get back my original parameter matrices, \Theta ^{(1)}, \Theta ^{(2)}, \Theta ^{(3)}. So there are going to be matrices that I’m going to get. So that gives me more convenient form in which to use these matrices so that I’m run forward propagation and back propagation to compute my derivatives, and to compute my cost function J(\Theta ). And finally, I can then take my derivatives, and unroll them, to keep the elements in the same ordering as I did when I unroll my thetas. But I’m gonna unroll D^{(1)}, D^{(2)}, D^{(3)}, to get gradientVec which is now what my cost function can return. It can return a vector of these derivatives.

So, hopefully, you now have a good sense of how to convert back and forth between the matrix representation of the parameters versus the vector representation of the parameters. The advantage of the matrix representation is that when your parameters are stored as matrices it’s more convenient when you’re doing forward propagation and back propagation and it’s easier when your parameters are stored as matrices to take advantage of the vectorized implementations. Whereas in contrast, the advantage of the vector representation when you have thetaVec and DVec is that  when you are using the advanced optimization algorithms. Those algorithms tend to assume that you have all of your parameters unrolled into a big long vector. And so with what we just went through, hopefully you can now quickly convert between the two as needed.

<end>

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值