Neural Networks - Model Representation II

摘要: 本文是吴恩达 (Andrew Ng)老师《机器学习》课程,第九章《神经网络学习》中第65课时《模型表示 II》的视频原文字幕。为本人在视频学习过程中记录下来并加以修正,使其更加简洁,方便阅读,以便日后查阅使用。现分享给大家。如有错误,欢迎大家批评指正,在此表示诚挚地感谢!同时希望对大家的学习能有所帮助.
————————————————
In the last video, we gave a mathematical definition of how to represent or how to compute the hypotheses used by Neural Network. In this video, I’d like to show you how to actually carry out that computation efficiently, and that is show you a vectorized implementation. Second and more importantly, I want to start giving you intuition about why these neural network representations might be a good idea, and how they can help us to learn complex nonlinear hypotheses.

Consider this neural network. Previously we said that the sequence of steps that we need in order to compute the output of a hypothesis is these equations given on the left where we compute the activation values of the three hidden uses, and then we use those to compute the final output of our hypotheses h_{\theta }(x). Now, I’m going to define a few extra terms. So, this term I’m underlining here, I’m going to define that to be z^{^{(2)}}_{1}. So that we have that a^{(2)}_{1} = g(z^{(2)}_{1}). And by the way, these superscript (2), you know, what that means is that the z^{(2)} and this a^{(2)} as well, the superscript (2) in parentheses means that these are values associated with layer 2, that is with the hidden layer in the neural network. Now this term here, I’m going to similarly defined as z^{^{(2)}}_{2}. And finally, this last term here that I’m underling, let me define that as z^{^{(2)}}_{3}. So that similarly we have a^{(2)}_{3} = g(z^{(2)}_{3}). So these z values are just a linear combination, a weighted linear combination, of the input values x_{1}, x_{2}, x_{3}, that go into a particular neuron. Now if you look at this block of numbers, you may notice that that block of numbers corresponds suspiciously similar to matrix vector operation, matrix vector multiplication, of \Theta ^{(1)} times the vector x. Using this observation, we’re going to be able to vectorize this computation of this neural network. Concretely, let’s define the feature vector x, as usual to be the vector of x_{0}, x_{1}, x_{2}, x_{3}, where x_{0} as usual is always equal to 1. And let’s define z^{(2)} to be the vector of these z-values, you know, z^{(2)}_{1}, z^{(2)}_{2}, z^{(2)}_{3}. And notice that there, z^{(2)}, this is a three dimensional vector. We can now vectorize the computation of a^{(2)}_{1}, a^{(2)}_{2}, a^{(2)}_{3} as follows. We can just write this in two steps. We can compute z^{(2)} = \Theta ^{(1)}*x, and that would give us this vector z^{(2)}; and then a^{(2)} = g(z^{(2)}), and just be clear, this is a three-dimensional vector, and a^{(2)} is also a three-dimensional vector, and thus this activation g, this applies the sigmoid function element-wise to each of z^{(2)}‘s elements. By the way, to make our notation a little more consistent with what we’ll do later, in this input layer we have this inputs x, but we can also think of this as the activations of the first layer. So, if I defined a^{(1)} to be equal to x, so that a^{(1)} is a vector, I can now take this x here and replace this with z^{(2)} = \Theta ^{(1)}*a^{(1)} just by defining a^{(1)} to be activations in my input layer. Now, with what I’ve written so far, I’ve now got myself the values a^{(2)}_{1}, a^{(2)}_{2}, a^{(2)}_{3}.  But I need one more value, which is I also want this a^{(2)}_{0} and that corresponds to a bias unit in the hidden layer that goes to the output there. Of course, there was a bias unit here too, that you know, it just didn’t draw under here, but to take care of this extra bias unit, what we’re going to do is add an extra a^{(2)}_{0} = 1. And after taking this step, we now have that a^{(2)} is going to be a four dimensional feature vector, because we just added this extra, you know, a^{(2)}_{0} = 1, corresponding to the bias unit in the hidden layer. And finally, to compute the actual value output of our hypotheses, we then simply need to compute z^{(3)}. So z^{(3)} is equal to this term here that I’m just underling. This inner term here is z^{(3)}. And z^{(3)} = \Theta ^{(2)}*a^{(2)}. And finally my hypotheses output h_{\theta }(x) = a^{(3)} = g(z^{(3)}), that is the activation of my only unit in the output layer. So, that’s just the real number. You can write it as a^{(3)} or a^{(3)}_{1}, and that’s g(z^{(3)}). This process of computing h\theta (x) is also called forward propagation, and it’s called that because we start off with the activations of the input-units, and then we sort of forward-propagate that to the hidden layer, and compute the activations of the hidden layer, and then we sort of forward propagate that, and compute activations of the output layer, but this process of computing the activations from the input, then the hidden, then the output layer, that’s also called forward propagation. And what we just did, what’s we worked out a vectorized implementation of this procedure. So, if you implement it using these equations that we have on the right, this would give you an efficient way relatively efficient way of computing h\theta (x).

This forward propagation view also help us to understand what Neural Networks might be doing, and why they might help us to learn interesting nonlinear hypotheses. Consider the following neural network, and let’s say I cover up the left part of this picture for now. If you look at what’s left in this picture, this looks a lot like logistic regression, where what we’re doing is we’re using that node that’s just the logistic regression unit, and we’re using that to make a prediction of h\theta (x). And concretely, what the hypotheses is outputting is h\theta (x), is going to be equal to g(\theta ^{(2)}_{10}a^{(2)}_{0}+\theta ^{(2)}_{11}a^{(2)}_{1}+\theta ^{(2)}_{12}a^{(2)}_{2}+\theta ^{(2)}_{13}a^{(2)}_{3}), where the values

are those given by these three given units. Now, to be actually consistent to my early notation, actually, we need, fill in this superscript 2s here. And I also have these indices 1 there, because I only have one output unit, but if you focus on the blue parts of the notation, this is, you know, this looks awfully like the standard logistic model, except that I now have a capital \Theta instead of lower case \theta. And what this is doing is just logistic regression. But where the features fed into logistic regression are these values computed by the hidden layer. Just to say that again, what this neural network is doing is just like logistic regression, except that rather than using the original features x_{1}, x_{2}, x_{3}, it is using these new features a_{1}, a_{2}, a_{3}. Again, we’ll put superscripts there, you know, to be consistent with the notation. And the cool thing about this is that the features a_{1}, a_{2}, a_{3}, they themselves are leaned as functions of the input. Concretely, the function mapping from layer 1 to layer 2, that is determined by some other set of parameters, \Theta ^{(1)}. So that’s it, the neural network, instead of being constrained to feed the features x_{1}, x_{2}, x_{3} to logistic regression, it gets to learn its own features, a_{1}, a_{2}, a_{3}. to feed into the logistic regression. And as you can imagine, depending on what parameters it chooses for \Theta ^{(1)}, you can learn some pretty interesting and complex features, and therefore you can end up with a better hypotheses, than if you were constrained to use the raw features x_{1}, x_{2}, x_{3}, or you will constrain to say choose the polynomial term, you know, x_{1}x_{2}, x_{2}x_{3}, and so on. But instead, this algorithm has the flexibility to try to learn whatever features at once, using these a_{1}, a_{2}, a_{3} in order to feed into this last unit, that’s essentially a logistic regression here. I realized this example is describe as somewhat high level, and so I’m not sure this intuition of the neural network, having more complex features will quite make sense yet, but if it doesn’t yet, in the next two videos, I’m going to go through a specific example of how a neural network can use this hidden layer to compute more complex features to feed into this output layer, and how that can learn more complex hypotheses. So, in case what I’m saying here doesn’t quite make sense, stick of me for the next two videos, hopefully out there working through those examples, this explanation will make a little more sense.

But just to point out, you can have neural network with other types of diagrams as well, and the way that neural networks are connected, that’s called the architecture. So the term architecture refers to how the different neurons are connected with each other. This is an example of a different neural network architecture, and once again you may be able to get this intuition of how the second layer, here we have three hidden units that are computing some complex function maybe of the input layer, and then the third layer can take the second layer’s features, and compute even more complex features in layer 3, so that by the time you get to the output layer, layer four, you can have even more complex features of what you are able to compute in layer three, and so get very interesting nonlinear hypotheses. By the way, in a network like this, layer one, this is called an input layer, layer four is still our output layer, and this network has two hidden layers. So anything that’s not an input or an output layer is called a hidden layer. So, hopefully from this video you’ve gotten a sense of how the feed forward propagation step in a neural network works, where you start from the activations of the input layer, and forward propagate that to the first hidden layer, then the second hidden layer, and then finally to the output layer. And you also saw how we can vectorize that computation. In the next, I realized that some of the intuitions in this video of how, you know, other certain layers are computing complex features of the early layers. I realized that some of that intuition may be still slightly abstract and kind of a high level. And so what I would like to do in the next two videos is work through a detailed example of how a neural network can be used to compute nonlinear functions of the input, and hope that will give you a good sense of the sorts of complex nonlinear hypotheses we can get out of Neural Networks.

<end>

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值