Neural Networks: Learning: Cost function_neural network cost function-CSDN博客

本文链接：https://blog.csdn.net/edward_wang1/article/details/105981696

摘要: 本文是吴恩达 (Andrew Ng)老师《机器学习》课程，第十章《神经网络参数的反向传播算法》中第72课时《代价函数》的视频原文字幕。为本人在视频学习过程中记录下来并加以修正，使其更加简洁，方便阅读，以便日后查阅使用。现分享给大家。如有错误，欢迎大家批评指正，在此表示诚挚地感谢！同时希望对大家的学习能有所帮助.
————————————————
Neural networks are one of the most powerful learning algorithms that we have today. In this and next few videos, I’d like to start talking about a learning algorithm for fitting the parameters of the neural network given the training set. As for the discussion of most of the learning algorithms, we’re going to begin by talking about the cost function for fitting the parameters of the network.

I’m going to focus on the application of neural networks to classification problems. So, suppose we have a network like that shown on the left. And suppose we have a training set like this of $(x^{(i)}, y^{(i)})$ pairs of m training examples. I’m going to use uppercase L to denote the total number of layers in this network. So, for the network shown on the left, we would have capital L equals 4. And I’m going to use $S_{l}$ , to denote the number of units, that’s the number of neurons, not counting the bias unit in layer of the network. So, for example, we would have a $S_{1}$ which is the input layer equals 3 units, $S_{2}$ in my example is 5 units. And the output layer $S_{4}$ , which is also equals $S_{L}$ because capital L=4. The output layer in my example in the left has 4 units. We’re going to consider two types of classification problems. The first is binary classification, where the labels y are either 0 or 1. In this case, we would have one output unit. So, this neural network on top has four output units, but if we had binary classification, we would have only one output unit that computes $h_{\theta }(x)$ . And the output of the neural network would be $h_{\theta }(x)$ is going to be a real number. And in this case, the number of output units, $S_{L}$ , where L is again the index of the final layer because that’s the number of layers we have in the network. So the number of units we have in the output layer is going to be equal to 1. In this case, to simplify notation later, I’m also going to set K=1. So, you can think of K as also denoting the number of units in the output layer. The second type of classification problem we’ll consider will be multi-class classification problem where we may have K distinct classes. So, our early example, I had this representation for y if we have four classes and in this case, we would have capital K output units, and our hypotheses will output vectors that are K dimensional. And the number of output units will be equal to K. And usually we will have K greater than or equal to 3 in this case because if we had two classes then, you know, we don’t need to use the one versus all method. We need to use the one versus all method only if we have K greater than or equal to 3 classes. So if we only have 2 classes, we will need to use only one output unit. Now, let’s define the cost function for our neural network.

The cost function we use for the neural network is going to be a generalization of the one that we used for logistic regression. For logistic regression, we used to minimize the cost function $J(\theta )$ that was $-\frac{1}{m}$ of this cost function and then plus this extra regularization term here, where this is a sum from j equals 1 through n, because we did not regularize the bias term $\theta _{0}$ . For a neural network, our cost function is going to be a generalization of this. Where instead of having basically just one logistic regression output unit, we may instead have of them. So here is our cost function. Neural network now outputs vectors in $\mathbb{R}^{K}$ , where might be equal to 1 if we have the binary classification problem. I’m going to use this notation $(h_{\theta }(x))_{i}$ to denote the $i_{th}$ output. That is $h_{\theta }(x)\in \mathbb{R}^{K}$ . And so this subscript just selects out the $i_{th}$ element of the vector that is the output by my neural network. My cost function $J(\Theta )$ is now going to be the following, is $-\frac{1}{m}$ of a sum, of a similar term to what we have in logistic regression. Except that we have this sum from k equals from 1 through K. The summation is basically a sum over my output unit. So, if I have 4 output units, that is the final layer of my neural network has four output units, then this sum from, this is a sum from k equals 1 through 4 of basically the logistic regression algorithms cost function but summing that cost function over each of my 4 output units in turn. And so, you notice in particular that this applies to $y_{k}$ $h_{k}$ , because we’re basically taking the $k_{th}$ output unit and comparing that to the value of $y_{k}$ , which is that one of those vectors to say which class it should be. And finally, the second term here is the regularization term similar to what to had to logistic regression. This summation terms looks really complicated and always doing is a summing over these terms $\theta ^{(l)}_{ji}$ for all values of , and . Except that we don’t sum over the terms corresponding to these bias values like we had for logistic regression. Concretely, we don’t sum over the terms corresponding to where is equal to 0. So, that is because when we are computing the activation of the neuron, we have terms like these, you know, $\theta _{j0}x_{0}+\theta _{j1}x_{1}+...$ and so on, where I guess we could have $a^{(1)}_{j}$ there if this is the first hidden layer, and so the values with the 0 there that corresponds to something that multiplies into an $x_{0}$ or an $a_{0}$ and so, this is kind of like a bias unit and by analogy to what we were doing for logistic regression, we won’t sum over those terms in our regularization term because we don’t want to regularize them and strains their values 0. But this is just one possible convention and even if you were to sum over, you know, equals 0 up to $S_{l}$ , it will work about the same and it doesn’t make a big difference. But maybe this convention of not regularizing the bias term is just slightly more common. So, that’s the cost function we’re going to use to fill on your own network. In the next video, we’ll start to talk about an algorithm for trying to optimize the cost function.

<end>