C1 Week 3:(1)Classification&Cost Function&Gradient Descent

写在前面:该笔记为学习吴恩达团队在coursera新开设的机器学习课程而记录,新版本采用python进行授课且相较于旧版有小部分改动
官方网址:Machine Learning Specialization [3 courses] (Stanford) | Coursera
学习视频传送门:(强推|双字)2022吴恩达机器学习Deeplearning.ai课程_哔哩哔哩_bilibili

Course1 Week 3-Classification内容如下:
1.Classification
2.Cost Function
3.Gradient DescentC1 Week 3:(1)Classification&Cost Function&Gradient Descent
4.Regularization to Reduce OverfittingC1 Week 3:(2)Regularization to Reduce Overfitting
5.Andrew Ng and Fei-Fei Li on Human-Centered AI


catalog

1.Classification

1.1 Motivation

1.2 Logistic Regression

1.3 Decision Boundary

2.Cost Function

2.1 Cost Function for Logistic Regression

2.2 Simplified Cost Function 

3.Gradient Descent

3.1 Gradient Descent Implementation


1.Classification

1.1 Motivation

->Last week you learned about linear regression, which predicts a number. This week, you learn about classification where your output variable y can take on only one of a small handful of possible values instead of any number in an infinite range of numbers. It turns out that linear regression is not a good algorithm for classification problems. Let's take a look at why and this will lead us into a different algorithm called logistic regression. Which is one of the most popular and most widely used learning algorithms today.

->Here are some examples of classification problems recall the example of trying to figure out whether an email is spam. So the answer you want to output is going to be either a no or a yes. Another example would be figuring out if an online financial transaction is fraudulent.The problem is given a financial transaction.Can your learning algorithm figure out is this transaction fraudulent, such as what this credit card stolen? Another example we've touched on before was trying to classify a tumor as malignant versus not. In each of these problems the variable that you want to predict can only be one of two possible values. No or yes.This type of classification problem where there are only two possible outputs is called binary classification. Where the word binary refers to there being only two possible classes or two possible categories.
->In these problems I will use the terms class and category relatively interchangeably. They mean basically the same thing. By convention we can refer to these two classes or categories in a few common ways. We often designate clauses as no or yes or sometimes equivalently false or true or very commonly using the numbers zero or one. Following the common convention in computer science with zero denoting false and one denoting true. I'm usually going to use the numbers zero and one to represent the answer y. Because that will fit in most easily with the types of learning algorithms we want to implement. But when we talk about it will often say no or yes or false or true as well. One of the technologies commonly used is to call the false or zero class. The negative class and the true or the one class, the positive class.
->For example, for spam classification, an email that is not spam may be referred to as a negative example. Because the output to the question of is a spam. The output is no or zero. In contrast, an email that has spam might be referred to as a positive training example. Because the answer to is it spam is yes or true or one to be clear, negative and positive. Do not necessarily mean bad versus good or evil versus good. It's just that negative and positive examples are used to convey the concepts of absence or zero or false vs the presence or true or one of something you might be looking for. Such as the absence or presence of the spam illness or the spam property of an email or the absence of presence of broadening activity or absence of presence of malignancy of the tumor. Between non spam and spam emails. Which one you call false or zero and which one you call true or one is a little bit arbitrary. Often either choice could work. So, different engineer might actually swap it around and have the positive class B. The presence of a good email or the possible causes be the presence of a real financial transaction or a healthy patient.
->So how do you build a classification algorithm? Here's the example of a training set for classifying if the tumor is malignant. A class one positive cross ,yes cross or the nine class zero or negative class. I plotted both the tumor size on the horizontal axis as well as the label Y on the vertical axis. By the way, in week one, when we first talked about classification. This is how we previously visualized it on the number line except that now we're calling the classes zero and one and plotting them on the vertical axis. Now, one thing you could try on this training set is to apply the album you already know. Linear regression and try to fit a straight line to the data. If you do that, maybe the straight line looks like this, right? And that's your F effects. Linear regression predicts not just the values zero and one. But all numbers between zero and one or even less than zero or greater than one. But here we want to predict categories. One thing you could try is to pick a threshold of say 0.5. So that if the model outputs a value below 0.5, then you predict why equal zero or not malignant. And if the model outputs a number equal to or greater than 0.5, then predict Y equals one or malignant. Notice that this threshold value of 0.5 intersects the best fit straight line at this point. So if you draw this vertical line here, everything to the left ends up with a prediction of y equals zero. And everything on the right ends up with the prediction of y equals one. Now, for this particular data set it looks like linear regression could do something reasonable.
->But now let's see what happens if your dataset has one more training example. This one way over here on the right. Let's also extend the horizontal axis. Notice that this training example shouldn't really change how you classify the data points. This vertical dividing line that we drew just now still makes sense as the cut off where tumors smaller than this should be classified as zero. And tumors greater than this should be classified as one. But once you've added this extra training example on the right. The best fit line for linear regression will shift over like this. And if you continue using the threshold of 0.5, you now notice that everything to the left of this point is predicted at zero non malignant. And everything to the right of this point is predicted to be one or malignant. This isn't what we want because adding that example way to the right shouldn't change any of our conclusions about how to classify malignant versus benign tumors. But if you try to do this with linear regression, adding this one example which feels like it shouldn't be changing anything. It ends up with us learning a much worse function for this classification problem. Clearly, when the tumor is large, we want the algorithm to classify it as malignant. So what we just saw was linear regression causes the best fit line. When we added one more example to the right to shift over. And does the dividing line also called the decision boundary to shift over to the right.
->You learn more about the decision boundary in the next video, you also learn about an algorithm called logistic regression. Where the output value of the outcome will always be between zero and one. And the average will avoid these problems that we're seeing on this slide. By the way one thing confusing about the name logistic regression is that even though it has the word of regression in it is actually used for classification. Don't be confused by the name which was given for historical reasons. It's actually used to solve binary classification problems with output label y is either zero or one.

【About optional lab】
->In the upcoming optional lab you also get to take a look at what happens when you try to use linear regression for classification. Sometimes you get lucky and it may work but often it will not work well. Which is why I don't use linear regression myself for classification. In the optional lab, you see an interactive plot that attempts to classify between two categories. And hopefully notice how this often doesn't work very well. Which is okay because that motivates the need for a different model to do classification talks. 

1.2 Logistic Regression

->Let's talk about logistic regression, which is probably the single most widely used classification algorithm in the world. This is something that I use all the time in my work. Let's continue with the example of classifying whether a tumor is malignant. Whereas before we're going to use the label 1 or yes to the positive class to represent malignant tumors, and zero or no and negative examples to represent benign tumors. Here's a graph of the dataset where the horizontal axis is the tumor size and the vertical axis takes on only values of 0 and 1, because is a classification problem. You saw in the last video that linear regression is not a good algorithm for this problem. In contrast, what logistic regression we end up doing is fit a curve that looks like this, S-shaped curve to this dataset. For this example, if a patient comes in with a tumor of this size, which I'm showing on the x-axis, then the algorithm will output 0.7 suggesting that is closer or maybe more likely to be malignant and benign. Will say more later what 0.7 actually means in this context. But the output label y is never 0.7 is only ever 0 or 1. To build out to the logistic regression algorithm, there's an important mathematical function I like to describe which is called the Sigmoid function, sometimes also referred to as the logistic function. The Sigmoid function looks like this.
->Notice that the x-axis of the graph on the left and right are different. In the graph to the left on the x-axis is the tumor size, so is all positive numbers. Whereas in the graph on the right, you have 0 down here, and the horizontal axis takes on both negative and positive values and have label the horizontal axis Z. I'm showing here just a range of negative 3 to plus 3. So the Sigmoid function outputs value is between 0 and 1. If I use g of z to denote this function, then the formula of g of z is equal to 1 over 1 plus e to the negative z. Where here e is a mathematical constant that takes on a value of about 2.7, and so e to the negative z is that mathematical constant to the power of negative z. Notice if z where really be, say a 100, e to the negative z is e to the negative 100 which is a tiny number. So this ends up being 1 over 1 plus a tiny little number, and so the denominator will be basically very close to 1. Which is why when z is large, g of z that is a Sigmoid function of z is going to be very close to 1. Conversely, you can also check for yourself that when z is a very large negative number, then g of z becomes 1 over a giant number, which is why g of z is very close to 0. That's why the sigmoid function has this shape where it starts very close to zero and slowly builds up or grows to the value of one. Also, in the Sigmoid function when z is equal to 0, then e to the negative z is e to the negative 0 which is equal to 1, and so g of z is equal to 1 over 1 plus 1 which is 0.5, so that's why it passes the vertical axis at 0.5.
->Now, let's use this to build up to the logistic regression algorithm. We're going to do this in two steps. In the first step, I hope you remember that a straight line function, like a linear regression function can be defined as w. product of x plus b. Let's store this value in a variable which I'm going to call z, and this will turn out to be the same z as the one you saw on the previous slide, but we'll get to that in a minute. The next step then is to take this value of z and pass it to the Sigmoid function, also called the logistic function, g. Now, g of z then outputs a value computed by this formula, 1 over 1 plus e to the negative z. There's going to be between 0 and 1. When you take these two equations and put them together, they then give you the logistic regression model f of x, which is equal to g of wx plus b. Or equivalently g of z, which is equal to this formula over here. This is the logistic regression model, and what it does is it inputs feature or set of features X and outputs a number between 0 and 1.
->Next, let's take a look at how to interpret the output of logistic regression. We'll return to the tumor classification example. The way I encourage you to think of logistic regressions output is to think of it as outputting the probability that the class or the label y will be equal to 1 given a certain input x. For example, in this application, where x is the tumor size and y is either 0 or 1, if you have a patient come in and she has a tumor of a certain size x, and if based on this input x, the model I'll plus 0.7, then what that means is that the model is predicting or the model thinks there's a 70 percent chance that the true label y would be equal to 1 for this patient. In other words, the model is telling us that it thinks the patient has a 70 percent chance of the tumor turning out to be malignant.
->Now, let me ask you a question. See if you can get this right. We know that y has to be either 0 or 1, so if y has a 70 percent chance of being 1, what is the chance that it is 0? So y has got to be either 0 or 1, and thus the probability of it being 0 or 1 these two numbers have to add up to one or to a 100 percent chance. That's why if the chance of y being 1 is 0.7 or 70 percent chance, then the chance of it being 0 has got to be 0.3 or 30 percent chance. If someday you read research papers or blog pulls of all logistic regression, sometimes you see this notation that f of x is equal to p of y equals 1 given the input features x and with parameters w and b. What the semicolon here is used to denote is just that w and b are parameters that affect this computation of what is the probability of y being equal to 1 given the input feature x? For the purpose of this class, don't worry too much about what this vertical line and what the semicolon mean. You don't need to remember or follow any of this mathematical notation for this class. I'm mentioning this only because you may see this in other places.

【About optional lab】
->In the optional lab that follows this video, you also get to see how the Sigmoid function is implemented in code. You can see a plot that uses the Sigmoid function so as to do better on the classification tasks that you saw in the previous optional lab. Remember that the code will be provided to you, so you just have to run it. I hope you take a look and get familiar with the code.
->Congrats on getting here. You now know what is the logistic regression model as well as the mathematical formula that defines logistic regression. For a long time, a lot of Internet advertising was actually driven by basically a slight variation of logistic regression. This was very lucrative for some large companies, and this is basically the algorithm that decided what ad was shown to you and many others on some large websites. Now, there's, even more, to learn about this algorithm. In the next video, we'll take a look at the details of logistic regression. We'll look at some visualizations and also examines something called the decision boundary. This will give you a few different ways to map the numbers that this model outputs, such as 0.3, or 0.7, or 0.65 to a prediction of whether y is actually 0 or 1. Let's go on to the next video to learn more about logistic regression.

1.3 Decision Boundary

->In the last video, you learned about the logistic regression model. Now, let's take a look at the decision boundary to get a better sense of how logistic regression is computing these predictions. To recap, here's how the logistic regression models outputs are computed in two steps. In the first step, you compute z as w.x plus b. Then you apply the Sigmoid function g to this value z. Here again, is the formula for the Sigmoid function. Another way to write this is we can say f of x is equal to g, the Sigmoid function, also called the logistic function, applied to w.x plus b, where this is of course, the value of z. If you take the definition of the Sigmoid function and plug in the definition of z, then you find that f of x is equal to this formula over here, 1 over 1 plus e to the negative z, where z is wx plus b. You may remember we said in the previous video that we interpret this as the probability that y is equal to 1 given x and with parameters w and b. This is going to be a number like maybe a 0.7 or 0.3.
->Now, what if you want to learn the algorithm to predict. Is the value of y going to be zero or one? Well, one thing you might do is set a threshold above which you predict y is one, or you set y hat to prediction to be equal to one and below which you might say y hat, my prediction is going to be equal to zero. A common choice would be to pick a threshold of 0.5 so that if f of x is greater than or equal to 0.5, then predict y is one. We write that prediction as y hat equals 1, or if f of x is less than 0.5, then predict y is 0, or in other words, the prediction y hat is equal to 0. Now, let's dive deeper into when the model would predict one. In other words, when is f of x greater than or equal to 0.5. We'll recall that f of x is just equal to g of z. So f is greater than or equal to 0.5 whenever g of z is greater than or equal to 0.5. But when is g of z greater than or equal to 0.5? Well, here's a Sigmoid function over here. So g of z is greater than or equal to 0.5 whenever z is greater than or equal to 0. That is whenever z is on the right half of this axis. Finally, when is z greater than or equal to zero? Well, z is equal to w.x plus b, so z is greater than or equal to zero whenever w.x plus b is greater than or equal to zero. To recap, what you've seen here is that the model predicts 1 whenever w.x plus b is greater than or equal to 0. Conversely, when w.x plus b is less than zero, the algorithm predicts y is 0.

Eg one —Linear Decision Boundary】
->Given this, let's now visualize how the model makes predictions. I'm going to take an example of a classification problem where you have two features, x1 and x2 instead of just one feature. Here's a training set where the little red crosses denote the positive examples and the little blue circles denote negative examples. The red crosses corresponds to y equals 1, and the blue circles correspond to y equals 0. The logistic regression model will make predictions using this function f of x equals g of z, where z is now this expression over here, w1x1 plus w2x2 plus b, because we have two features x1 and x2. Let's just say for this example that the value of the parameters are w1 equals 1, w2 equals 1, and b equals negative 3. Let's now take a look at how logistic regression makes predictions. In particular, let's figure out when wx plus b is greater than equal to 0 and when wx plus b is less than 0. To figure that out, there's a very interesting line to look at, which is when wx plus b is exactly equal to 0. It turns out that this line is also called the decision boundary because that's the line where you're just almost neutral about whether y is 0 or y is 1. Now, for the values of the parameters w_1, w_2, and b that we had written down above, this decision boundary is just x_1 plus x_2 minus 3. When is x_1 plus x_2 minus 3 equal to 0? Well, that will correspond to the line x_1 plus x_2 equals 3, and that is this line shown over here. This line turns out to be the decision boundary, where if the features x are to the right of this line, logistic regression would predict 1 and to the left of this line, logistic regression with predicts 0. In other words, what we have just visualize is the decision boundary for logistic regression when the parameters w_1, w_2, and b are 1,1 and negative 3. Of course, if you had a different choice of the parameters, the decision boundary would be a different line.

Eg two —Non-linear Decision Boundary】
->Now let's look at a more complex example where the decision boundary is no longer a straight line. As before, crosses denote the class y equals 1, and the little circles denote the class y equals 0. Earlier last week, you saw how to use polynomials in linear regression, and you can do the same in logistic regression. This set z to be w_1, x_1 squared plus w_2, x_2 squared plus b. With this choice of features, polynomial features into a logistic regression. F of x, which equals g of z, is now g of this expression over here. Let's say that we ended up choosing w_1 and w_2 to be 1 and b to be negative 1. Z is equal to 1 times x_1 squared plus 1 times x_2 squared minus 1. The decision boundary, as before, will correspond to when z is equal to 0. This expression will be equal to 0 when x_1 squared plus x_2 squared is equal to 1. If you plot on the diagram on the left, the curve corresponding to x_1 squared plus x_2 squared equals 1, this turns out to be the circle. When x_1 squared plus x_2 squared is greater than or equal to 1, that's this area outside the circle and that's when you predict y to be 1. Conversely, when x_1 squared plus x_2 squared is less than 1, that's this area inside the circle and that's when you predict y to be 0.

Eg three —Non-linear Decision Boundary】
->Can we come up with even more complex decision boundaries than these? Yes, you can. You can do so by having even higher-order polynomial terms. Say z is w_1, x_1 plus w_2, x_2 plus w_3, x_1 squared plus w_4, x_1, x_2 plus w_5, x_2 squared. Then it's possible you can get even more complex decision boundaries. The model can define decision boundaries, such as this example, an ellipse just like this, or with a different choice of the parameters. You can even get more complex decision boundaries, which can look like functions that maybe looks like that. So this is an example of an even more complex decision boundary than the ones we've seen previously. This implementation of logistic regression will predict y equals 1 inside this shape and outside the shape will predict y equals 0. With these polynomial features, you can get very complex decision boundaries. In other words, logistic regression can learn to fit pretty complex data. Although if you were to not include any of these higher-order polynomials, so if the only features you use are x_1, x_2, x_3, and so on, then the decision boundary for logistic regression will always be linear, will always be a straight line.

【About optional lab】
->In the upcoming optional lab, you also get to see the code implementation of the decision boundary. In the example in the lab, there will be two features so you can see that decision boundary as a line. With this visualization, I hope that you now have a sense of the range of possible models you can get with logistic regression. Now that you've seen what f of x can potentially compute, let's take a look at how you can actually train a logistic regression model. We'll start by looking at the cost function for logistic regression and after that, figured out how to apply gradient descent to it. Let's go on to the next video.

2.Cost Function

2.1 Cost Function for Logistic Regression

->Remember that the cost function gives you a way to measure how well a specific set of parameters fits the training data. Thereby gives you a way to try to choose better parameters. In this video, we'll look at how the squared error cost function is not an ideal cost function for logistic regression. We'll take a look at a different cost function that can help us choose better parameters for logistic regression. Here's what the training set for our logistic regression model might look like. Where here each row might correspond to patients that was paying a visit to the doctor and one dealt with some diagnosis. As before, we'll use m to denote the number of training examples. Each training example has one or more features, such as the tumor size, the patient's age, and so on for a total of n features. Let's call the features X_1 through X_n. Since this is a binary classification task, the target label y takes on only two values, either 0 or 1. Finally, the logistic regression model is defined by this equation. The question you want to answer is, given this training set, how can you choose parameters w and b?
->Recall for linear regression, this is the squared error cost function. The only thing I've changed is that I put the one half inside the summation instead of outside the summation. You might remember that in the case of linear regression, where f of x is the linear function, w dot x plus b. The cost function looks like this, is a convex function or a bowl shape or hammer shape. Gradient descent will look like this, where you take one step, one step, and so on to converge at the global minimum. Now you could try to use the same cost function for logistic regression. But it turns out that if I were to write f of x equals 1 over 1 plus e to the negative wx plus b and plot the cost function using this value of f of x, then the cost will look like this. This becomes what's called a non-convex cost function is not convex. What this means is that if you were to try to use gradient descent. There are lots of local minima that you can get sucking. It turns out that for logistic regression, this squared error cost function is not a good choice.
->Instead, there will be a different cost function that can make the cost function convex again. The gradient descent can be guaranteed to converge to the global minimum. The only thing I've changed is that I put the one half inside the summation instead of outside the summation. This will make the math you see later on this slide a little bit simpler. In order to build a new cost function, one that we'll use for logistic regression. I'm going to change a little bit the definition of the cost function J of w and b. In particular, if you look inside this summation, let's call this term inside the loss on a single training example. I'm going to denote the loss via this capital L and as a function of the prediction of the learning algorithm, f of x as well as of the true label y. The loss given the predictor f of x and the true label y is equal in this case to 1.5 of the squared difference. We'll see shortly that by choosing a different form for this loss function, will be able to keep the overall cost function, which is 1 over n times the sum of these loss functions to be a convex function.
->Now, the loss function inputs f of x and the true label y and tells us how well we're doing on that example. I'm going to just write down here at the definition of the loss function we'll use for logistic regression. If the label y is equal to 1, then the loss is negative log of f of x and if the label y is equal to 0, then the loss is negative log of 1 minus f of x.
->Let's take a look at why this loss function hopefully makes sense. Let's first consider the case of y equals 1 and plot what this function looks like to gain some intuition about what this loss function is doing. Remember, the loss function measures how well you're doing on one training example and is by summing up the losses on all of the training examples that you then get, the cost function, which measures how well you're doing on the entire training set. If you plot log of f, it looks like this curve here, where f here is on the horizontal axis. A plot of a negative of the log of f looks like this, where we just flip the curve along the horizontal axis. Notice that it intersects the horizontal axis at f equals 1 and continues downward from there. Now, f is the output of logistic regression. Thus, f is always between zero and one because the output of logistic regression is always between zero and one. The only part of the function that's relevant is therefore this part over here, corresponding to f between 0 and 1. Let's zoom in and take a closer look at this part of the graph. If the algorithm predicts a probability close to 1 and the true label is 1, then the loss is very small. It's pretty much 0 because you're very close to the right answer. Now continue with the example of the true label y being 1, say everything is a malignant tumor. If the algorithm predicts 0.5, then the loss is at this point here, which is a bit higher but not that high. Whereas in contrast, if the algorithm were to have outputs at 0.1 if it thinks that there is only a 10 percent chance of the tumor being malignant but y really is 1. If really is malignant, then the loss is this much higher value over here. When y is equal to 1, the loss function incentivizes or nurtures, or helps push the algorithm to make more accurate predictions because the loss is lowest, when it predicts values close to 1. Now on this slide, we'll be looking at what the loss is when y is equal to 1.
->On this slide, let's look at the second part of the loss function corresponding to when y is equal to 0. In this case, the loss is negative log of 1 minus f of x. When this function is plotted, it actually looks like this. The range of f is limited to 0 to 1 because logistic regression only outputs values between 0 and 1. If we zoom in, this is what it looks like. In this plot, corresponding to y equals 0, the vertical axis shows the value of the loss for different values of f of x. When f is 0 or very close to 0, the loss is also going to be very small which means that if the true label is 0 and the model's prediction is very close to 0, well, you nearly got it right so the loss is appropriately very close to 0. The larger the value of f of x gets, the bigger the loss because the prediction is further from the true label 0. In fact, as that prediction approaches 1, the loss actually approaches infinity.
->Going back to the tumor prediction example just says if the model predicts that the patient's tumor is almost certain to be malignant, say, 99.9 percent chance of malignancy, that turns out to actually not be malignant, so y equals 0 then we penalize the model with a very high loss. In this case of y equals 0, so this is in the case of y equals 1 on the previous slide, the further the prediction f of x is away from the true value of y, the higher the loss. In fact, if f of x approaches 0, the loss here actually goes really large and in fact approaches infinity. When the true label is 1, the algorithm is strongly incentivized not to predict something too close to 0.
->In this video, you saw why the squared error cost function doesn't work well for logistic regression. We also defined the loss for a single training example and came up with a new definition for the loss function for logistic regression.
->It turns out that with this choice of loss function, the overall cost function will be convex and thus you can reliably use gradient descent to take you to the global minimum. Proving that this function is convex, it's beyond the scope of this cost. You may remember that the cost function is a function of the entire training set and is, therefore, the average or 1 over m times the sum of the loss function on the individual training examples. The cost on a certain set of parameters, w and b, is equal to 1 over m times the sum of all the training examples of the loss on the training examples. If you can find the value of the parameters, w and b, that minimizes this, then you'd have a pretty good set of values for the parameters w and b for logistic regression.
->In the upcoming optional lab, you'll get to take a look at how the squared error cost function doesn't work very well for classification, because you see that the surface plot results in a very wiggly costs surface with many local minima. Then you'll take a look at the new logistic loss function. As you can see here, this produces a nice and smooth convex surface plot that does not have all those local minima. Please take a look at the cost and the plots after this video. We've seen a lot in this video. In the next video, let's go back and take the loss function for a single train example and use that to define the overall cost function for the entire training set. We'll also figure out a simpler way to write out the cost function, which will then later allow us to run gradient descent to find good parameters for logistic regression. Let's go on to the next video.

2.2 Simplified Cost Function 

->In the last video you saw the loss function and the cost function for logistic regression. In this video you'll see a slightly simpler way to write out the loss and cost functions, so that the implementation can be a bit simpler when we get to gradient descent for fitting the parameters of a logistic regression model. Let's take a look. 
->As a reminder, here is the loss function that we had defined in the previous video for logistic regression. Because we're still working on a binary classification problem, y is either zero or one. Because y is either zero or one and cannot take on any value other than zero or one, we'll be able to come up with a simpler way to write this loss function. You can write the loss function as follows. Given a prediction f of x and the target label y, the loss equals negative y times log of f minus 1 minus y times log of 1 minus f. It turns out this equation, which we just wrote in one line, is completely equivalent to this more complex formula up here. Let's see why this is the case. Remember, y can only take on the values of either one or zero. In the first case, let's say y equals 1. This first y over here is one and this 1 minus y is 1 minus 1, which is therefore equal to 0. So the loss becomes negative 1 times log of f of x minus 0 times a bunch of stuff. That becomes zero and goes away. When y is equal to 1, the loss is indeed the first term on top, negative log of f of x. Let's look at the second case, when y is equal to 0. In this case, this y here is equal to 0, so this first term goes away, and the second term is 1 minus 0 times that logarithmic term. The loss becomes this negative 1 times log of 1 minus f of x. That's just equal to this second term up here. In the case of y equals 0, we also get back the original loss function as defined above. What you see is that whether y is one or zero, this single expression here is equivalent to the more complex expression up here, which is why this gives us a simpler way to write the loss with just one equation without separating out these two cases, like we did on top. Using this simplified loss function, let's go back and write out the cost function for logistic regression.
->Here again is the simplified loss function. Recall that the cost J is just the average loss, average across the entire training set of m examples. So it's 1 over n times the sum of the loss from i equals 1 to m. If you plug in the definition for the simplified loss from above, then it looks like this, 1 over m times the sum of this term above. If you bring the negative signs and move them outside, then you end up with this expression over here, and this is the cost function. The cost function that pretty much everyone uses to train logistic regression. You might be wondering, why do we choose this particular function when there could be tons of other costs functions we could have chosen?
->Although we won't have time to go into great detail on this in this class, I'd just like to mention that this particular cost function is derived from statistics using a statistical principle called maximum likelihood estimation, which is an idea from statistics on how to efficiently find parameters for different models. This cost function has the nice property that it is convex. But don't worry about learning the details of maximum likelihood. It's just a deeper rationale and justification behind this particular cost function. The upcoming optional lab will show you how the logistic cost function is implemented in code. I recommend taking a look at it, because you implement this later into practice lab at the end of the week.

【About optional lab】
->This upcoming optional lab also shows you how two different choices of the parameters will lead to different cost calculations. You can see in the plot that the better fitting blue decision boundary has a lower cost relative to the magenta decision boundary. So with the simplified cost function, we're now ready to jump into applying gradient descent to logistic regression. Let's go see that in the next video.

3.Gradient Descent

3.1 Gradient Descent Implementation

->To fit the parameters of a logistic regression model, we're going to try to find the values of the parameters w and b that minimize the cost function J of w and b, and we'll again apply gradient descent to do this. Let's take a look at how. In this video we'll focus on how to find a good choice of the parameters w and b. After you've done so, if you give the model a new input, x, say a new patients at the hospital with a certain tumor size and age, then these are diagnosis. The model can then make a prediction, or it can try to estimate the probability that the label y is one.
->The average you can use to minimize the cost function is gradient descent. Here again is the cost function. If you want to minimize the cost j as a function of w and b, well, here's the usual gradient descent algorithm, where you repeatedly update each parameter as the 0 value minus Alpha, the learning rate times this derivative term.
->Let's take a look at the derivative of j with respect to w_j. This term up on top here, where as usual, j goes from one through n, where n is the number of features. If someone were to apply the rules of calculus, you can show that the derivative with respect to w_j of the cost function capital J is equal to this expression over here, is 1 over m times the sum from 1 through m of this error term. That is f minus the label y times x_j. Here are just x I j is the j feature of training example i. Now let's also look at the derivative of j with respect to the parameter b. It turns out to be this expression over here. It's quite similar to the expression above, except that it is not multiplied by this x superscript i subscript j at the end. Just as a reminder, similar to what you saw for linear regression, the way to carry out these updates is to use simultaneous updates, meaning that you first compute the right-hand side for all of these updates and then simultaneously overwrite all the values on the left at the same time. Let me take these derivative expressions here and plug them into these terms here. This gives you gradient descent for logistic regression.
->Now, one funny thing you might be wondering is, that's weird. These two equations look exactly like the average we had come up with previously for linear regression so you might be wondering, is linear regression actually secretly the same as logistic regression? Well, even though these equations look the same, the reason that this is not linear regression is because the definition for the function f of x has changed.
->In linear regression, f of x is, this is wx plus b. But in logistic regression, f of x is defined to be the sigmoid function applied to wx plus b. Although the algorithm written looked the same for both linear regression and logistic regression, actually they're two very different algorithms because the definition for f of x is not the same. When we talked about gradient descent for linear regression previously, you saw how you can monitor a gradient descent to make sure it converges. You can just apply the same method for logistic regression to make sure it also converges. I've written out these updates as if you're updating the parameters w_j one parameter at a time. Similar to the discussion on vectorized implementations of linear regression, you can also use vectorization to make gradient descent run faster for logistic regression.
->I won't dive into the details of the vectorized implementation in this video. But you can also learn more about it and see the code in the optional labs. Now you know how to implement gradient descent for logistic regression. You might also remember feature scaling when we were using linear regression. Where you saw how feature scaling, that is scaling all the features to take on similar ranges of values, say between negative 1 and plus 1, how they can help gradient descent to converge faster. Feature scaling applied the same way to scale the different features to take on similar ranges of values can also speed up gradient descent for logistic regression.

【About optional lab】
->In the upcoming optional lab, you also see how the gradient for the logistic regression can be calculated in code. This will be useful to look at because you also implement this in the practice lab at the end of this week. After you run gradient descent in this lab, there'll be a nice set of animated plots that show gradient descent in action. You see the sigmoid function, the contour plot of the cost, the 3D surface plot of the cost, and the learning curve or evolve as gradient descent runs. There will be another optional lab after that, which is short and sweet, but also very useful because they're showing you how to use the popular scikit-learn library to train the logistic regression model for classification. Many machine learning practitioners in many companies today use scikit-learn regularly as part of their job. I hope you check out the scikit-learn function as well and take a look at how that is used. That's it. You should now know how to implement logistic regression. This is a very powerful and very widely used learning algorithm and you now know how to get it to work yourself. Congratulations.

  • 20
    点赞
  • 23
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值