Support Vector Machines - Optimization objective

最新推荐文章于 2022-12-18 22:14:45 发布

王彩旗 edwardwangcq.com

最新推荐文章于 2022-12-18 22:14:45 发布

阅读量142

点赞数

分类专栏：人工智能 # 机器学习

本文链接：https://blog.csdn.net/edward_wang1/article/details/108326468

版权

人工智能同时被 2 个专栏收录

142 篇文章 0 订阅

订阅专栏

机器学习

109 篇文章 0 订阅

订阅专栏

摘要: 本文是吴恩达 (Andrew Ng)老师《机器学习》课程，第十三章《支持向量机》中第101课时《优化目标》的视频原文字幕。为本人在视频学习过程中记录下来并加以修正，使其更加简洁，方便阅读，以便日后查阅使用。现分享给大家。如有错误，欢迎大家批评指正，在此表示诚挚地感谢！同时希望对大家的学习能有所帮助.
————————————————

By now, you see the range of different learning algorithms. Within supervised learning, the performance of many supervised learning algorithms will be pretty similar and what matters less will often be whether you use learning algorithm A or learning algorithm B. What matters more will often be things like the amount of data you are creating these algorithms on. That's always your skill in applying these algorithms. Things like your choise of the features that you designed to give the learning algorithms and how you choose the regularization parameter and things like that. But there's one more algorithm that is very powerful and it's very widely used both within industry and in academia. And that's called the Support Vector Machine. Compared to both the logistic regression and neural networks, the Support Vector Machine or the SVM sometimes gives a cleaner and sometimes more powerful way of learning complex nonlinear functions. And so I'd like to take the next videos to talk about that. Later in this course, I will do a quick survey of the range of different supervised learning algorithms just to very briefly describe them. But the support vector machine, given its popularity and how powerful it is, this will be the last of the supervised learning algorithms that I'll spend a significant amount of time on in this course. As with our development of the learning algorithm, going to start by talking about the optimization objective, so let's get started on this algorithm.

In order to describe the support vector machine, I'm actually going to start with logistic regression and show how we can modify it a bit and get what is essentially the support vector machine. So, in logistic regression, we have our familiar form of the hypothesis there and the sigmoid activation function shown on the right. In order to explain some of the math, I'm going to use to denote $\theta ^{T}x$ . Now let's think about what we will like the logistic regression to do. If we have an example with y=1 , and by this I mean an example in either a training set or the test set or cross validation set. Where the y=1 , then we are sort of hoping that $h_{\theta }(x)\approx 1$ . So we are hoping to correctly classify that example. And we're having $h_{\theta }(x)\approx 1$ , what that means is $\theta ^{T}x> > 0$ . So this > > sign means much, much greater than 0. Because $z=\theta ^{T}x$ , when is much bigger than 0 or is far to the right of the figure, the output of the logistic regression becomes close to 1. Conversely, if we have an example where y=0 , then what we're hoping for is the hypothesis will output the value close to 0 ( $h_{\theta }(x)\approx 0$ ) and that corresponds to $\theta ^{T}x$ or pretty much less than .

If you look at the cost function of logistic regression, what you find is that each example (x,y) contributes a term like this to the overall cost function. So, for the overall cost function, we will also have a sum over all the training examples and 1/m term. But this expression here, that's the term that a single training example contributes to the overall objective function for logistic regression. Now, if I take the definition for the full of my hypothesis and plug it in, then what I get is that each training example contributes this term. Ignoring the 1/m but it contributes that term to my overall cost function for logistic regression. Now, let's consider the 2 cases when y=1 and when y=0 . In the first case, let's suppose that y=1 . In that case, only this first term in the objective matters, so what we get is $-log\frac{1}{1+e^{-\theta ^{T}x}}=-log\frac{1}{1+e^{-z}}$ . And if we plot this function as a function of , what you find is that you get this curve shown on the lower left of this slide. Thus, we also see that when is equal to large, that is when $\theta ^{T}x$ is large, that corresponds to a value of that gives us a very small value, a very small contribution to the cost function. And this kind of explains why when logistic regression sees a positive example with y=1 , it tries to set $\theta ^{T}x$ to be very large because this corresponds to this term ( $-log\frac{1}{1+e^{-z}}$ ) in a cost function being small.

Now, to build the Support Vector Machine, here's what we're going to do. We're going to take this cost function $-log\frac{1}{1+e^{-z}}$ and modify it a little bit. Let me take this point 1 (z=1) over here and let me draw the cost function that I'm going to use. The new cost function is gonna be flat from here on out and then I'm going to draw something that grows as a straight line similar to logistic regression but this is going to be the straight line in this portion. So the curve that I just drew in magenta, it's a pretty close approximation to the cost function used by logistic regression except that it's now made out of two line segments. A flat portion on the right and a straight line portition on the left. Don't worry about the slope of the straight line portition. You can imagine it should do something pretty similar to logistic regression, but it turns out that this will give the support vector machine computational advantage and give us an easier optimization problem. The other case is y=0 . In that case, if you look at the cost, then only the second term will apply. So, the cost of the example or the contribution of the cost function is going to be given by this term over here. If you plot that as a function of , you end up with this graph. And for the Support Vector Machine, once again, we're going to replace this blue line with something similar. In particular, we can replace it with a new cost in magenta. So, let me give these two functions names. This function on the left, I'm going to call $cost_{1}(z)$ . And this function on the right I'm going to call cost_0(z) . And the subscript just refers to the cost corresponding to y=1 or y=0 . Armed with these definitions, we're now ready to build the Support Vector Machine.

Here's the cost function that we have for logistic regression. In case this equation looks a bit unfamiliar, is because previously we had a minus sign outside, but here what I did was I instead moved the minus signs inside this expression. For the Support Vector Machine, what we're going to do is essentially take and replace this ( $-log(h_\theta (x^{(i)}))$ ) with $cost_1(z)=cost_1(\theta ^T(x^{(i)}))$ . And I'm goging to take this ( $-log(1-h_\theta (x^{(i)}))$ ) and replace it with $cost_0(z)=cost_0(\theta ^Tx^{(i)})$ . So what we have for the Support Vector Machine, is an minimization problem of above equation. Now by convention, for the Support Vector Machine, we actually write things slightly differently. First, we're goging to get rid of the 1/m term. This just to be a slightly different convention that people use for Support Vector Machine compared to for logistic regression. Here's what I mean. I'm gonna get rid of this 1/m terms and this should give me the same optimal value for $\theta$ . Because 1/m is just a constant. So, whether I solve this minimization problem with 1/m in front or not, I should end up with the same optimal value for $\theta$ . Here's what I mean. Suppose I had a minimization problem that minimize over real number of (u-5)^2+1 . The minimum of this is u=5 . Now if I want to take this objective function and multiply it by . The value of that minimize this is still u=5 . So, multiply something you minimize over by some constant, it doesn't change the value of that minimize this function. So the same way what I've done by crossing out this is multiplying my objective function by some constant and it doesn't change the value of $\theta$ that achieves the minimum. The second bit of notational change, which is just again most standard convention when using SVM. So, for logistic regression, we had two terms to our objective function. The first is this term which is the cost that comes from the training set, and the second is this term, which is the regularization term. We control the trade off between these by saying, we want to minimize $A+\lambda B$ . I'm using to denote this first term, and using to denote that second term. And instead of parameterizing this as $A+\lambda B$ , we set different values for the regularization parameter $\lambda$ . We could trade off the relative weight between how much we want to fit the training set as minimizing versus how much we care about keeping the values of the parameters small. For the SVM, just by convention we're going to use a different parameter. So, instead of using $\lambda$ here to control the relative weighting between the first and second terms, we're going to use a different parameter which by convention is called . we're going to minimize C*A+B . So for logistic regression, if we set a very large value of $\lambda$ , that means give a very high weight. Here is that if we set to a very small value, that corresponds to giving a much larger weight than . So, this is just a different way of controlling the trade off how much we care about optimizing the first term versus how much we care about optimizing the second term. And you can think of this as the parameter playing a role similar to $\frac{1}{\lambda }$ . And it's not that these two equations or these two expressions were equal with $C=\frac{1}{\lambda }$ . That's not the case. We'd rather if $C=\frac{1}{\lambda }$ , then these two optimization objectives should give you the same value of $\theta$ . So just filling that in, I'm going to cross out $\lambda$ here and write in the costant there. So, that gives us our overall optimization objective function for the SVM. And when you minimize that function then what you have is the parameter learned by the SVM.

Finally, unlike logistic regression, the SVM doesn't output probability. Instead what we have is, we have this cost function which we minimize to get parameter $\theta$ , and what SVM does is it just makes a prediction of y=1 or y=0 directly. So the hypothesis will output 1 if $\theta ^{T}x\geqslant 0$ , and predict 0 otherwise. So, having learned a parameter $\theta$ , this is the form of the hypothesis for SVM.

So, that was a mathematical definition of what a SVM does. In the next few videos, let's try to get back to intuition about what this optimization objective leads to and what are the sorts of the hypothesis a SVM will learn. And also talk about how to modify this just a little bit to learn complex, nonlinear functions.