Support Vector Machine - The mathematics behind large margin classification

王彩旗 edwardwangcq.com

于 2020-09-16 21:39:02 发布

阅读量193

点赞数

分类专栏：人工智能 # 机器学习

本文链接：https://blog.csdn.net/edward_wang1/article/details/108552354

版权

人工智能同时被 2 个专栏收录

142 篇文章 0 订阅

订阅专栏

机器学习

109 篇文章 0 订阅

订阅专栏

摘要: 本文是吴恩达 (Andrew Ng)老师《机器学习》课程，第十三章《支持向量机》中第103课时《大间隔分类器的数学原理》的视频原文字幕。为本人在视频学习过程中记录下来并加以修正，使其更加简洁，方便阅读，以便日后查阅使用。现分享给大家。如有错误，欢迎大家批评指正，在此表示诚挚地感谢！同时希望对大家的学习能有所帮助.
————————————————

In this video, I'd like to tell you a bit about the math behind large margin classification. This video is optional, so please feel free to skip it. But it may also give you better intuition about how the optimization problem of the SVM, how that leads to large margin classifier.

In order to get started, let me first remind you a couple of properties of what vector inner product look like. Let's say I have two vectors ${\color{Blue} u}$ and ${\color{Blue} v}$ . Both 2 dimensional vectors. Let's see what ${\color{Blue} u^{T}v}$ looks like. And ${\color{Blue} u^{T}v}$ is also called the inner product between the vectors ${\color{Blue} u}$ and ${\color{Blue} v}$ . ${\color{Blue} u}$ is a two dimensional vector, so I can plot it on this figure. So, let's say that is the vector ${\color{Blue} u}$ . And what I mean by that is on the horizontal axis, that value takes whatever value ${\color{Blue} u_{1}}$ is; and on the vertical axis, the height of that is whatever ${\color{Blue} u_{2}}$ is. Now, one quantity that will be nice to have is the norm (范数) of the vector ${\color{Blue} u}$ . So, these are double bars on the left and right, that denotes the norm or length of ${\color{Blue} u}$ . So, this just means really the euclidean length of the vector ${\color{Blue} u}$ . And this is by Pythagoras theorem is just equal to ${\color{Blue} \sqrt{u_{1}^{2}+u_{2}^{2}}}$ . And this is the length of the vector ${\color{Blue} u}$ . That is a real number. Now, let's go back and look at the vector ${\color{Blue} v}$ because we want to compute the inner product. So ${\color{Blue} v}$ will be some other vector with some other value ${\color{Blue} \begin{bmatrix} v_{1}\\ v_{2} \end{bmatrix}}$ . And so the vector ${\color{Blue} v}$ will look like that. Now, let's go back and look at how to compute the inner product between ${\color{Blue} u}$ and ${\color{Blue} v}$ . Let me take vector ${\color{Blue} v}$ and project it down onto the vector ${\color{Blue} u}$ . So, I'm going to take a orthogonal projection or a 90 degree projection, so project it down on ${\color{Blue} u}$ like so. And I'm going to do is measure length of this red line I just draw here. So, I'm going to call the length of that red line ${\color{Blue} p}$ . So ${\color{Blue} p}$ is the length or is the magnitude of the projection of the vector ${\color{Blue} v}$ onto the vector ${\color{Blue} u}$ . And it's possible to show that inner product ${\color{Blue} u^{T}v}$ is going to equal to ${\color{Blue} p. \left \| u \right \|}$ . So, this is one way to compute the inner product. And if you actually do the geometry and figure out what ${\color{Blue} p}$ is and figure out what the norm ${\color{Blue} u}$ is. This should give you the same answer as the other way of computing the inner product which is if you take ${\color{Blue} u^{T}v=u_{1}v_{1}+u_{2}v_{2}}$ . And so the theorem of linear algebra that these two formulas give you the same answer. And by the way, ${\color{Blue} u^{T}v}$ is also equal to ${\color{Blue} v^{T}u}$ . So, if you were to do the same process in reverse, instead of projecting ${\color{Blue} v}$ onto ${\color{Blue} u}$ , you could project ${\color{Blue} u}$ onto ${\color{Blue} v}$ . Then do the same process, but with the rows of ${\color{Blue} u}$ and ${\color{Blue} v}$ reversed. And you should actually get the same number whatever that number is. And just to clarify what's going on in this equation: the norm of ${\color{Blue} u}$ is a real number, and ${\color{Blue} p}$ is also a real number. So ${\color{Blue} u^{T}v}$ is the regular multiplication of two real numbers, of the length of ${\color{Blue} p}$ and the norm of ${\color{Blue} u}$ . Just one last detail, which is if you look at the norm of ${\color{Blue} p}$ , ${\color{Blue} p}$ is actually signed, and it can either positive or negative. So, let me say what I mean by that, if ${\color{Blue} u}$ is a vector like this, And if ${\color{Blue} v}$ is a vector that looks like this. So, if the angle between ${\color{Blue} u}$ and ${\color{Blue} v}$ is greater than $90^{\circ}$ , then if I project ${\color{Blue} v}$ onto ${\color{Blue} u}$ , what I get is a projection it looks like this and so you have that length ${\color{Blue} p}$ . And in this case, I will still have that ${\color{Blue} u^{T}v=p\left \| u \right \|}$ . Except that in this example, ${\color{Blue} p}$ will be negative. So, inner product if the angle between ${\color{Blue} u}$ and ${\color{Blue} v}$ is less than $90^{\circ}$ , then ${\color{Blue} p}$ is the positive length for that red line. Whereas if the angle is greater than $90^{\circ}$ , then ${\color{Blue} p}$ will be negative of the length of the little line. So the inner product between two vectors can be also negative. So, that's how vector inner products work. We're going to use these properties of vector inner product to try to understand the SVM optimization objective over there.

Here's the optimization objective for SVM. Just for the purpose of this side, I am going to make one simplification or once just make the objective easier to analyze. What I'm going to do is ignore the intercept term. Just set ${\color{Blue} \theta _{0}=0}$ . To make things easier to plot, I'm also going to set ${\color{Blue} n}$ , the number of features, to be equal to 2. So, we have only 2 features ${\color{Blue} x_{1}}$ and ${\color{Blue} x_{2}}$ . Now, let's look at the objective function. The optimization objective of SVM. This can be written as ${\color{Blue} \frac{1}{2}(\theta _{1}^{2}+\theta _{2}^{2})}$ . What I'm going to do is rewrite this a bit, as ${\color{Blue} \frac{1}{2}\left ( \sqrt{\theta _{1}^{2}+\theta _{2}^{2}} \right )^{2}}$ . And the reason I can do that is because for any number ${\color{Blue} w}$ , ${\color{Blue} w=\left ( \sqrt{w} \right )^{2}}$ . Now what you may notice is that this term inside the parenthesis, that's equal to the norm or the length of the vector ${\color{Blue} \theta }$ . And what I mean by that is that if we write out the vector ${\color{Blue} \theta =\begin{bmatrix} \theta _{1}\\ \theta _{2} \end{bmatrix}}$ , then this term that I've just underlined in red, that's exactly the length or the norm of the vector ${\color{Blue} \theta }$ . And in fact, this is actually equal to the length of vector ${\color{Blue} \theta }$ , whether you write this as ${\color{Blue} \begin{bmatrix} \theta _{0}\\ \theta _{1}\\ \theta _{2} \end{bmatrix}}$ , if ${\color{Blue} \theta _{0}=0}$ , or just the length of ${\color{Blue} \theta _{1}, \theta _{2}}$ . But for this side, I'm going to igore ${\color{Blue} \theta _{0}}$ . So finally, this means that my optimization objective is ${\color{Blue} \frac{1}{2}\left \| \theta \right \|^{2}}$ . So all the SVM is doing in the optimization objective is it's minimizing the squared norm or the squared length of the parameter vector ${\color{Blue} \theta }$ . Now, what I'd like to do is look at these terms ${\color{Blue} \theta ^{T}x}$ and understand better what they're doing. So, given the parameter vector ${\color{Blue} \theta }$ , and given an example ${\color{Blue} x}$ , what is this equal to? On the previous slide, we've figured out what ${\color{Blue} u^{T}v}$ looks like with different ${\color{Blue} u}$ and ${\color{Blue} v}$ . So, we're going to take those definitions, with ${\color{Blue} \theta }$ and ${\color{Blue} x^{(i)}}$ playing the roles of ${\color{Blue} u}$ and ${\color{Blue} v}$ . And let's see what that picture looks like. Let's say I look at a single training example. let's say I have a positive example that was drawing across there and let's say that's my example ${\color{Blue} x^{(i)}}$ . So what that really means is that I plotted on the horizontal axis some value ${\color{Blue} x^{(i)}_{1}}$ and on the vertical axis ${\color{Blue} x^{(i)}_{2}}$ . That's how I plot my training examples. And although we haven't been really thinking of this as a vector, what this really is, this is a vector from the origin ${\color{Blue} (0,0)}$ , out to the location of the training example. And now, let's say we have a parameter ${\color{Blue} \theta}$ , and I'm goging to plot that as a vector as well. So what is the inner product ${\color{Blue} \theta ^{T}x^{(i)}}$ ? Well, using the early method, the way we compute that is we take my example and project it onto my parameter vector ${\color{Blue} \theta }$ . And then I'm goging to look at the length of this segment I'm coloring it in red. And I'm going to call that ${\color{Red} P^{(i)}}$ to denote that this is a projection of the ${\color{Blue} i^{th}}$ training example onto the parameter vector ${\color{Blue} \theta }$ . And so what we have is that ${\color{Blue} \theta ^{T}x^{(i)}=}{\color{Red} P^{(i)}}.{\color{Blue} \left \| \theta \right \|=\theta _{1}x^{(i)}_1+\theta_{2}x^{(i)}_{2}}$ . So, each of this is an equally valid way of computing the inner product between ${\color{Blue} \theta }$ and ${\color{Blue} x^{(i)}}$ . So, what does this leave us? This contraints that ${\color{Blue} \theta ^{T}x^{(i)}\geqslant 1}$ or ${\color{Blue} \theta ^{T}x^{(i)}\leqslant -1}$ , what does this means is that it can replace the use of constrains that ${\color{Red} P^{(i)}}{\color{Blue} \left \| \theta \right \|\geqslant 1}$ or ${\color{Red} P^{(i)}}{\color{Blue} \left \| \theta \right \|\leqslant -1}$ .

So, writing that into our optimization objective, this is what we get. Just to remind you we worked out earlier too that this optimization objective is ${\color{Blue} \frac{1}{2}\left \| \theta \right \|}^{2}$ . So, now let's consider the training example that we have at the bottom. And for now, continuing to use the simplification that ${\color{Blue} \theta _{0}=0}$ . Let's see what decision boundary the SVM will choose. Here's one option, let's say the SVM were to choose this decision boundary (green line). This is not a very good choise because it has very small margin. This decision boundary comes very close to the training examples. Let's see why the support vector machine will not do this. For this choice of parameters, it's possible to show that the parameter vector ${\color{Blue} \theta }$ (blud line) is actually at $90^{\circ}$ to the decision boundary. So that green decision boundary corresponds to a parameter vector ${\color{Blue} \theta }$ that points in that direction. And by the way, the simplification that ${\color{Blue} \theta _{0}=0}$ that just means that the decision boundary has to pass through the origin ${\color{Blue} (0,0)}$ over there. So now, let's look at what this implies for the optimization objective. Let's say that this example here is my first example ${\color{Blue} x^{(1)}}$ . If we look at the projection of this example onto my parameter ${\color{Blue} \theta }$ , that little red segment. That is equal to ${\color{Red} P^{(1)}}$ . That is going to be pretty small, right? And similarly, if this example here happens to be ${\color{Blue} x^{(2)}}$ . Then, if I look at the projection, this little magenta line segment is going to be ${\color{Magenta} P^{(2)}}$ . It will actually be a negative number because it is in the opposite direction. This vector ${\color{Blue} x^{(2)}}$ has greater than $90^{\circ}$ angle with my parameter vector ${\color{Blue} \theta }$ . So ${\color{Magenta} P^{(2)}}$ is going to be less than . And so what we're finding is that these terms ${\color{Magenta} P^{(i)}}$ are going to be pretty small numbers. And so if we look at the optimization objective, and see for positive examples, we need ${\color{Red} P^{(i)}}{\color{Blue} \left \| \theta \right \|\geqslant 1}$ . But if ${\color{Magenta} P^{(i)}}$ or ${\color{Red} P^{(1)}}$ over here is pretty small, that means that we need the ${\color{Blue} \left \| \theta \right \|}$ to be pretty large. And similarly for our negative example, we need ${\color{Red} P^{(i)}}{\color{Blue} \left \| \theta \right \|\leqslant -1}$ . And we saw that in this example, ${\color{Magenta} P^{(2)}}$ is going to be pretty small negative number, and so the only way for that to happen as well is for the norm of ${\color{Blue} \theta }$ to be large. But what we're doing in the optimization objective is we're trying to find a setting of parameters where the ${\color{Blue} \left \| \theta \right \|}$ is small, so this doesn't seem like a good direction for the parameter vector ${\color{Blue} \theta }$ . In contrast, just look at a different decision boundary. Here, let' say, this SVM choose that decision boundary. Now the picture is going to be very different. If that is the decision boundary, here is the corresponding direction for ${\color{Blue} \theta }$ . So, with the decision boundary, being that vertical line, it is possible to show using linear algebra, that the way to get green decision bounary is having the vector of ${\color{Blue} \theta }$ be at $90^{\circ}$ to it. And now, if you look at the projection of your data onto the vector ${\color{Blue} \theta }$ , let's say as befor that this example is my example of ${\color{Blue} x^{(1)}}$ . So, when I project this onto ${\color{Blue} \theta }$ , what I find is that this is ${\color{Red} P^{(1)}}$ . And for the other example ${\color{Blue} x^{(2)}}$ , I do the same projection. This length here (magenta) is ${\color{Magenta} P^{(2)}}$ which is going to be less than . And you notice that now ${\color{Red} P^{(1)}}$ and ${\color{Magenta} P^{(2)}}$ , these length of projections are going to be much bigger. So, if we still need to enforce these constraints, ${\color{Red} P^{(1)}}{\color{Blue} \left \| \theta \right \|\geqslant 1}$ , because ${\color{Red} P^{(1)}}$ is so much bigger now, The ${\color{Blue} \left \| \theta \right \|}$ can be small. And so, what this means is that by choosing the decision boundary shown on the right, the SVM can make the norm of the parameter ${\color{Blue} \theta }$ much smaller and therefore make the $\left \| \theta \right \|^{2}$ smaller. Which is why the SVM would choose this hypothesis on the right instead. And this is how the SVM gives rise to this large margin certification effect. Mainly, if you look at this green line, if you look at this green hypothesis, we want the projections of my positive and negative examples onto ${\color{Blue} \theta }$ to be large. And the only way for that to hold true is if surrounding the green line. This is a large margin that separates the positive and negative examples. And this is really the magnitude of this gap. The magnitude of this margin is exactly the values of ${\color{Red} P^{(1)}}$ , ${\color{Magenta} P^{(2)}}$ , ${\color{Magenta} P^{(3)}}$ and so on. And so by making the margin large, the SVM can end up with a smaller value for the norm of ${\color{Blue} \theta }$ which is it is trying to do in the objective. And this is why the SVM ends up with large margin classifiers because it is trying to maximize the norm of these ${\color{Magenta} P^{(i)}}$ which is the distance from the training examples to the decision boundary. Finally, we did this whole derivation using the simplification that ${\color{Blue} \theta _{0}=0}$ . The effect of that as I mentioned briefly is that if ${\color{Blue} \theta _{0}=0}$ , we're entertaining only decision boundaries that pass through the origin. If you allow ${\color{Blue} \theta _{0}}$ to be non zero, then what that means is that you entertain the decision boundaries that didn't pass through the origin like that one I just drew. I'm not going to do the full derivation. But it turns out that the same large margin works in pretty much exactly the same way. And there's a generalization of this argument that we just went through and not to go through. It shows that even when ${\color{Blue} \theta _{0}}$ is non 0, what the SVM is trying to do, when you have this optimization objective, is still finding large margin separator between the positive and negative examples. Which again corresponds to the case when is very large. So, that explains how the SVM is a large margin classifier. In the next video, we'll start to talk about how to take some of these SVM ideas and start to apply them to build a complex nonlinear classifier as well.

<end>

王彩旗 edwardwangcq.com

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Support Vector Machine - The mathematics behind large margin classification

In this video, I'd like to tell you a bit about the math behind large margin classification. This video is optional, so please feel free to skip it. But it may also give you better intuition about how the optimization problem of the SVM, how that leads to
复制链接

扫一扫

专栏目录