Support Vector Machine - Kernels II

王彩旗 edwardwangcq.com

于 2020-10-05 15:58:12 发布

阅读量149

点赞数

分类专栏：人工智能 # 机器学习

本文链接：https://blog.csdn.net/edward_wang1/article/details/108911869

版权

人工智能同时被 2 个专栏收录

142 篇文章 0 订阅

订阅专栏

机器学习

109 篇文章 0 订阅

订阅专栏

摘要: 本文是吴恩达 (Andrew Ng)老师《机器学习》课程，第十三章《支持向量机》中第105课时《核函数II》的视频原文字幕。为本人在视频学习过程中记录下来并加以修正，使其更加简洁，方便阅读，以便日后查阅使用。现分享给大家。如有错误，欢迎大家批评指正，在此表示诚挚地感谢！同时希望对大家的学习能有所帮助.
————————————————

In the last video, we started to talk about the kernels idea and how it can be used to define new features for the support vector machine. In this video, I'd like to throw in some of the missing details and also say a few words about how to use these ideas in practice. Such as, how they pertain to, for example, the bias variance trade-off in support vector machine.

In the last video, I talked about the process of picking a few landmarks. You know, $l^{(1)}, l^{(2)}, l^{(3)}$ and that allowed us to define the similarity function also called the kernel or in this example if you have this similarity function this is a Gaussian kernel. And that allowed us to build this form of hypothesis function. But where do we get these landmarks from? And it seems also that for complex learning problems, maybe we want a lot more landmarks than just three of them that we might choose by hand. So in practice this is how the landmarks are chosen, which is that given the machine learning problem, we have some data set of some positive and negative examples. So, this is the idea here, which is that we're gonna take the examples and for every training example that we have, we are just going to put landmarks as exactly the same locations as the training examples.So if I have one training example $x^{(1)}$ , I'm going to choose this as my first landmark to be at exactly the same location as my first training example. And if I have a different example $x^{(2)}$ , we're going to set the second landmark to be the location of my second training example. On the figure on the right, I use red and blue dots just as illustration, the color of the dots on the figure on the right is not significant. But what I'm going to end up with using this method is I'm going to end up with landmarks of $l^{(1)}, l^{(2)}$ down to $l^{(m)}$ if I have training examples, with one landmark per location of per location of my training examples. And this is nice because it is saying that my features are basically going to measure how close an example is to one of the things I saw in my training set.

So, just write this out a little more concretely. Given training examples, I'm going to choose the location of my landmarks to be exactly near the locations of my training examples. When you're given example , which can be something in the training set, can be something in the cross validation set, or it can be something in the test set. Given the example we're going to compute, these features $f_{1}, f_{2}$ and so on, where $l^{(1)}$ is actually equal to $x^{(1)}$ and so on. And these then give me a feature vector. So let me write as a feature vector $\begin{bmatrix} f_1\\ f_{2}\\ ...\\ f_m\end{bmatrix}$ . By convention, if we want we can add an extra feature $f_{0}$ , which is always equal to 1. So this plays a role similar to what we had previously. For $x_{0}$ , which was our interceptor. For example, if we have a training example $(x^{(i)}, y^{(i)})$ , the features we would compute for this training example will be as follows:

$f^{(i)}_{1}=SIM(x^{(i)}, l^{(1)})$

$f^{(i)}_{2}=SIM(x^{(i)}, l^{(2)})$

...

$f^{(i)}_{m}=SIM(x^{(i)}, l^{(m)})$

And, somewhere in the middle, for the i-th component, I will actually have one feature which is $f^{(i)}_{i}$ , which is going to be the similarity between $x^{(i)}$ and $l^{(i)}$ : $f^{(i)}_{i}=SIM(x^{(i)}, l^{(i)})=SIM(x^{(i)}, x^{(i)})=exp(-\frac{0}{2\sigma ^{2}})=1$ if using the Gaussian kernel. So one of my features for this training example is going to be equal to 1. And then similar to what I have above, I can take all of these featues, and group them into a feature vector. So instead of representing my example using $x^{(i)}$ , which is $\mathbb{R}^{n}$ or $\mathbb{R}^{n+1}$ dimensional vector. We can now represent my training example using this feature vector . I'm going to write this as $f^{(i)}=\begin{bmatrix}f^{(i)}_0\\ f^{(i)}_1\\ f^{(i)}_2\\ ...\\ f^{(i)}_m \end{bmatrix}$ . Note that usually we'll also add this $f^{(i)}_{0}=1$ . And so this vector gives me my new feature vector with which to represent my training example. So given these kernels and similarity functions, here's how we use a support vector machine.

If you already have learned a set of parameters $\theta$ , then if you're given a value of , and you want to make a prediction. What we do is to compute the features , which is now an $\mathbb{R}^{m+1}$ dimensional vector. We have here because we have training examples and thus landmarks. And what we do is we predict if $\theta ^{T}f=\theta _{0}f_{0}+\theta _{1}f_{1}+...+\theta _{m}f_{m}\geq 0$ . And $\theta \in \mathbb{R}^{m+1}$ . So that's how you make a prediction if you already have a setting for the parameter $\theta$ . But how do you get the parameters $\theta$ ? Well you do that using the SVM learning algorithm, and specifically what you do is you would solve this minimization problem. Only now, instead of making predictions using $\theta^{T}x^{(i)}$ , we're using $\theta^{T}f^{(i)}$ to make prediction. It's by solving this minimization problem that you get the parameters for your support vector machine. And one last detail is because for this optimization problem we have n=m features. There's one sort of mathematical detail I should mention, which is that in the way the support vector machine is implemented, this last term is actually done a little bit differently. You don't really need to know about this last detail in order to use SVM, and in fact the equations that are written down here should give you all the intuitions that you need. But in the way the SVM is implemented, the item $\sum _{j=1} ^{m} {\theta_j}^2=\sum _{j=1} ^{m} {\theta^T\theta}$ if we ignore $\theta_{0}$ . And what most support vector machine implementations do is actually replace this $\theta^{T}\theta$ with instead $\theta ^{T}$ times some matrix inside, depends on the kernel you used, times $\theta$ , that is $\theta ^{T} M \theta$ . This gives us a slightly different distance metric. We'll use a slightly different measure instead of minimizing exactly the norm of $\theta$ squared ( $\left \| \theta \right \|^{2}$ ), we instead minimize something slightly similar to it. That's like a rescale version of the parameter vector $\theta$ that depends on the kernel. But this is kind of a mathematical detail that allows the SVM software to run more efficiently. And the reason the SVM does this with this modifcation is it allows it to scale to much bigger training sets. Because, for example, if you have a training set with 10,000 training examples. We end up with 10,000 landmarks. And $\theta \in \mathbb{R}^{10,000}$ . Maybe that works, but when becomes really big like 50,000 or 100,000, then solving for all these parameters can become expensive for SVM optimization software, thus solving the minimization problem that I drew here. So kind of as mathematical detail, which again you really don't need to know about. It actually modifies that last term a little bit to optimize something slightly different than just minimizing the norm squared of $\theta$ . But if you want, you can feel free to think of this as a kind of implementation detail that does changed the objective a bit, but it's done primarily for reasons of computational efficiency, so usually you don't really have to worry about this. And by the way, in case you're wondering why we don't apply the kernel's idea to other algorithms as well like logistic regression, it turns out that if you want, you can actually apply the kernel's idea and define the source of features using landmarks and so on for logistic regression. But the computational tricks that apply for support vector machines don't generalize well to other algorithms like logistic regression. And so, using kernels with logistic regression is going to be very slow. Whereas, because of computational tricks, like that embodied and how it modifies this (meaning $\theta ^{T} M \theta$ ) and the details of how the support vector machine software is implemented, support vector machines and kernels tend to go particularly well together. Whereas, logistic regression and kernels, you can do it, but this would run very slowly. And it won't be able to take advantage of advanced optimization techniques that people have figured out for the particular case of running a support vector machine with a kernel. But all this pertains only to how you actually implement software to minimize the cost function. I'll say more about that in the next video, but you really don't need to know about how to write software to minimize the cost function because you can find very good off the shelf software for doing so. And just as I wouldn't recommend writing code to invert a matrix or to compute a square root, I actually do not recommend writing software to minimize this cost function yourself, but instead to use off the shelf software packages that people have developed. Those software packages already embody these numerical tricks, so you don't really have to worry about them.

But one other thing that is worth knowing about is when you're applying support vector machine, how to choose the parameters () of support vector machine? And the last thing I want to do in this video is say a little bit about the bias and variance trade offs when using a SVM. When using a SVM, one of the things you need to choose is the parameter which was in the optimization objective, and you recall that palyed a role similar to $1/\lambda$ , where $\lambda$ is the regularization paramber we have for logistic regression. So, if you have a large value of , this corresponds to what we have back in logistic regression, of small value of $\lambda$ meaning of not using much regularization. And if you do that, you tend to have a hypothesis with lower bias and higher variance. Whereas if you have smaller value of then this corresponds to when we are using logistic regression with a large value of $\lambda$ , and that corresponds to a hypothesis with higher bias and lower variance. And so, hypothesis with large has a higher variance, is more prone to overfitting, whereas hypothesis with small has higher bias and thus is more prone to underfitting. So this parameter is one of the parameters we need to choose. The other thing is the parameter $\sigma^{2}$ , which appear in the Gaussian kernel. So, if the Gaussian kernel $\sigma^{2}$ is large, then in the similarity function, which is $exp(-\frac{\left \| x-l^{(i)} \right \|^{2}}{2\sigma ^{2}})$ . In this one-way example, if I have only one feature $x_{1}$ . If I have a landmark , if $\sigma^{2}$ is large, then the Gaussian kernel would tend to fall off relatively slowly. This would be smoother function that varies more smoothly. This would give me a hypothesis with higher bias and lower variance, because the Gaussian kernel that falls off smoothly, you tend to get a hypothesis that varies slowly as you change input . Whereas in contrast, if $\sigma^{2}$ was small, the Gaussian kernel will vary more abruptly. So if $\sigma^{2}$ is small, then my features vary less smoothly. So it's just higher slopes or higher derivatives here. And using this, you end up fitting hypotheses of lower bias and you can have higher variance. If you look at this week's program exercise, you actually get to play around with some of these ideas yourself and see these effects yourself.

So, that was the support vector machine with kernels algorithm. And hopefully this discussion of bias and variance will give you some sense of how you can expect this algorithm to behave as well.

<end>

王彩旗 edwardwangcq.com

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Support Vector Machine - Kernels II

In the last video, we started to talk about the kernels idea and how it can be used to define new features for the support vector machine. In this video, I'd like to throw in some of the missing details and also say a few words about how to use these ideas
复制链接

扫一扫

专栏目录