Support Vector Machine - Kernels II

摘要: 本文是吴恩达 (Andrew Ng)老师《机器学习》课程,第十三章《支持向量机》中第105课时《核函数II》的视频原文字幕。为本人在视频学习过程中记录下来并加以修正,使其更加简洁,方便阅读,以便日后查阅使用。现分享给大家。如有错误,欢迎大家批评指正,在此表示诚挚地感谢!同时希望对大家的学习能有所帮助.
————————————————

In the last video, we started to talk about the kernels idea and how it can be used to define new features for the support vector machine. In this video, I'd like to throw in some of the missing details and also say a few words about how to use these ideas in practice. Such as, how they pertain to, for example, the bias variance trade-off in support vector machine.

In the last video, I talked about the process of picking a few landmarks. You know, l^{(1)}, l^{(2)}, l^{(3)} and that allowed us to define the similarity function also called the kernel or in this example if you have this similarity function this is a Gaussian kernel. And that allowed us to build this form of hypothesis function. But where do we get these landmarks from? And it seems also that for complex learning problems, maybe we want a lot more landmarks than just three of them that we might choose by hand. So in practice this is how the landmarks are chosen, which is that given the machine learning problem, we have some data set of some positive and negative examples.  So, this is the idea here, which is that we're gonna take the examples and for every training example that we have, we are just going to put landmarks as exactly the same locations as the training examples.So if I have one training example x^{(1)}, I'm going to choose this as my first landmark to be at exactly the same location as my first training example. And if I have a different example x^{(2)}, we're going to set the second landmark to be the location of my second training example. On the figure on the right, I use red and blue dots just as illustration, the color of the dots on the figure on the right is not significant. But what I'm going to end up with using this method is I'm going to end up with m landmarks of l^{(1)}, l^{(2)} down to l^{(m)} if I have m training examples, with one landmark per location of per location of my training examples. And this is nice because it is saying that my features are basically going to measure how close an example is to one of the things I saw in my training set.

So, just write this out a little more concretely. Given m training examples, I'm going to choose the location of my landmarks to be exactly near the locations of my m training examples. When you're given example x, which can be something in the training set, can be something in the cross validation set, or it can be something in the test set. Given the example x we're going to compute, these features f_{1}, f_{2} and so on, where l^{(1)} is actually equal to x^{(1)} and so on. And these then give me a feature vector. So let me write f as a feature vector \begin{bmatrix} f_1\\ f_{2}\\ ...\\ f_m\end{bmatrix}. By convention, if we want we can add an extra feature f_{0}, which is always equal to 1. So this plays a role similar to what we had previously. For x_{0}, which was our interceptor. For example, if we have a training example (x^{(i)}, y^{(i)}), the features we would compute for this training example will be as follows:

f^{(i)}_{1}=SIM(x^{(i)}, l^{(1)})

f^{(i)}_{2}=SIM(x^{(i)}, l^{(2)})

...

f^{(i)}_{m}=SIM(x^{(i)}, l^{(m)})

And, somewhere in the middle, for the i-th component, I will actually have one feature which is f^{(i)}_{i}, which is going to be the similarity between x^{(i)} and l^{(i)}: f^{(i)}_{i}=SIM(x^{(i)}, l^{(i)})=SIM(x^{(i)}, x^{(i)})=exp(-\frac{0}{2\sigma ^{2}})=1 if using the Gaussian kernel. So one of my features for this training example is going to be equal to 1. And then similar to what I have above, I can take all of these m featues, and group them into a feature vector. So instead of representing my example using x^{(i)}, which is \mathbb{R}^{n} or \mathbb{R}^{n+1} dimensional vector. We can now represent my training example using this feature vector f. I'm going to write this as f^{(i)}=\begin{bmatrix}f^{(i)}_0\\ f^{(i)}_1\\ f^{(i)}_2\\ ...\\ f^{(i)}_m \end{bmatrix}. Note that usually we'll also add this f^{(i)}_{0}=1. And so this vector gives me my new feature vector with which to represent my training example. So given these kernels and similarity functions, here's how we use a support vector machine.

If you already have learned a set of parameters \theta, then if you're given a value of x, and you want to make a prediction. What we do is to compute the features f, which is now an \mathbb{R}^{m+1} dimensional vector. We have m here because we have m training examples and thus m landmarks. And what we do is we predict 1 if \theta ^{T}f=\theta _{0}f_{0}+\theta _{1}f_{1}+...+\theta _{m}f_{m}\geq 0. And \theta \in \mathbb{R}^{m+1}. So that's how you make a prediction if you already have a setting for the parameter \theta. But how do you get the parameters \theta? Well you do that using the SVM learning algorithm, and specifically what you do is you would solve this minimization problem. Only now, instead of making predictions using \theta^{T}x^{(i)}, we're using \theta^{T}f^{(i)} to make prediction. It's by solving this minimization problem that you get the parameters for your support vector machine. And one last detail is because for this optimization problem we have n=m features. There's one sort of mathematical detail I should mention, which is that in the way the support vector machine is implemented, this last term is actually done a little bit differently. You don't really need to know about this last detail in order to use SVM, and in fact the equations that are written down here should give you all the intuitions that you need. But in the way the SVM is implemented, the item \sum _{j=1} ^{m} {\theta_j}^2=\sum _{j=1} ^{m} {\theta^T\theta} if we ignore \theta_{0}. And what most support vector machine implementations do is actually replace this \theta^{T}\theta with instead \theta ^{T} times some matrix inside, depends on the kernel you used, times \theta, that is \theta ^{T} M \theta. This gives us a slightly different distance metric. We'll use a slightly different measure instead of minimizing exactly the norm of \theta squared (\left \| \theta \right \|^{2}), we instead minimize something slightly similar to it. That's like a rescale version of the parameter vector \theta that depends on the kernel. But this is kind of a mathematical detail that allows the SVM software to run more efficiently. And the reason the SVM does this with this modifcation is it allows it to scale to much bigger training sets. Because, for example, if you have a training set with 10,000 training examples. We end up with 10,000 landmarks. And \theta \in \mathbb{R}^{10,000}. Maybe that works, but when m becomes really big like 50,000 or 100,000, then solving for all these parameters can become expensive for SVM optimization software, thus solving the minimization problem that I drew here. So kind of as mathematical detail, which again you really don't need to know about. It actually modifies that last term a little bit to optimize something slightly different than just minimizing the norm squared of \theta. But if you want, you can feel free to think of this as a kind of implementation detail that does changed the objective a bit, but it's done primarily for reasons of computational efficiency, so usually you don't really have to worry about this. And by the way, in case you're wondering why we don't apply the kernel's idea to other algorithms as well like logistic regression, it turns out that if you want, you can actually apply the kernel's idea and define the source of features using landmarks and so on for logistic regression. But the computational tricks that apply for support vector machines don't generalize well to other algorithms like logistic regression. And so, using kernels with logistic regression is going to be very slow. Whereas, because of computational tricks, like that embodied and how it modifies this (meaning \theta ^{T} M \theta) and the details of how the support vector machine software is implemented, support vector machines and kernels tend to go particularly well together. Whereas, logistic regression and kernels, you can do it, but this would run very slowly. And it won't be able to take advantage of advanced optimization techniques that people have figured out for the particular case of running a support vector machine with a kernel. But all this pertains only to how you actually implement software to minimize the cost function. I'll say more about that in the next video, but you really don't need to know about how to write software to minimize the cost function because you can find very good off the shelf software for doing so. And just as I wouldn't recommend writing code to invert a matrix or to compute a square root, I actually do not recommend writing software to minimize this cost function yourself, but instead to use off the shelf software packages that people have developed. Those software packages already embody these numerical tricks, so you don't really have to worry about them.

But one other thing that is worth knowing about is when you're applying support vector machine, how to choose the parameters (C) of support vector machine? And the last thing I want to do in this video is say a little bit about the bias and variance trade offs when using a SVM. When using a SVM, one of the things you need to choose is the parameter C which was in the optimization objective, and you recall that C palyed a role similar to 1/\lambda, where \lambda is the regularization paramber we have for logistic regression. So, if you have a large value of C, this corresponds to what we have back in logistic regression, of small value of \lambda meaning of not using much regularization. And if you do that, you tend to have a hypothesis with lower bias and higher variance. Whereas if you have smaller value of C then this corresponds to when we are using logistic regression with a large value of \lambda, and that corresponds to a hypothesis with higher bias and lower variance. And so, hypothesis with large C has a higher variance, is more prone to overfitting, whereas hypothesis with small C has higher bias and thus is more prone to underfitting. So this parameter C is one of the parameters we need to choose. The other thing is the parameter \sigma^{2}, which appear in the Gaussian kernel. So, if the Gaussian kernel \sigma^{2} is large, then in the similarity function, which is exp(-\frac{\left \| x-l^{(i)} \right \|^{2}}{2\sigma ^{2}}). In this one-way example, if I have only one feature x_{1}. If I have a landmark l, if \sigma^{2} is large, then the Gaussian kernel would tend to fall off relatively slowly. This would be smoother function that varies more smoothly. This would give me a hypothesis with higher bias and lower variance, because the Gaussian kernel that falls off smoothly, you tend to get a hypothesis that varies slowly as you change input x. Whereas in contrast, if \sigma^{2} was small, the Gaussian kernel will vary more abruptly. So if \sigma^{2} is small, then my features vary less smoothly. So it's just higher slopes or higher derivatives here. And using this, you end up fitting hypotheses of lower bias and you can have higher variance. If you look at this week's program exercise, you actually get to play around with some of these ideas yourself and see these effects yourself.

So, that was the support vector machine with kernels algorithm. And hopefully this discussion of bias and variance will give you some sense of how you can expect this algorithm to behave as well.

<end>

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值