Support Vector Machine - Kernels I

摘要: 本文是吴恩达 (Andrew Ng)老师《机器学习》课程,第十三章《支持向量机》中第104课时《核函数I》的视频原文字幕。为本人在视频学习过程中记录下来并加以修正,使其更加简洁,方便阅读,以便日后查阅使用。现分享给大家。如有错误,欢迎大家批评指正,在此表示诚挚地感谢!同时希望对大家的学习能有所帮助.
————————————————

In this video, I'd like to start adapting support vector machine in order to develop complex nonlinear classifiers. The main technique for doing that is something called kernels. Let's see what this kernels are and how to use them.

If you have a training set that looks like this. And you want to find a nonlinear decision boundary to distinguish the positive and negative examples. Maybe a decision boundary looks like that. One way to do so is to come up with a set of complex polynomial features, right? So, set of features that looks like this. So you end up with a hypothesis h_{\theta }(x) that predicts 1 if all those polynomial features is greater than 0. And predict 0 otherwise. And another way of writing this is to introduce a little bit of new notation that I'll use later is that we can think of a hypothesis as computing a decision boundary using this: \theta _{0}+\theta _{1}f_{1}+\theta _{2}f_{2}+\theta _{3}f_{3}+.... Where I'm going to use this new notation f_{1},f_{2},f_{3} and so on to denote these new sort of features I'm computing. So, f_{1}=x_{1}, f_{2}=x_{2}, f_{3}=x_{1}x_{2}, f_{4}=x_{1}^{2}, f_{5}=x_{2}^{2} and so on. And we seen previously that coming up these high order polynomial is one way to come up with lots more features. But the question is, is there a different choise of features or is there better choice of features than this high order polynomials. Because you know it's not clear that this high order polynomial is what we want, and when we talk about computer vision, talk about when the input is an image with lots of pixels.We also saw how using high order polynomials becomes very computationally expensive because there are a lot of these high order polynomial terms. So, is there a different or a better choice of the features that we can use to plug into this sort of hypothesis form.

So, here's one idea for how to define the new features f_{1}, f_{2}, f_{3}. On this line, I'm going to define only three new features, but for real problems, you can get to define a much larger number. Here's what I'm going to do. In this phase of features x_{1}, x_{2}, and I'm going to leave x_{0} out of this. I'm going to manually pick a few points, and then call these points l^{(1)}, l^{(2)}, l^{(3)}. And for now, let's just say that I'm going to choose these three points manually. I'm going to call these three points landmark, so landmark x_{1}, x_{2},x_{3}. What I'm going to do is define my new features as follows. Given an example x, let me define my first feature f_{1} to be some measure of the similarity between my training example x and my first landmark. The specific formula that I'm going to use to measure similarity is going to be this exp{(-\frac{\left \| x-l^{(1)} \right \|^{2}}{2\sigma ^{2}})}. Here, \left \| x-l^{(1)} \right \|^{2} is actually the Euclidean distance squared, is the Euclidean distance between the point x and the landmark l^{(1)}. We'll see more about this later. But that's my first feature. And my second feature f_{2}=similarity(x, l^{(2)})=exp(-\frac{\left \| x-l^{(2)} \right \|^{2}}{2\sigma ^{2}}).  And similarly, f_{3}=similarity(x, l^{(3)})=exp(-\frac{\left \| x-l^{(3)} \right \|^{2}}{2\sigma ^{2}}). And what this similarity function is, the mathematical term for this, is that this is going to be a kernel function. And the specific kernel I'm using here, this is actually called a Gaussian kernel. So, this formula, this particular choice of similarity function is called a Gaussian kernel. But the way the terminology goes is that, these different similarity functions are called kernels and we can have different similarity functions, and the specific example I'm giving here is called the Gaussian kernel. We'll see other examples of other kernels. But for now just think of these as similarity functions. And so, instead of writing similarity between x and l (similarity(x,l)), sometimes we also write this a kernel denoted k(x,l^{(i)}). So, let's see what this kernel actually do and why these sorts of similarity functions might make sense.

So let's take my first landmark l^{(1)} which is one of those points I chose on my figure just now. So the similarity of the kernel between x and l^{(1)} is given by this expression. Just to make sure we are on the same page about what the numerator term is, the numerator can also be written as a sum from j=1 through n on sort of distance. So this is the component wise distance between the vector x and the vector l^{(1)}. And again for the purpose of these slides I'm ignoring x_{0}. So just ignoring the intercept term x_{0} which is always equal to 1. So, this is how you compute the kernel with similarity between x and a landmark. So, let's see what this function does. Suppose x is close to one of the landmarks. Then this Euclidean distance formula and the numerator will be close to 0. So, f_{1}\approx exp(-\frac{0^{2}}{2\delta ^{2}})\approx 1. And I put the approximation symbol here because the distance may not be exactly 0, but as x is close to the landmark, this term will be close to 0, and f_{1} would be close to 1. Conversely, if x is far from f_{1}, then f_{1}\approx exp(-\frac{(large number)^{2}}{2\delta ^{2}})\approx 0. So what these features do is they measure how similar x is from one of the landmarks. And the feature f is going to be close to one when x is close to your landmark and is going to be 0 or close to 0 when x is far from your landmark. Each of these landmarks on the previous slide, defines a new feature f_{1}, f_{2}, f_{3}. That is, given a training example x, we can now compute 3 new features: f_{1}, f_{2}, f_{3}. But first, let's look at this exponentiation function, let's look at the similarity function and plot in some figures and just understand better what this really looks like. 

For this example, let's say we have two features x_{1} and x_{2}. And let's say my first landmark, l^{(1)}=\begin{bmatrix} 3\\ 5 \end{bmatrix} . And let's say I set \sigma ^{2}=1. If I plot what this feature f_{1}=exp(-\frac{\left \| x-l^{(1)} \right \|^{2}}{2\sigma ^{2}}) looks like, what I get is this figure. The vertical axis, the height of the surface is the value of f_{1} and down here on the horizontal axis are, if I have some training example, and there is x_{1} and there is x_{2}. Given a certain training example, the training example here which shows the value of x_{1} and x_{2}. And the height above the surface, shows the corresponding value of f_{1}. And down below is the same figure I have shown using a contour plot, with x_{1} on horizontal axis, x_{2} on vertical axis, this figure on the bottom is just a contour plot of the 3D surface. You notice that when x=\begin{bmatrix} 3\\ 5 \end{bmatrix} exactly, then f_{1} takes the value 1 because that's at the maximum. And as x moves away, this feature takes on values that are close to 0. And so, this is really a feature f_{1} measures how close x is to the first landmark and it varies between 0 and 1 depending how close x is to the first landmark l^{(1)}. Now the other thing was due on this slide is show the effects of varying this parameter \sigma ^{2}. So, \sigma ^{2} is the parameter of the Gaussian kernel, and as you vary it, you get slightly different effect. Let's set \sigma ^{2} to be equal to 0.5 and see what we get. We set \sigma ^{2}=0.5, what you find is that the kernel looks similar, except for the width of the bump becomes narrower. The contours shrink a bit too. So if \sigma ^{2}=0.5, as you start from x=\begin{bmatrix} 3\\ 5 \end{bmatrix}, and as you move away, then the feature f_{1} falls to zero much more rapidly. And conversely, if you have increased \sigma ^{2}=3, as I move away from l^{(1)}, the value of the feature falls away much more slowly.

So, given this definition of the features, let's see what sort of hypothesis we can learn. Given the training example x, we're going to compute these features f_{1}, f_{2}, f_{3}. And a hypothesis is going to predict 1 when \theta _{0}+\theta _{1}f_{1}+\theta _{2}f_{2}+\theta _{3}f_{3}\geq 0. For this particular example, let's say that I've already found a learning algorithm and somehow I ended up with these values of the parameter: \theta _{=-0.5}, \theta _{1}=1, \theta _{2}=1, \theta _{3}=0. And what I want to do is consider what happens if we have a training example that has location at this magenta dot. So, let's say I have a training example x, what would my hypothesis predict? Well, if I look at this formula, because my training example x is close to l^{(1)}, we have that f_{1}\approx 1. And because my training example x is far from l^{(2)} and l^{(3)}, f_{2}\approx 0, f_{3}\approx 0. So, if I look at that formula, we have \theta _{0}+\theta _{1}\times1+\theta _{2}\times0+\theta _{3}\times0=-0.5+1=0.5\geq 0. So, at this magenta point, we predict y=1. Now, let's take a different point. Let's draw a different point in cyan. You can make a similar computation. f_{1}\approx f_{2}\approx f_{3}\approx 0. And \theta _{0}+\theta _{1}f_{1}+\theta _{2}f_{2}+\theta _{3}f_{3}= -0.5< 0. So we're going to predict y=0. And if you do this yourself for a range of different points, be sure to convince yourself that if you have a training example that's close to l^{(2)}, then we'll also predict y=1. And in fact, what you end up doing is if you look around this space, what you'll find is that for points near l^{(1)} and l^{(2)}, we end up predicting positive. And for points far away from l^{(1)} and l^{(2)}, that's for points far away from these two landmarks, we end up predicting that the class is equal to 0. And so, what we end up doing is that the decision boundary of this hypothesis would end up looking something like this where inside this read decision boundary would predict y=1 and outside would predict y=0. And so this is how with this definition of the landmarks and the kernel function, we can learn pretty complex non-linear decision boundary, like what I just drew where we predict positive when we're close to either one of the two landmarks. And we predict negative when we're very far away from any of the landmarks. And so this is part of the idea of kernels and how we use them with support vector machine, which is that we define these extra features using landmarks and similarity functions to learn more complex non-linear classifiers.

So, hopefully that gives you a sense of idea of kernels and how we can use it to define new features for the support vector machine. But there are a couple of questions that we haven't answered yet. One is, how do we get these landmarks? How do we choose these landmarks? And another is, what other similarity functions, if any, can we use other than the one we talked about, which is called the Gaussian kernel? In the next video, we give answers to these questions and put everything together to show how support vector machine with kernels can be a powerful way to learn complex non-linear functions.

<end>

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值