Support Vector Machine - Kernels I

最新推荐文章于 2020-10-05 15:58:12 发布

王彩旗 edwardwangcq.com

最新推荐文章于 2020-10-05 15:58:12 发布

阅读量141

点赞数

分类专栏：人工智能 # 机器学习

本文链接：https://blog.csdn.net/edward_wang1/article/details/108631637

版权

人工智能同时被 2 个专栏收录

142 篇文章 0 订阅

订阅专栏

机器学习

109 篇文章 0 订阅

订阅专栏

摘要: 本文是吴恩达 (Andrew Ng)老师《机器学习》课程，第十三章《支持向量机》中第104课时《核函数I》的视频原文字幕。为本人在视频学习过程中记录下来并加以修正，使其更加简洁，方便阅读，以便日后查阅使用。现分享给大家。如有错误，欢迎大家批评指正，在此表示诚挚地感谢！同时希望对大家的学习能有所帮助.
————————————————

In this video, I'd like to start adapting support vector machine in order to develop complex nonlinear classifiers. The main technique for doing that is something called kernels. Let's see what this kernels are and how to use them.

If you have a training set that looks like this. And you want to find a nonlinear decision boundary to distinguish the positive and negative examples. Maybe a decision boundary looks like that. One way to do so is to come up with a set of complex polynomial features, right? So, set of features that looks like this. So you end up with a hypothesis $h_{\theta }(x)$ that predicts 1 if all those polynomial features is greater than 0. And predict 0 otherwise. And another way of writing this is to introduce a little bit of new notation that I'll use later is that we can think of a hypothesis as computing a decision boundary using this: $\theta _{0}+\theta _{1}f_{1}+\theta _{2}f_{2}+\theta _{3}f_{3}+...$ . Where I'm going to use this new notation $f_{1},f_{2},f_{3}$ and so on to denote these new sort of features I'm computing. So, $f_{1}=x_{1}, f_{2}=x_{2}, f_{3}=x_{1}x_{2}, f_{4}=x_{1}^{2}, f_{5}=x_{2}^{2}$ and so on. And we seen previously that coming up these high order polynomial is one way to come up with lots more features. But the question is, is there a different choise of features or is there better choice of features than this high order polynomials. Because you know it's not clear that this high order polynomial is what we want, and when we talk about computer vision, talk about when the input is an image with lots of pixels.We also saw how using high order polynomials becomes very computationally expensive because there are a lot of these high order polynomial terms. So, is there a different or a better choice of the features that we can use to plug into this sort of hypothesis form.

So, here's one idea for how to define the new features $f_{1}, f_{2}, f_{3}$ . On this line, I'm going to define only three new features, but for real problems, you can get to define a much larger number. Here's what I'm going to do. In this phase of features $x_{1}, x_{2}$ , and I'm going to leave $x_{0}$ out of this. I'm going to manually pick a few points, and then call these points $l^{(1)}, l^{(2)}, l^{(3)}$ . And for now, let's just say that I'm going to choose these three points manually. I'm going to call these three points landmark, so landmark $x_{1}, x_{2},x_{3}$ . What I'm going to do is define my new features as follows. Given an example , let me define my first feature $f_{1}$ to be some measure of the similarity between my training example and my first landmark. The specific formula that I'm going to use to measure similarity is going to be this $exp{(-\frac{\left \| x-l^{(1)} \right \|^{2}}{2\sigma ^{2}})}$ . Here, $\left \| x-l^{(1)} \right \|^{2}$ is actually the Euclidean distance squared, is the Euclidean distance between the point and the landmark $l^{(1)}$ . We'll see more about this later. But that's my first feature. And my second feature $f_{2}=similarity(x, l^{(2)})=exp(-\frac{\left \| x-l^{(2)} \right \|^{2}}{2\sigma ^{2}})$ . And similarly, $f_{3}=similarity(x, l^{(3)})=exp(-\frac{\left \| x-l^{(3)} \right \|^{2}}{2\sigma ^{2}})$ . And what this similarity function is, the mathematical term for this, is that this is going to be a kernel function. And the specific kernel I'm using here, this is actually called a Gaussian kernel. So, this formula, this particular choice of similarity function is called a Gaussian kernel. But the way the terminology goes is that, these different similarity functions are called kernels and we can have different similarity functions, and the specific example I'm giving here is called the Gaussian kernel. We'll see other examples of other kernels. But for now just think of these as similarity functions. And so, instead of writing similarity between x and l ( similarity(x,l) ), sometimes we also write this a kernel denoted $k(x,l^{(i)})$ . So, let's see what this kernel actually do and why these sorts of similarity functions might make sense.

So let's take my first landmark $l^{(1)}$ which is one of those points I chose on my figure just now. So the similarity of the kernel between and $l^{(1)}$ is given by this expression. Just to make sure we are on the same page about what the numerator term is, the numerator can also be written as a sum from j=1 through on sort of distance. So this is the component wise distance between the vector and the vector $l^{(1)}$ . And again for the purpose of these slides I'm ignoring $x_{0}$ . So just ignoring the intercept term $x_{0}$ which is always equal to 1. So, this is how you compute the kernel with similarity between x and a landmark. So, let's see what this function does. Suppose is close to one of the landmarks. Then this Euclidean distance formula and the numerator will be close to 0. So, $f_{1}\approx exp(-\frac{0^{2}}{2\delta ^{2}})\approx 1$ . And I put the approximation symbol here because the distance may not be exactly , but as is close to the landmark, this term will be close to , and $f_{1}$ would be close to . Conversely, if is far from $f_{1}$ , then $f_{1}\approx exp(-\frac{(large number)^{2}}{2\delta ^{2}})\approx 0$ . So what these features do is they measure how similar is from one of the landmarks. And the feature is going to be close to one when is close to your landmark and is going to be 0 or close to 0 when is far from your landmark. Each of these landmarks on the previous slide, defines a new feature $f_{1}, f_{2}, f_{3}$ . That is, given a training example , we can now compute 3 new features: $f_{1}, f_{2}, f_{3}$ . But first, let's look at this exponentiation function, let's look at the similarity function and plot in some figures and just understand better what this really looks like.

For this example, let's say we have two features $x_{1}$ and $x_{2}$ . And let's say my first landmark, $l^{(1)}=\begin{bmatrix} 3\\ 5 \end{bmatrix}$ . And let's say I set $\sigma ^{2}=1$ . If I plot what this feature $f_{1}=exp(-\frac{\left \| x-l^{(1)} \right \|^{2}}{2\sigma ^{2}})$ looks like, what I get is this figure. The vertical axis, the height of the surface is the value of $f_{1}$ and down here on the horizontal axis are, if I have some training example, and there is $x_{1}$ and there is $x_{2}$ . Given a certain training example, the training example here which shows the value of $x_{1}$ and $x_{2}$ . And the height above the surface, shows the corresponding value of $f_{1}$ . And down below is the same figure I have shown using a contour plot, with $x_{1}$ on horizontal axis, $x_{2}$ on vertical axis, this figure on the bottom is just a contour plot of the 3D surface. You notice that when $x=\begin{bmatrix} 3\\ 5 \end{bmatrix}$ exactly, then $f_{1}$ takes the value 1 because that's at the maximum. And as moves away, this feature takes on values that are close to . And so, this is really a feature $f_{1}$ measures how close is to the first landmark and it varies between and depending how close is to the first landmark $l^{(1)}$ . Now the other thing was due on this slide is show the effects of varying this parameter $\sigma ^{2}$ . So, $\sigma ^{2}$ is the parameter of the Gaussian kernel, and as you vary it, you get slightly different effect. Let's set $\sigma ^{2}$ to be equal to 0.5 and see what we get. We set $\sigma ^{2}=0.5$ , what you find is that the kernel looks similar, except for the width of the bump becomes narrower. The contours shrink a bit too. So if $\sigma ^{2}=0.5$ , as you start from $x=\begin{bmatrix} 3\\ 5 \end{bmatrix}$ , and as you move away, then the feature $f_{1}$ falls to zero much more rapidly. And conversely, if you have increased $\sigma ^{2}=3$ , as I move away from $l^{(1)}$ , the value of the feature falls away much more slowly.

So, given this definition of the features, let's see what sort of hypothesis we can learn. Given the training example , we're going to compute these features $f_{1}, f_{2}, f_{3}$ . And a hypothesis is going to predict 1 when $\theta _{0}+\theta _{1}f_{1}+\theta _{2}f_{2}+\theta _{3}f_{3}\geq 0$ . For this particular example, let's say that I've already found a learning algorithm and somehow I ended up with these values of the parameter: $\theta _{=-0.5}, \theta _{1}=1, \theta _{2}=1, \theta _{3}=0$ . And what I want to do is consider what happens if we have a training example that has location at this magenta dot. So, let's say I have a training example x, what would my hypothesis predict? Well, if I look at this formula, because my training example x is close to $l^{(1)}$ , we have that $f_{1}\approx 1$ . And because my training example x is far from $l^{(2)}$ and $l^{(3)}$ , $f_{2}\approx 0, f_{3}\approx 0$ . So, if I look at that formula, we have $\theta _{0}+\theta _{1}\times1+\theta _{2}\times0+\theta _{3}\times0=-0.5+1=0.5\geq 0$ . So, at this magenta point, we predict y=1 . Now, let's take a different point. Let's draw a different point in cyan. You can make a similar computation. $f_{1}\approx f_{2}\approx f_{3}\approx 0$ . And $\theta _{0}+\theta _{1}f_{1}+\theta _{2}f_{2}+\theta _{3}f_{3}= -0.5< 0$ . So we're going to predict y=0 . And if you do this yourself for a range of different points, be sure to convince yourself that if you have a training example that's close to $l^{(2)}$ , then we'll also predict y=1 . And in fact, what you end up doing is if you look around this space, what you'll find is that for points near $l^{(1)}$ and $l^{(2)}$ , we end up predicting positive. And for points far away from $l^{(1)}$ and $l^{(2)}$ , that's for points far away from these two landmarks, we end up predicting that the class is equal to 0. And so, what we end up doing is that the decision boundary of this hypothesis would end up looking something like this where inside this read decision boundary would predict y=1 and outside would predict y=0 . And so this is how with this definition of the landmarks and the kernel function, we can learn pretty complex non-linear decision boundary, like what I just drew where we predict positive when we're close to either one of the two landmarks. And we predict negative when we're very far away from any of the landmarks. And so this is part of the idea of kernels and how we use them with support vector machine, which is that we define these extra features using landmarks and similarity functions to learn more complex non-linear classifiers.

So, hopefully that gives you a sense of idea of kernels and how we can use it to define new features for the support vector machine. But there are a couple of questions that we haven't answered yet. One is, how do we get these landmarks? How do we choose these landmarks? And another is, what other similarity functions, if any, can we use other than the one we talked about, which is called the Gaussian kernel? In the next video, we give answers to these questions and put everything together to show how support vector machine with kernels can be a powerful way to learn complex non-linear functions.

<end>

王彩旗 edwardwangcq.com

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Support Vector Machine - Kernels I

In this video, I'd like to start adapting support vector machine in order to develop complex nonlinear classifiers. The main technique for doing that is something called kernels. Let's see what this kernels are and how to use them.If you have a traini.
复制链接

扫一扫

专栏目录