What are Kernels in Machine Learning and SVM?

最新推荐文章于 2022-04-22 03:48:54 发布

brightming

最新推荐文章于 2022-04-22 03:48:54 发布

阅读量1.7k

点赞数

分类专栏：机器学习文章标签：机器学习 svm kernel

机器学习专栏收录该内容

17 篇文章 1 订阅

订阅专栏

一个关于kernel的很好的解析：

https://www.quora.com/What-are-Kernels-in-Machine-Learning-and-SVM

将它摘录过来了。

What are Kernels in Machine Learning and SVM?

I'm trying to get into SVM, but I cannot get the idea of kernels. What are they and why do we need them?

12 Answers

Bharath Hariharan, Phd Student in Computer vision

37.4k Views • Upvoted by Vladimir Novakovski, Led machine learning at Quora, Aditya N. Joshi, Wanderer. Engineer at AWS., Nikhil Garg, I lead a team of Quora engineers working on ML/NLP problems

Bharath is a Most Viewed Writer in Classification (machine learning).

Great answers here already, but there are some additional things that I would want to say. So here goes.

What are kernels?
A kernel is a similarity function. It is a function that you, as the domain expert, provide to a machine learning algorithm. It takes two inputs and spits out how similar they are.

Suppose your task is to learn to classify images. You have (image, label) pairs as training data. Consider the typical machine learning pipeline: you take your images, you compute features, you string the features for each image into a vector, and you feed these "feature vectors" and labels into a learning algorithm.

Data --> Features --> Learning algorithm

Kernels offer an alternative. Instead of defining a slew of features, you define a single kernel function to compute similarity between images. You provide this kernel, together with the images and labels to the learning algorithm, and out comes a classifier.

Of course, the standard SVM/ logistic regression/ perceptron formulation doesn't work with kernels : it works with feature vectors. How on earth do we use kernels then? Two beautiful mathematical facts come to our rescue:

Under some conditions, every kernel function can be expressed as a dot product in a (possibly infinite dimensional) feature space ( Mercer's theorem ).
Many machine learning algorithms can be expressed entirely in terms of dot products.

These two facts mean that I can take my favorite machine learning algorithm, express it in terms of dot products, and then since my kernel is also a dot product in some space, replace the dot product by my favorite kernel. Voila!

Why kernels?
Why kernels, as opposed to feature vectors? One big reason is that in many cases, computing the kernel is easy, but computing the feature vector corresponding to the kernel is really really hard. The feature vector for even simple kernels can blow up in size, and for kernels like the RBF kernel ( k(x,y) = exp( -||x-y||^2), see Radial basis function kernel) the corresponding feature vector is infinite dimensional. Yet, computing the kernel is almost trivial.

Many machine learning algorithms can be written to only use dot products, and then we can replace the dot products with kernels. By doing so, we don't have to use the feature vector at all. This means that we can work with highly complex, efficient-to-compute, and yet high performing kernels without ever having to write down the huge and potentially infinite dimensional feature vector. Thus if not for the ability to use kernel functions directly, we would be stuck with relatively low dimensional, low-performance feature vectors. This "trick" is called the kernel trick ( Kernel trick ).

Endnote
I want to clear up two confusions which seem prevalant on this page:

A function that transforms one feature vector into a higher dimensional feature vector is not a kernel function. Thus f(x) = [x, x^2 ] is not a kernel. It is simply a new feature vector. You do not need kernels to do this. You need kernels if you want to do this, or more complicated feature transformations without blowing up dimensionality.
A kernel is not restricted to SVMs. Any learning algorithm that only works with dot products can be written down using kernels. The idea of SVMs is beautiful, the kernel trick is beautiful, and convex optimization is beautiful, and they stand quite independent.

Written 21 Dec 2013 • View Upvotes • Answer requested by Aniket Gurav

Related Questions

More Answers Below

Lili Jiang, Data Scientist at Quora

24.9k Views • Upvoted by Tao Xu, Built ML systems at Airbnb, Quora, Facebook and Microsoft.

Briefly speaking, a kernel is a shortcut that helps us do certain calculation faster which otherwise would involve computations in higher dimensional space.

Mathematical definition: K(x, y) = <f(x), f(y)>. Here K is the kernel function, x, y are n dimensional inputs. f is a map from n-dimension to m-dimension space. < x,y> denotes the dot product. usually m is much larger than n.

Intuition: normally calculating <f(x), f(y)> requires us to calculate f(x), f(y) first, and then do the dot product. These two computation steps can be quite expensive as they involve manipulations in m dimensional space, where m can be a large number. But after all the trouble of going to the high dimensional space, the result of the dot product is really a scalar: we come back to one-dimensional space again! Now, the question we have is: do we really need to go through all the trouble to get this one number? do we really have to go to the m-dimensional space? The answer is no, if you find a clever kernel.

Simple Example: x = (x1, x2, x3); y = (y1, y2, y3). Then for the function f(x) = (x1x1, x1x2, x1x3, x2x1, x2x2, x2x3, x3x1, x3x2, x3x3), the kernel is K(x, y ) = (<x, y>)^2.

Let's plug in some numbers to make this more intuitive: suppose x = (1, 2, 3); y = (4, 5, 6). Then:
f(x) = (1, 2, 3, 2, 4, 6, 3, 6, 9)
f(y) = (16, 20, 24, 20, 25, 36, 24, 30, 36)
<f(x), f(y)> = 16 + 40 + 72 + 40 + 100+ 180 + 72 + 180 + 324 = 1024

A lot of algebra. Mainly because f is a mapping from 3-dimensional to 9 dimensional space.

Now let us use the kernel instead:
K(x, y) = (4 + 10 + 18 ) ^2 = 32^2 = 1024
Same result, but this calculation is so much easier.

Additional beauty of Kernel: kernels allow us to do stuff in infinite dimensions! Sometimes going to higher dimension is not just computationally expensive, but also impossible. f(x) can be a mapping from n dimension to infinite dimension which we may have little idea of how to deal with. Then kernel gives us a wonderful shortcut.

Relation to SVM: now how is related to SVM? The idea of SVM is that y = w phi(x) +b, where w is the weight, phi is the feature vector, and b is the bias. if y> 0, then we classify datum to class 1, else to class 0. We want to find a set of weight and bias such that the margin is maximized. Previous answers mention that kernel makes data linearly separable for SVM. I think a more precise way to put this is, kernels do not make the the data linearly separable. The feature vector phi(x) makes the data linearly separable. Kernel is to make the calculation process faster and easier, especially when the feature vector phi is of very high dimension (for example, x1, x2, x3, ..., x_D^n, x1^2, x2^2, ...., x_D^2).

Why it can also be understood as a measure of similarity:
if we put the definition of kernel above, <f(x), f(y)>, in the context of SVM and feature vectors, it becomes <phi(x), phi(y)>. The inner product means the projection of phi(x) onto phi(y). or colloquially, how much overlap do x and y have in their feature space. In other words, how similar they are.

Updated 15 Sep 2015 • View Upvotes

Sameer Gupta, Flat Out

9.9k Views

Analogy of equilibrium of a spring-mass configuration is yet another way to understand it, which is attained when the config has minimum Potential energy.
//Caveat : Low math , limited scope.
We can compare :
Learning model to the horizontal line in figure, which contains the data point

queried.
The data points

yi are like weighted particles.
Similarity of data points with query point is comparable to spring length
and the spring constants

ki to Kernel functions , adding weights to data points corresponding to their similarity with the one queried .
Potential Energy

U=12Σki(yi−yo)2

to Error in predicting the data points similar to the one queried from the learning model.
and equilibrium to the state when this error is minimum.

All springs are identical till kernel functions are not user for weighing the data corresponding to their similarity.

Once used springs are not equal anymore.

Sometimes no values of the parameters of a [ non-linear]global model can provide a good approximation of the true function. There are two approaches to this problem.
First, we could use a larger, more complex global model and hope
that it can approximate the data sufﬁciently.
The second approach is to ﬁt the simple model to local patches instead of the whole region of interest.
In the second approach
Now error in prediction behaves much as thin plate splines minimize a bending energy of a plate and the energy of the constraints pulling on the plate, in a planar local model[ the line in figure] can now rotate as well as translate. The springs are forced to remain oriented vertically, rather than move to the smallest distance between the data points and the line.
the ﬁt (the line in equilibrium) produced by equally strong springs to a set of data points (the black dots), minimizing the criterion , looks like following

As the kernels are put to action ,springs nearer to the query point are strengthened and the springs further away are weakened. The strengths of the springs are given by K (d(xi , q)), and the ﬁt minimizes the criterion .. & [equilibrium]looks like following

http://www.qou.edu/arabic/resear...

Written 28 Nov 2012 • View Upvotes

Rahul Agarwal, Data Scientist at Citi

3.6k Views

Found this on Reddit: Please explain Support Vector Machines (SVM) like I am a 5 year old. • /r/MachineLearning

Simply the best explanation of SVM i ever found.
----------------------------------------------------------------------------------------------

We have 2 colors of balls on the table that we want to separate.

We get a stick and put it on the table, this works pretty well right?

Some villain comes and places more balls on the table, it kind of works but one of the balls is on the wrong side and there is probably a better place to put the stick now.

SVMs try to put the stick in the best possible place by having as big a gap on either side of the stick as possible.

Now when the villain returns the stick is still in a pretty good spot.

There is another trick in the SVM toolbox that is even more important. Say the villain has seen how good you are with a stick so he gives you a new challenge.

There’s no stick in the world that will let you split those balls well, so what do you do? You flip the table of course! Throwing the balls into the air. Then, with your pro ninja skills, you grab a sheet of paper and slip it between the balls.

Now, looking at the balls from where the villain is standing, they balls will look split by some curvy line.

Boring adults the call balls data, the stick a classifier, the biggest gap trick optimization, call flipping the table kernelling and the piece of paper a hyperplane.
------------------------- ------------------------- ------------------------- --------------------

Now see this:

--------------------------------------------------------------------------------------------
One other point that I like to mention about the SVM (unrelated to this question) is how it is defined by the boundary case examples. (taken from CS109 course:SVM Explanation)

Assume you want to separate Apples from oranges using SVM. The Red square are Apples and the blue circles are oranges.

Now see the support vectors(The filled Points) which define the margin here.

Intuitively the filled Blue circle is an Orange that looks very much like an Apple. While the filled Red squares are Apples that very much look like oranges.

Think for this another time. If you would want your kid to learn to differentiate between an apple and orange you would show him a perfect apple and a perfect orange. But not SVMs they want to only see an apple that looks like an orange and vice versa. This approach is very different from how most Machine learning algorithms operate, and maybe thats why it works so well in some cases.

Written 24 Nov 2015 • View Upvotes

Charles H Martin, Calculation Consulting; we predict things

7k Views • Charles is a Most Viewed Writer in Classification (machine learning) with 60+ answers.

The intent of the Kernel Trick: to allow us to represent a infinite set of discrete functions with a family of continuous functions. And in many cases in machine learning, we can express our dot products

$\text{[math]}$

with a simple analytic function, with some adjustable parameters.

Please see my blog posts for details

Kernels Part 1: What is an RBF Kernel? Really?

Kernels Part 2: Affine Quantum Gravity

Kernels and Quantum Gravity Part 3: Coherent States

Written Jun 14, 2013 • View Upvotes

Gillis Danielsen, degree in physics, job in finance

9.1k Views • Upvoted by Franck Dernoncourt, PhD student in AI @ MIT

Here is an even shorter visualization of what an SVM does. Basically the trick is to to look for a projection to higher dimensional space where there is a linear separation of the data (others have posted much more detailed and correct answers so I just wanted to share the vid, not made by me).

Updated Apr 28, 2014 • View Upvotes

Abhishek Shivkumar, Data Scientist

5.9k Views • Abhishek is a Most Viewed Writer in Machine Learning.

I know that pasting a link to an external talk would not be an appropriate answer here, but this talk is just so awesome and answers your question so intuitively that I encourage you to watch it

http://www.google.co.in/url?sa=t...

Written Aug 17, 2012 • View Upvotes

Shashank Gupta, Learning the nuts and bolts of Machine Learning.

1.2k Views

SVM in it's general setting works by finding an "optimal" hyper plane which separates two point clouds of data. But this is limited in the sense that it only works when the two point clouds (each corresponding to a class) can be separated by a hyper plane. What if the separating boundary is non-linear ?

This is some image I found from web. This is a perfect example of the limitation of SVM in it's general setting. SVM will learn the Red colored line from the data points which we can clearly see is not optimal. The optimal boundary is the Green curve which is a non-linear structure.

To overcome this limitation we use the concept of kernels. Given a test point kernels will fit a curve on data points, specifically to points that are 'close' to the test set. So if using an RBF kernel it will fit a gaussian distribution over 'nearby' points of test point. Closeness is defined using std. distance metric.

Suppose this is the function we are trying to approximate. Yellow region is the gaussian that it fit over the training points points that are closed to test point(assume it is one in the center of the gaussian). So it approximates the function (separating boundary) by using piece-wise gaussians (for RBF kernel). And it uses only sparse set from training data(support vectors) to do so.

This is the intuition of using kernels in SVM. Of course there are nice theoretical arguments for using kernels like projection of data points in Infinite Dimensional space where it becomes linearly seperable, but this is the intuition that I have for using kernels. HTH

Written Jan 7 • View Upvotes

Ashkon Farhangi, Studied AI @ Stanford

3.4k Views

To add to the other answers, an RBF kernel, perhaps the mostly commonly used kernel, acts essentially as a low band pass filter that prefers smoother models. For a full mathematical justification of this fact check out Charles Martin's explanation.

Updated 14 Apr 2015 • View Upvotes

Hasan Poonawala

6.8k Views

SVMs depend on two ideas: VC dimension and optimization. Given points in the plane and the fact that they are separable (Nikhil's answer), there are infinite lines (halfplanes in 2D) that would separate them. The SVM finds the best one by solving an optimization problem on those separating halfplanes. The nice thing is that it is a Linear Programming problem which is fast and easy to solve.

The VC dimension is related to a fact we took for granted: are the training points separable? Well, when it comes to points in 2D, the maximum number of points that are guaranteed to be separable are 3, which is precisely the VC dimension. Note the diagram in Nikhil's answer projects the three points on a line to 2D, and he drew a classifier that works. The VC dimension depends on the classifiers (hyperplanes in the case of SVMs) and the dimension which are data lies in (2D in this case). Now if the points are in n-dimensions, the VC dimention is n+1. So instead, if we have m points, we need the points to lie in a space of dimension at least (m-1). When this happens, we are sure to find a solution to the optimization we perform. This is where the kernel comes in. We project the m points lying in n-dimension, n< (m-1), into an (m-1) dimensional space so that we are guaranteed to find a linear classifier in that dimension.

To make it concrete, as in Nikhil's example, we had three points in 1 dimension. Not solvable. So we project it into a space of at least (3-1)=2 dimensions, and we will find a solution.

Edit:
As Yan King Yin points out, three points in a plane cannot be separated if they are collinear. VC dimension excludes these cases (measure zero sets), otherwise half planes couldn't separate anything.

Updated Mar 18, 2013 • View Upvotes