Support Vector Machines - Using an SVM

摘要: 本文是吴恩达 (Andrew Ng)老师《机器学习》课程,第十三章《支持向量机》中第106课时《使用SVM》的视频原文字幕。为本人在视频学习过程中记录下来并加以修正,使其更加简洁,方便阅读,以便日后查阅使用。现分享给大家。如有错误,欢迎大家批评指正,在此表示诚挚地感谢!同时希望对大家的学习能有所帮助.
————————————————

So far we've been talking about SVMs in a fairly abstract level. In this video, I'd like to talk about what you actually need to do in order to run or to use an SVM.

The support vector machine algorithm poses a particular optimization problem. But as I briefly mentioned in an earlier video, I really do not recommend writing your own software to solve for the parameters \theta yourself. So, just as today, very few of us, or alsmost essentially none of us would think of writing code ourselves to invert a matrix or take a square root of a number, and so on. We just call some library function to do that. In the same way, the software for solving the SVM optimization problem is very complex and there have been researchers that have been doing essentially numerical optimization research for many years. So come up with good software libraries and good software packages to do this. And I strongly recommend just using one of the highly optimized software libraries rather than trying to implement something yourself. And there are lots of good software libraries out there. The two that I happen to use the most often are the liblinear and libsvm but there are really lots of good software libraries for doing this. You can link to many of the major programming languages that you may be using to code up learning algorithm. Even though you shouldn't be writing your own SVM optimization software, there are a few things you need to do though. First, it's to come up with some choice of the parameter C. We talked a little bit of the bias/variance properties of this in the earlier video. Secondly, you also need to choose the kernel or the similarity function that you want to use. So, one choice might be if we decide not to use any kernel. And the idea of no kernel is also called a linear kernel. So, if someone says, I use SVM with a linear kernel, what that means is they use SVM without using a kernel and it was a version of the SVM that just uses \theta^{T}x, which predicts y=1 if \theta_{0}+\theta_{1}x_{1}+...+\theta_{n}x_{n} \geq 0. This term linear kernel, you can think of this as this is a version of SVM that just gives you a standar linear classifier. So, that would be one reasonable choice for some problems, there would be many software libraries like liblinear that can train SVM without using a kernel, also called a linear kernel. So, why would you want to do this? If you have a large number of features, if n is large, and the number of examples m is small, maybe you just want to fit a linear decision boundary and not try to fit a very complicated nonlinear function because you might not have enough data. And you might risk overfitting if you're trying to fit a very complicated function in a very high dimensional feature space, but your training set example is small. So, this would be one reasonable setting where you might decide to just not use a kernel, or you call it to use what's called a linear kernel. A second choice for the kernel that you might make is this Gaussian kernel. And this is what we had previously. And if you do this, then the other choice you need to make is to choose this parameter \sigma ^{2}. We also talked a little bit about the bias/variance tradeoffs of how, if \sigma ^{2} is large, then you tend to have a higher bias, lower variance classifier; but if \sigma ^{2} is small, then you have a higher variance, lower bias classifier.So when would you choose a Gaussian kernel? If your original features x\in \mathbb{R}^{n}, and if n is small and ideally m is large, so that is, if we have say a two-dimensional training set like the example I drew earlier where n=2, but we have a pretty large training set, then maybe you want to use a kernel to fit a more complex nonlinear decision boundary, and the Gaussian kernel would be a fine way to do this. I'll say more towards the end of the video about when you might choose a linear kernel, a Gaussian kernel and so on. But concretely, if you decide to use a Gaussian kernel, then here's what you need to do.

Depending on what suport vector machine software package you use, it may ask you to implement a kernel function, or to implement the similarity function. So if you're using an Octave or MATLAB implementation of an SVM, it may ask you to provide a function to compute a particular feature of the kernel. So this is really computing f_{i} for one particular value i, where f here is just a single real number, so maybe I should write this as f_{i}, but what you need to do is to write a kernel function that takes this input: some vector x that is a training example or a test example and an input of landmark. But I've come down x_{1} and x_{2} here because the landmarks are really training examples as well. But what you need to do is write software that takes this input  x_{1} and x_{2}, and compute this similarity function between them and return a real number. And so what some support vector machine packages do is expect you to provide this kernel function that takes this input  x_{1} and x_{2} and returns a real number. And then it will take it from there and it will automatically generate all the features. That is, it automatically take x and map it to  \begin{bmatrix} f_{1}\\ f_{2}\\ ... \\f_{m}\end{bmatrix} using this function and generate all the features and train the support vector machine from there. But sometimes you do need to provide this function yourself. Although if you are using the Gaussian kernel, some SVM implementations will also include the Gaussian kernel and a few other kernels as well, since the Gaussian kernel is probably the most common kernel.  Gaussian and the linear kernels are really the two most popular kernels by far. Just one implementation note. If you have features of very different scales, it is important to perform feature scaling before using the Gaussian kernel. And here's why. If imaging computing the norm between x and l, what this is doing is really computing v=x-l. So the norm of v is really \left \| v \right \|^{2}=v_{1}^{2}+v_{2}^{2}+...+v_{n}^{2}, where x\in \mathbb{R}^{n} if ignoring x_{0}. So \left \| x-l \right \|^{2}=\left \| v \right \|^{2}=v_{1}^{2}+v_{2}^{2}+...+v_{n}^{2} =(x_{1}-l_{1})^{2}+(x_{2}-l_{2})^{2}+...+(x_{m}-l_{m})^{2}. And now, if your features take on very different ranges of value. So take a housing prediction example, if your data is about some data about houses. If it is in the range of thousands of square feet for the first feature x_{1}. But if your second feature x_{2} is the number of bedrooms which is in the range of 1 to 5. Then x_{1}-l_{1} is going to be huge, which could be like 1000^{2}. Whereas x_{2}-l_{2} is going to be much smaller and if that's the case, then in this term, those distances will be almost essentially dominated by the sizes of the houses and the number of bedrooms would be largely ignored. An so, to avoid this in order to make SVM work well, do perform feature scaling. And that will make sure that the SVM gives comparable amount of attention to all of your different features. 

When you're applying support vector machine, chances are by far the two most common kernels you use will be the linear kernel, meaning no kernel, or the Gaussian kernel that we talked about. And just one note of warning which is that not all similarity functions you might come up with are valide kernels. And the Gaussian kernel and the linear kernel and other kernels that sometimes others will use, all of them need to satisfy a technical condition. It's called Mercer's Theorem and the reason you need to do this is because support vector machine algorithms or implementations of the SVM have lots of clever numerical optimization tricks. In order to solve for the parameters \theta efficiently and in the original design envisaged, there were decision made to restrict our attention only to kernels that satisfy this technical condition called Mercer's Theorem. And what does is that makes sure that all of these SVM software packages can use the large class of optimizations and get the parameter \theta very quickly. So, what most people end up doing is either using the linear or Gaussian kernel, but there are a few other kernels that also satisfy Mercer's theorem and you may run across other people using, although I end up using other kernels very rarely, if at all. Just to mention some of the other kernels that you may run across. One is the Polynomial kernel. One version for that is the similarity between x and l is (x^{T}l)^{2}. If x and l are very close with each other, then the inner product will tend to be large. This is not used that often, but you may run acros some people using it. Another option is (x^{T}l)^{3}. These are all the options of polynomial kernels: (x^{T}l+1)^{3}, (x^{T}l+5)^{4}. So the polynomial kernel actually has two parameters. One is, what number do you add over here? It could be 0. As well as what is the degree of polynomial, so the degree power of these numbers. And the more general form of the polynomial kernel is (x^{T}l+constant)^{degree}. The polynomial kernel almost always or usually performs worse. And the Gaussian kernel does not use that much, but this is just something that you may run across. Usually it is used only for data where x and l are all strictly non negative, and so that ensure that these inner products are never negative. They have some other properties as well, but people tend not to use it much. And then, depending on what you're doing, there are other sort of more esoteric kernels as well. There is a string kernel. This is sometimes used if your input data is text strings. There are things like the chi-square kernel, the histogram intersection kernel, and so on. There are sort of more esoteric kernels that you can use to measure similarity between different objects. So, for example, if you're trying to do some sort of text classification problem where the input x is a string then maybe want to define the similarity between two strings using the string kernel. But I personally end up very rarely using the esoteric kernel. I think I might have used the chi-square kernel maybe once in my life and the histogram kernel, maybe once or twice in my life. I've actually never used the string kernel myself. But in cae you run across this in other applications,  if you do a quick web search, we do a quick Google search or quick Bin search, you should have found the definition of these kernels.

So, just two details I want to talk about in this video. One is multiclass classification. So, you have four classes or more generally k classes. How do you get a SVM to output some appropriate decision boundary between your multiple classes. Many SVM packages already have built-in multiclass classification functionality. Otherwise, one way to do this is to use the one versus all method that we talked about when we developing logistic regression. So what you do is you train k SVMs if you have k classes, one to distinguish each of the classes from the rest. And this would give you k parameter vectors. So this gives you \theta ^{(1)} which is trying to distinguish class y equals one from all of the other classes; \theta ^{(2)} which is what you get when you have y=2 as the positive class and all the others as negative class and so on; up to a parameter vector \theta ^{(k)} which is for distinguishing the final class k from anything else. Where we just predict the class i with the largest (\theta ^{(i)})^{T}x. There is a good chance that whatever software package you use, there already have built in multiclass classification functionality, so you don't need to worry about this.

Finally, we developed support vector machine starting off with logistic regression and then modified the cost function a little bit. The last thing I want to do in this video is just say a little bit about when you will use one of these two algorithms. Let's say n is the number of features and m is number of training examples. So, when should we use one algorithm versus the other? Well, if n is large relative to your training set size, then what I would usually do is use logistic regression or use the SVM without a kernel or use it with a linear kernel. Because if there are so many features with smaller training sets, a linear function will probably do fine, and you don't have really enough data to fit a very complicated nonlinear function. Now, if n is small and m is intermediate. What I mean by this is maybe n is maybe anywhere from 1-1000, and m is maybe anywhere from 10 to 10,000 examples or maybe 50,000 examples. If m is pretty big like 10,000 but not a million, right? So, if m is intermediate size then often an SVM with Gaussian kernel will work well. We talked about this earlier as well, with a two dimensional training set, so if n=2 and where you have drawn a pretty large number of training examples. So Gaussian kernel will do a pretty good job separating positive and negative examples. One third setting that is of interest is n is small but m is large. If this is the case, then an SVM with the Gaussian kernel will be somewhat slow to run. Today's SVM packages are very good, but they can still struggle a little bit when you have a massive training set size when using a Gaussian kernel.  In that case, what I usually do is try to just manually create more features and then use logistic regression or use SVM without kernel. And in case you look at the slides and you see logistic regression or SVM without a kernel. In both of these places, I kind of pair them together. There is a reason for that. Logistic regression and SVM without the kernel are really pretty similar algorithms. They usually do pretty similar things and give pretty similar performance. But depending on your implementation details, one maybe more efficient than the other. But where one of these algorithms applies, the other one is likely to work pretty well as well. And finally, where do neural networks fit in? For all of these problems, a well designed neural network is likely to work well. The one disadvantage, or one reason that might sometimes not use the neural network is that, for some of these problems, the neural network might be slow to train. But if you have a very good SVM package, that could run faster, quite a bit faster than your neural network. And, although we didn't show this earlier, it turns out that the optimization problem that the SVM has is a convex optimization problem, so a good SVM optimization software packages will always find the global minimum or something close to it. So, for SVM you don't need to worry about local optima.

In case the guidelines gave here seem a little bit vague and if you're looking at some problems, that's ok. When I face a machine learning problem, sometimes it's just not clear whether it's the best algorithm to use, but as you saw in the earlier videos, the algorithm does matter, but what often matters even more is things like how much data do you have. And how skilled are you, how good are you at doing error analysis and debugging learning algorithms, figuring out how to design new features and figuring out what other features to give your learning algorithms and so on. And often those things will matter more than whether you're using logistic regression or an SVM. But SVM is still widely perceived as one of the most powerful learning algorithms, and there's this regime of when there's a very effective way to learn complex non-linear functions. So I actually, together with logistic regression, neural networks, SVMs, using those to speed algorithms you're very well positioned to build state of the art machine learning systems for a wide range of applications and this is another very powerful tool in your arsenal. One that is used all over the place in Silicon Valley, or in industry and in Academia, to build many high performance learning systems.

<end>

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值