Study notes for Support Vector Machines (2)

最新推荐文章于 2021-11-02 07:44:14 发布

Felix_夜雨

最新推荐文章于 2021-11-02 07:44:14 发布

阅读量853

点赞数

分类专栏： Machine Learning 文章标签： machine learning study notes 机器学习

本文链接：https://blog.csdn.net/u010693617/article/details/8962961

版权

Machine Learning 专栏收录该内容

23 篇文章 0 订阅

订阅专栏

It is a method for the classification of both linear and nonlinear data, i.e., linearly separable and nonlinearly separable data. In addition, SVMs can be used for numeric prediction. Put simply, SVM uses a nonlinear mapping to transform the original data into a higher dimension from which it searches for a maximal marginal hyperplane (i.e., a "decision boundary"), separating data of one class from another. For linear data, the mapping may not be necessary. The SVM finds this hyperplane using support vectors (i.e. the essential or critical training tuples or examples that lies closest to the hyperplane, and hence the most difficult tuples to classify and give the most information regarding classification) and margins (defined by the support vectors). With a properiate nonlinear mapping to a sufficiently higher dimension, data from two classes can always be separated by a hyperplane.

1. Pros and Cons

Pros:

SVMs are highly accurate and less prone to overfitting than other methods because the learned classifier is characterized by the number of support vectors rather than the dimensionality of the data;
SVM training always finds a global solution, unlike neural network, such as backpropagation, where many local minima usually exist.

Cons:

Howe to improve the speed of training and testing so as to adapt to very large scale data sets (e.g., millions of support vectors);
How to select the best kernel method and find better solutions for multiclass classification.

2. Kernels and Kernel Methods

The kernel function is used to speed up the training, by avoiding do the dot product in the higher dimensional space but in the original data space. In other words, we may not need to do mappings in practice.

Three admissible kernel functions are: Polynominal kernel of degree,Gaussian radial basis function kernel (RBF) and Sigmoid kernel and linear kernel.
There are no golden rules for determining which admissible kernel will result in the most accurate SVM.
In practice, the kernel chosen does not generally make a large difference in resulting accuracy. An SVM with a Gaussian radial basis function (RBF) gives the same decision hyperplane as a type of neural network known as a radial basis function network. An SVM with a sigmoid kernel is equivalent to a simple two-layer neural network known as a multilayer perception (with no hidden layers).

Kernel Methods

Kernel Methods (KMs) approach the problem (pattern analysis) by mapping the data into a high dimensional feature space, where each coordinate corresponds to one feature of the data items, transforming the data into a set of points in a Euclidean space. In that space, a variety of methods can be used to find relations in the data. Since the mapping can be quite general (not necessarily linear, for example), the relations found in this way are accordingly very general. KMs owe their name to the use of kernel functions, which enable them to operate in the feature space without ever computing the coordinates of the data in that space, but rather by simply computing the inner products between the images of all pairs of data in the feature space. This operation is often computationally cheaper than the explicit computation of the coordinates. This approach is called the kernel trick. Kernel functions have been introduced for sequence data, graphs, text, images, as well as vectors.
Blog: 核方法(kernel method)的主要思想

Kernel Selection

In general, the RBF kernel is a reasonable first choice.
A recent result shows that if RBF is used with model selection, then there is no need to consider the linear kernel.
The kernel matrix using sigmoid may not be positive definite and in general it's accuracy is not better than RBF.
Polynomial kernels are ok but if a high degree is used, numerical difficulties tend to happen.
In some cases RBF kernel is not used in the case where number of examples is less than the number of features/attributes.
- Consider an example1 .we have 2 data sets one for training and other for testing, and training set contains 38 instance and testing set contains 34 instance. number of features:7,129. If you use RBF kernel the accuracy is 71.05 and for linear kernel accuracy is 94.73. In such cases linear kernel have more accuracy than RBF kernel.

3. Grid Search

The de facto standard way of performing hyperparameter optimization is grid search, which is simply an exhaustive searching through a manually specified subset of the hyperparameter space of a learning algorithm. A grid search algorithm must be guided by some performance metric, typically measured by cross-validation on the training set or evaluation on a held-out validation set. Since the parameter space of a machine learner may include real-valued or unbounded value spaces for certain parameters, manually set bounds and discretization may be necessary before applying grid search.

For example, a typical soft-margin SVM classifier equipped with an RBF kernel has at least two hyperparameters that need to be tuned for good performance on unseen data: a regularization constant C and a kernel hyperparameter gamma. Both parameters are continuous, so to perform grid search, one selects a finite set of "reasonable" values for each, say C in {10, 100, 1000}, gamma in {0.1,0.2, 0.5, 1.0}, or according to paper "CENTRIST: A Visual Descriptor for Scene Categorization" where log_2C in [-5, 15], log_2 gamma in [-11, 3] with grid step size 2. Grid search then trains an SVM with each pair (C, gamma) in the cross-product of these two sets and evaluates their performance on a held-out validation set (or by internal cross-validation on the training set, in which case multiple SVMs are trained per pair). Finally, the grid search algorithm outputs the settings that achieved the highest score in the validation procedure.

Grid search suffers from the curse of dimensionality, but is often embarrassingly parallel because typically the hyperparameter settings it evaluates are independent of each other. Note that python package sklearn has implemented a module grid search.

4. Feature Selection for SVM

To obtain a sparse representation of SVM (i.e., Sparse SVM), researchers have proposed many approaches to select features from the original input space.

Recursive Feature Elimination (RFE) is proposed for feature selection. SVM-RFE can obtain nested subsets of input features and has shown state-of-the-art performance on gene selection in Microarray data analysis (refers to Guyon et al., Gene selection for cancer classification using support vector machines. ). The drawback is the nested "monotonic" feature seleciton scheme may be suboptimal in identifying the most informative subset of input features.
L1-norm regularization, i.e., using ||w||₁ as the regularizer. The resultant problem is convex, and can be solved by Linear Programming (LP) solvers or Newton method.
L0-norm regularization. Weston et al. (2003) [Use of thezero-norm with linear models and kernel methods.] propose an approximation of zero norm minimization to solve SSVM with ||w||₀ as the regularizer. However, the resultant optimization is non-convex and may suffer from local minima.
Feature Generating Machine (FGM). It adopts acutting plane algorithm. It iteratively generates a pool of violated sparse feature subsets and then combines them via efficient Multiple Kernel Learning (MKL) algorithm. FGM shows great scalability to non-monotonic feature selection on large-scale and very high dimensional datasets.

5. Further Reading

Libsvm Tutorial: http://lekshmideepu.blogspot.sg/2012/02/libsvm-tutorial.html

Felix_夜雨

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Study notes for Support Vector Machines (2)

It is a method for the classification of both linear and nonlinear data, i.e., linearly separable and nonlinearly separable data. In addition, SVMs can be used for numeric prediction. Put simply, SV
复制链接

扫一扫