Study notes for Support Vector Machines (2)

It is a method for the classification of both linear and nonlinear data, i.e., linearly separable and nonlinearly separable data. In addition, SVMs can be used for numeric prediction. Put simply, SVM uses a nonlinear mapping to transform the original data into a higher dimension from which it searches for a maximal marginal hyperplane (i.e., a "decision boundary"), separating data of one class from another. For linear data, the mapping may not be necessary. The SVM finds this hyperplane using support vectors (i.e. the essential or critical training tuples or examples that lies closest to the hyperplane, and hence the most difficult tuples to classify and give the most information regarding classification) and margins (defined by the support vectors). With a properiate nonlinear mapping to a sufficiently higher dimension, data from two classes can always be separated by a hyperplane.

1. Pros and Cons

Pros:
  1. SVMs are highly accurate and less prone to overfitting than other methods because the learned classifier is characterized by the number of support vectors rather than the dimensionality of the data; 
  2. SVM training always finds a global solution, unlike neural network, such as backpropagation, where many local minima usually exist.
Cons:
  1. Howe to improve the speed of training and testing so as to adapt to very large scale data sets (e.g., millions of support vectors); 
  2. How to select the best kernel method and find better solutions for multiclass classification.

2. Kernels and Kernel Methods

The kernel function is used to speed up the training, by avoiding do the dot product in the higher dimensional space but in the original data space. In other words, we may not need to do mappings in practice.
  • Three admissible kernel functions are: Polynominal kernel of degree,Gaussian radial basis function kernel (RBF) and Sigmoid kernel and linear kernel.  
  • There are no golden rules for determining which admissible kernel will result in the most accurate SVM.
  • In practice, the kernel chosen does not generally make a large difference in resulting accuracy.  An SVM with a Gaussian radial basis function (RBF) gives the same decision hyperplane as a type of neural network known as a radial basis function network. An SVM with a sigmoid kernel is equivalent to a simple two-layer neural network known as a multilayer perception (with no hidden layers).
Kernel Methods

Kernel Methods (KMs) approach the problem (pattern analysis) by mapping the data into a high dimensional feature space, where each coordinate corresponds to one feature of the data items, transforming the data into a set of points in a Euclidean space. In that space, a variety of methods can be used to find relations in the data. Since the mapping can be quite general (not necessarily linear, for example), the relations found in this way are accordingly very general. KMs owe their name to the use of kernel functions, which enable them to operate in the feature space without ever computing the coordinates of the data in that space, but rather by simply computing the inner products between the images of all pairs of data in the feature space. This operation is often computationally cheaper than the explicit computation of the coordinates. This approach is called the kernel trick. Kernel functions have been introduced for sequence data, graphs, text, images, as well as vectors.
Blog: 核方法(kernel method)的主要思想

Kernel Selection
  • In general, the RBF kernel is a reasonable first choice. 
  • A recent result shows that if RBF is used with model selection, then there is no need to consider the linear kernel. 
  • The kernel matrix using sigmoid may not be positive definite and in general it's accuracy is not better than RBF. 
  • Polynomial kernels are ok but if a high degree is used, numerical difficulties tend to happen.
  • In some cases RBF kernel is not used in the case where number of examples is less than the number of features/attributes.
    • Consider an example1 .we have 2 data sets one for training and other for testing, and training set contains 38 instance and testing set contains 34 instance. number of features:7,129. If you use RBF kernel the accuracy is 71.05 and for linear kernel accuracy is 94.73. In such cases linear kernel have more accuracy than RBF kernel.

3. Grid Search

The de facto standard way of performing hyperparameter optimization is grid search, which is simply an exhaustive searching through a manually specified subset of the hyperparameter space of a learning algorithm. A grid search algorithm must be guided by some performance metric, typically measured by cross-validation on the training set or evaluation on a held-out validation set. Since the parameter space of a machine learner may include real-valued or unbounded value spaces for certain parameters, manually set bounds and discretization may be necessary before applying grid search.

For example, a typical soft-margin SVM classifier equipped with an RBF kernel has at least two hyperparameters that need to be tuned for good performance on unseen data: a regularization constant C and a kernel hyperparameter gamma. Both parameters are continuous, so to perform grid search, one selects a finite set of "reasonable" values for each, say C in {10, 100, 1000}, gamma in {0.1,0.2, 0.5, 1.0}, or according to paper "CENTRIST: A Visual Descriptor for Scene Categorization" where log_2C in [-5, 15], log_2 gamma in [-11, 3] with grid step size 2. Grid search then trains an SVM with each pair (C, gamma) in the cross-product of these two sets and evaluates their performance on a held-out validation set (or by internal cross-validation on the training set, in which case multiple SVMs are trained per pair). Finally, the grid search algorithm outputs the settings that achieved the highest score in the validation procedure.

Grid search suffers from the curse of dimensionality, but is often embarrassingly parallel because typically the hyperparameter settings it evaluates are independent of each other. Note that python package sklearn has implemented a module grid search.

4. Feature Selection for SVM

To obtain a sparse representation of SVM (i.e., Sparse SVM), researchers have proposed many approaches to select features from the original input space.  
  • Recursive Feature Elimination (RFE) is proposed for feature selection. SVM-RFE can obtain nested subsets of input features and has shown state-of-the-art performance on gene selection in Microarray data analysis (refers to Guyon et al., Gene selection for cancer classification using support vector machines. ). The drawback is the nested "monotonic" feature seleciton scheme may be suboptimal in identifying the most informative subset of input features. 
  • L1-norm regularization, i.e., using ||w||1 as the regularizer. The resultant problem is convex, and can be solved by Linear Programming (LP) solvers or Newton method. 
  • L0-norm regularization. Weston et al. (2003) [Use of thezero-norm with linear models and kernel methods.] propose an approximation of zero norm minimization to solve SSVM with ||w||0 as the regularizer. However, the resultant optimization is non-convex and may suffer from local minima. 
  • Feature Generating Machine (FGM)It adopts acutting plane algorithm. It iteratively generates a pool of violated sparse feature subsets and then combines them via efficient Multiple Kernel Learning (MKL) algorithm. FGM shows great scalability to non-monotonic feature selection on large-scale and very high dimensional datasets. 

5. Further Reading

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值