Summary
Support vector machine is used for classification and regression analysis
SVM training algorithm build one model to assign the new examples into the category.
An SVM model is a representation of the examples as points in space.
svm can perform linear classification and non-linear classification with kernel tricks.
Definition
SVM constructs a hyperplane or set of hyperplanes in a high- or infinite-dimensional space, which can be used for classification, regression, or other tasks.
The hyperplane has the largest distance to the nearest training data point of any class.the larger the margin the lower generalization error of the classifier.
sometimes The set to discriminate are not liner seperatable in the finite dimension,the original finite-dimensional space is mapped into a much higher-dimensional space
By defining a kernel function , the mappings are design to make dot product computation easily.
The hyperplanes is defined as set of points whoses the dot prodcut with vector is constant.
The vectors can be chosen as a liner combination with prameters.
Then the points in the feature space that are mapped into hyperplane are defined by the relation:
the sum of kernel are used to measure the relative nearness of each test point to the data points originating in sets.
Motivation
The goal is to decide which class a new data point will be in.
H1 does not separate the classes. H2 does, but only with a small margin. H3 separates them with the maximum margin.
we choose the hyperplane so that the distance from it to the nearest data point on each side is maximized.
The hyperplan is called maximum-margin hyperplane.
Liner svm
Given training set D, including n points.
Each is a p-dimensional real vector.
Any hyperplane can be written as the set of points satisfying
is the normal vector to the hyperplane.
Maximum-margin hyperplane and margins for an SVM trained with samples from two classes. Samples on the margin are called the support vectors.
we can select two hyperplanes(seperate the points)can be described by the equations
and
In order to prevent data points from falling into the margin, we add the following constraint: for each either
- of the first class
or
- of the second.
This can be rewritten as:
Minimize (in )
subject to (for any )
Primal form
substituting || w || with with math convinience. this is quadratic programming optimization.subject to (for any )
By introducing Lagrange multipliers , the previous constrained problem can be expressed as
- The KKT implies the solution can be expressed as a linear combination of the training vectors:
-
Only a few will be greater than zero. The corresponding are exactly the support vectors, which lie on the margin and satisfy . From this one can derive that the support vectors also satisfy
which allows one to define the offset . In practice, it is more robust to average over all support vectors:
Dual form
Writing the classification rule reveals that the maximum-margin hyperplane and therefore the classification task is only a function of the support vectors.
Using the fact that and substituting , one can show that the dual of the SVM reduces to the following optimization problem:
Maximize (in )
subject to (for any )
and to the constraint from the minimization in
Here the kernel is defined by .
can be computed thanks to the terms:
- Soft Margain
-
Soft Margin method will choose a hyperplane that splits the examples as cleanly as possible.
-
The method introduces non-negative slack variables, , which measure the degree of misclassification of the data ;
-
-
optimization becomes a trade off between a large margin and a small error penalty
-
subject to (for any )
-
This constraint in (2) along with the objective of minimizing can be solved using Lagrange multipliers as done above. One has then to solve the following problem:
with .
-
Dual form[edit]
Maximize (in )
subject to (for any )
and
-
every dot product is replaced by a nonlinear kernel function
-
Some common kernels include:
- Polynomial (homogeneous):
- Polynomial (inhomogeneous):
- Gaussian radial basis function: , for . Sometimes parametrized using
- Hyperbolic tangent: , for some (not every) and
-
- Properties
-
The effectiveness of SVM depends on the selection of kernel, the kernel's parameters, and soft margin parameter C
- Implementation
-
The parameters of the maximum-margin hyperplane are derived by solving the optimization.