Classifiers Reviews - #1
SVM (Support Vector Machine)
1. Concept
SVM is a kind of classifier which is also a line that can separate two sets of linearly separable data. And this line is special because it is right in the middle of the two sets, it is a line that has the largest distance to the closest data point. When you add more out of sample data into the sets, the line will still work.
Remove the data point on the support vector will affect the decision boundary.
Datasets which have a clear classification boundary will function best with SVM’s.
2. Parameters
1. C
C—misclassification penalty,
The higher the C, less toleration for the misclassification, may result in over-fitting.
The lower the C, more toleration for the misclassification, may result in under-fitting.
C too big or small will make lower the ability of generalization.
Hard margin: The SVM allows very low error in classification.
Soft margin: also called noisy linear SVM which includes some miss-classified points.
2. Kernel
The selection of kernel in SVM is very important, especially for those data that is not linearly separable. The goal is to project the linearly inseparable data onto a High-dimensional eigenspace in order to let the data become linearly separable. We define this projection as
Φ
(
x
)
\Phi (x)
Φ(x) .
When it comes to optimization, there will be
Φ
(
i
)
⋅
Φ
(
j
)
\Phi (i) \cdot \Phi (j)
Φ(i)⋅Φ(j) which requires a very large calculation dimension, so we introduce kernel into the calculation, which is be much faster.
Here are some kernels that are often used in SVM:
Name | Usage | Function |
---|---|---|
Linear kernel | mainly used for linearly separable data, also when there are large amount of features | k ( x , x j ) = x ⋅ x i k(x, x_j) = x \cdot x_i k(x,xj)=x⋅xi |
Polynomial kernel | can achieve the projection, but the parameters are a lot, so when the degree is high, the elements in the metrics will be close to zero, calculation complexity will be huge | k ( x , x j ) = ( ( x ⋅ x i ) + 1 ) d k(x, x_j) = ((x \cdot x_i) + 1)^d k(x,xj)=((x⋅xi)+1)d |
RBF kernel | Linearly inseparable, less parameters, normal sample amount and less features amount, when you don’t know what to use, use this one first (most used one) | k ( x , x j ) = e x p ( − ∣ ∣ x − x i ∣ ∣ 2 σ 2 ) k(x, x_j) = exp(- \frac{\vert\vert{x-x_i\vert\vert ^2}}{\sigma ^2}) k(x,xj)=exp(−σ2∣∣x−xi∣∣2) |
Sigmoid | achieve neural networks | k ( x , x j ) = t a n h k(x, x_j) = tanh k(x,xj)=tanh |
gamma
gamma is a parameter that from the RBF kernel.
The greater the gamma, the less the support vectors. The smaller the gamma, the more the support vectors. The number of support vectors affect the training and predicting speed.