目录
1. Large margin classification
1.1 Optimization objective
1.2 Large Margin Intuition
1.3 Mathematic Behind Large Margin classification
2. Kernels
2.1 Kernels I
2.2 Kernels II
3. SVMs in practice
1. Large margin classification
1.1 Optimization objective
Alternative view of LR:
fig. 1
(引自coursera week 7 Optimization objective)
If y = 1, we want h(x) approximately equal to 1, θ^T * x >> 0
If y = 1, we want h(x) approximately equal to 0, θ^T * x << 0
if y = 1, want θ^T * x >> 0
fig. 2
(引自coursera week 7 Optimization objective)
if y = 0, want θ^T * x << 0
fig. 3
(引自coursera week 7 Optimization objective)
Support Vector Machine:
LR:
SVM:
hypothesis:
1.2 Large Margin Intuition
SVM:
fig. 4
(引自coursera week 7 Large Margin Intuition)
SVM Decision Boundary:
whenever y^i = 1:
θ^T * x^i >= 1
whenever y^i = 0:
θ^T * x^i <= -1
linearly separable case:
fig. 5
(引自coursera week 7 Large Margin Intuition)
SVM会选择黑色的decision boundary, 因为其分类性能会更好,也有很大的margin。 SVM有时被称为 Large Margin classifier
Large Margin Classifier in presence of outliers(异常值):
fig. 6
(引自coursera week 7 Large Margin Intuition)
1.3 Mathematic Behind Large Margin classification
Vector Inner Product:
inner product = u^T * v
SVM Decision Boundary:
simplication: θ_0 = 0, n = 2
fig. 7
(引自coursera week 7 Mathematic Behind Large Margin classification)
where p^i is the project of x^i onto the vector θ. simplification: θ_0 = 0, 决策边界过原点
E.g.
fig 8
(引自coursera week 7 Mathematic Behind Large Margin classification)
fig. 9
(引自coursera week 7 Mathematic Behind Large Margin classification)
SVM会选择fig 9 的边界.
2. Kernels
2.1 Kernels I
Non-linear Decision Boundary:
fig. 10
(引自coursera week 7 Kernels I)
Is there a different/better choice of the features???
Kenels:
Given x, compute new feature depending on proximity to landmarks l_1, l_2, l_3
fig. 11
(引自coursera week 7 Kernels I)
Given x:
exp(...) is called Gaussian Kernel
note: if x approximately equal to l_1, then f1 approximately equal 1
if x far from l_1, then f1 approximately equal 0
fig. 12
(引自coursera week 7 Kernels I)
2.2 Kernels II
Choose the landmarks:
choose l_(i) = x_(i)
Given example x:
fig. 13
(引自coursera week 7 Kernels II)
given (x1, y1), (x2, y2), ..., (xm, ym)
for training example (xi, yi),
Hypothesis: given x, compute features f(belongs to R^m+1), predict "y = 1", if θ^T * f >= 0 (θ belongs to R^m+1)
Training:
note: 正则化项的的项数从n变成了m
SVM parameters:
C:
large C: lower bias, high variance
small C: higher bias, low variance
σ^2:
large σ^2: feature f very more smoothly, then higher bias, lower variace
small σ^2: feature f very less smoothly, then lower bias, higher variace
3. SVMs in practice
Using a SVM:
Need to specify:
-choice of parameter C
-choice of Kermel
E.g. No kernel("linear kernel") ---- n large, m small
Gaussian kernel: choose σ^2 ---- n small, m large
note: Do perform feature scaling before using the Gaussian kernel
other choice of kernel:
note: Not all similarity functions similarity (x, l) make valid kernels. (need to satisfy technical condition called "Mercer's Theorem" to make sure SVM package's optimizations run correctly and do not diverge).
many off-the-self kernels available:
-Polynomial kernel: k(x, l) = (x^t * x + r)^d
- More esoteric: string kernel, chi-square kernel, histogram intersection kernel, ....
Multi-class classification:
fig. 14
(引自coursera week 7 SVMs in practice)
LR Vs. SVMs:
n = # features, m = # training examples
if n is large(relative to m), then use LR or SVm without a kernel
if n is small, m is intermediate, then use SVM with Gaussian kernel
if n is small, m is large, then create more features, then us LR or SVM without kernel