第7章 支持向量机(大间距分类器,Support Vector Machine)
1 Optimization Objective 优化目标
- Hypothesis:
h θ ( x ) = { 1 , if θ T x ≥ 0 0 , otherwise h_\theta(x)=\begin{cases} 1&\text{, if $\theta^Tx≥0$}\\ 0&\text{, otherwise} \end{cases} hθ(x)={10, if θTx≥0, otherwise - Cost Function:
J ( θ ) = C ∑ i = 1 m [ y ( i ) c o s t 1 ( θ T x ( i ) ) + ( 1 − y ( i ) ) c o s t 0 ( θ T x ( i ) ) ] + 1 2 ∑ j = 1 n θ j 2 J(\theta)=C\sum_{i=1}^m\left[y^{(i)}{cost}_{1}(\theta^Tx^{(i)})+(1-y^{(i)}){cost}_0(\theta^Tx^{(i)})\right]+\frac{1}{2}\sum_{j=1}^n{\theta_j}^2 J(θ)=Ci=1∑m[y(i)cost1(θTx(i))+(1−y(i))cost0(θTx(i))]+21j=1∑nθj2
C 类 似 于 1 λ C类似于\frac{1}{\lambda} C类似于λ1
- Goal:
minimize θ J ( θ ) \mathop{\text{minimize}}\limits_{\theta} J(\theta) θminimizeJ(θ)
2 Large Margin Intution
- if y ( i ) = 1 , we want θ T x ≥ 1 (not just ≥ 0 ) if y ( i ) = 0 , we want θ T x ≤ − 1 (not just ≤ 0 ) \begin{aligned} &\text{if $y^{(i)}=1$, we want $\theta^Tx≥1$(not just $≥0$)}\\ &\text{if $y^{(i)}=0$, we want $\theta^Tx≤-1$(not just $≤0$)} \end{aligned} if y(i)=1, we want θTx≥1(not just ≥0)if y(i)=0, we want θTx≤−1(not just ≤0)
3 数学原理
u = [ u 1 u 2 ] , v = [ v 1 v 2 ] u=\left[\begin{matrix} u_1\\u_2\end{matrix}\right],v=\left[\begin{matrix} v_1\\v_2\end{matrix}\right] u=[u1u2],v=[v1v2]
-
u
T
v
=
p
⋅
∣
∣
u
∣
∣
=
u
1
v
1
+
u
2
v
2
u^Tv=p\ \cdot ||u||=u_1v_1+u_2v_2
uTv=p ⋅∣∣u∣∣=u1v1+u2v2
p p p:length of projection of v v v onto u ( 有 符 号 ) u(有符号) u(有符号)
- u T v = v T u u^Tv=v^Tu uTv=vTu
- ∣ ∣ u ∣ ∣ = u 1 2 + u 2 2 ||u||=\sqrt{{u_1}^2+{u_2}^2} ∣∣u∣∣=u12+u22
- θ 0 = 0 \theta_0=0 θ0=0:意味着决策边界必须通过原点(0,0)
- 支持向量机就是极小化
∣
∣
θ
∣
∣
||\theta||
∣∣θ∣∣
minimize θ 1 2 ∑ j = 1 n θ j 2 s.t. p ( i ) ⋅ ∣ ∣ θ ∣ ∣ ≥ 1 if y ( i ) = 1 p ( i ) ⋅ ∣ ∣ θ ∣ ∣ ≤ − 1 if y ( i ) = 0 \begin{aligned} &\mathop{\text{minimize}}\limits_{\theta} \frac{1}{2}\sum_{j=1}^n{\theta_j}^2\\ \text{s.t.} \ \ &p^{(i)}\ \cdot ||\theta||≥1&\text{if $y^{(i)}=1$}\\ \ \ &p^{(i)}\ \cdot ||\theta||≤-1&\text{if $y^{(i)}=0$} \end{aligned} s.t. θminimize21j=1∑nθj2p(i) ⋅∣∣θ∣∣≥1p(i) ⋅∣∣θ∣∣≤−1if y(i)=1if y(i)=0 - 要使投影足够大,才能有大间距
4 Kernels 核函数
- h θ ( x ) = θ 1 f 1 + θ 2 f 2 + ⋅ ⋅ ⋅ + θ n f n h_\theta(x)=\theta_1f_1+\theta_2f_2+···+\theta_nf_n hθ(x)=θ1f1+θ2f2+⋅⋅⋅+θnfn
- Given x x x, compute new features f f f depending on proximity to landmarks l l l
- Kernel: f i = s i m i l a r i t y ( x , l ( i ) ) f_i=similarity(x,l^{(i)}) fi=similarity(x,l(i))
- Choose the landmarks:
l
(
1
)
=
x
(
1
)
,
l
(
2
)
=
x
(
2
)
,
⋅
⋅
⋅
,
l
(
m
)
=
x
(
m
)
l^{(1)}=x^{(1)}, l^{(2)}=x^{(2)}, ···, l^{(m)}=x^{(m)}
l(1)=x(1),l(2)=x(2),⋅⋅⋅,l(m)=x(m)
f ( i ) = [ f 0 ( i ) = 1 f 1 ( i ) = s i m ( x ( i ) , l ( 1 ) ) f 2 ( i ) = s i m ( x ( i ) , l ( 2 ) ) ⋮ f i ( i ) = s i m ( x ( i ) , l ( i ) ) = e 0 = 1 ⋮ f 1 ( m ) = s i m ( x ( i ) , l ( m ) ) ] f^{(i)}=\left[\begin{matrix} f_0^{(i)}=1\\ f_1^{(i)}=sim(x^{(i)},l^{(1)})\\ f_2^{(i)}=sim(x^{(i)},l^{(2)})\\ \vdots\\ f_i^{(i)}=sim(x^{(i)},l^{(i)})=e^0=1\\ \vdots\\ f_1^{(m)}=sim(x^{(i)},l^{(m)}) \end{matrix}\right] f(i)=⎣⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎡f0(i)=1f1(i)=sim(x(i),l(1))f2(i)=sim(x(i),l(2))⋮fi(i)=sim(x(i),l(i))=e0=1⋮f1(m)=sim(x(i),l(m))⎦⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎤
4.1 Gaussian Kernel
-
f
i
=
s
i
m
i
l
a
r
i
t
y
(
x
,
l
(
i
)
)
=
e
x
p
(
−
∣
∣
x
−
l
(
i
)
∣
∣
2
2
σ
2
)
=
e
x
p
(
−
∑
j
=
1
n
(
x
j
−
l
j
(
i
)
)
2
2
σ
2
)
f_i=similarity(x,l^{(i)})=exp\left(-\frac{{||x-l^{(i)}||}^2}{2\sigma^2}\right)=exp\left(-\frac{\sum_{j=1}^n{(x_j-l_j^{(i)})}^2}{2\sigma^2}\right)
fi=similarity(x,l(i))=exp(−2σ2∣∣x−l(i)∣∣2)=exp⎝⎛−2σ2∑j=1n(xj−lj(i))2⎠⎞
与正态分布没啥联系
- if
x
≈
l
(
i
)
x≈l^{(i)}
x≈l(i),
f
i
≈
1
f_i≈1
fi≈1
if x x x is far from l ( i ) l^{(i)} l(i), f i ≈ 0 f_i≈0 fi≈0 - 使用前需要进行特征缩放
- 需要选择 σ 2 \sigma^2 σ2
4.2 Linear Kernel
- do not use kernels
- Predict “ y = 1 y=1 y=1” if θ T f ≥ 0 \theta^Tf≥0 θTf≥0
5 SVM with Kernels
- Hypothesis:Given x x x, compute features f ∈ R m + 1 f∈\mathbb{R}^{m+1} f∈Rm+1.
- Training:
min θ C ∑ i = 1 m [ y ( i ) c o s t 1 ( θ T f ( i ) ) + ( 1 − y ( i ) ) c o s t 0 ( θ T f ( i ) ) ] + θ T M θ \mathop{\text{min}}\limits_{\theta}C\sum_{i=1}^m\left[y^{(i)}{cost}_{1}(\theta^Tf^{(i)})+(1-y^{(i)}){cost}_0(\theta^Tf^{(i)})\right]+\theta^TM\theta θminCi=1∑m[y(i)cost1(θTf(i))+(1−y(i))cost0(θTf(i))]+θTMθ
其中, M M M是根据我们选择的核函数而不同的一个矩阵
6 参数 C C C和 σ 2 \sigma^2 σ2的影响
-
C
C
C较大时,相当于
λ
\lambda
λ较小,可能会导致过拟合,高方差
C C C较小时,相当于 λ \lambda λ较大,可能会导致欠拟合,高偏差 -
σ
2
\sigma^2
σ2较大时,可能会导致低方差,高偏差(
f
i
f_i
fi very more smoothly)
σ 2 \sigma^2 σ2较大时,可能会导致低偏差,高方差( f i f_i fi very less smoothly)
7 Using an SVM
- Use SVM software package to solve for parameters θ \theta θ
- Need to specify:
(1) Choice of parameter C C C
(2) Choice of kernel ( similarity function )
7.1 Other Choices of Kernel
- Not all similarity function make valid kernels
Need to satisfy technical condition called “Mercer’s Theorem” to make sure SVM packages’ optimizations run correctly, and do not diverge
- Many off-the-shelf kernels:
(1) Polynomial kernel(多项式核函数)
(2) String kernel(字符串核函数)
(3) chi-square kernel(卡方核函数)
(4) histogram intersection kernel(直方图交集核函数)
7.2 Multi-class classification
-Train K K K SVMs, one to distinguish y = i y=i y=i from the rest, for i = 1 , 2 , ⋅ ⋅ ⋅ , K i=1, 2, ···, K i=1,2,⋅⋅⋅,K, get θ ( 1 ) , θ ( 2 ) , ⋅ ⋅ ⋅ , θ ( K ) \theta^{(1)}, \theta^{(2)}, ···, \theta^{(K)} θ(1),θ(2),⋅⋅⋅,θ(K). Pick class i i i with largest ( θ ( i ) ) T x {(\theta^{(i)})}^Tx (θ(i))Tx
- Or use packages already
8 逻辑回归和支持向量机
n n n:特征数
m m m:训练样本数
n 、 m n、m n、m | 选择 |
---|---|
n > > m n>>m n>>m | 逻辑回归、不带核函数的支持向量机 |
n n n较小, m m m中等 | 高斯核函数的支持向量机 |
n n n较小, m m m较大 | 创造、增加更多的特征,再逻辑回归、不带核函数的支持向量机 |
- 神经网络在上述情况下可能非常慢
- 支持向量机主要在于它的代价函数是凸函数,不存在局部最小值
9 Reference
吴恩达 机器学习 coursera machine learning
黄海广 机器学习笔记