支持向量机SVM
1.SVM历史
- 最早是由Vladimir N. Vapnik 和Alexey Ya.Chervonenkis在1963年提出
- 目前的版本(soft margin)是由Corinna Cortes和Vapnik在1993年提出,并在1995年发表
- 深度学习(2012)出现之前,SVM被认为是机器学习近十年来最成功的,表现最好的算法
SVM寻找区分两类的超平面(hyper plane),使边际(margin)最大
向量内积
{ x 1 x 2 . . . x n } { y 1 y 2 . . . y n } \begin{Bmatrix} x_1\\ x_2\\ ...\\ x_n \end{Bmatrix} \quad\quad\quad\quad\quad \begin{Bmatrix} y_1\\ y_2\\ ...\\ y_n \end{Bmatrix} ⎩ ⎨ ⎧x1x2...xn⎭ ⎬ ⎫⎩ ⎨ ⎧y1y2...yn⎭ ⎬ ⎫
向量内积:
x
⋅
y
=
x
1
y
1
+
x
2
y
2
+
.
.
.
+
x
n
y
n
x\cdot y = x_1y_1 + x_2y_2 + ...+x_ny_n
x⋅y=x1y1+x2y2+...+xnyn
向量内积:
x
⋅
y
=
∥
x
∥
∥
y
∥
c
o
s
(
θ
)
x\cdot y = \begin{Vmatrix}x \end{Vmatrix}\begin{Vmatrix}y \end{Vmatrix}cos(\theta)
x⋅y=∥
∥x∥
∥∥
∥y∥
∥cos(θ)
范数:
∥
x
∥
=
x
⋅
x
=
x
1
2
+
x
2
2
+
x
3
2
+
.
.
.
x
n
2
\begin{Vmatrix}x \end{Vmatrix} = \sqrt {x\cdot x} = \sqrt{x_1^2 + x_2^2 + x_3^2+...x_n^2}
∥
∥x∥
∥=x⋅x=x12+x22+x32+...xn2
当||x|| ≠ 0, ||y||≠0时,可以求余弦相似度:
c
o
s
(
θ
)
=
x
⋅
y
∥
x
∥
∥
y
∥
cos(\theta) = \frac{x\cdot y}{\begin{Vmatrix}x\end{Vmatrix}\begin{Vmatrix}y \end{Vmatrix}}
cos(θ)=∥
∥x∥
∥∥
∥y∥
∥x⋅y
一些推导
w ⋅ x + b = 1 w ⋅ x + b = − 1 w ⋅ ( x 1 − x 2 ) = 2 ∥ w ∥ ∥ ( x 1 − x 2 ) ∥ c o s ( θ ) = 2 ∥ w ∥ ∗ d = 2 d = 2 ∥ w ∥ w\cdot x + b = 1 \\ w \cdot x + b = -1 \\ w \cdot (x_1 - x_2) = 2 \\ \begin{Vmatrix}w\end{Vmatrix}\begin{Vmatrix}(x_1 - x_2)\end{Vmatrix}cos(\theta) = 2\\ \begin{Vmatrix}w\end{Vmatrix}*d = 2\\ d = \frac{2}{\begin{Vmatrix}w\end{Vmatrix}} w⋅x+b=1w⋅x+b=−1w⋅(x1−x2)=2∥ ∥w∥ ∥∥ ∥(x1−x2)∥ ∥cos(θ)=2∥ ∥w∥ ∥∗d=2d=∥ ∥w∥ ∥2
转为凸优化
w ⋅ x + b ≥ 1 , 则分类 y = 1 w ⋅ x + b ≤ 1 , 则分类 y = − 1 则 y ( w ⋅ x + b ) ≥ 1 求 d = 2 ∥ w ∥ 最大值,也就是求 m i n ∥ w ∥ 2 2 w\cdot x + b \geq 1,则分类y=1\\ w \cdot x + b \leq 1,则分类y=-1\\ 则y(w\cdot x + b) \geq 1 \\ 求d = \frac{2}{\begin{Vmatrix}w\end{Vmatrix}}最大值,也就是求min\frac{{\begin{Vmatrix}w\end{Vmatrix}}^2}{2} w⋅x+b≥1,则分类y=1w⋅x+b≤1,则分类y=−1则y(w⋅x+b)≥1求d=∥ ∥w∥ ∥2最大值,也就是求min2∥ ∥w∥ ∥2
凸优化问题
-
无约束优化问题: min f(x)
----费马定理
-
带等式约束的优化问题: min f(x)
—拉格朗日乘子法:s.t. h_i(x) =0, i=0, 1, 2…n
L ( x , λ ) = f ( x ) + ∑ i = 1 n λ i h i ( x ) L(x,\lambda) = f(x) + \sum_{i=1}^{n}\lambda_ih_i(x) L(x,λ)=f(x)+i=1∑nλihi(x) -
带不等式约束的优化问题: min f(x)
—KTT条件 s.t. h_i(x) = 0, i=1, 2,…,n
g_i(x) ≤ 0, i=1,2,…,k
L ( x , λ , v ) = f ( x ) + ∑ i = 1 k λ i g i ( x ) + ∑ i = 1 n v i h i ( x ) L(x, \lambda, v) = f(x) + \sum_{i=1}^{k}\lambda_ig_i(x)+\sum_{i=1}^nv_ih_i(x) L(x,λ,v)=f(x)+i=1∑kλigi(x)+i=1∑nvihi(x)
广义拉格朗日乘子法
L ( w , b , a ) = 1 2 ∥ w ∥ 2 − ∑ i = 1 n α i ( y i ( w T x i + b ) − 1 ) L(w, b,a) = \frac{1}{2}\begin{Vmatrix}w\end{Vmatrix}^2 - \sum_{i=1}^{n}\alpha_i(y_i(w^Tx_i+b)-1) L(w,b,a)=21∥ ∥w∥ ∥2−i=1∑nαi(yi(wTxi+b)−1)
∂ L ∂ w = 0 → w = ∑ i = 1 n α i y i x i \frac{\partial L}{\partial w} = 0 \to w = \sum_{i=1}^{n}\alpha_iy_ix_i ∂w∂L=0→w=i=1∑nαiyixi
∂ L ∂ b = 0 → ∑ i = 1 n α i y i = 0 \frac{\partial L}{\partial b} = 0 \to \sum_{i=1}^n\alpha_iy_i = 0 ∂b∂L=0→i=1∑nαiyi=0
进一步简化为对偶问题
上述问题可以改写为:
min
w
,
b
max
a
i
≥
0
L
(
w
,
b
,
α
)
=
p
∗
\min\limits_{w, b} \max\limits_{a_i\ge0}L(w,b, \alpha) = p^*
w,bminai≥0maxL(w,b,α)=p∗
可以等价为下列对偶问题:
max
a
i
≥
0
min
w
,
b
L
(
w
,
b
,
α
)
=
d
∗
\max\limits_{a_i\ge0}\min\limits_{w, b}L(w, b, \alpha) = d^*
ai≥0maxw,bminL(w,b,α)=d∗
把w和b消除掉
L
(
w
,
b
,
α
)
=
∑
i
=
1
n
α
i
−
1
2
∑
i
,
j
=
1
n
α
i
α
j
y
i
y
j
x
i
T
x
j
L(w, b, \alpha) = \sum_{i=1}^{n}\alpha_i - \frac{1}{2}\sum_{i, j = 1}^{n}\alpha_i\alpha_jy_iy_jx_i^Tx_j
L(w,b,α)=i=1∑nαi−21i,j=1∑nαiαjyiyjxiTxj
max a i ≥ 0 min w , b L ( w , b , α ) = max α [ ∑ i = 1 k α i − 1 2 ∑ i , j = 1 k α i α j y i y j ( x i ) T x j ] \max\limits_{a_i\ge0}\min\limits_{w, b}L(w, b, \alpha) = \max\limits_{\alpha}[\sum_{i=1}^{k}\alpha_i - \frac{1}{2}\sum_{i, j = 1}^{k}\alpha_i\alpha_jy_iy_j(x_i)^Tx_j] ai≥0maxw,bminL(w,b,α)=αmax[i=1∑kαi−21i,j=1∑kαiαjyiyj(xi)Txj]
约束条件:
s
.
t
.
∑
i
=
1
k
a
i
y
i
=
0
,
a
i
≥
0
,
i
=
1
,
2
,
.
.
.
.
n
s.t. \sum_{i=1}^ka_iy_i = 0, \quad\quad a_i\ge0, i=1,2,....n
s.t.i=1∑kaiyi=0,ai≥0,i=1,2,....n
进一步转化为
min
α
[
−
∑
i
=
1
k
α
i
+
1
2
∑
i
,
j
=
1
k
α
i
α
j
y
i
y
j
(
x
i
)
T
x
j
]
=
min
α
[
−
∑
i
=
1
k
α
i
+
1
2
∑
i
,
j
=
1
k
α
i
α
j
y
i
y
j
(
x
i
⋅
x
j
)
]
\min\limits_{\alpha}[-\sum_{i=1}^{k}\alpha_i + \frac{1}{2}\sum_{i, j =1}^{k}\alpha_i\alpha_jy_iy_j(x_i)^Tx_j] = \min\limits_{\alpha}[-\sum_{i=1}^{k}\alpha_i + \frac{1}{2}\sum_{i, j = 1}^{k}\alpha_i\alpha_jy_iy_j(x_i\cdot x_j)]
αmin[−i=1∑kαi+21i,j=1∑kαiαjyiyj(xi)Txj]=αmin[−i=1∑kαi+21i,j=1∑kαiαjyiyj(xi⋅xj)]
由此可以求出最优解α*,求出该值后将其带入可以得到:
w
∗
=
∑
i
=
1
n
a
i
∗
y
i
x
i
w^* = \sum_{i=1}^{n}a_i^*y_ix_i
w∗=i=1∑nai∗yixi
b ∗ = y i − ( w ∗ ) T x i b^* = y_i - (w^*)^Tx_i b∗=yi−(w∗)Txi
SMO算法
Microsoft Research的John C.Platt在1998年提出针对线性SVM和数据稀疏时性能更优
基本思路:先根据约束条件随机给α赋值。然后每次选取两个α,调节这两个alpha使得目标函数最小。然后再选取两个α,调节α使得目标函数最小。以此类推
SVM简单实例
可知目标函数为
min
α
f
(
α
)
,
s
.
t
.
α
1
+
α
2
−
α
3
=
0
,
α
i
≥
0
,
i
=
1
,
2
,
3
\min\limits_{\alpha}f(\alpha), \quad\quad s.t. \alpha_1 + \alpha_2 - \alpha_3 = 0,\quad\quad \alpha_i\ge0,i=1,2,3
αminf(α),s.t.α1+α2−α3=0,αi≥0,i=1,2,3
其中
f
(
α
)
=
1
2
∑
i
,
j
=
1
3
α
i
α
j
y
i
y
j
(
x
i
⋅
x
j
)
−
∑
i
=
1
3
α
i
=
1
2
(
18
α
1
2
+
25
α
2
2
+
2
α
3
2
+
42
α
1
α
2
−
12
α
1
α
3
−
14
α
2
α
3
)
−
α
1
−
α
2
−
α
3
f(\alpha) = \frac{1}{2}\sum_{i,j=1}^{3}\alpha_i\alpha_jy_iy_j(x_i\cdot x_j)-\sum_{i=1}^3\alpha_i\\ = \frac{1}{2}(18\alpha_1^2+25\alpha_2^2+2\alpha_3^2+42\alpha_1\alpha_2-12\alpha_1\alpha_3-14\alpha_2\alpha_3) - \alpha_1 - \alpha_2 - \alpha_3
f(α)=21i,j=1∑3αiαjyiyj(xi⋅xj)−i=1∑3αi=21(18α12+25α22+2α32+42α1α2−12α1α3−14α2α3)−α1−α2−α3
然后,将α3 = α1 + α2带入到目标函数,得到一个关于α1和α2的函数
s
(
α
1
,
α
2
)
=
4
α
1
2
+
13
2
α
2
2
+
10
α
1
α
2
−
2
α
1
−
2
α
2
s(\alpha_1,\alpha_2) = 4\alpha_1^2 + \frac{13}{2}\alpha_2^2 + 10\alpha_1\alpha_2 - 2\alpha_1-2\alpha_2
s(α1,α2)=4α12+213α22+10α1α2−2α1−2α2
对α1和α2求偏导数并令其为0,易知s(α1,α2)在点(1.5, -1)处取极值。而该点不满足αi≥0的约束条件,于是可以推断最小值在边界上达到。经计算当α1 = 0时,s(α1=0,α2=2/13)=-0.1538;当α2=0时,s(α1=1/4, α2=0)=-0.25.于是s(α1,α2)在α1=1/4,α2=0时取得最小值,此时亦可以算出α3 = α1 + α2 = 1/4.因为α1和α3不等于0,所以对应的点x1和x3就应该是支持向量。
w
∗
=
∑
i
=
1
3
a
i
∗
y
i
x
i
=
1
4
⋅
(
3
,
3
)
+
1
4
(
1
,
1
)
=
(
1
2
,
1
2
)
w^* = \sum_{i=1}^{3}a_i^*y_ix_i = \frac{1}{4}\cdot (3,3) + \frac{1}{4}(1, 1) = (\frac{1}{2}, \frac{1}{2})
w∗=i=1∑3ai∗yixi=41⋅(3,3)+41(1,1)=(21,21)
即w1 = 0.5, w2=0.5,进而有
b
∗
=
1
−
(
w
1
,
w
2
)
⋅
(
3
,
3
)
=
−
2
b^* = 1-(w_1, w_2)\cdot(3, 3) = -2
b∗=1−(w1,w2)⋅(3,3)=−2
因此最大间隔分类超平面为
1
2
x
1
+
1
2
x
2
−
2
=
0
\frac{1}{2}x_1 + \frac{1}{2}x_2 - 2 = 0
21x1+21x2−2=0
分类决策函数为
f
(
x
)
=
s
i
g
n
(
1
2
x
1
+
1
2
x
2
−
2
)
f(x) = sign(\frac{1}{2}x_1+\frac{1}{2}x_2-2)
f(x)=sign(21x1+21x2−2)
SVM优点
- 训练好的模型的算法复杂度是由支持向量的个数决定的,而不是由数据的维度决定的,所以SVM不太容易产生overfitting
- SVM训练出来的模型完全依赖于支持向量,即使训练集里面所有非支持向量的点都被去除,重复训练过程,结果仍然会得到完全一样的模型
- 一个SVM如果训练得出的支持向量个数比较小,SVM训练出来的模型比较容易被泛化