参考李航统计学习方法第7章。
支持向量机学习的基本想法是求解能够正确划分训练数据集并且几何间隔最大的分离超平面。
对训练数据集找到几何间隔最大的超平面意味着以充分大的确信度对训练数据进行分类。也就是说,不仅将正负实例点分开,而且对最难分的实例点(离超平面最近的点)也有足够大的确信度将它们分开。这样的超平面应该对未知的新实例有很好的分类预测能力。
可以通过求解对偶问题而得到原始问题的解,进而确定分离超平面和决策函数。
①从原始问题到对偶问题
SVM的原始优化问题:
min
w
,
b
1
2
∣
∣
w
∣
∣
2
s
.
t
.
y
i
(
w
⋅
x
i
+
b
)
−
1
≥
0
,
i
=
1
,
2
,
.
.
.
,
N
\begin{aligned} &\min \limits_{w,b}\frac{1}{2}||w||^2 \\ &\mathrm{s.t.} \quad y_i(w\cdot x_i +b)-1\ge 0,\quad i =1,2,...,N \end{aligned}
w,bmin21∣∣w∣∣2s.t.yi(w⋅xi+b)−1≥0,i=1,2,...,N
⇔
\Leftrightarrow
⇔化成常见的约束为小于号的形式
min
w
,
b
1
2
∣
∣
w
∣
∣
2
s
.
t
.
−
y
i
(
w
⋅
x
i
+
b
)
+
1
≤
0
,
i
=
1
,
2
,
.
.
.
,
N
\begin{aligned} &\min \limits_{w,b}\frac{1}{2}||w||^2 \\ &\mathrm{s.t.} \quad -y_i(w\cdot x_i +b)+1\le 0,\quad i =1,2,...,N \end{aligned}
w,bmin21∣∣w∣∣2s.t.−yi(w⋅xi+b)+1≤0,i=1,2,...,N
拉格朗日对偶Lagrange duality
原始约束最优化问题
min x ∈ R n f ( x ) s . t . c i ( x ) ≤ 0 , i = 1 , 2 , . . . , k h j ( x ) = 0 , j = 1 , 2 , . . . , l \begin{aligned} &\min \limits_{x\in \mathrm{R}^n}f(x) \\ &\mathrm{s.t.} \ c_i(x)\le 0,\quad i =1,2,...,k\\ &\quad \ \ \ h_j(x)= 0,\quad j =1,2,...,l\\ \end{aligned} x∈Rnminf(x)s.t. ci(x)≤0,i=1,2,...,k hj(x)=0,j=1,2,...,l拉格朗日函数
L ( x , α , β ) = f ( x ) + ∑ i = 1 k α i c i ( x ) + ∑ j = 1 l β j h j ( x ) L(x, \alpha, \beta)=f(x)+\sum_{i=1}^{k} \alpha_{i} c_{i}(x)+\sum_{j=1}^{l} \beta_{j} h_{j}(x) L(x,α,β)=f(x)+i=1∑kαici(x)+j=1∑lβjhj(x)
其中 α i ≥ 0 \alpha_i\ge0 αi≥0关于约束参数 α , β \alpha,\beta α,β的最大化优化函数是 f ( x ) f(x) f(x)的等价形式
θ P ( x ) = max α , β : α i ⩾ 0 L ( x , α , β ) \theta_{P}(x)=\max _{\alpha, \beta: \alpha_{i} \geqslant 0} L(x, \alpha, \beta) θP(x)=α,β:αi⩾0maxL(x,α,β)
给定 x x x,当存在某个 i i i不满足约束条件 c i ( x ) ≤ 0 \ c_i(x)\le 0 ci(x)≤0时, α i → + ∞ \alpha_i\rightarrow +\infty αi→+∞使得 θ P ( x ) = + ∞ \theta_{P}(x)=+\infty θP(x)=+∞;当存在某个 j j j不满足约束条件 h j ( x ) = 0 \ h_j(x)= 0 hj(x)=0时, β j h j ( x ) → + ∞ \beta_j h_j(x)\rightarrow +\infty βjhj(x)→+∞使得 θ P ( x ) = + ∞ \theta_{P}(x)=+\infty θP(x)=+∞;故只有同时满足两类约束条件,才能使得最优化 L ( x , α , β ) L(x, \alpha, \beta) L(x,α,β)得到最大值 = f ( x ) =f(x) =f(x)广义拉格朗日函数的极小极大问题
min x θ P ( x ) = min x max α , β : α i ⩾ 0 L ( x , α , β ) \min\limits_{x}\theta_{P}(x)=\min\limits_{x}\max _{\alpha, \beta: \alpha_{i} \geqslant 0} L(x, \alpha, \beta) xminθP(x)=xminα,β:αi⩾0maxL(x,α,β)对偶问题/广义拉格朗日函数的极大极小问题
max α , β : α i ⩾ 0 θ D ( x ) = max α , β : α i ⩾ 0 min x L ( x , α , β ) \max_{\alpha, \beta: \alpha_{i} \geqslant 0}\theta_{D}(x)=\max_{\alpha, \beta: \alpha_{i} \geqslant 0}\min\limits_{x} L(x, \alpha, \beta) α,β:αi⩾0maxθD(x)=α,β:αi⩾0maxxminL(x,α,β)
其中
θ D ( x ) = min x L ( x , α , β ) \theta_{D}(x)=\min\limits_{x} L(x, \alpha, \beta) θD(x)=xminL(x,α,β)
拉格朗日函数
L
(
w
,
b
,
α
)
=
1
2
∥
w
∥
2
−
∑
i
=
1
N
α
i
y
i
(
w
⋅
x
i
+
b
)
+
∑
i
=
1
N
α
i
L(w, b, \alpha)=\frac{1}{2}\|w\|^{2}-\sum_{i=1}^{N} \alpha_{i} y_{i}\left(w \cdot x_{i}+b\right)+\sum_{i=1}^{N} \alpha_{i}
L(w,b,α)=21∥w∥2−i=1∑Nαiyi(w⋅xi+b)+i=1∑Nαi
在svm中,根据拉格朗日对偶性,原始问题的对偶问题是极大极小问题:
max
α
min
w
,
b
L
(
w
,
b
,
α
)
\max _{\alpha} \min _{w, b} L(w, b, \alpha)
αmaxw,bminL(w,b,α)
②对偶问题的解的形式化简
以下步骤先把原问题化成对偶问题(对偶变量的)的形式,然后通过解对偶问题,我们可以间接获得原问题的解。
(1)求
min
w
,
b
L
(
w
,
b
,
α
)
\min \limits_{w, b} L(w, b, \alpha)
w,bminL(w,b,α)
∇
w
L
(
w
,
b
,
α
)
=
w
−
∑
i
=
1
N
α
i
y
i
x
i
=
0
∇
b
L
(
w
,
b
,
α
)
=
−
∑
i
=
1
N
α
i
y
i
=
0
⇒
w
=
∑
i
=
1
N
α
i
y
i
x
i
∑
i
=
1
N
α
i
y
i
=
0
\begin{aligned} \nabla_{w} L(w, b, \alpha)&=w-\sum_{i=1}^{N} \alpha_{i} y_{i} x_{i}=0 \\ \nabla_{b} L(w, b, \alpha)&=-\sum_{i=1}^{N} \alpha_{i} y_{i}=0 \\ \Rightarrow\\ &w=\sum_{i=1}^{N} \alpha_{i} y_{i} x_{i} \\ &\sum_{i=1}^{N} \alpha_{i} y_{i}=0 \end{aligned}
∇wL(w,b,α)∇bL(w,b,α)⇒=w−i=1∑Nαiyixi=0=−i=1∑Nαiyi=0w=i=1∑Nαiyixii=1∑Nαiyi=0
求的结果代回L的表达式有
min
w
,
b
L
(
w
,
b
,
α
)
=
1
2
∑
i
=
1
N
∑
j
=
1
N
α
i
α
j
y
i
y
j
(
x
i
⋅
x
j
)
−
∑
i
=
1
N
α
i
y
i
(
(
∑
j
=
1
N
α
j
y
j
x
j
)
⋅
x
i
+
b
)
+
∑
i
=
1
N
α
i
=
−
1
2
∑
i
=
1
N
∑
j
=
1
N
α
i
α
j
y
i
y
j
(
x
i
⋅
x
j
)
+
∑
i
=
1
N
α
i
\begin{aligned} \min \limits_{w, b} L(w, b, \alpha) &=\frac{1}{2} \sum_{i=1}^{N} \sum_{j=1}^{N} \alpha_{i} \alpha_{j} y_{i} y_{j}\left(x_{i} \cdot x_{j}\right)-\sum_{i=1}^{N} \alpha_{i} y_{i}\left(\left(\sum_{j=1}^{N} \alpha_{j} y_{j} x_{j}\right) \cdot x_{i}+b\right)+\sum_{i=1}^{N} \alpha_{i} \\ &=-\frac{1}{2} \sum_{i=1}^{N} \sum_{j=1}^{N} \alpha_{i} \alpha_{j} y_{i} y_{j}\left(x_{i} \cdot x_{j}\right)+\sum_{i=1}^{N} \alpha_{i} \end{aligned}
w,bminL(w,b,α)=21i=1∑Nj=1∑Nαiαjyiyj(xi⋅xj)−i=1∑Nαiyi((j=1∑Nαjyjxj)⋅xi+b)+i=1∑Nαi=−21i=1∑Nj=1∑Nαiαjyiyj(xi⋅xj)+i=1∑Nαi
(2)
min
w
,
b
L
(
w
,
b
,
α
)
\min \limits_{w, b} L(w, b, \alpha)
w,bminL(w,b,α)求对
α
\alpha
α的极大
max
α
−
1
2
∑
i
=
1
N
∑
j
=
1
N
α
i
α
j
y
i
y
j
(
x
i
⋅
x
j
)
+
∑
i
=
1
N
α
i
s.t.
∑
i
=
1
N
α
i
y
i
=
0
α
i
⩾
0
,
i
=
1
,
2
,
⋯
,
N
\begin{array}{ll} \max \limits_{\alpha} & -\frac{1}{2} \sum\limits_{i=1}^{N} \sum\limits_{j=1}^{N} \alpha_{i} \alpha_{j} y_{i} y_{j}\left(x_{i} \cdot x_{j}\right)+\sum\limits_{i=1}^{N} \alpha_{i} \\ \text { s.t. } & \sum\limits_{i=1}^{N} \alpha_{i} y_{i}=0 \\ & \alpha_{i} \geqslant 0, \quad i=1,2, \cdots, N \end{array}
αmax s.t. −21i=1∑Nj=1∑Nαiαjyiyj(xi⋅xj)+i=1∑Nαii=1∑Nαiyi=0αi⩾0,i=1,2,⋯,N
(3)再将对目标函数求极大转换为求极小,最终得到下面与原始优化问题等价的对偶最优化问题:
min
α
1
2
∑
i
=
1
N
∑
j
=
1
N
α
i
α
j
y
i
y
j
(
x
i
⋅
x
j
)
−
∑
i
=
1
N
α
i
s.t.
∑
i
=
1
N
α
i
y
i
=
0
α
i
⩾
0
,
i
=
1
,
2
,
⋯
,
N
\begin{array}{ll} \min \limits_{\alpha} & \frac{1}{2} \sum\limits_{i=1}^{N} \sum\limits_{j=1}^{N} \alpha_{i} \alpha_{j} y_{i} y_{j}\left(x_{i} \cdot x_{j}\right)-\sum\limits_{i=1}^{N} \alpha_{i} \\ \text { s.t. } & \sum\limits_{i=1}^{N} \alpha_{i} y_{i}=0 \\ & \alpha_{i} \geqslant 0, \quad i=1,2, \cdots, N \end{array}
αmin s.t. 21i=1∑Nj=1∑Nαiαjyiyj(xi⋅xj)−i=1∑Nαii=1∑Nαiyi=0αi⩾0,i=1,2,⋯,N
所以存在
w
∗
w^*
w∗,
α
∗
\alpha^*
α∗,
b
∗
b^*
b∗,使
w
∗
w^*
w∗,
b
∗
b^*
b∗是原始问题的解,
α
∗
\alpha^*
α∗是对偶问题的解。
③从对偶问题的解到原问题的解
假设求得对偶最优化问题对
α
\alpha
α的解为
α
∗
\alpha^*
α∗,那么我们可以由按照下式
α
∗
\alpha^*
α∗求得原始最优化问题的解
w
∗
w^*
w∗,
b
∗
b^*
b∗:
w
∗
=
∑
i
=
1
N
α
i
∗
y
i
x
i
b
∗
=
y
j
−
∑
i
=
1
N
α
i
∗
y
i
(
x
i
⋅
x
j
)
\begin{array}{c} w^{*}=\sum_{i=1}^{N} \alpha_{i}^{*} y_{i} x_{i} \\ b^{*}=y_{j}-\sum_{i=1}^{N} \alpha_{i}^{*} y_{i}\left(x_{i} \cdot x_{j}\right) \end{array}
w∗=∑i=1Nαi∗yixib∗=yj−∑i=1Nαi∗yi(xi⋅xj)
证明:
针对对偶问题和原始问题共同的最优解
w
∗
,
b
∗
,
α
∗
w^*,b^*,\alpha^*
w∗,b∗,α∗,对目标函数
L
(
w
,
b
,
α
)
=
1
2
∥
w
∥
2
−
∑
i
=
1
N
α
i
y
i
(
w
⋅
x
i
+
b
)
+
∑
i
=
1
N
α
i
L(w, b, \alpha)=\frac{1}{2}\|w\|^{2}-\sum_{i=1}^{N} \alpha_{i} y_{i}\left(w \cdot x_{i}+b\right)+\sum_{i=1}^{N} \alpha_{i}
L(w,b,α)=21∥w∥2−i=1∑Nαiyi(w⋅xi+b)+i=1∑Nαi
(原问题)根据KKT条件,
∇
w
L
(
w
∗
,
b
∗
,
α
∗
)
=
w
∗
−
∑
i
=
1
N
α
i
∗
y
i
x
i
=
0
∇
b
L
(
w
∗
,
b
∗
,
α
∗
)
=
−
∑
i
=
1
N
α
i
∗
y
i
=
0
α
i
∗
(
−
y
i
(
w
∗
⋅
x
i
+
b
∗
)
+
1
)
=
0
,
i
=
1
,
2
,
⋯
,
N
−
y
i
(
w
∗
⋅
x
i
+
b
∗
)
+
1
⩽
0
,
i
=
1
,
2
,
⋯
,
N
α
i
∗
⩾
0
,
i
=
1
,
2
,
⋯
,
N
\begin{array}{l} \nabla_{w} L\left(w^{*}, b^{*}, \alpha^{*}\right)=w^{*}-\sum_{i=1}^{N} \alpha_{i}^{*} y_{i} x_{i}=0 \\ \nabla_{b} L\left(w^{*}, b^{*}, \alpha^{*}\right)=-\sum_{i=1}^{N} \alpha_{i}^{*} y_{i}=0 \\ \alpha_{i}^{*}\left(-y_{i}\left(w^{*} \cdot x_{i}+b^{*}\right)+1\right)=0, \quad i=1,2, \cdots, N \\ -y_{i}\left(w^{*} \cdot x_{i}+b^{*}\right)+1 \leqslant 0, \quad i=1,2, \cdots, N \\ \alpha_{i}^{*} \geqslant 0, \quad i=1,2, \cdots, N \end{array}
∇wL(w∗,b∗,α∗)=w∗−∑i=1Nαi∗yixi=0∇bL(w∗,b∗,α∗)=−∑i=1Nαi∗yi=0αi∗(−yi(w∗⋅xi+b∗)+1)=0,i=1,2,⋯,N−yi(w∗⋅xi+b∗)+1⩽0,i=1,2,⋯,Nαi∗⩾0,i=1,2,⋯,N
由第一个式子得
w
∗
=
∑
i
=
1
N
α
i
∗
y
i
x
i
w^{*}=\sum_{i=1}^{N} \alpha_{i}^{*} y_{i} x_{i}
w∗=∑i=1Nαi∗yixi
至少有一个(不是要求的,是求解问题的时候,发现至少一个非零才是合理的,否则都为0,根据第一个式子w也为0) α j ∗ > 0 \alpha_j^*>0 αj∗>0,任选其一,所以 y j ( w ∗ ⋅ x j + b ∗ ) − 1 = 0 y_{j}\left(w^{*} \cdot x_{j}+b^{*}\right)-1=0 yj(w∗⋅xj+b∗)−1=0
注意到 y j 2 = 1 y_{j}^{2}=1 yj2=1,联立上述二式得
b ∗ = 1 y j − w ∗ ⋅ x j = y j − ∑ i = 1 N α i ∗ y i x i x j b^*=\frac{1}{y_j}-w^*\cdot x_j=y_j-\sum_{i=1}^{N} \alpha_{i}^{*} y_{i} x_{i}x_j b∗=yj1−w∗⋅xj=yj−∑i=1Nαi∗yixixj
证毕。
④从原问题的解到分离超平面和决策函数、支撑向量
进一步,分离超平面可以写成
∑
i
=
1
N
α
i
∗
y
i
(
x
⋅
x
i
)
+
b
∗
=
0
\sum_{i=1}^{N} \alpha_{i}^{*} y_{i}\left(x \cdot x_{i}\right)+b^{*}=0
i=1∑Nαi∗yi(x⋅xi)+b∗=0
分类决策函数可以写成
f
(
x
)
=
s
i
g
n
(
∑
i
=
1
N
α
i
∗
y
i
(
x
⋅
x
i
)
+
b
∗
)
f(x)=sign\left(\sum_{i=1}^{N} \alpha_{i}^{*} y_{i}\left(x \cdot x_{i}\right)+b^{*}\right)
f(x)=sign(i=1∑Nαi∗yi(x⋅xi)+b∗)
可以看到,
w
∗
,
b
∗
w^*,b^*
w∗,b∗只依赖于那些
α
j
∗
>
0
\alpha_j^*>0
αj∗>0的样本,这样的对应的训练样本我们就称之为支撑向量。
对于支撑向量,
α
i
∗
(
−
y
i
(
w
∗
⋅
x
i
+
b
∗
)
+
1
)
=
0
,
α
i
∗
>
0
⇒
y
i
(
w
∗
⋅
x
i
+
b
∗
)
+
1
=
0
⇒
w
∗
⋅
x
i
+
b
∗
=
1
o
r
−
1
\alpha_{i}^{*}\left(-y_{i}\left(w^{*} \cdot x_{i}+b^{*}\right)+1\right)=0,\alpha_{i}^{*}>0\\ \Rightarrow \\ y_{i}\left(w^{*} \cdot x_{i}+b^{*}\right)+1=0\\ \Rightarrow \\ w^{*} \cdot x_{i}+b^{*}=1\ or -1
αi∗(−yi(w∗⋅xi+b∗)+1)=0,αi∗>0⇒yi(w∗⋅xi+b∗)+1=0⇒w∗⋅xi+b∗=1 or−1
即它一定在边界上!
- SVM的分类结果仅依赖于支持向量,这也是它拥有极高运行效率的关键之一