分类决策函数:
f
(
x
)
=
s
i
g
n
(
w
x
+
b
)
f(x)=sign(wx+b)
f(x)=sign(wx+b)
其中
s
i
g
n
(
z
)
=
{
+
1
i
f
z
≥
0
−
1
o
t
h
e
r
w
i
s
e
sign(z)=\begin{cases}+1 &if\space z\ge 0\\ -1 &otherwise \end{cases}
sign(z)={+1−1if z≥0otherwise
训练数据T中的一个样本
(
x
(
i
)
,
y
(
i
)
)
(x^{(i)},y^{(i)})
(x(i),y(i))的函数间隔:
γ
^
(
i
)
=
y
(
i
)
(
w
x
+
b
)
.
\hat\gamma(i) = y^{(i)}(wx + b).
γ^(i)=y(i)(wx+b).
函数间隔有个问题,比如将w和b增大2倍,间隔就会增大2倍,但这没有什么意义!!!因为超平面还是那个超平面,而我们的目标是选一个较好的超平面。
为此我们定义一个与w,b的比例尺无关的几何间隔(对w,b使用L2正则化):
γ ( i ) = y ( i ) ( w ∣ ∣ w ∣ ∣ x + b ∣ ∣ w ∣ ∣ ) . \gamma(i) = y^{(i)}(\dfrac{w}{||w||} x + \dfrac{b}{||w||}). γ(i)=y(i)(∣∣w∣∣wx+∣∣w∣∣b).
整个训练数据T的函数间隔:
γ
=
min
i
=
1
,
⋯
 
,
N
γ
(
i
)
\gamma=\displaystyle\min_{i=1,\cdots,N}\gamma(i)
γ=i=1,⋯,Nminγ(i)
优化问题为:
max
w
,
b
γ
s
.
t
.
y
(
i
)
(
w
∣
∣
w
∣
∣
x
(
i
)
+
b
∣
∣
w
∣
∣
)
≥
γ
\begin{alignedat}{2} &\displaystyle\max_{w,b} \quad \gamma \\ &s.t.\quad y^{(i)}(\dfrac{w}{||w||}x^{(i)}+\dfrac{b}{||w||}) \ge\gamma \end{alignedat}
w,bmaxγs.t.y(i)(∣∣w∣∣wx(i)+∣∣w∣∣b)≥γ
因为
γ
=
γ
^
∣
∣
w
∣
∣
\gamma=\dfrac{\hat\gamma}{||w||}
γ=∣∣w∣∣γ^
则该优化问题等价于:
max w , b γ ^ ∣ ∣ w ∣ ∣ s . t . y ( i ) ( w ∣ ∣ w ∣ ∣ x ( i ) + b ∣ ∣ w ∣ ∣ ) ≥ γ ^ ∣ ∣ w ∣ ∣ \begin{alignedat}{2} &\displaystyle\max_{w,b} \quad \dfrac{\hat\gamma}{||w||} \\ &s.t.\quad y^{(i)}(\dfrac{w}{||w||}x^{(i)}+\dfrac{b}{||w||}) \ge \dfrac{\hat\gamma}{||w||} \end{alignedat} w,bmax∣∣w∣∣γ^s.t.y(i)(∣∣w∣∣wx(i)+∣∣w∣∣b)≥∣∣w∣∣γ^
即
max
w
,
b
γ
^
∣
∣
w
∣
∣
s
.
t
.
y
(
i
)
(
w
x
(
i
)
+
b
)
≥
γ
^
\begin{alignedat}{2} &\displaystyle\max_{w,b} \quad \dfrac{\hat\gamma}{||w||} \\ &s.t.\quad y^{(i)}(wx^{(i)}+b) \ge \hat\gamma \end{alignedat}
w,bmax∣∣w∣∣γ^s.t.y(i)(wx(i)+b)≥γ^
因为比例缩放不影响最优化问题的不等式约束,我们可以让
γ
^
=
1
\hat\gamma=1
γ^=1,则最优化问题变成
max
w
,
b
1
∣
∣
w
∣
∣
s
.
t
.
y
(
i
)
(
w
x
(
i
)
+
b
)
≥
1
\begin{alignedat}{2} &\displaystyle\max_{w,b} \quad \dfrac{1}{||w||} \\ &s.t.\quad y^{(i)}(wx^{(i)}+b) \ge 1 \end{alignedat}
w,bmax∣∣w∣∣1s.t.y(i)(wx(i)+b)≥1
因为最大化
1
∣
∣
w
∣
∣
\dfrac{1}{||w||}
∣∣w∣∣1和最小化
1
2
∣
∣
w
∣
∣
2
\dfrac{1}{2}||w||^2
21∣∣w∣∣2是等价的,所以最优化问题变成:
min
w
,
b
1
2
∣
∣
w
∣
∣
2
s
.
t
.
y
(
i
)
(
w
x
(
i
)
+
b
)
−
1
≥
0
\begin{alignedat}{2} &\displaystyle\min_{w,b} \quad \dfrac{1}{2}||w||^2 \\ &s.t.\quad y^{(i)}(wx^{(i)}+b)-1 \ge 0 \end{alignedat}
w,bmin21∣∣w∣∣2s.t.y(i)(wx(i)+b)−1≥0
对每个不等式约束引入一个拉格朗日算子
α
i
≥
0
\alpha_i\ge 0
αi≥0,定义拉格朗日函数:
L
(
w
,
b
,
α
)
=
1
2
∣
∣
w
∣
∣
2
−
∑
i
=
1
N
α
i
[
y
(
i
)
(
w
x
(
i
)
+
b
)
−
1
]
L(w,b,\alpha)=\dfrac{1}{2}||w||^2 -\displaystyle\sum_{i=1}^N\alpha_i [y^{(i)}(wx^{(i)}+b)-1]
L(w,b,α)=21∣∣w∣∣2−i=1∑Nαi[y(i)(wx(i)+b)−1]
则原始问题为:
min
w
,
b
max
α
L
(
w
,
b
,
α
)
\displaystyle\min_{w,b}\displaystyle\max_\alpha L(w,b,\alpha)
w,bminαmaxL(w,b,α)
其拉格朗日对偶问题为:
max
α
min
w
,
b
L
(
w
,
b
,
α
)
\displaystyle\max_\alpha \displaystyle\min_{w,b} L(w,b,\alpha)
αmaxw,bminL(w,b,α)
令
∇
w
L
(
w
,
b
,
α
)
=
w
−
∑
i
=
1
N
α
i
y
(
i
)
x
(
i
)
=
0
∇
b
L
(
w
,
b
,
α
)
=
−
∑
i
=
1
N
α
i
y
(
i
)
=
0
\begin{aligned} &\nabla_wL(w,b,\alpha)=w-\displaystyle\sum_{i=1}^N\alpha_iy^{(i)}x^{(i)}=0 \\ &\nabla_b L(w,b,\alpha)=-\displaystyle\sum_{i=1}^N\alpha_iy^{(i)}=0 \end{aligned}
∇wL(w,b,α)=w−i=1∑Nαiy(i)x(i)=0∇bL(w,b,α)=−i=1∑Nαiy(i)=0
得到
min
w
,
b
L
(
w
,
b
,
α
)
=
−
1
2
∑
i
=
1
N
∑
j
=
1
N
α
i
α
j
y
(
i
)
y
(
j
)
(
x
(
i
)
⋅
x
(
j
)
)
+
∑
i
=
1
N
α
i
w
=
∑
i
=
1
N
α
i
y
(
i
)
x
(
i
)
\begin{aligned} \displaystyle\min_{w,b} L(w,b,\alpha)&=-\dfrac{1}{2}\displaystyle\sum_{i=1}^N\displaystyle\sum_{j=1}^N\alpha_i\alpha_jy^{(i)}y^{(j)}(x^{(i)}\cdot x^{(j)}) +\displaystyle\sum_{i=1}^N\alpha_i \\ w&=\displaystyle\sum_{i=1}^N\alpha_iy^{(i)}x^{(i)} \end{aligned}
w,bminL(w,b,α)w=−21i=1∑Nj=1∑Nαiαjy(i)y(j)(x(i)⋅x(j))+i=1∑Nαi=i=1∑Nαiy(i)x(i)
一般情况下
max
min
<
min
max
\max\min<\min\max
maxmin<minmax
,只有KKT条件成立的时候,原始问题的解才和对偶问题的解相等。
对于形如
min
w
f
(
w
)
s
.
t
.
g
i
(
w
)
≤
0
,
i
=
1
,
.
.
.
,
k
h
i
(
w
)
=
0
,
i
=
1
,
.
.
.
,
l
.
\begin{aligned} &\min_w \quad f(w) \\ & \begin{aligned}s.t. \quad g_i(w) &≤ 0, i = 1, . . . , k\\ h_i(w) &= 0, i = 1, . . . , l. \end{aligned} \end{aligned}
wminf(w)s.t.gi(w)hi(w)≤0,i=1,...,k=0,i=1,...,l.
拉格朗日函数为
L
(
w
,
α
,
β
)
=
f
(
w
)
+
∑
i
=
1
k
α
i
g
i
(
w
)
+
∑
i
=
1
l
β
i
h
i
(
w
)
.
L(w, α, β) = f(w) + \displaystyle\sum_{i=1}^k α_ig_i(w) + \displaystyle\sum_{i=1}^l β_i h_i(w).
L(w,α,β)=f(w)+i=1∑kαigi(w)+i=1∑lβihi(w).
的问题,其KTT条件如下:
∂
∂
w
i
L
(
w
,
α
,
β
)
=
0
,
i
=
1
,
⋯
 
,
N
∂
∂
β
i
L
(
w
,
α
,
β
)
=
0
,
i
=
1
,
⋯
 
,
l
α
i
g
i
(
w
)
=
0
,
i
=
1
,
⋯
 
,
k
g
i
(
w
)
≤
0
,
i
=
1
,
⋯
 
,
k
α
i
≥
0
,
i
=
1
,
⋯
 
,
k
\begin{aligned} \dfrac{\partial}{\partial w_i}L(w, α, β) &= 0, \quad i = 1,\cdots ,N \\ \dfrac{\partial}{\partial \beta_i}L(w, α, β) &= 0, \quad i = 1, \cdots ,l \\ \alpha_i g_i(w) &= 0 ,\quad i = 1, \cdots ,k\\ g_i(w) &≤ 0, \quad i = 1, \cdots ,k \\ α_i &≥ 0, \quad i = 1, \cdots ,k \end{aligned}
∂wi∂L(w,α,β)∂βi∂L(w,α,β)αigi(w)gi(w)αi=0,i=1,⋯,N=0,i=1,⋯,l=0,i=1,⋯,k≤0,i=1,⋯,k≥0,i=1,⋯,k
注意因为
g
i
(
w
)
=
−
[
y
(
i
)
(
w
x
(
i
)
+
b
)
−
1
]
≤
0
g_i(w)=-[y^{(i)}(wx^{(i)}+b)-1]\le 0
gi(w)=−[y(i)(wx(i)+b)−1]≤0,
如果
α
i
>
0
\alpha_i>0
αi>0,则必有
g
i
(
w
)
g_i(w)
gi(w)=0,则说明样本
i
i
i距离分割超平面的距离为1,我们称这样的点为支持向量.
假设
α
j
>
0
\alpha_j>0
αj>0,则
y
(
j
)
(
w
x
(
j
)
+
b
)
−
1
=
0
y^{(j)}(wx^{(j)}+b)-1=0
y(j)(wx(j)+b)−1=0
将w的值代入,得
0
=
y
(
j
)
(
∑
α
i
y
(
i
)
x
(
i
)
x
(
j
)
+
b
)
−
1
=
(
y
(
j
)
)
2
(
∑
α
i
y
(
i
)
x
(
i
)
x
(
j
)
+
b
)
−
y
(
j
)
=
∑
α
i
y
(
i
)
x
(
i
)
x
(
j
)
+
b
−
y
(
j
)
\begin{aligned} 0&=y^{(j)}(\sum\alpha_iy^{(i)}x^{(i)}x^{(j)}+b)-1 \\ &=(y^{(j)})^2(\sum\alpha_iy^{(i)}x^{(i)}x^{(j)}+b)-y^{(j)} \\ &=\sum\alpha_iy^{(i)}x^{(i)}x^{(j)}+b-y^{(j)} \end{aligned}
0=y(j)(∑αiy(i)x(i)x(j)+b)−1=(y(j))2(∑αiy(i)x(i)x(j)+b)−y(j)=∑αiy(i)x(i)x(j)+b−y(j)
则
b
=
y
(
j
)
−
∑
i
=
1
N
α
i
y
(
i
)
x
(
i
)
x
(
j
)
b=y^{(j)}-\displaystyle\sum_{i=1}^N\alpha_iy^{(i)}x^{(i)}x^{(j)}
b=y(j)−i=1∑Nαiy(i)x(i)x(j)
在得到
w
w
w和
b
b
b之后,当我们对新的数据·
x
x
x进行分类,即判断wx+b的符号,将w的值代入得:
w
T
x
+
b
=
(
∑
i
=
1
N
α
i
y
(
i
)
x
(
i
)
)
T
x
+
b
=
∑
i
=
1
N
α
i
y
(
i
)
⟨
x
(
i
)
,
x
⟩
+
b
\begin{aligned} w^T x + b &= (\displaystyle\sum_{i=1}^N\alpha_iy^{(i)}x^{(i)})^Tx+b \\ &=\displaystyle\sum_{i=1}^N\alpha_iy^{(i)}\langle x^{(i)},x\rangle+b \end{aligned}
wTx+b=(i=1∑Nαiy(i)x(i))Tx+b=i=1∑Nαiy(i)⟨x(i),x⟩+b
注意只有支持向量
i
i
i对应的
α
i
\alpha_i
αi才可能大于0,而其他的
α
i
\alpha_i
αi均为0,因为训练样本中只有很少的几个点是支持向量,因此上述计算过程中计算内积将会减少很多开销。当然这也说明了一点,最后得到的分类器其实只和支持向量有关,和其他的点无关。
参考文献:
[1] http://cs229.stanford.edu/notes/cs229-notes3.pdf
[2] 统计学习方法-李航