如果这篇文章对你有一点小小的帮助,请给个关注,点个赞喔,我会非常开心的~
0. 前言
在样本空间中,划分超平面可通过线性方程 w T x + b = 0 w^Tx+b=0 wTx+b=0决定。
严格的说,对超平面设置上界和下界,如下图所示(图源:机器学习):
并满足:
{
w
T
x
i
+
b
⩾
+
1
,
y
i
=
+
1
w
T
x
i
+
b
⩽
−
1
,
y
i
=
−
1
\left\{\begin{matrix} w^Tx_i+b\geqslant +1,\ y_i=+1\\ w^Tx_i+b\leqslant -1,\ y_i=-1 \end{matrix}\right.
{wTxi+b⩾+1, yi=+1wTxi+b⩽−1, yi=−1
支持向量定义为使得上式等号成立的向量,上界下界的距离称为间隔:
γ
=
2
∣
∣
w
∣
∣
\gamma=\frac{2}{||w||}
γ=∣∣w∣∣2
1. 拉格朗日乘子法
拉格朗日乘子法(Lagrange multipliers)是一种寻找多元函数在一组约束下的极值的方法。
假设问题具有
m
m
m个等式约束和
n
n
n个不等式约束:
min
x
f
(
x
)
s
.
t
.
h
i
(
x
)
=
0
(
i
=
1
,
.
.
,
m
)
g
j
(
x
)
⩽
0
(
j
=
1
,
.
.
.
,
n
)
\begin{aligned} \min_x\ \ &f(x)\\ s.t.\ \ &h_i(x)=0\ \ (i=1,..,m)\\ &g_j(x)\leqslant 0\ \ (j=1,...,n) \end{aligned}
xmin s.t. f(x)hi(x)=0 (i=1,..,m)gj(x)⩽0 (j=1,...,n)
引入拉格朗日乘子
λ
μ
\lambda\ \mu
λ μ,相应的拉格朗日函数为:
L
(
x
,
λ
,
μ
)
=
f
(
x
)
+
∑
i
=
1
m
λ
i
h
i
(
x
)
+
∑
j
=
1
n
μ
j
g
j
(
x
)
L(x,\lambda, \mu)=f(x)+\sum_{i=1}^m\lambda_ih_i(x)+\sum_{j=1}^n\mu_jg_j(x)
L(x,λ,μ)=f(x)+i=1∑mλihi(x)+j=1∑nμjgj(x)
对应的KKT条件(Karush-Kuhn-Tucker)为:
{
∇
x
L
=
0
h
i
(
x
)
=
0
g
j
(
x
)
⩽
0
μ
j
⩾
0
μ
j
g
j
(
x
)
=
0
\left\{\begin{aligned} &\nabla_xL=0\\ &h_i(x)=0\\ &g_j(x)\leqslant 0\\ &\mu_j\geqslant 0\\ &\mu_jg_j(x)=0 \end{aligned}\right.
⎩⎪⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎪⎧∇xL=0hi(x)=0gj(x)⩽0μj⩾0μjgj(x)=0
2. SVM参数求解方法
欲找到最大间隔划分超平面,需要满足:
min
w
,
b
1
2
∣
∣
w
∣
∣
2
s
.
t
.
y
i
(
w
T
x
i
+
b
)
⩾
1
,
i
=
1
,
.
.
.
,
m
\begin{aligned} \min_{w,b}\ \ &\frac{1}{2}||w||^2\\ s.t.\ \ &y_i(w^Tx_i+b)\geqslant 1,\ i=1,...,m \end{aligned}
w,bmin s.t. 21∣∣w∣∣2yi(wTxi+b)⩾1, i=1,...,m
根据拉格朗日乘子法,有下式:
L
(
w
,
b
,
α
)
=
1
2
∣
∣
w
∣
∣
2
+
∑
i
=
1
m
α
i
(
1
−
y
i
(
w
T
x
i
+
b
)
)
∂
L
∂
w
=
0
⇒
w
=
∑
i
=
1
m
α
i
y
i
x
i
∂
L
∂
b
=
0
⇒
0
=
∑
i
=
1
m
α
i
y
i
L(w,b,\alpha)=\frac{1}{2}||w||^2+\sum_{i=1}^m\alpha_i(1-y_i(w^Tx_i+b))\\ \begin{aligned} &\frac{\partial L}{\partial w}=0\Rightarrow w=\sum_{i=1}^m\alpha_iy_ix_i\\ &\frac{\partial L}{\partial b}=0\Rightarrow 0=\sum_{i=1}^m\alpha_iy_i \end{aligned}
L(w,b,α)=21∣∣w∣∣2+i=1∑mαi(1−yi(wTxi+b))∂w∂L=0⇒w=i=1∑mαiyixi∂b∂L=0⇒0=i=1∑mαiyi
可得到对偶问题:
max
α
∑
i
=
1
m
α
i
−
1
2
∑
i
=
1
m
∑
j
=
1
m
α
i
α
j
y
i
y
j
x
i
T
x
j
s
.
t
.
∑
i
=
1
m
α
i
y
i
=
0
,
α
i
⩾
0
,
i
=
1
,
.
.
.
,
m
\begin{aligned} \max_\alpha\ \ &\sum_{i=1}^m\alpha_i-\frac{1}{2}\sum_{i=1}^m\sum_{j=1}^m\alpha_i\alpha_jy_iy_jx_i^Tx_j\\ s.t.\ \ &\sum_{i=1}^m\alpha_iy_i=0,\\ &\alpha_i\geqslant 0,\ i=1,...,m \end{aligned}
αmax s.t. i=1∑mαi−21i=1∑mj=1∑mαiαjyiyjxiTxji=1∑mαiyi=0,αi⩾0, i=1,...,m
对应的KKT条件:
{
α
i
⩾
0
y
i
f
(
x
i
)
−
1
⩾
0
α
i
(
y
i
f
(
x
i
)
−
1
)
=
0
\left\{\begin{aligned} &\alpha_i \geqslant 0\\ &y_if(x_i)-1\geqslant 0\\ &\alpha_i(y_if(x_i)-1)=0 \end{aligned}\right.
⎩⎪⎨⎪⎧αi⩾0yif(xi)−1⩾0αi(yif(xi)−1)=0
采用SMO(Sequential Minimal Optimization)求解参数:先固定 α i \alpha_i αi之外的所有参数,然后求 α i \alpha_i αi上的极值( α i \alpha_i αi可以通过其他变量导出),于是,SMO每次选择两个变量 α i \alpha_i αi和 α j \alpha_j αj,并固定其他参数,求解拉格朗日乘子法的对偶问题,更新 α i \alpha_i αi和 α j \alpha_j αj,迭代这个过程到收敛为止。
求解出 α \alpha α之后,可得超平面 f ( x ) = w T x + b = ∑ i = 1 m α i y i x i T x + b f(x)=w^Tx+b=\sum_{i=1}^m\alpha_iy_ix_i^Tx+b f(x)=wTx+b=∑i=1mαiyixiTx+b。其中,对任意支持向量 y s f ( x s ) = 1 y_sf(x_s)=1 ysf(xs)=1,所以可得 b = 1 ∣ S ∣ ∑ s ∈ S ( 1 / y s − ∑ i ∈ S α i y i x i T x s ) b=\frac{1}{|S|}\sum_{s\in S}(1/y_s-\sum_{i\in S}\alpha_iy_ix_i^Tx_s) b=∣S∣1∑s∈S(1/ys−∑i∈SαiyixiTxs)。
3. 软间隔
很多时候数据并不是线性可分的,无法辨别找到的超平面是否是过拟合引起的问题。
可采用软间隔,允许SVM在一些样本上出错。引入松弛变量
ξ
\xi
ξ:
min
w
,
b
,
ξ
1
2
∣
∣
w
∣
∣
2
+
C
∑
i
=
1
m
ξ
i
s
.
t
.
y
i
(
w
T
x
i
+
b
)
⩾
1
−
ξ
i
ξ
i
⩾
0
,
i
=
1
,
.
.
.
,
m
\begin{aligned} \min_{w,b,\xi}\ \ & \frac{1}{2}||w||^2+C\sum_{i=1}^m\xi_i\\ s.t.\ \ & y_i(w^Tx_i+b)\geqslant 1-\xi_i\\ & \xi_i\geqslant 0,\ i=1,...,m \end{aligned}
w,b,ξmin s.t. 21∣∣w∣∣2+Ci=1∑mξiyi(wTxi+b)⩾1−ξiξi⩾0, i=1,...,m
根据拉格朗日乘子法,有下式:
L
(
w
,
b
,
ξ
,
α
,
μ
)
=
1
2
∣
∣
w
∣
∣
2
+
C
∑
i
=
1
m
ξ
i
+
∑
i
=
1
m
α
i
(
1
−
ξ
i
−
y
i
(
w
T
x
i
+
b
)
)
−
∑
i
=
1
m
μ
i
ξ
i
∂
L
∂
w
=
0
⇒
w
=
∑
i
=
1
m
α
i
y
i
x
i
∂
L
∂
b
=
0
⇒
0
=
∑
i
=
1
m
α
i
y
i
∂
L
∂
ξ
i
=
0
⇒
C
=
α
i
+
μ
i
L(w,b,\xi,\alpha,\mu)=\frac{1}{2}||w||^2+C\sum_{i=1}^m\xi_i+\sum_{i=1}^m\alpha_i(1-\xi_i-y_i(w^Tx_i+b))-\sum_{i=1}^m\mu_i\xi_i\\ \begin{aligned} &\frac{\partial L}{\partial w}=0\Rightarrow w=\sum_{i=1}^m\alpha_iy_ix_i\\ &\frac{\partial L}{\partial b}=0\Rightarrow 0=\sum_{i=1}^m\alpha_iy_i\\ &\frac{\partial L}{\partial \xi_i}=0\Rightarrow C=\alpha_i+\mu_i \end{aligned}
L(w,b,ξ,α,μ)=21∣∣w∣∣2+Ci=1∑mξi+i=1∑mαi(1−ξi−yi(wTxi+b))−i=1∑mμiξi∂w∂L=0⇒w=i=1∑mαiyixi∂b∂L=0⇒0=i=1∑mαiyi∂ξi∂L=0⇒C=αi+μi
可得到对偶问题,与硬间隔唯一的差别在于约束条件:
max
α
∑
i
=
1
m
α
i
−
1
2
∑
i
=
1
m
∑
j
=
1
m
α
i
α
j
y
i
y
j
x
i
T
x
j
s
.
t
.
∑
i
=
1
m
α
i
y
i
=
0
,
0
⩽
α
i
⩽
C
,
i
=
1
,
.
.
.
,
m
\begin{aligned} \max_\alpha\ \ &\sum_{i=1}^m\alpha_i-\frac{1}{2}\sum_{i=1}^m\sum_{j=1}^m\alpha_i\alpha_jy_iy_jx_i^Tx_j\\ s.t.\ \ &\sum_{i=1}^m\alpha_iy_i=0,\\ &0\leqslant \alpha_i\leqslant C,\ i=1,...,m \end{aligned}
αmax s.t. i=1∑mαi−21i=1∑mj=1∑mαiαjyiyjxiTxji=1∑mαiyi=0,0⩽αi⩽C, i=1,...,m
对应的KKT条件:
{
α
i
⩾
0
,
μ
i
⩾
0
y
i
f
(
x
i
)
−
1
+
ξ
i
⩾
0
α
i
(
y
i
f
(
x
i
)
−
1
+
ξ
i
)
=
0
ξ
i
⩾
0
,
μ
i
ξ
i
⩾
0
\left\{\begin{aligned} &\alpha_i \geqslant 0,\ \ \mu_i\geqslant 0\\ &y_if(x_i)-1+\xi_i\geqslant 0\\ &\alpha_i(y_if(x_i)-1+\xi_i)=0\\ &\xi_i\geqslant 0,\ \ \mu_i\xi_i\geqslant 0 \end{aligned}\right.
⎩⎪⎪⎪⎪⎨⎪⎪⎪⎪⎧αi⩾0, μi⩾0yif(xi)−1+ξi⩾0αi(yif(xi)−1+ξi)=0ξi⩾0, μiξi⩾0
4. 核方法
将样本从原始空间映射到更高维的特征空间,使得样本在这个特征空间内线性可分。
核函数表示为: k ( x i , x j ) = ϕ ( x i ) T ϕ ( x j ) k(x_i,x_j)=\phi(x_i)^T\phi(x_j) k(xi,xj)=ϕ(xi)Tϕ(xj)
拉格朗日乘子法对偶问题修改为:
max
α
∑
i
=
1
m
α
i
−
1
2
∑
i
=
1
m
∑
j
=
1
m
α
i
α
j
y
i
y
j
k
(
x
i
,
x
j
)
s
.
t
.
∑
i
=
1
m
α
i
y
i
=
0
,
α
i
⩾
0
,
i
=
1
,
.
.
.
,
m
\begin{aligned} \max_\alpha\ \ &\sum_{i=1}^m\alpha_i-\frac{1}{2}\sum_{i=1}^m\sum_{j=1}^m\alpha_i\alpha_jy_iy_jk(x_i,x_j)\\ s.t.\ \ &\sum_{i=1}^m\alpha_iy_i=0,\\ &\alpha_i\geqslant 0,\ i=1,...,m \end{aligned}
αmax s.t. i=1∑mαi−21i=1∑mj=1∑mαiαjyiyjk(xi,xj)i=1∑mαiyi=0,αi⩾0, i=1,...,m
超平面修改为: f ( x ) = w T ϕ ( x ) + b f(x)=w^T\phi(x)+b f(x)=wTϕ(x)+b
常见的核函数如下图所示(图源:机器学习):
5. 支持向量回归
支持向量回归(Support Vector Regression)构建一个 2 ε 2\varepsilon 2ε的间隔带,若样本落入间隔带,则被认为预测正确。
间隔带两侧松弛程度不同,有:
min
w
,
b
,
ξ
,
ξ
^
1
2
∣
∣
w
∣
∣
2
+
C
∑
i
=
1
m
(
ξ
i
+
ξ
^
i
)
s
.
t
.
f
(
x
i
)
−
y
i
⩽
ε
+
ξ
i
y
i
−
f
(
x
i
)
⩽
ε
+
ξ
^
i
ξ
i
⩾
0
,
ξ
^
i
⩾
0
,
i
=
1
,
.
.
.
,
m
\begin{aligned} \min_{w,b,\xi,\hat{\xi}}\ \ & \frac{1}{2}||w||^2+C\sum_{i=1}^m(\xi_i+\hat{\xi}_i)\\ s.t.\ \ & f(x_i)-y_i\leqslant \varepsilon+\xi_i\\ & y_i-f(x_i)\leqslant \varepsilon+\hat{\xi}_i\\ & \xi_i\geqslant 0,\ \hat{\xi}_i\geqslant 0,\ i=1,...,m \end{aligned}
w,b,ξ,ξ^min s.t. 21∣∣w∣∣2+Ci=1∑m(ξi+ξ^i)f(xi)−yi⩽ε+ξiyi−f(xi)⩽ε+ξ^iξi⩾0, ξ^i⩾0, i=1,...,m
根据拉格朗日乘子法,有下式:
L
(
w
,
b
,
ξ
,
ξ
^
,
α
,
α
^
,
μ
,
μ
^
)
=
1
2
∣
∣
w
∣
∣
2
+
C
∑
i
=
1
m
(
ξ
i
+
ξ
^
i
)
−
∑
i
=
1
m
μ
i
ξ
i
−
∑
i
=
1
m
μ
^
i
ξ
^
i
+
∑
i
=
1
m
α
i
(
f
(
x
i
)
−
y
i
−
ε
−
ξ
i
)
+
∑
i
=
1
m
α
^
i
(
y
i
−
f
(
x
i
)
−
ε
−
ξ
^
i
)
∂
L
∂
w
=
0
⇒
w
=
∑
i
=
1
m
(
α
i
^
−
α
i
)
x
i
∂
L
∂
b
=
0
⇒
0
=
∑
i
=
1
m
(
α
^
i
−
α
i
)
∂
L
∂
ξ
i
=
0
⇒
C
=
α
i
+
μ
i
∂
L
∂
ξ
^
i
=
0
⇒
C
=
α
^
i
+
μ
^
i
\begin{aligned} &L(w,b,\xi,\hat{\xi},\alpha,\hat{\alpha},\mu,\hat{\mu})\\ &=\frac{1}{2}||w||^2+C\sum_{i=1}^m(\xi_i+\hat{\xi}_i)-\sum_{i=1}^m\mu_i\xi_i-\sum_{i=1}^m\hat{\mu}_i\hat{\xi}_i\\ &+\sum_{i=1}^m\alpha_i(f(x_i)-y_i-\varepsilon-\xi_i)+\sum_{i=1}^m\hat{\alpha}_i(y_i-f(x_i)-\varepsilon-\hat{\xi}_i)\\ \end{aligned}\\ \begin{aligned} &\frac{\partial L}{\partial w}=0\Rightarrow w=\sum_{i=1}^m(\hat{\alpha_i}-\alpha_i)x_i\\ &\frac{\partial L}{\partial b}=0\Rightarrow 0=\sum_{i=1}^m(\hat{\alpha}_i-\alpha_i)\\ &\frac{\partial L}{\partial \xi_i}=0\Rightarrow C=\alpha_i+\mu_i\\ &\frac{\partial L}{\partial \hat{\xi}_i}=0\Rightarrow C=\hat{\alpha}_i+\hat{\mu}_i \end{aligned}
L(w,b,ξ,ξ^,α,α^,μ,μ^)=21∣∣w∣∣2+Ci=1∑m(ξi+ξ^i)−i=1∑mμiξi−i=1∑mμ^iξ^i+i=1∑mαi(f(xi)−yi−ε−ξi)+i=1∑mα^i(yi−f(xi)−ε−ξ^i)∂w∂L=0⇒w=i=1∑m(αi^−αi)xi∂b∂L=0⇒0=i=1∑m(α^i−αi)∂ξi∂L=0⇒C=αi+μi∂ξ^i∂L=0⇒C=α^i+μ^i
可得到对偶问题:
max
α
∑
i
=
1
m
y
i
(
α
i
^
−
α
i
)
−
ε
(
α
i
^
+
α
i
)
−
1
2
∑
i
=
1
m
∑
j
=
1
m
(
α
i
^
−
α
i
)
(
α
j
^
−
α
j
)
x
i
T
x
j
s
.
t
.
∑
i
=
1
m
(
α
i
^
−
α
i
)
=
0
,
0
⩽
α
i
,
α
i
^
⩽
C
,
i
=
1
,
.
.
.
,
m
\begin{aligned} \max_\alpha\ \ &\sum_{i=1}^my_i(\hat{\alpha_i}-\alpha_i)-\varepsilon(\hat{\alpha_i}+\alpha_i)-\frac{1}{2}\sum_{i=1}^m\sum_{j=1}^m(\hat{\alpha_i}-\alpha_i)(\hat{\alpha_j}-\alpha_j)x_i^Tx_j\\ s.t.\ \ &\sum_{i=1}^m(\hat{\alpha_i}-\alpha_i)=0,\\ &0\leqslant \alpha_i,\hat{\alpha_i}\leqslant C,\ i=1,...,m \end{aligned}
αmax s.t. i=1∑myi(αi^−αi)−ε(αi^+αi)−21i=1∑mj=1∑m(αi^−αi)(αj^−αj)xiTxji=1∑m(αi^−αi)=0,0⩽αi,αi^⩽C, i=1,...,m
对应的KKT条件:
{
α
i
(
f
(
x
i
)
−
y
i
−
ε
−
ξ
i
)
=
0
α
i
^
(
y
i
−
f
(
x
i
)
−
ε
−
ξ
^
i
)
=
0
α
i
α
i
^
=
0
,
ξ
i
ξ
^
i
=
0
(
C
−
α
i
)
ξ
i
=
0
,
(
C
−
α
i
^
)
ξ
^
i
=
0
\left\{\begin{aligned} &\alpha_i(f(x_i)-y_i-\varepsilon-\xi_i)=0\\ &\hat{\alpha_i}(y_i-f(x_i)-\varepsilon-\hat{\xi}_i)=0\\ &\alpha_i\hat{\alpha_i}=0,\ \xi_i\hat{\xi}_i=0\\ &(C-\alpha_i)\xi_i=0,\ (C-\hat{\alpha_i})\hat{\xi}_i=0 \end{aligned}\right.
⎩⎪⎪⎪⎪⎨⎪⎪⎪⎪⎧αi(f(xi)−yi−ε−ξi)=0αi^(yi−f(xi)−ε−ξ^i)=0αiαi^=0, ξiξ^i=0(C−αi)ξi=0, (C−αi^)ξ^i=0
SVR的解表示为: f ( x ) = ∑ i = 1 m ( α i ^ − α i ) x i T x + b f(x)=\sum_{i=1}^m(\hat{\alpha_i}-\alpha_i)x_i^Tx+b f(x)=∑i=1m(αi^−αi)xiTx+b。
其中, b = y i + ε − ∑ j = 1 m ( α j ^ − α j ) x j T x i b=y_i+\varepsilon-\sum_{j=1}^m(\hat{\alpha_j}-\alpha_j)x_j^Tx_i b=yi+ε−∑j=1m(αj^−αj)xjTxi。
如果这篇文章对你有一点小小的帮助,请给个关注,点个赞喔,我会非常开心的~