SVM对偶形式的推导
SVM原问题:
min w , b 1 2 ∣ ∣ w ∣ ∣ 2 2 \min\limits_{w,b}{\frac{1}{2}\left| |w| \right|_{2}^{2}} w,bmin21∣∣w∣∣22
s . t . y i ( w T x i + b ) ≥ 1 s.t.~~y_{i}\left( {w^{T}x_{i} + b} \right) \geq 1 s.t. yi(wTxi+b)≥1
先把不等式变成小于号的标准形式:
1
−
y
i
(
w
T
x
i
+
b
)
≤
0
1 - y_{i}\left( {w^{T}x_{i} + b} \right) \leq 0
1−yi(wTxi+b)≤0
化拉格朗日对偶问题:
min w , b { max α { 1 2 ∣ ∣ w ∣ ∣ 2 2 + ∑ i = 1 m α i ( 1 − y i ( w T x i + b ) ) } } \min\limits_{w,b}\left\{ {\max\limits_{\alpha}\left\{ {\frac{1}{2}\left| |w| \right|_{2}^{2} + {\sum_{i = 1}^{m}{\alpha_{i}\left( {1 - y_{i}\left( {w^{T}x_{i} + b} \right)} \right)}}} \right\}} \right\} w,bmin{αmax{21∣∣w∣∣22+∑i=1mαi(1−yi(wTxi+b))}}
s . t . α i ≥ 0 s.t.~~\alpha_{i} \geq 0 s.t. αi≥0
为什么会多一个 max 和
α
\alpha
α 呢?
这是因为,为了把不等式减少,用
α
\alpha
α作为类似惩罚因子一样的东西,来对
(
1
−
y
i
(
w
T
x
i
+
b
)
)
\left(1-y_i\left(w^Tx_i+b\right)\right)
(1−yi(wTxi+b))的值进行限制,因为
(
1
−
y
i
(
w
T
x
i
+
b
)
)
≤
0
\left(1-y_i\left(w^Tx_i+b\right)\right)\le0
(1−yi(wTxi+b))≤0,而
α
i
≥
0
\alpha_i\geq0
αi≥0,如果对两者乘积取 max,那么
(
1
−
y
i
(
w
T
x
i
+
b
)
)
\left(1-y_i\left(w^Tx_i+b\right)\right)
(1−yi(wTxi+b))一定全都会小于0,因为若大于0,再取
m
a
x
max
max,一定可以取到
∞
\infty
∞,而外面套了一个
m
i
n
min
min,所以肯定取不到
∞
\infty
∞,所以肯定可以满足条件。这样就把不等式转化到目标函数内了,这就是拉格朗日对偶。
然后,把上面这个式子记作 L ( w , b , α ) L\left(w,b,\alpha\right) L(w,b,α),然后交换 min , max \min{,}\max{\ } min,max 的位置
L ( w , b , α ) = max α { min w , b { 1 2 ∣ ∣ w ∣ ∣ 2 2 + ∑ i = 1 m α i ( 1 − y i ( w T x i + b ) ) } } L\left( {w,b,\alpha} \right) = {\max\limits_{\alpha}\left\{ {\min\limits_{w,b}\left\{ {\frac{1}{2}\left| |w| \right|_{2}^{2} + {\sum_{i = 1}^{m}{\alpha_{i}\left( {1 - y_{i}\left( {w^{T}x_{i} + b} \right)} \right)}}} \right\}} \right\}} L(w,b,α)=αmax{w,bmin{21∣∣w∣∣22+∑i=1mαi(1−yi(wTxi+b))}}
里面的 min 可以对 w,b 求导然后令导数=0,从而去掉min
对min 里面的这部分求导:
f
=
1
2
∣
∣
w
∣
∣
2
2
+
∑
i
=
1
m
α
i
(
1
−
y
i
(
w
T
x
i
+
b
)
)
f = ~\frac{1}{2}\left| |w| \right|_{2}^{2} + {\sum_{i = 1}^{m}{\alpha_{i}\left( {1 - y_{i}\left( {w^{T}\mathbf{x}_{\mathbf{i}} + b} \right)} \right)}}
f= 21∣∣w∣∣22+∑i=1mαi(1−yi(wTxi+b))
= 1 2 w T w + ∑ i = 1 m α i ( 1 − y i ( w T x i + b ) ) = \frac{1}{2}w^{T}w + {\sum_{i = 1}^{m}{\alpha_{i}\left( {1 - y_{i}\left( {w^{T}\mathbf{x}_{\mathbf{i}} + b} \right)} \right)}} =21wTw+∑i=1mαi(1−yi(wTxi+b))
求导:
∂ f ∂ w = w − ∑ i = 1 m α i y i x i \frac{\partial f}{\partial w} = w - {\sum_{i = 1}^{m}{\alpha_{i}y_{i}\mathbf{x}_{\mathbf{i}}}} ∂w∂f=w−∑i=1mαiyixi
∂ f ∂ b = − ∑ i = 1 m α i y i \frac{\partial f}{\partial b} = - {\sum_{i = 1}^{m}{\alpha_{i}y_{i}}} ∂b∂f=−∑i=1mαiyi
{
∂
f
∂
w
=
0
∂
f
∂
b
=
0
⇒
{
w
=
∑
i
=
1
m
α
i
y
i
x
i
0
=
∑
i
=
1
m
α
i
y
i
\left. \left\{ \begin{matrix} {\frac{\partial f}{\partial w} = 0} \\ {\frac{\partial f}{\partial b} = 0} \\ \end{matrix} \right.\Rightarrow~\left\{ \begin{matrix} {w = {\sum_{i = 1}^{m}{\alpha_{i}y_{i}\mathbf{x}_{\mathbf{i}}}}} \\ {0 = {\sum_{i = 1}^{m}{\alpha_{i}y_{i}}}} \\ \end{matrix} \right. \right.
{∂w∂f=0∂b∂f=0⇒ {w=∑i=1mαiyixi0=∑i=1mαiyi
将解回代,消去w,得:
f
=
1
2
(
∑
i
=
1
m
α
i
y
i
x
i
)
T
(
∑
j
=
1
m
α
j
y
j
x
j
)
+
∑
i
=
1
m
α
i
(
1
−
y
i
(
(
∑
j
=
1
m
α
j
y
j
x
j
)
T
x
i
+
b
)
)
f = \frac{1}{2}\left( {\sum_{i = 1}^{m}{\alpha_{i}y_{i}\mathbf{x}_{\mathbf{i}}}} \right)^{T}\left( {\sum_{j = 1}^{m}{\alpha_{j}y_{j}\mathbf{x}_{\mathbf{j}}}} \right) + {\sum_{i = 1}^{m}{\alpha_{i}\left( {1 - y_{i}\left( {\left( {\sum_{j = 1}^{m}{\alpha_{j}y_{j}\mathbf{x}_{\mathbf{j}}}} \right)^{T}x_{i} + b} \right)} \right)}}
f=21(∑i=1mαiyixi)T(∑j=1mαjyjxj)+∑i=1mαi(1−yi((∑j=1mαjyjxj)Txi+b))
= 1 2 ∑ i = 1 m ∑ j = 1 m α i α j y i y j x i T x j + ∑ i = 1 m α i − ∑ i = 1 m α i α j y i y j x i T x j − ∑ i = 1 m α i y i b 等于 0 = \frac{1}{2}{\sum_{i = 1}^{m}{\sum_{j = 1}^{m}{\alpha_{i}\alpha_{j}y_{i}y_{j}\mathbf{x}_{\mathbf{i}}^{\mathbf{T}}\mathbf{x}_{\mathbf{j}}}}} + {\sum_{i = 1}^{m}\alpha_{i}} - {\sum_{i = 1}^{m}{\alpha_{i}\alpha_{j}y_{i}y_{j}\mathbf{x}_{\mathbf{i}}^{\mathbf{T}}\mathbf{x}_{\mathbf{j}}}} - ~\overset{等于0}{\sum_{i = 1}^{m}{\alpha_{i}y_{i}b}} =21∑i=1m∑j=1mαiαjyiyjxiTxj+∑i=1mαi−∑i=1mαiαjyiyjxiTxj− ∑i=1mαiyib等于0
= ∑ i = 1 m α i − 1 2 ∑ i = 1 m ∑ j = 1 m α i α j y i y j x i T x j = {\sum_{i = 1}^{m}\alpha_{i}} - \frac{1}{2}{\sum_{i = 1}^{m}{\sum_{j = 1}^{m}{\alpha_{i}\alpha_{j}y_{i}y_{j}\mathbf{x}_{\mathbf{i}}^{\mathbf{T}}\mathbf{x}_{\mathbf{j}}}}} =∑i=1mαi−21∑i=1m∑j=1mαiαjyiyjxiTxj
这就是SVM的对偶形式:
L ( α ) = max α { ∑ i = 1 m α i − 1 2 ∑ i = 1 m ∑ j = 1 m α i α j y i y j x i T x j } L(\alpha) = {\max\limits_{\alpha}\left\{ {{\sum_{i = 1}^{m}\alpha_{i}} - \frac{1}{2}{\sum_{i = 1}^{m}{\sum_{j = 1}^{m}{\alpha_{i}\alpha_{j}y_{i}y_{j}\mathbf{x}_{\mathbf{i}}^{\mathbf{T}}\mathbf{x}_{\mathbf{j}}}}}} \right\}} L(α)=αmax{∑i=1mαi−21∑i=1m∑j=1mαiαjyiyjxiTxj}
s . t . ∑ i = 1 m α i y i = 0 s.t.~~~{\sum_{i = 1}^{m}{\alpha_{i}y_{i}}} = 0 s.t. ∑i=1mαiyi=0
然后怎么求解呢?(二次规划形式,可以用 Gurobi 等优化软件求解)
注:下面在使用 gurobi 的代码中,设定的 α \alpha α范围设定为了(0,1000),
很巧合地使用了软间隔SVM
硬间隔 SVM 对偶形式:
0
≤
α
i
0\le\alpha_i
0≤αi
软间隔 SVM 对偶形式:
0
≤
α
i
≤
C
0\le\alpha_i\le C
0≤αi≤C
α
i
\alpha_i
αi 的取值范围的不同恰好就是软间隔 SVM 对偶形式和硬间隔 SVM 的唯一区别
Gurobi 求解 SVM 对偶形式
from gurobipy import *
import numpy as np
# import iris
from sklearn.datasets import load_iris
iris = load_iris()
X_raw = iris.data
y_raw = iris.target
把 iris 数据集改造成二分类数据集,并且x取前两个维度,方便观察
# get iris class 1 and class 2 as training data
X = X_raw[y_raw!= 2][:,:2]
y = y_raw[y_raw!= 2]
# convert y==0 to y==-1
y[y==0] = -1
y
array([-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])
X.shape,y.shape
((100, 2), (100,))
# plot X,y, with different colors for class 1 and class 2
import matplotlib.pyplot as plt
plt.scatter(X[y==-1,0],X[y==-1,1],c='r',marker='o',label='class 1')
plt.scatter(X[y==1,0],X[y==1,1],c='b',marker='x',label='class 2')
建立 Gurobi 模型
model = Model()
m = X.shape[0]
设定对偶问题变量 α i i = 1 , . . . , m \alpha_i\ \ \ i=1,...,m αi i=1,...,m
a = {}
for i in range(m):
name = 'a_' + str(i) + '_'
a[i]=model.addVar(0,1000,vtype=GRB.CONTINUOUS,name=name)
设定目标函数 ( max ) o b j = ∑ i = 1 m α i − 1 2 ∑ i = 1 m ∑ j = 1 m α i α j y i y j x i T x j (\max)\ obj=\sum_{i=1}^m{\alpha_i}-\frac{1}{2}\sum_{i=1}^m{\sum_{j=1}^m{\alpha_i \alpha_j y_i y_j x_i^T x_j }} (max) obj=∑i=1mαi−21∑i=1m∑j=1mαiαjyiyjxiTxj
obj = LinExpr(0)
for i in range(m):
obj.addTerms(1, a[i])
for i in range(m):
for j in range(m):
obj = obj + (-0.5*y[i]*y[j]*(X[i]@X[j])*a[i]*a[j])
model.setObjective(obj,GRB.MAXIMIZE)
设定约束 ∑ i = 1 m α i y i ≥ 0 \sum_{i=1}^m{\alpha_i y_i}\ge 0 ∑i=1mαiyi≥0
for i in range(m):
lhs=LinExpr(0)
lhs.addTerms(y[i],a[i])
model.addConstr(lhs==0)
设定约束 α i ≥ 0 \alpha_i\ge0 αi≥0
for i in range(m):
lhs=LinExpr(0)
lhs.addTerms(1,a[i])
model.addConstr(lhs>=0)
# set model torrentiality
model.Params.TimeLimit = 60
求解
model.optimize()
Output exceeds the size limit. Open the full output data in a text editor
Gurobi Optimizer version 9.5.1 build v9.5.1rc2 (win64)
Thread count: 8 physical cores, 16 logical processors, using up to 16 threads
Optimize a model with 101 rows, 100 columns and 101 nonzeros
Model fingerprint: 0x531e14cb
Model has 5050 quadratic objective terms
Coefficient statistics:
Matrix range [1e+00, 1e+00]
Objective range [1e+00, 1e+00]
QObjective range [3e+01, 1e+02]
Bounds range [1e+03, 1e+03]
RHS range [0e+00, 0e+00]
Presolve removed 101 rows and 1 columns
Presolve time: 0.00s
Presolved: 0 rows, 99 columns, 0 nonzeros
Presolved model has 4950 quadratic objective terms
Ordering time: 0.00s
Barrier statistics:
Free vars : 2
AA' NZ : 1.000e+00
Factor NZ : 3.000e+00
Factor Ops : 5.000e+00 (less than 1 second per iteration)
Threads : 1
Objective Residual
...
Barrier solved model in 14 iterations and 0.01 seconds (0.00 work units)
Optimal objective 3.98490889e+03
for v in model.getVars():
print('%s %g' % (v.varName, v.x))
print('Obj: %g' % model.objVal)
Output exceeds the size limit. Open the full output data in a text editor
a_0_ 8.66013e-11
a_1_ 192.72
a_2_ 1.11011e-10
a_3_ 1.3588e-10
a_4_ 5.60845e-11
a_5_ 4.88591e-11
a_6_ 5.23398e-11
a_7_ 1.00946e-10
a_8_ 2.46651e-10
a_9_ 7.19769e-10
a_10_ 8.36457e-11
a_11_ 6.78189e-11
a_12_ 3.88936e-09
a_13_ 9.53111e-11
a_14_ 6.77632e-11
a_15_ 3.0376e-11
a_16_ 4.88591e-11
a_17_ 8.66013e-11
a_18_ 1.14911e-10
a_19_ 4.23708e-11
a_20_ 6.43613e-10
a_21_ 5.11846e-11
a_22_ 3.9366e-11
a_23_ 2.84358e-10
a_24_ 6.78189e-11
...
a_97_ 4.33557e-11
a_98_ 7.59785e-11
a_99_ 0
Obj: 3984.91
查看所有非 0 的 α \alpha α
for v in model.getVars():
if v.x > 1e-8:
print('%s %g' % (v.varName, v.x))
print('Obj: %g' % model.objVal)
a_1_ 192.72
a_25_ 1000
a_41_ 1000
a_84_ 823.764
a_85_ 1000
Obj: 3984.91
如何求解b?只需知道哪些是支持向量就行了👇
https://stats.stackexchange.com/questions/362046/how-is-bias-term-calculated-in-svms
获得所有支持向量
# find support vectors xi ( which a are non-zero )
support_vectors = []
for i in range(m):
if a[i].x > 1e-8:
support_vectors.append(X[i])
support_vectors
[array([4.9, 3. ]),
array([5., 3.]),
array([4.5, 2.3]),
array([5.4, 3. ]),
array([6. , 3.4])]
根据支持向量计算出 b = y 支持 − w T x 支持 b=y_{支持}-w^T x_{支持} b=y支持−wTx支持
# calculate b ( b= y_i - w^T x_i for support vectors )
b = 0
for i in range(len(support_vectors)):
b += y[i] - (w@support_vectors[i])
b /= len(support_vectors) # 取平均
计算 w = ∑ i = 1 m α i y i x i w=\sum_{i=1}^m{\alpha_i y_i x_i} w=∑i=1mαiyixi
w = np.zeros(2)
for i,v in enumerate(model.getVars()):
w += v.x * y[i] * X[i]
w,b
(array([ 4. , -6.86666667]), -1.451999999981188)
画图
# plot w^T x + b =0 and data
xx = np.linspace(3,7,100)
yy = -(w[0]*xx + b)/w[1]
plt.plot(xx,yy,'r')
# plot dataset
plt.scatter(X[y==-1,0],X[y==-1,1],c='r',marker='o',label='class 1')
plt.scatter(X[y==1,0],X[y==1,1],c='b',marker='x',label='class 2')
C=1000
C=1
C=0.2
观察得到结论:根据公式
min
f
Ω
(
f
)
+
C
∑
i
=
1
m
l
(
f
(
x
i
,
y
i
)
)
\min_f{\Omega(f)+C\sum_{i=1}^m{l(f(\mathbf{x}_i,y_i))}}
minfΩ(f)+C∑i=1ml(f(xi,yi))
左边:结构风险(防过拟合);右边:经验风险(拟合数据集)
随着
C
C
C 的减小,越来越不能拟合数据集(越来越趋向于软间隔 SVM)
这里的
C
C
C 的调整,依靠调整 Gurobi 模型设置中
α
i
\alpha_i
αi 的取值范围实现
a = {}
for i in range(m):
name = 'a_' + str(i) + '_'
a[i]=model.addVar(0,1000,vtype=GRB.CONTINUOUS,name=name) # 在这里调整 (C=1000)