这一篇接着上一篇机器学习(4)-支持向量机的理解与代码实现(上)继续介绍SMO算法以及对应的代码实现
序列最小最优化算法
上一篇我们讲到将原问题的优化问题转化为求解对偶问题的最优解,最终转化为求解如下约束规划问题:
min
α
L
(
w
,
b
,
α
)
=
1
2
∑
i
=
1
N
∑
j
=
1
N
α
i
α
j
y
i
y
j
K
(
x
i
,
x
j
)
−
∑
i
=
1
N
α
i
s
.
t
.
∑
i
=
1
N
α
i
y
i
=
0
0
≤
α
i
≤
C
,
i
=
1
,
2
,
.
.
.
,
N
\min_{\alpha}L(w,b,\alpha)=\frac{1}{2}\sum_{i=1}^{N}\sum_{j=1}^{N}{\alpha_i\alpha_jy_iy_jK(x_i, x_j)}-\sum_{i=1}^{N}\alpha_i\\ s.t.\quad \sum_{i=1}^{N}\alpha_iy_i=0\\0\leq\alpha_i\leq C, i=1,2,...,N
αminL(w,b,α)=21i=1∑Nj=1∑NαiαjyiyjK(xi,xj)−i=1∑Nαis.t.i=1∑Nαiyi=00≤αi≤C,i=1,2,...,N
上一篇是一个最大优化问题,这里我们对优化函数取一个负号,转化为最小优化问题。在实际运用中,随着训练数据的不断增加,上述优化问题需要求解的参数也等比例的增加,同时求解多个参数的算法将变得异常的非常和抵消,为了解决这个问题,Platt在1998年提出了SMO算法。基本思想如下:每次从要求解的参数中随机选取两个,固定其他参数,将原问题转为二元优化问题,通过零求导的梯度等于零,不断的更新参数直至所有的参数都满足KKT条件从而求得最优解。
下面详细讲解一下优化过程,不失一般性,假设我们选择的两个变量是
α
1
\alpha_1
α1和
α
2
\alpha_2
α2,去掉优化问题中与
α
1
\alpha_1
α1和
α
2
\alpha_2
α2无关的常数项(其他参数现在是固定的相当于已知量,在求导的过程中这些项的导数都为零)得到SMO需要优化的问题:
min
α
1
,
α
2
W
(
α
1
,
α
2
)
=
1
2
K
11
α
1
2
+
1
2
K
22
α
2
2
+
y
1
α
1
∑
i
=
3
N
y
i
α
i
K
i
1
+
y
2
α
2
∑
i
=
3
N
y
i
α
i
K
i
2
+
y
1
α
1
y
2
α
2
K
12
−
α
1
−
α
2
s
.
t
.
α
1
y
1
+
α
2
y
2
=
−
∑
i
=
3
N
α
i
y
i
=
ς
0
≤
α
i
≤
C
,
i
=
1
,
2
,
.
.
.
,
N
\min\limits_{\alpha_1,\alpha_2}W(\alpha_1,\alpha_2)=\frac{1}{2}K_{11}\alpha_1^2+\frac{1}{2}K_{22}\alpha_2^2+y_1\alpha_1\sum_{i=3}^{N}y_i\alpha_iK_{i1}+y_2\alpha_2\sum_{i=3}^{N}y_i\alpha_iK_{i2}+y_1\alpha_1y_2\alpha_2K_{12}-\alpha_1-\alpha_2\\s.t. \quad \alpha_1y_1+\alpha_2y_2=-\sum_{i=3}^{N}\alpha_iy_i=\varsigma\\0\leq\alpha_i\leq C, i=1,2,...,N
α1,α2minW(α1,α2)=21K11α12+21K22α22+y1α1i=3∑NyiαiKi1+y2α2i=3∑NyiαiKi2+y1α1y2α2K12−α1−α2s.t.α1y1+α2y2=−i=3∑Nαiyi=ς0≤αi≤C,i=1,2,...,N
其中
K
i
j
=
K
(
x
i
,
x
j
)
,
K_{ij}=K(x_i,x_j),
Kij=K(xi,xj),对于等式
α
1
y
1
+
α
2
y
2
=
−
∑
i
=
3
N
α
i
y
i
=
ς
\alpha_1y_1+\alpha_2y_2=-\sum_{i=3}^{N}\alpha_iy_i=\varsigma
α1y1+α2y2=−∑i=3Nαiyi=ς左右同乘
y
1
y_1
y1可得
α
1
=
y
1
(
ς
−
α
2
y
2
)
\alpha_1=y_1(\varsigma-\alpha_2y_2)
α1=y1(ς−α2y2),同时我们令
v
j
=
∑
i
=
3
N
y
i
α
i
K
i
j
v_j=\sum_{i=3}^{N}y_i\alpha_iK_{ij}
vj=∑i=3NyiαiKij,将
v
1
,
v
2
,
α
1
v_1,v_2,\alpha_1
v1,v2,α1代入优化目标函数可以的到一个只和
α
2
\alpha_2
α2有关的函数:
W
(
α
2
)
=
1
2
K
11
(
ς
−
α
2
y
2
)
2
+
1
2
K
22
α
2
2
+
(
ς
−
α
2
y
2
)
v
1
+
y
2
α
2
v
2
+
(
ς
−
α
2
y
2
)
y
2
α
2
K
12
−
y
1
(
ς
−
α
2
y
2
)
−
α
2
W(\alpha_2)=\frac{1}{2}K_{11}(\varsigma-\alpha_2y_2)^2+\frac{1}{2}K_{22}\alpha_2^2+(\varsigma-\alpha_2y_2)v_1+y_2\alpha_2v_2+(\varsigma-\alpha_2y_2)y_2\alpha_2K_{12}-y_1(\varsigma-\alpha_2y_2)-\alpha_2
W(α2)=21K11(ς−α2y2)2+21K22α22+(ς−α2y2)v1+y2α2v2+(ς−α2y2)y2α2K12−y1(ς−α2y2)−α2
为了求得极小值我们对目标函数进行求导:
∂
W
(
α
2
)
∂
α
2
=
α
2
(
K
11
+
K
22
−
2
K
12
)
+
ς
y
2
(
K
12
−
K
11
)
+
y
2
(
v
2
−
v
1
)
+
y
1
y
2
−
1
\frac{\partial W(\alpha_2)}{\partial \alpha_2}=\alpha_2(K_{11}+K_{22}-2K_{12})+\varsigma y_2(K_{12}-K_{11})+y_2(v_2-v_1)+y_1y_2-1
∂α2∂W(α2)=α2(K11+K22−2K12)+ςy2(K12−K11)+y2(v2−v1)+y1y2−1
对于输入
x
x
x模型的复测输出
f
(
x
)
=
∑
i
=
1
N
y
i
α
i
K
(
x
i
,
x
)
+
b
f(x)=\sum_{i=1}^{N}y_i\alpha_iK(x_i,x)+b
f(x)=∑i=1NyiαiK(xi,x)+b,我们用
f
(
x
)
f(x)
f(x)来表示
v
1
v_1
v1和
v
2
v_2
v2:
v
1
=
f
(
x
1
)
−
b
−
y
1
α
1
K
11
−
y
2
α
2
K
12
v
2
=
f
(
x
2
)
−
b
−
y
1
α
1
K
12
−
y
2
α
2
K
22
v_1=f(x_1)-b-y_1\alpha_1K_{11}-y_2\alpha_2K_{12}\\v_2=f(x_2)-b-y_1\alpha_1K_{12}-y_2\alpha_2K_{22}
v1=f(x1)−b−y1α1K11−y2α2K12v2=f(x2)−b−y1α1K12−y2α2K22
前面我们已经知道
α
1
=
y
1
(
ς
−
α
2
y
2
)
\alpha_1=y_1(\varsigma-\alpha_2y_2)
α1=y1(ς−α2y2),代入上式可以得到:
v
1
−
v
2
=
f
(
x
1
)
−
f
(
x
2
)
+
y
2
α
2
(
K
11
+
K
22
−
2
K
12
)
+
ς
(
K
12
−
K
11
)
v_1-v_2=f(x_1)-f(x_2)+y_2\alpha_2(K_{11}+K_{22}-2K_{12})+\varsigma (K_{12}-K_{11})
v1−v2=f(x1)−f(x2)+y2α2(K11+K22−2K12)+ς(K12−K11)
因为预测值
f
(
x
)
f(x)
f(x)是根据更新前的参数得到的,我们令更新前的参数为
α
o
l
d
\alpha^{old}
αold和
b
o
l
d
b^{old}
bold,带求解的更新参数为
α
n
e
w
\alpha^{new}
αnew和
b
n
e
w
b^{new}
bnew,则
v
1
−
v
2
v_1-v_2
v1−v2可以用旧的参数表示,同时我们令预测值和真实值的差值为
E
i
=
f
(
x
i
)
−
y
i
E_i=f(x_i)-y_i
Ei=f(xi)−yi将
v
1
−
v
2
v_1-v_2
v1−v2代入梯度可得:
∂
W
(
α
2
)
∂
α
2
=
α
2
(
K
11
+
K
22
−
2
K
12
)
−
α
2
o
l
d
(
K
11
+
K
22
−
2
K
12
)
+
y
2
(
y
1
−
f
(
x
1
)
)
−
y
2
(
y
2
−
f
(
x
2
)
)
=
α
2
(
K
11
+
K
22
−
2
K
12
)
−
α
2
o
l
d
(
K
11
+
K
22
−
2
K
12
)
−
y
2
(
E
1
−
E
2
)
\begin{aligned}\frac{\partial W(\alpha_2)}{\partial \alpha_2}&=\alpha_2(K_{11}+K_{22}-2K_{12})-\alpha_{2}^{old}(K_{11}+K_{22}-2K_{12})+y_2(y_1-f(x_1))-y_2(y_2-f(x_2))\\&=\alpha_2(K_{11}+K_{22}-2K_{12})-\alpha_{2}^{old}(K_{11}+K_{22}-2K_{12})-y_2(E_1-E_2)\end{aligned}
∂α2∂W(α2)=α2(K11+K22−2K12)−α2old(K11+K22−2K12)+y2(y1−f(x1))−y2(y2−f(x2))=α2(K11+K22−2K12)−α2old(K11+K22−2K12)−y2(E1−E2)
令梯度等于零,可得
α
2
\alpha_2
α2的更新值为:
α
2
n
e
w
=
α
2
o
l
d
+
y
2
(
E
1
−
E
2
)
η
,
η
=
K
11
+
K
22
−
2
K
12
\alpha_2^{new}=\alpha_2^{old}+\frac{y_2(E_1-E_2)}{\eta}, \eta=K_{11}+K_{22}-2K_{12}
α2new=α2old+ηy2(E1−E2),η=K11+K22−2K12
我们现在求出了
α
2
\alpha_2
α2的更新值,但是这个是没有考虑参数的约束条件
0
≤
α
i
≤
C
0\leq\alpha_i\leq C
0≤αi≤C的情况下,下面我们根据约束条件对参数进行裁剪,我们重新命名
α
2
n
e
w
\alpha_{2}^{new}
α2new为
α
2
n
e
w
,
u
n
c
\alpha_{2}^{new,unc}
α2new,unc,进行裁剪后的参数为
α
2
n
e
w
\alpha_2^{new}
α2new,根据约束条件和
y
y
y的取值
α
2
n
e
w
\alpha_2^{new}
α2new的取值范围可以用如下两幅图表示:
左图是
y
1
≠
y
2
y_1\not =y_2
y1=y2的情况这个时候是一条斜率为一的线段,可以得到最小值和最大值分别为:
L
=
max
(
0
,
α
2
o
l
d
−
α
1
o
l
d
)
,
H
=
m
i
n
(
C
,
C
+
α
2
o
l
d
−
α
1
o
l
d
)
L=\max(0,\alpha_2^{old}-\alpha_1^{old}), H=min(C,C+\alpha_2^{old}-\alpha_1^{old})
L=max(0,α2old−α1old),H=min(C,C+α2old−α1old)
当
y
1
=
y
2
y_1=y_2
y1=y2时,如右图所示,此时最小和最大值分别为:
L
=
max
(
0
,
α
2
o
l
d
+
α
1
o
l
d
−
C
)
,
H
=
min
(
C
,
α
2
o
l
d
+
α
1
o
l
d
)
L=\max(0,\alpha_2^{old}+\alpha_1^{old}-C), H=\min(C,\alpha_2^{old}+\alpha_1^{old})
L=max(0,α2old+α1old−C),H=min(C,α2old+α1old)
根据约束条件可以得到最终
α
2
n
e
w
\alpha_2^{new}
α2new的表达式如下:
α
2
n
e
w
=
{
L
,
α
2
n
e
w
,
u
n
c
<
L
α
2
n
e
w
,
u
n
c
,
L
≤
α
2
n
e
w
,
u
n
c
≤
H
H
,
α
2
n
e
w
,
u
n
c
>
H
\alpha_2^{new}= \left\{ \begin{array}{lr} L,&\alpha_2^{new,unc}<L\\ \alpha_2^{new,unc},&L \leq \alpha_2^{new,unc} \leq H\\ H,&\alpha_2^{new,unc}>H \end{array} \right.
α2new=⎩⎨⎧L,α2new,unc,H,α2new,unc<LL≤α2new,unc≤Hα2new,unc>H
可以根据
α
2
n
e
w
\alpha_2^{new}
α2new可求得
α
1
n
e
w
\alpha_1^{new}
α1new:
α
1
n
e
w
=
α
1
o
l
d
+
y
1
y
2
(
α
2
o
l
d
−
α
2
n
e
w
)
\alpha_1^{new}=\alpha_1^{old}+y_1y_2(\alpha_2^{old}-\alpha_2^{new})
α1new=α1old+y1y2(α2old−α2new)
得到了更新后的
α
\alpha
α,就可以由此计算
b
n
e
w
b^{new}
bnew了,当
0
<
α
1
n
e
w
<
C
0<\alpha_1^{new}<C
0<α1new<C时,由KKT条件可知:
∑
i
=
1
N
α
i
y
i
K
i
1
+
b
=
y
1
\sum_{i=1}^{N}\alpha_iy_iK_{i1}+b=y_1
i=1∑NαiyiKi1+b=y1
其中除了
α
1
\alpha_1
α1和
α
2
\alpha_2
α2其他的都没有改变:
b
1
n
e
w
=
y
1
−
∑
i
=
3
N
α
i
y
i
K
i
1
−
α
1
n
e
w
y
1
K
11
−
α
2
n
e
w
y
2
K
21
b_1^{new}=y_1-\sum_{i=3}^{N}\alpha_iy_iK_{i1}-\alpha_1^{new}y_1K_{11}-\alpha_2^{new}y_2K_{21}
b1new=y1−i=3∑NαiyiKi1−α1newy1K11−α2newy2K21
右边等式的前两项可以用
E
1
E_1
E1表示为:
y
1
−
∑
i
=
3
N
α
i
y
i
K
i
1
=
−
E
1
+
α
1
o
l
d
y
1
K
11
+
α
2
o
l
d
y
2
K
21
+
b
o
l
d
y_1-\sum_{i=3}^{N}\alpha_iy_iK_{i1}=-E_1+\alpha_1^{old}y_1K_{11}+\alpha_2^{old}y_2K_{21}+b^{old}
y1−i=3∑NαiyiKi1=−E1+α1oldy1K11+α2oldy2K21+bold
代入可得:
b
1
n
e
w
=
−
E
1
−
y
1
K
11
(
α
1
n
e
w
−
α
1
o
l
d
)
−
y
2
K
21
(
α
2
n
e
w
−
α
2
o
l
d
)
+
b
o
l
d
b_1^{new}=-E_1-y_1K_{11}(\alpha_1^{new}-\alpha_1^{old})-y_2K_{21}(\alpha_2^{new}-\alpha_2^{old})+b^{old}
b1new=−E1−y1K11(α1new−α1old)−y2K21(α2new−α2old)+bold
同理当
0
<
α
2
n
e
w
<
C
0<\alpha_2^{new}<C
0<α2new<C时:
b
2
n
e
w
=
−
E
2
−
y
1
K
12
(
α
1
n
e
w
−
α
1
o
l
d
)
−
y
2
K
22
(
α
2
n
e
w
−
α
2
o
l
d
)
+
b
o
l
d
b_2^{new}=-E2-y_1K_{12}(\alpha_1^{new}-\alpha_1^{old})-y_2K_{22}(\alpha_2^{new}-\alpha_2^{old})+b^{old}
b2new=−E2−y1K12(α1new−α1old)−y2K22(α2new−α2old)+bold
当
0
<
α
i
n
e
w
<
C
,
i
=
1
,
2
0<\alpha_i^{new}<C,i=1,2
0<αinew<C,i=1,2同时满足时,此时
b
1
n
e
w
b_1^{new}
b1new和
b
2
n
e
w
b_2^{new}
b2new是相等的,如果
α
1
n
e
w
\alpha_1^{new}
α1new和
α
2
n
e
w
\alpha_2^{new}
α2new等于C或者零时,
b
1
n
e
w
b_1^{new}
b1new和
b
2
n
e
w
b_2^{new}
b2new以及他们之间的任何值都是符合KKT条件的,这时我们可以取他们的平均值
代码实现
下面就是用代码实现SMO算法训练SVM的过程,我们可以把SVM抽象为一个类,然后主要分析下在求解 α 1 n e w \alpha_1^{new} α1new和 α 2 n e w \alpha_2^{new} α2new需要用到哪些值以及需要哪些功能,把这些功能分离出来用单独的函数实现,实际使用的时候再调用,就是把功能模块化。求解过程大致可以分为一下几个模块(1)任意两个样本之间的核函数值,因为会多次用到所以可以先一次性计算出来保存好(2)预测值和真实值之间的误差,而且在训练过程中还需要不断更新,我们可以用一个向量存储所有样本的误差,及时更新(3)如何挑选 α 1 \alpha_1 α1和 α 2 \alpha_2 α2,就是挑选哪两个参数用来更新,具体的可以看下李航老师书上介绍的方法(4)训练SVM模型(5)对测试集进行预测,大致可以分为上面五个模块,下面是具体的代码实现:
class SVM(Model):
"""docstring for SVM"""
def __init__(self, C=200, trainfile='../mnist_train.csv', sigma=10):
super(SVM, self).__init__(trainfile)
self.C = C
self.sigma = sigma
self.n, self.m = self.data.shape #(samplenum, featurenum)
self.alpha = np.zeros(self.n)
# self.alpha = np.random.random(self.n)
self.b = 0
self.E = np.zeros(self.n)
self.g = np.zeros(self.n)
self.data, self.label = self.preprocess(self.data, self.label)
self.kernel = self.calGaussianKernel(self.data, self.data)
# print(self.kernel)
self.cal_E()
def calGaussianKernel(self, data1, data2):
n1 = data1.shape[0]
n2 = data2.shape[0]
data1_1 = np.sum(data1 * data1, axis=-1, keepdims=True) # (data1·data2) inner product
data2_2 = np.sum(data2 * data2, axis=-1, keepdims=True) # (data2·data2)
data1_2 = np.dot(data1, data2.T)
data = np.tile(data1_1,(1, n2)) + np.tile(data2_2.T, (n1, 1)) - 2*data1_2
data = np.exp(-data / (2 * self.sigma**2))
return data
def cal_E(self):
self.g = np.dot(self.alpha * self.label, self.kernel) + self.b
self.E = self.g - self.label
def IsSatisfyKKT(self, i):
if self.alpha[i] == 0 and self.label[i]*self.g[i] >= 1:
return True
if 0 < self.alpha[i] < self.C and self.label[i]*self.g[i] == 1:
return True
if self.alpha[i] == self.C and self.label[i]*self.g[i] <=1:
return True
return False
def getAlpha2(self, i):
maxdiff = 0
index = i
# print(self.E)
for j in range(self.n):
if abs(self.E[i] - self.E[j]) > maxdiff:
maxdiff = abs(self.E[i] - self.E[j])
index = j
while(index == i):
index = random.randint(0, self.n-1)
return index
@calTime
def train(self, maxiter=50):
it = 0
stopiter = False
while(it < maxiter and not stopiter):
stopiter = True
it += 1
# print(self.alpha)
for i in range(self.n):
if not self.IsSatisfyKKT(i):
j = self.getAlpha2(i)
# print('i is {}, j is {}'.format(i, j))
k11 = self.kernel[i][i]
k22 = self.kernel[j][j]
k12 = self.kernel[i][j]
E1 = self.E[i]
E2 = self.E[j]
y1 = self.label[i]
y2 = self.label[j]
alpha1_old = self.alpha[i]
alpha2_old = self.alpha[j]
b_old = self.b
# print('alpha2_old is {}, alpha1_old is {}'.format(alpha2_old, alpha1_old))
eta = k11 + k22 - 2 * k12
# print('k11 k12 k22',k11,k12,k22)
alpha2_new = alpha2_old + y2 * (E1 - E2) / eta
# print(y2 * (E1 - E2) / eta)
# print('y1 is {}, y2 is {}'.format(y1, y2))
if y1 == y2:
L = max(0, alpha2_old + alpha1_old - self.C)
H = min(self.C, alpha2_old + alpha1_old)
else:
L = max(0, alpha2_old - alpha1_old)
H = min(self.C, self.C + alpha2_old - alpha1_old)
if L == H:
continue
if alpha2_new < L:
alpha2_new = L
elif alpha2_new > H:
alpha2_new = H
# print('alpha2_new after cutting is {}'.format(alpha2_new - alpha2_old))
alpha1_new = alpha1_old + y1*y2*(alpha2_old - alpha2_new)
b1_new = -E1 - y1*k11*(alpha1_new - alpha1_old) - y2*k12*(alpha2_new - alpha2_old) + b_old
b2_new = -E2 - y2*k22*(alpha2_new - alpha2_old) - y1*k12*(alpha1_new - alpha1_old) + b_old
if 0 < alpha1_new < self.C:
b_new = b1_new
elif 0 < alpha2_new < self.C:
b_new = b2_new
else:
b_new = (b1_new + b2_new) / 2
# print('alpha2_new is {}, alpha1_new is {}, bnew is {}'.format(alpha2_new, alpha1_new, b_new))
self.b = b_new
self.alpha[i] = alpha1_new
self.alpha[j] = alpha2_new
self.cal_E()
if abs(alpha2_new - alpha2_old) >= 0.00001:
stopiter = False
# print(self.alpha)
print('iterate {}times'.format(it))
def preprocess(self, data, label):
label[label>0] = 1
label[label==0] = -1
#不能直接 a /= 255,会报错 TypeError: No loop matching the specified signature and casting was found for ufunc true_divide
#或者可以先将data转化为float
data = data / 255
return data, label
@calTime
def predict(self, data):
kernel = self.calGaussianKernel(self.data, data)
g = np.dot(self.alpha * self.label, kernel)
# print(g)
return np.sign(g)
顺便记录一下自己踩过的坑,一开始模型预测准确率一直为零,无论我怎么对数据做预处理调整模型的参数(其实这个时候我就应该反思了,我怎么调整参数输出都没有改变,至少说明除了模型代码肯定还有其他地方有问题),以及检查这个算法的代码都没有任何改变。最后打印我预测值才发现输出是None,但是我的预测函数是由返回值的,然后我就意识到是装饰器的原因了,我一开始写的装饰器没有返回值,所以导致我的predict一直是None,这是我第一次自己写代码用到装饰器,果然你以为你会了和你实际会了还有很远的距离,完整的代码后期我会整理放到github上,有错误的地方欢迎大家指正~