支持向量机的全称是Support Vector Machine (SVM),它是一种用于二分类的监督学习算法。
部分内容引用自《Machine Learning in Action》
问题提出
假设我们有以下二维平面上的数据点,现在要求使用一条直线分割不同颜色的小球,
我们可以这样分割,
但如果再增加更多的小球,上面的分割方法可能会出现问题。如下图所示,其中一个红球被分割到了左边,
那怎样的分割才是最好的分割呢?
SVM的思想就是在某个最好的位置进行分割,使得在该位置上的分割线的两边有尽可能大的间隙。下图中黄色区域就是间隔区域,
按此分割,在一定程度上就不会出现上面的问题了。该问题的求解就是线性SVM问题的求解。
但如果数据点不是有规律的集中在分割线的两边,而是混在一起,又该如何分割呢?我们可以把数据升维,在更高的维度来处理,
此时,在二维平面上观察,数据就是被一条曲线分割开的,而不是直线。该问题的求解就是非线性SVM问题的求解。
SVM原理
SVM的本质是求解一个最大化的问题。如下图所示,需要在最小间隔的数据点上求解w的最大值,
转换成数据模型,其目标函数是:
这里的n个最小数据点叫做支持向量,分割线在三维及以上维度称为超平面。
直接求解上面的目标函数是非常困难的,不过可以使用拉格朗日乘子法求解,此时目标函数转换为:
约束条件:
如果引入一个松弛变量C来控制允许哪些点可以处于分割面的错误一侧,则约束条件变为:
一般情况我们会使用SMO(Sequential Minimal Optimization)算法来求解SVM问题,相对于拉格朗日算子,SMO更简单高效。
对于非线性SVM问题,我们需要借助核函数将原数据映射到更高的维度空间。核函数通常有线性核,多项式核,径向基核函数(Radial Basis Function)/ 高斯核(Gaussian Kernel)等,具体可参考,
https://blog.csdn.net/sunflower_sara/article/details/81228112
关于SVM的更多细节,会涉及到非常多的数学推导,网上有一篇文章写的很好,可以参考:
https://blog.csdn.net/v_july_v/article/details/7624837
Pyhton实现
测试样本,testSet.txt:
3.542485 1.977398 -1
3.018896 2.556416 -1
7.551510 -1.580030 1
2.114999 -0.004466 -1
8.127113 1.274372 1
7.108772 -0.986906 1
8.610639 2.046708 1
2.326297 0.265213 -1
3.634009 1.730537 -1
0.341367 -0.894998 -1
3.125951 0.293251 -1
2.123252 -0.783563 -1
0.887835 -2.797792 -1
7.139979 -2.329896 1
1.696414 -1.212496 -1
8.117032 0.623493 1
8.497162 -0.266649 1
4.658191 3.507396 -1
8.197181 1.545132 1
1.208047 0.213100 -1
1.928486 -0.321870 -1
2.175808 -0.014527 -1
7.886608 0.461755 1
3.223038 -0.552392 -1
3.628502 2.190585 -1
7.407860 -0.121961 1
7.286357 0.251077 1
2.301095 -0.533988 -1
-0.232542 -0.547690 -1
3.457096 -0.082216 -1
3.023938 -0.057392 -1
8.015003 0.885325 1
8.991748 0.923154 1
7.916831 -1.781735 1
7.616862 -0.217958 1
2.450939 0.744967 -1
7.270337 -2.507834 1
1.749721 -0.961902 -1
1.803111 -0.176349 -1
8.804461 3.044301 1
1.231257 -0.568573 -1
2.074915 1.410550 -1
-0.743036 -1.736103 -1
3.536555 3.964960 -1
8.410143 0.025606 1
7.382988 -0.478764 1
6.960661 -0.245353 1
8.234460 0.701868 1
8.168618 -0.903835 1
1.534187 -0.622492 -1
9.229518 2.066088 1
7.886242 0.191813 1
2.893743 -1.643468 -1
1.870457 -1.040420 -1
5.286862 -2.358286 1
6.080573 0.418886 1
2.544314 1.714165 -1
6.016004 -3.753712 1
0.926310 -0.564359 -1
0.870296 -0.109952 -1
2.369345 1.375695 -1
1.363782 -0.254082 -1
7.279460 -0.189572 1
1.896005 0.515080 -1
8.102154 -0.603875 1
2.529893 0.662657 -1
1.963874 -0.365233 -1
8.132048 0.785914 1
8.245938 0.372366 1
6.543888 0.433164 1
-0.236713 -5.766721 -1
8.112593 0.295839 1
9.803425 1.495167 1
1.497407 -0.552916 -1
1.336267 -1.632889 -1
9.205805 -0.586480 1
1.966279 -1.840439 -1
8.398012 1.584918 1
7.239953 -1.764292 1
7.556201 0.241185 1
9.015509 0.345019 1
8.266085 -0.230977 1
8.545620 2.788799 1
9.295969 1.346332 1
2.404234 0.570278 -1
2.037772 0.021919 -1
1.727631 -0.453143 -1
1.979395 -0.050773 -1
8.092288 -1.372433 1
1.667645 0.239204 -1
9.854303 1.365116 1
7.921057 -1.327587 1
8.500757 1.492372 1
1.339746 -0.291183 -1
3.107511 0.758367 -1
2.609525 0.902979 -1
3.263585 1.367898 -1
2.912122 -0.202359 -1
1.731786 0.589096 -1
2.387003 1.573131 -1
显示数据,prepare_data.py:
import matplotlib.pyplot as plt
import numpy as np
def load_data_set(file_name):
data_mat = [];
label_mat = []
fr = open(file_name)
for line in fr.readlines():
line_arr = line.strip().split('\t')
data_mat.append([float(line_arr[0]), float(line_arr[1])])
label_mat.append(float(line_arr[2]))
return data_mat, label_mat
def show_data_set(data_mat, label_mat):
data_plus = []
data_minus = []
for i in range(len(data_mat)):
if label_mat[i] > 0:
data_plus.append(data_mat[i])
else:
data_minus.append(data_mat[i])
data_plus_np = np.array(data_plus)
data_minus_np = np.array(data_minus)
plt.scatter(np.transpose(data_plus_np)[0], np.transpose(data_plus_np)[1])
plt.scatter(np.transpose(data_minus_np)[0], np.transpose(data_minus_np)[1])
plt.show()
if __name__ == '__main__':
data_mat, label_mat = load_data_set('testSet.txt')
show_data_set(data_mat, label_mat)
运行结果:
简化版SMO算法,simple_smo.py:
from time import sleep
import matplotlib.pyplot as plt
import numpy as np
import random
import types
import SVM.prepare_data as pd
def selectJrand(i, m):
j = i
while (j == i):
j = int(random.uniform(0, m))
return j
def clipAlpha(aj, H, L):
if aj > H:
aj = H
if L > aj:
aj = L
return aj
def smoSimple(dataMatIn, classLabels, C, toler, maxIter):
dataMatrix = np.mat(dataMatIn);
labelMat = np.mat(classLabels).transpose()
b = 0;
m, n = np.shape(dataMatrix)
alphas = np.mat(np.zeros((m, 1)))
iter_num = 0
while (iter_num < maxIter):
alphaPairsChanged = 0
for i in range(m):
fXi = float(np.multiply(alphas, labelMat).T * (dataMatrix * dataMatrix[i, :].T)) + b
Ei = fXi - float(labelMat[i])
if ((labelMat[i] * Ei < -toler) and (alphas[i] < C)) or ((labelMat[i] * Ei > toler) and (alphas[i] > 0)):
j = selectJrand(i, m)
fXj = float(np.multiply(alphas, labelMat).T * (dataMatrix * dataMatrix[j, :].T)) + b
Ej = fXj - float(labelMat[j])
alphaIold = alphas[i].copy();
alphaJold = alphas[j].copy();
if (labelMat[i] != labelMat[j]):
L = max(0, alphas[j] - alphas[i])
H = min(C, C + alphas[j] - alphas[i])
else:
L = max(0, alphas[j] + alphas[i] - C)
H = min(C, alphas[j] + alphas[i])
if L == H: print("L==H"); continue
eta = 2.0 * dataMatrix[i, :] * dataMatrix[j, :].T - dataMatrix[i, :] * dataMatrix[i, :].T - dataMatrix[
j,
:] * dataMatrix[
j, :].T
if eta >= 0: print("eta>=0"); continue
alphas[j] -= labelMat[j] * (Ei - Ej) / eta
alphas[j] = clipAlpha(alphas[j], H, L)
if (abs(alphas[j] - alphaJold) < 0.00001): print("alpha_j变化太小"); continue
alphas[i] += labelMat[j] * labelMat[i] * (alphaJold - alphas[j])
b1 = b - Ei - labelMat[i] * (alphas[i] - alphaIold) * dataMatrix[i, :] * dataMatrix[i, :].T - labelMat[
j] * (alphas[j] - alphaJold) * dataMatrix[i, :] * dataMatrix[j, :].T
b2 = b - Ej - labelMat[i] * (alphas[i] - alphaIold) * dataMatrix[i, :] * dataMatrix[j, :].T - labelMat[
j] * (alphas[j] - alphaJold) * dataMatrix[j, :] * dataMatrix[j, :].T
if (0 < alphas[i]) and (C > alphas[i]):
b = b1
elif (0 < alphas[j]) and (C > alphas[j]):
b = b2
else:
b = (b1 + b2) / 2.0
alphaPairsChanged += 1
print("第%d次迭代 样本:%d, alpha优化次数:%d" % (iter_num, i, alphaPairsChanged))
if (alphaPairsChanged == 0):
iter_num += 1
else:
iter_num = 0
print("迭代次数: %d" % iter_num)
return b, alphas
def showClassifer(dataMat, w, b):
data_plus = []
data_minus = []
for i in range(len(dataMat)):
if labelMat[i] > 0:
data_plus.append(dataMat[i])
else:
data_minus.append(dataMat[i])
data_plus_np = np.array(data_plus)
data_minus_np = np.array(data_minus)
plt.scatter(np.transpose(data_plus_np)[0], np.transpose(data_plus_np)[1], s=30, alpha=0.7)
plt.scatter(np.transpose(data_minus_np)[0], np.transpose(data_minus_np)[1], s=30, alpha=0.7)
x1 = max(dataMat)[0]
x2 = min(dataMat)[0]
a1, a2 = w
b = float(b)
a1 = float(a1[0])
a2 = float(a2[0])
y1, y2 = (-b - a1 * x1) / a2, (-b - a1 * x2) / a2
plt.plot([x1, x2], [y1, y2])
for i, alpha in enumerate(alphas):
if alpha > 0:
x, y = dataMat[i]
plt.scatter([x], [y], s=150, c='none', alpha=0.7, linewidth=1.5, edgecolor='red')
plt.show()
def get_w(dataMat, labelMat, alphas):
alphas, dataMat, labelMat = np.array(alphas), np.array(dataMat), np.array(labelMat)
w = np.dot((np.tile(labelMat.reshape(1, -1).T, (1, 2)) * dataMat).T, alphas)
return w.tolist()
if __name__ == '__main__':
dataMat, labelMat = pd.load_data_set('testSet.txt')
b, alphas = smoSimple(dataMat, labelMat, 0.6, 0.001, 40)
w = get_w(dataMat, labelMat, alphas)
showClassifer(dataMat, w, b)
运行结果: