现在有很多现成的包可以直接调用,为了让自己更加理解算法的内部过程,决定自己用numpy等简单的工具包来完成机器学习算法。
支持向量机是经典的算法,简单回顾一下,详细参考李航的统计学习方法或者其他相关书籍。
对于线性可分数据,可以导出一个关于w的凸二次规划问题。通过求解对偶问题可以导入原始问题的对偶优化问题。这个优化问题是关于拉格朗日乘子的凸二次规划问题。
对于线性不可分数据,把导致不可分的点称为异常点,往往异常点不是很多,用一个松弛变量来弥补异常点,并且用一个惩罚来惩罚它。同样可以得到一个含有对松弛变量进行惩罚的关于w的凸二次规划问题,用对偶方法得到对偶优化,发现与线性可分稍有不同,仅仅是约束中,拉格朗日乘子多了上限C。
引入核技巧,将x的内积用核函数来代替,将数据映射到高维空间,具体不再叙述。
下面进入程序
首先是数据,为了方便,直接从sklearn.dataset中导入鸢尾花数据,做成两类。将数据分为训练集和测试集。
train_test_split和normalize是自己写的函数,就是把数据分开、将特征归一化。
def normalize(X, axis=-1, order=2):
""" Normalize the dataset X """
l2 = np.atleast_1d(np.linalg.norm(X, order, axis))
l2[l2 == 0] = 1
return X / np.expand_dims(l2, axis)
def train_test_split(X, y, test_size=0.5, shuffle=True, seed=None):
""" Split the data into train and test sets """
if shuffle:
X, y = shuffle_data(X, y, seed)
# Split the training data from test data in the ratio specified in
# test_size
split_i = len(y) - int(len(y) // (1 / test_size))
X_train, X_test = X[:split_i], X[split_i:]
y_train, y_test = y[:split_i], y[split_i:]
return X_train, X_test, y_train, y_test
调用了shuffle_data函数
def shuffle_data(X, y, seed=None):
if seed:
np.random.seed(seed)
idx = np.arange(X.shape[0])
np.random.shuffle(idx)
return X[idx], y[idx]
然后是
data = datasets.load_iris()
X = normalize(data.data[data.target != 0])
y = data.target[data.target != 0]
y[y == 1] = -1
y[y == 2] = 1
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, shuffle=False)
接下来就是主要的SVM部分了,布局按照sklearn的方式来,定义一个SupportVectorMachine类,类中定义了一些参数,还有拉格朗日乘子,支持向量,超平面的截距等。
def __init__(self, C=1, kernel=rbf_kernel, power=4, gamma=None, coef=4):
self.C = C
self.kernel = kernel
self.power = power # the power of polynomial
self.gamma = gamma
self.coef = coef
self.lagr_multipliers = None
self.support_vectors = None
self.support_vector_labels = None
self.intercept = None
然后是fit函数
# gamma 默认设置为 1/n_features
if not self.gamma:
self.gamma = 1 / n_features
计算核函数
self.kernel = self.kernel(
power=self.power,
gamma=self.gamma,
coef=self.coef)
kernel_matrix = np.zeros((n_samples, n_samples))
for i in range(n_samples):
for j in range(n_samples):
kernel_matrix[i, j] = self.kernel(X[i], X[j])
下面的程序定义了三种常见的核函数
def linear_kernel(**kwargs):
def f(x1, x2):
return np.inner(x1, x2) # 内积
return f
def polynomial_kernel(power, coef, **kwargs):
def f(x1, x2):
return (np.inner(x1, x2) + coef)**power
return f
def rbf_kernel(gamma, **kwargs):
'''
K(x, y) = exp(-γ || x -y ||的平方),γ > 0
'''
def f(x1, x2):
distance = np.linalg.norm(x1 - x2) ** 2
return np.exp(-gamma * distance)
return f
SVM最终就是解二次规划的问题,sklearn中好像用的是SMO算法。这里打算用cvxopt包中的qp去解。
那就要先给出对应的P、q、G、h、A、b,也就是二次项系数,一次项系数,不等式约束和等式约束。
考虑到用不用C来惩罚
'''
G, h是不等式约束, A, b是等式约束。P是二次项系数, q是一次项系数
'''
P = cvxopt.matrix(np.outer(y, y) * kernel_matrix, tc='d') # outer是外积,其实作用是y的第一个元素和y的每一个元素相乘作为第一行,y的第二个元素和y的每一个元素相乘作为第二行,以此类推。
q = cvxopt.matrix(np.ones(n_samples) * -1)
A = cvxopt.matrix(y, (1, n_samples), tc='d')
b = cvxopt.matrix(0, tc='d')
if not self.C:
G = cvxopt.matrix(np.identity(n_samples) * -1)
h = cvxopt.matrix(np.zeros(n_samples))
else:
G_max = np.identity(n_samples) * -1
G_min = np.identity(n_samples)
G = cvxopt.matrix(np.vstack((G_max, G_min)))
h_max = cvxopt.matrix(np.zeros(n_samples))
h_min = cvxopt.matrix(np.ones(n_samples) * self.C)
h = cvxopt.matrix(np.vstack((h_max, h_min)))
然后求解
minimization = cvxopt.solvers.qp(P, q, G, h, A, b)
得到的解就是最优的拉格朗日乘子
lagr_mult = np.ravel(minimization['x'])
拉格朗日中乘子大于0的所对应的就是支持向量,这里设置一个阈值,近似大于0
idx = lagr_mult > 1e-7
self.lagr_multipliers = lagr_mult[idx]
self.support_vectors = X[idx]
self.support_vector_labels = y[idx]
求截距
# 求b
self.intercept = self.support_vector_labels[0]
for i in range(len(self.lagr_multipliers)):
self.intercept -= self.lagr_multipliers[i] * self.support_vector_labels[
i] * self.kernel(self.support_vectors[i], self.support_vectors[0])
接下来就是预测,看测试集中的点在决策面的哪一边
def predict(self, X):
y_pred = []
for sample in X:
prediction = 0
#计算w,然后判断决策面
for i in range(len(self.lagr_multipliers)):
prediction += self.lagr_multipliers[i] * self.support_vector_labels[
i] * self.kernel(self.support_vectors[i], sample)
prediction += self.intercept
y_pred.append(np.sign(prediction))
return np.array(y_pred)
最后,看一下效果,跟sklearn中的SVM做比较,同样都采用多项式核,power=4, coe为1
# 使用自己写的函数SVM来分类
start1 = time.time()
clf1 = SupportVectorMachine(C=2, kernel=polynomial_kernel, power=4, coef=1)
clf1.fit(X_train, y_train)
y_pred1 = clf1.predict(X_test)
end1 = time.time()
elasp1 = end1 - start1
accuracy1 = accuracy_score(y_test, y_pred1)
print ("My_Accuracy:", accuracy1, "Time: ", elasp1)
#使用sklearn中的SVM,但sklearn里参数较多,自己写的里面是没有的。
start2 = time.time()
clf2 = SVC(C=2, kernel='poly',degree=4, coef0=1)
clf2.fit(X_train, y_train)
y_pred2 = clf2.predict(X_test)
end2 = time.time()
elasp2 = end2 - start2
accuracy2 = accuracy_score(y_test, y_pred2)
print("Sklearn_Accuracy:", accuracy2, "Time: ", elasp2)
运行结果
My_Accuracy: 0.818181818182 Time: 0.01800084114074707
Sklearn_Accuracy: 0.0 Time: 0.0010001659393310547
sklearn的SVM不工作了。由于上面分割数据是没用shuffle,把shuffle调成True再看结果
My_Accuracy: 0.909090909091 Time: 0.020000934600830078
Sklearn_Accuracy: 0.939393939394 Time: 0.0
My_Accuracy: 0.939393939394 Time: 0.018001079559326172
Sklearn_Accuracy: 0.636363636364 Time: 0.0
My_Accuracy: 0.939393939394 Time: 0.02000117301940918
Sklearn_Accuracy: 0.818181818182 Time: 0.0010001659393310547
My_Accuracy: 0.878787878788 Time: 0.02000117301940918
Sklearn_Accuracy: 0.909090909091 Time: 0.0009999275207519531
My_Accuracy: 0.969696969697 Time: 0.020000934600830078
Sklearn_Accuracy: 1.0 Time: 0.0010001659393310547
My_Accuracy: 0.878787878788 Time: 0.019001007080078125 Sklearn_Accuracy: 1.0 Time: 0.0010001659393310547 My_Accuracy: 0.939393939394 Time: 0.021001100540161133 Sklearn_Accuracy: 0.909090909091 Time: 0.0010001659393310547 My_Accuracy: 0.939393939394 Time: 0.02000117301940918 Sklearn_Accuracy: 0.878787878788 Time: 0.0009999275207519531 My_Accuracy: 0.878787878788 Time: 0.02000117301940918 Sklearn_Accuracy: 0.909090909091 Time: 0.0009999275207519531 My_Accuracy: 0.969696969697 Time: 0.019001007080078125 Sklearn_Accuracy: 0.878787878788 Time: 0.0010001659393310547 My_Accuracy: 1.0 Time: 0.019001007080078125 Sklearn_Accuracy: 1.0 Time: 0.0 My_Accuracy: 0.939393939394 Time: 0.022001266479492188 Sklearn_Accuracy: 0.878787878788 Time: 0.0010001659393310547
跑了这么多次,可以发现,sklearn的优化确实好,运行时间很短。准确率上互有胜负吧。