Support Vector Machine

SVM

Q: Why is hyperplane?
A: The dataset is normally more than 2 dimension, when dim=2, we need a 2-1 dimensional line to separate the data, when dim=3, we need a 3-1 dimensional plane to separate. So we need a n-1 dimensional hyperplane to separate the n dimensional data.

Q: which hyperplane fits well the most in classify application?
在这里插入图片描述

A: The middle one fits well, for it has a wild rage of tolerance and high robustness with high generalization ability

Hyperplane

The function of hyperplane is ω T x + b = 0 \omega^Tx+b=0 ωTx+b=0, we define ( ω , b ) (\omega,b) (ω,b) as a notation of hyperplane.
We assume ( ω , b ) (\omega,b) (ω,b) could classify the training sample correctly, we could get:
i f   y i = + 1 , ω T + b > 0 ; i f   y i = − 1 , ω T + b < 0 if\ y_i=+1,\omega^T+b>0;\\ if\ y_i=-1,\omega^T+b<0 if yi=+1,ωT+b>0;if yi=1,ωT+b<0
在这里插入图片描述

margin

在这里插入图片描述Maximize the margin for its robustness.
The distance are the same for every points on the dashed line, the hyperplane can be located by just only these few points, that’s why we called it the ‘supported vector’.
Maximize margin by finding the corresponding ω \omega ω, and b b b
a r g max ⁡ ω , b 2 ∣ ∣ ω ∣ ∣ arg\max_{\omega,b}\frac{2}{||\omega||} argω,bmax∣∣ω∣∣2
transform maximize optimization into minimize:
a r g min ⁡ ω , b 1 2 ∣ ∣ ω ∣ ∣ 2 s . t .   y i ( ω T x i + b ) ≥ 1 , i = 1 , 2 , . . . , m . \begin{aligned} &arg\min_{\omega,b}\frac{1}{2}||\omega||^2\\ &s.t.\ y_i(\omega^Tx_i+b){\geq}1,i=1,2,...,m. \end{aligned} argω,bmin21∣∣ω2s.t. yi(ωTxi+b)1,i=1,2,...,m.

Model

solve:
Introducing the Lagrange Multiplier:
L ( ω , b , α ) = 1 2 ∣ ∣ ω ∣ ∣ 2 − ∑ i = 1 m α ( y i ( ω T x i + b ) − 1 ) L(\omega,b,\alpha)=\frac{1}{2}||\omega||^2-\sum_{i=1}^{m}\alpha(y_i(\omega^Tx_i+b)-1) L(ω,b,α)=21∣∣ω2i=1mα(yi(ωTxi+b)1)
Let: ∂ L ∂ ω = ∂ L ∂ b = 0 \frac{{\partial}L}{{\partial}\omega}=\frac{{\partial}L}{{\partial}b}=0 ωL=bL=0
ω = ∑ i = 1 m α i y i x i , ∑ i = 1 m α i y i = 0 \omega=\sum_{i=1}^{m}\alpha_iy_ix_i,\sum^m_{i=1}\alpha_iy_i=0 ω=i=1mαiyixi,i=1mαiyi=0

Dual Problem

min ⁡ α   1 2 ∑ i = 1 m ∑ j = 1 m α i α j y i y j x i T x j − ∑ i = 1 m α i s . t . ∑ i = 1 m α i y j = 0 , α i ≥ 0 , i = 1 , 2 , . . . , m \begin{aligned} \min_{\alpha}\ \frac{1}{2}\sum_{i=1}^m\sum_{j=1}^m\alpha_i\alpha_jy_iy_jx_i^Tx_j-\sum_{i=1}^m\alpha_i\\ s.t.\sum^m_{i=1}\alpha_iy_j=0,\alpha_i\geq0,i=1,2,...,m \end{aligned} αmin 21i=1mj=1mαiαjyiyjxiTxji=1mαis.t.i=1mαiyj=0,αi0,i=1,2,...,m

Classification model

X T X^T XT:testing sample
f ( x ) = ω T x + b = ∑ i = 1 m α i y i x i T x + b f(x)=\omega^Tx+b=\sum_{i=1}^m\alpha_iy_ix_i^Tx+b f(x)=ωTx+b=i=1mαiyixiTx+b
fulfill KKT constrains:( Karush-Kuhn-Tucker)
{ α i ≥ 0 y i f ( x i ) ≥ 1 α i ( y i f ( x i ) − 1 ) = 0 \left\{\begin{aligned} &\alpha_i\geq0\\ &y_if(x_i)\geq1\\ &\alpha_i(y_if(x_i)-1)=0 \end{aligned}\right. αi0yif(xi)1αi(yif(xi)1)=0
Sparsity of solutions: the sample will not be reserved after training, the model at last is only related to the supported vector. So for the SVM model, over fitting may not occur easily.

from sklearn import svm

X = [[2, 0], [1, 1], [2,3]]
y = [0, 0, 1]

clf = svm.SVC(kernel = 'linear')
clf.fit(X, y)
clf
clf.support_vectors_
array([[1., 1.],
       [2., 3.]])
clf.support_
array([1, 2])
clf.predict([[2,0]])
array([0])
import numpy as np
import pylab as pl
from sklearn import svm

np.random.seed(0)
X = np.r_[np.random.randn(20, 2) - [2, 2], np.random.randn(20, 2) + [2, 2]]
Y = [0] * 20 + [1] * 20
clf = svm.SVC(kernel='linear')
clf.fit(X, Y)

#hyperplane:w0x0 + w1x1 + b = 0;  y = -(w0/w1)x - (w2/w1)
w = clf.coef_[0]  # coef:w
a = -w[0] / w[1]
xx = np.linspace(-5, 5)
yy = a * xx - (clf.intercept_[0]) / w[1]  # intercept:bias,b

b = clf.support_vectors_[0]
yy_down = a * xx + (b[1] - a * b[0])
b = clf.support_vectors_[-1]
yy_up = a * xx + (b[1] - a * b[0])

print("w: ", w)
print("a: ", a)
print("support_vectors_: ", clf.support_vectors_)
print("clf.coef_: ", clf.coef_)

pl.plot(xx, yy, 'k-')
pl.plot(xx, yy_down, 'k--')
pl.plot(xx, yy_up, 'k--')
pl.scatter(clf.support_vectors_[:, 0], clf.support_vectors_[:, 1],
           s=80, facecolors='none')
pl.scatter(X[:, 0], X[:, 1], c=Y, cmap=pl.cm.Paired)

pl.axis('tight')
pl.show()

w:  [0.90230696 0.64821811]
a:  -1.391980476255765
support_vectors_:  [[-1.02126202  0.2408932 ]
 [-0.46722079 -0.53064123]
 [ 0.95144703  0.57998206]]
clf.coef_:  [[0.90230696 0.64821811]]

在这里插入图片描述

Kernel

Linear Separable

Q: What if the hyperplane that separate the sample in two does not exist?
A: Mapping the original space into a higher dimensional feature space, making the sample linear separable.
在这里插入图片描述### Dual Problem
min ⁡ α   1 2 ∑ i = 1 m ∑ j = 1 m α i α j y i y j ϕ ( x i ) T ϕ ( x j ) − ∑ i = 1 m α i s . t . ∑ i = 1 m α i y j = 0 , α i ≥ 0 , i = 1 , 2 , . . . , m \begin{aligned} \min_{\alpha}\ \frac{1}{2}\sum_{i=1}^m\sum_{j=1}^m\alpha_i\alpha_jy_iy_j\phi(x_i)^T\phi(x_j)-\sum_{i=1}^m\alpha_i\\ s.t.\sum^m_{i=1}\alpha_iy_j=0,\alpha_i\geq0,i=1,2,...,m \end{aligned} αmin 21i=1mj=1mαiαjyiyjϕ(xi)Tϕ(xj)i=1mαis.t.i=1mαiyj=0,αi0,i=1,2,...,m

Classification model

f ( x ) = ω T ϕ ( x ) + b = ∑ i = 1 m α i y i ϕ ( x i ) T ϕ ( x + b ) f(x)=\omega^T\phi(x)+b=\sum_{i=1}^m\alpha_iy_i\phi(x_i)^T\phi(x+b) f(x)=ωTϕ(x)+b=i=1mαiyiϕ(xi)Tϕ(x+b)

Kernel trick

K ( x , z ) = ϕ ( x ) ⋅ ϕ ( z ) = x 1 2 z 1 2 + 2 x 1 x 2 z 1 z 2 + x 2 2 x 2 2 = ( x 1 z 1 + x 2 z 2 ) 2 = ( [ x 1 x 2 ] ⋅ [ z 1 z 2 ] ) 2 = ( x ⋅ z ) 2 \begin{aligned}K(x,z)&=\phi(x){\cdot}\phi(z)\\ &=x_1^2z_1^2+2x_1x_2z_1z_2+x_2^2x_2^2\\ &=(x_1z_1+x_2z_2)^2\\ &=\left( \left[ \begin{array}{c} x_1\\ x_2 \end{array}\right] {\cdot} \left[ \begin{array}{c} z_1\\ z_2 \end{array} \right] \right)^2\\ &=(x{\cdot}z)^2 \end{aligned} K(x,z)=ϕ(x)ϕ(z)=x12z12+2x1x2z1z2+x22x22=(x1z1+x2z2)2=([x1x2][z1z2])2=(xz)2
computing k ( k , z ) k(k,z) k(k,z) some times is faster that calculating feature transformation and the inner product.

Mercer therom

As long as the kernel matrix corresponding to a symmetric function is semi positive, it can be used as a kernel function

Kernel Function:

Linear Kernel:
K ( x i , x j ) = x i T x j K(x_i,x_j)=x_i^Tx_j K(xi,xj)=xiTxj
Multinomial Kernel:
K ( x i , x j ) = ( x i T x j ) d K(x_i,x_j)=(x_i^Tx_j)^d K(xi,xj)=(xiTxj)d
Gaussian radial basis function kernel:
K ( x i , x j ) = e ∣ ∣ x i T x j ∣ ∣ 2 / e σ 2 K(x_i,x_j)=e^{||x_i^Tx_j||^2/e\sigma^2} K(xi,xj)=e∣∣xiTxj2/eσ2
Sigmoid function kernel:
K ( x i , x j ) = t a n h ( β x i T x j + θ ) K(x_i,x_j)=tanh({\beta}x_i^Tx_j+\theta) K(xi,xj)=tanh(βxiTxj+θ)

eg: breast_cancer dataset using multiple kernel SVM

from sklearn import datasets
from sklearn.model_selection import train_test_split, ShuffleSplit
from sklearn import svm
import matplotlib.pyplot as plt
from sklearn.model_selection import learning_curve
import numpy as np

# 绘图函数
def plot_learning_curve(estimator, title, X_data, y_target, ylim=None, cv=None, n_jobs=1, train_sizes=np.linspace(.1, 1.0, 20)):
    plt.figure()
    plt.title(title)
    if ylim is not None:
        plt.ylim(*ylim)
    plt.xlabel("Training examples")
    plt.ylabel("Score")
    train_sizes, train_scores, test_scores = learning_curve(estimator, X_data, y_target, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)
    plt.grid()

    plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
                     train_scores_mean + train_scores_std, alpha=0.1,
                     color="r")
    plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
                     test_scores_mean + test_scores_std, alpha=0.1, color="g")
    plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
             label="Training score")
    plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
             label="Cross-validation score")

    plt.legend(loc="best")
    return plt


# 从乳腺癌库中调用数据集
data_cancer = datasets.load_breast_cancer()
X = data_cancer.data
y = data_cancer.target
print(X, y)
print("====================")
# 训练样本测试样本划分
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
cv = ShuffleSplit(n_splits=100, test_size=0.2, random_state=0)
# 初始化SVM分类器
svc_classifier = svm.SVC()
# 采用训练数据训练SVM分类器
svc_classifier.fit(X_train, y_train)
'''
# 对测试样本进行预测
y_predict = svc_classifier.predict(X_test)
print(y_predict)
'''
# 用测试样本对模型进行评价
svc_accuracy = svc_classifier.score(X_test, y_test)
print("svc_accuracy:%f %%" % (100 * svc_accuracy))
plot_learning_curve(svc_classifier, "SVM", X, y, (0.87, 0.95), cv=cv, n_jobs=1)
# 无核函数
cv = ShuffleSplit(n_splits=10, test_size=0.2, random_state=0)
title = 'Learning Curves (C=1.0, kernel=linear, degree=2) '
estimator = svm.SVC(C=1.0, kernel='linear', degree=2)
SVC_2=estimator
plot_learning_curve(estimator, title, X, y, (0.9, 1.0), cv=cv, n_jobs=1)
plt.show()

cv.get_n_splits(X, y)
for train_index, test_index in cv.split(X, y):
    X_train = X[train_index]
    X_test = X[test_index]
    y_train = y[train_index]
    y_test = y[test_index]
SVC_2.fit(X_train, y_train)
train_score = SVC_2.score(X_train, y_train)
test_score = SVC_2.score(X_test, y_test)
print('train score: {0}; test score: {1}'.format(train_score, test_score))
[[1.799e+01 1.038e+01 1.228e+02 ... 2.654e-01 4.601e-01 1.189e-01]
 [2.057e+01 1.777e+01 1.329e+02 ... 1.860e-01 2.750e-01 8.902e-02]
 [1.969e+01 2.125e+01 1.300e+02 ... 2.430e-01 3.613e-01 8.758e-02]
 ...
 [1.660e+01 2.808e+01 1.083e+02 ... 1.418e-01 2.218e-01 7.820e-02]
 [2.060e+01 2.933e+01 1.401e+02 ... 2.650e-01 4.087e-01 1.240e-01]
 [7.760e+00 2.454e+01 4.792e+01 ... 0.000e+00 2.871e-01 7.039e-02]] [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 1 0 0 0 0 0 0 0 0 1 0 1 1 1 1 1 0 0 1 0 0 1 1 1 1 0 1 0 0 1 1 1 1 0 1 0 0
 1 0 1 0 0 1 1 1 0 0 1 0 0 0 1 1 1 0 1 1 0 0 1 1 1 0 0 1 1 1 1 0 1 1 0 1 1
 1 1 1 1 1 1 0 0 0 1 0 0 1 1 1 0 0 1 0 1 0 0 1 0 0 1 1 0 1 1 0 1 1 1 1 0 1
 1 1 1 1 1 1 1 1 0 1 1 1 1 0 0 1 0 1 1 0 0 1 1 0 0 1 1 1 1 0 1 1 0 0 0 1 0
 1 0 1 1 1 0 1 1 0 0 1 0 0 0 0 1 0 0 0 1 0 1 0 1 1 0 1 0 0 0 0 1 1 0 0 1 1
 1 0 1 1 1 1 1 0 0 1 1 0 1 1 0 0 1 0 1 1 1 1 0 1 1 1 1 1 0 1 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 1 1 1 1 1 1 0 1 0 1 1 0 1 1 0 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1
 1 0 1 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 0 1 0 1 1 1 1 0 0 0 1 1
 1 1 0 1 0 1 0 1 1 1 0 1 1 1 1 1 1 1 0 0 0 1 1 1 1 1 1 1 1 1 1 1 0 0 1 0 0
 0 1 0 0 1 1 1 1 1 0 1 1 1 1 1 0 1 1 1 0 1 1 0 0 1 1 1 1 1 1 0 1 1 1 1 1 1
 1 0 1 1 1 1 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 0 1 0 1 1 1 1 1 0 1 1
 0 1 0 1 1 0 1 0 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 0 1
 1 1 1 1 1 1 0 1 0 1 1 0 1 1 1 1 1 0 0 1 0 1 0 1 1 1 1 1 0 1 1 0 1 0 1 0 0
 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 0 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 0 0 0 0 0 0 1]
====================
svc_accuracy:92.397661 %

在这里插入图片描述
在这里插入图片描述

train score: 0.9604395604395605; test score: 0.9649122807017544
  • 3
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值