SVM
Q: Why is hyperplane?
A: The dataset is normally more than 2 dimension, when dim=2, we need a 2-1 dimensional line to separate the data, when dim=3, we need a 3-1 dimensional plane to separate. So we need a n-1 dimensional hyperplane to separate the n dimensional data.
Q: which hyperplane fits well the most in classify application?
A: The middle one fits well, for it has a wild rage of tolerance and high robustness with high generalization ability
Hyperplane
The function of hyperplane is
ω
T
x
+
b
=
0
\omega^Tx+b=0
ωTx+b=0, we define
(
ω
,
b
)
(\omega,b)
(ω,b) as a notation of hyperplane.
We assume
(
ω
,
b
)
(\omega,b)
(ω,b) could classify the training sample correctly, we could get:
i
f
y
i
=
+
1
,
ω
T
+
b
>
0
;
i
f
y
i
=
−
1
,
ω
T
+
b
<
0
if\ y_i=+1,\omega^T+b>0;\\ if\ y_i=-1,\omega^T+b<0
if yi=+1,ωT+b>0;if yi=−1,ωT+b<0
margin
Maximize the margin for its robustness.
The distance are the same for every points on the dashed line, the hyperplane can be located by just only these few points, that’s why we called it the ‘supported vector’.
Maximize margin by finding the corresponding
ω
\omega
ω, and
b
b
b
a
r
g
max
ω
,
b
2
∣
∣
ω
∣
∣
arg\max_{\omega,b}\frac{2}{||\omega||}
argω,bmax∣∣ω∣∣2
transform maximize optimization into minimize:
a
r
g
min
ω
,
b
1
2
∣
∣
ω
∣
∣
2
s
.
t
.
y
i
(
ω
T
x
i
+
b
)
≥
1
,
i
=
1
,
2
,
.
.
.
,
m
.
\begin{aligned} &arg\min_{\omega,b}\frac{1}{2}||\omega||^2\\ &s.t.\ y_i(\omega^Tx_i+b){\geq}1,i=1,2,...,m. \end{aligned}
argω,bmin21∣∣ω∣∣2s.t. yi(ωTxi+b)≥1,i=1,2,...,m.
Model
solve:
Introducing the Lagrange Multiplier:
L
(
ω
,
b
,
α
)
=
1
2
∣
∣
ω
∣
∣
2
−
∑
i
=
1
m
α
(
y
i
(
ω
T
x
i
+
b
)
−
1
)
L(\omega,b,\alpha)=\frac{1}{2}||\omega||^2-\sum_{i=1}^{m}\alpha(y_i(\omega^Tx_i+b)-1)
L(ω,b,α)=21∣∣ω∣∣2−i=1∑mα(yi(ωTxi+b)−1)
Let:
∂
L
∂
ω
=
∂
L
∂
b
=
0
\frac{{\partial}L}{{\partial}\omega}=\frac{{\partial}L}{{\partial}b}=0
∂ω∂L=∂b∂L=0
ω
=
∑
i
=
1
m
α
i
y
i
x
i
,
∑
i
=
1
m
α
i
y
i
=
0
\omega=\sum_{i=1}^{m}\alpha_iy_ix_i,\sum^m_{i=1}\alpha_iy_i=0
ω=i=1∑mαiyixi,i=1∑mαiyi=0
Dual Problem
min α 1 2 ∑ i = 1 m ∑ j = 1 m α i α j y i y j x i T x j − ∑ i = 1 m α i s . t . ∑ i = 1 m α i y j = 0 , α i ≥ 0 , i = 1 , 2 , . . . , m \begin{aligned} \min_{\alpha}\ \frac{1}{2}\sum_{i=1}^m\sum_{j=1}^m\alpha_i\alpha_jy_iy_jx_i^Tx_j-\sum_{i=1}^m\alpha_i\\ s.t.\sum^m_{i=1}\alpha_iy_j=0,\alpha_i\geq0,i=1,2,...,m \end{aligned} αmin 21i=1∑mj=1∑mαiαjyiyjxiTxj−i=1∑mαis.t.i=1∑mαiyj=0,αi≥0,i=1,2,...,m
Classification model
X
T
X^T
XT:testing sample
f
(
x
)
=
ω
T
x
+
b
=
∑
i
=
1
m
α
i
y
i
x
i
T
x
+
b
f(x)=\omega^Tx+b=\sum_{i=1}^m\alpha_iy_ix_i^Tx+b
f(x)=ωTx+b=i=1∑mαiyixiTx+b
fulfill KKT constrains:( Karush-Kuhn-Tucker)
{
α
i
≥
0
y
i
f
(
x
i
)
≥
1
α
i
(
y
i
f
(
x
i
)
−
1
)
=
0
\left\{\begin{aligned} &\alpha_i\geq0\\ &y_if(x_i)\geq1\\ &\alpha_i(y_if(x_i)-1)=0 \end{aligned}\right.
⎩
⎨
⎧αi≥0yif(xi)≥1αi(yif(xi)−1)=0
Sparsity of solutions: the sample will not be reserved after training, the model at last is only related to the supported vector. So for the SVM model, over fitting may not occur easily.
from sklearn import svm
X = [[2, 0], [1, 1], [2,3]]
y = [0, 0, 1]
clf = svm.SVC(kernel = 'linear')
clf.fit(X, y)
clf
clf.support_vectors_
array([[1., 1.],
[2., 3.]])
clf.support_
array([1, 2])
clf.predict([[2,0]])
array([0])
import numpy as np
import pylab as pl
from sklearn import svm
np.random.seed(0)
X = np.r_[np.random.randn(20, 2) - [2, 2], np.random.randn(20, 2) + [2, 2]]
Y = [0] * 20 + [1] * 20
clf = svm.SVC(kernel='linear')
clf.fit(X, Y)
#hyperplane:w0x0 + w1x1 + b = 0; y = -(w0/w1)x - (w2/w1)
w = clf.coef_[0] # coef:w
a = -w[0] / w[1]
xx = np.linspace(-5, 5)
yy = a * xx - (clf.intercept_[0]) / w[1] # intercept:bias,b
b = clf.support_vectors_[0]
yy_down = a * xx + (b[1] - a * b[0])
b = clf.support_vectors_[-1]
yy_up = a * xx + (b[1] - a * b[0])
print("w: ", w)
print("a: ", a)
print("support_vectors_: ", clf.support_vectors_)
print("clf.coef_: ", clf.coef_)
pl.plot(xx, yy, 'k-')
pl.plot(xx, yy_down, 'k--')
pl.plot(xx, yy_up, 'k--')
pl.scatter(clf.support_vectors_[:, 0], clf.support_vectors_[:, 1],
s=80, facecolors='none')
pl.scatter(X[:, 0], X[:, 1], c=Y, cmap=pl.cm.Paired)
pl.axis('tight')
pl.show()
w: [0.90230696 0.64821811]
a: -1.391980476255765
support_vectors_: [[-1.02126202 0.2408932 ]
[-0.46722079 -0.53064123]
[ 0.95144703 0.57998206]]
clf.coef_: [[0.90230696 0.64821811]]
Kernel
Linear Separable
Q: What if the hyperplane that separate the sample in two does not exist?
A: Mapping the original space into a higher dimensional feature space, making the sample linear separable.
### Dual Problem
min
α
1
2
∑
i
=
1
m
∑
j
=
1
m
α
i
α
j
y
i
y
j
ϕ
(
x
i
)
T
ϕ
(
x
j
)
−
∑
i
=
1
m
α
i
s
.
t
.
∑
i
=
1
m
α
i
y
j
=
0
,
α
i
≥
0
,
i
=
1
,
2
,
.
.
.
,
m
\begin{aligned} \min_{\alpha}\ \frac{1}{2}\sum_{i=1}^m\sum_{j=1}^m\alpha_i\alpha_jy_iy_j\phi(x_i)^T\phi(x_j)-\sum_{i=1}^m\alpha_i\\ s.t.\sum^m_{i=1}\alpha_iy_j=0,\alpha_i\geq0,i=1,2,...,m \end{aligned}
αmin 21i=1∑mj=1∑mαiαjyiyjϕ(xi)Tϕ(xj)−i=1∑mαis.t.i=1∑mαiyj=0,αi≥0,i=1,2,...,m
Classification model
f ( x ) = ω T ϕ ( x ) + b = ∑ i = 1 m α i y i ϕ ( x i ) T ϕ ( x + b ) f(x)=\omega^T\phi(x)+b=\sum_{i=1}^m\alpha_iy_i\phi(x_i)^T\phi(x+b) f(x)=ωTϕ(x)+b=i=1∑mαiyiϕ(xi)Tϕ(x+b)
Kernel trick
K
(
x
,
z
)
=
ϕ
(
x
)
⋅
ϕ
(
z
)
=
x
1
2
z
1
2
+
2
x
1
x
2
z
1
z
2
+
x
2
2
x
2
2
=
(
x
1
z
1
+
x
2
z
2
)
2
=
(
[
x
1
x
2
]
⋅
[
z
1
z
2
]
)
2
=
(
x
⋅
z
)
2
\begin{aligned}K(x,z)&=\phi(x){\cdot}\phi(z)\\ &=x_1^2z_1^2+2x_1x_2z_1z_2+x_2^2x_2^2\\ &=(x_1z_1+x_2z_2)^2\\ &=\left( \left[ \begin{array}{c} x_1\\ x_2 \end{array}\right] {\cdot} \left[ \begin{array}{c} z_1\\ z_2 \end{array} \right] \right)^2\\ &=(x{\cdot}z)^2 \end{aligned}
K(x,z)=ϕ(x)⋅ϕ(z)=x12z12+2x1x2z1z2+x22x22=(x1z1+x2z2)2=([x1x2]⋅[z1z2])2=(x⋅z)2
computing
k
(
k
,
z
)
k(k,z)
k(k,z) some times is faster that calculating feature transformation and the inner product.
Mercer therom
As long as the kernel matrix corresponding to a symmetric function is semi positive, it can be used as a kernel function
Kernel Function:
Linear Kernel:
K
(
x
i
,
x
j
)
=
x
i
T
x
j
K(x_i,x_j)=x_i^Tx_j
K(xi,xj)=xiTxj
Multinomial Kernel:
K
(
x
i
,
x
j
)
=
(
x
i
T
x
j
)
d
K(x_i,x_j)=(x_i^Tx_j)^d
K(xi,xj)=(xiTxj)d
Gaussian radial basis function kernel:
K
(
x
i
,
x
j
)
=
e
∣
∣
x
i
T
x
j
∣
∣
2
/
e
σ
2
K(x_i,x_j)=e^{||x_i^Tx_j||^2/e\sigma^2}
K(xi,xj)=e∣∣xiTxj∣∣2/eσ2
Sigmoid function kernel:
K
(
x
i
,
x
j
)
=
t
a
n
h
(
β
x
i
T
x
j
+
θ
)
K(x_i,x_j)=tanh({\beta}x_i^Tx_j+\theta)
K(xi,xj)=tanh(βxiTxj+θ)
eg: breast_cancer dataset using multiple kernel SVM
from sklearn import datasets
from sklearn.model_selection import train_test_split, ShuffleSplit
from sklearn import svm
import matplotlib.pyplot as plt
from sklearn.model_selection import learning_curve
import numpy as np
# 绘图函数
def plot_learning_curve(estimator, title, X_data, y_target, ylim=None, cv=None, n_jobs=1, train_sizes=np.linspace(.1, 1.0, 20)):
plt.figure()
plt.title(title)
if ylim is not None:
plt.ylim(*ylim)
plt.xlabel("Training examples")
plt.ylabel("Score")
train_sizes, train_scores, test_scores = learning_curve(estimator, X_data, y_target, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)
plt.grid()
plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
train_scores_mean + train_scores_std, alpha=0.1,
color="r")
plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
test_scores_mean + test_scores_std, alpha=0.1, color="g")
plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
label="Training score")
plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
label="Cross-validation score")
plt.legend(loc="best")
return plt
# 从乳腺癌库中调用数据集
data_cancer = datasets.load_breast_cancer()
X = data_cancer.data
y = data_cancer.target
print(X, y)
print("====================")
# 训练样本测试样本划分
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
cv = ShuffleSplit(n_splits=100, test_size=0.2, random_state=0)
# 初始化SVM分类器
svc_classifier = svm.SVC()
# 采用训练数据训练SVM分类器
svc_classifier.fit(X_train, y_train)
'''
# 对测试样本进行预测
y_predict = svc_classifier.predict(X_test)
print(y_predict)
'''
# 用测试样本对模型进行评价
svc_accuracy = svc_classifier.score(X_test, y_test)
print("svc_accuracy:%f %%" % (100 * svc_accuracy))
plot_learning_curve(svc_classifier, "SVM", X, y, (0.87, 0.95), cv=cv, n_jobs=1)
# 无核函数
cv = ShuffleSplit(n_splits=10, test_size=0.2, random_state=0)
title = 'Learning Curves (C=1.0, kernel=linear, degree=2) '
estimator = svm.SVC(C=1.0, kernel='linear', degree=2)
SVC_2=estimator
plot_learning_curve(estimator, title, X, y, (0.9, 1.0), cv=cv, n_jobs=1)
plt.show()
cv.get_n_splits(X, y)
for train_index, test_index in cv.split(X, y):
X_train = X[train_index]
X_test = X[test_index]
y_train = y[train_index]
y_test = y[test_index]
SVC_2.fit(X_train, y_train)
train_score = SVC_2.score(X_train, y_train)
test_score = SVC_2.score(X_test, y_test)
print('train score: {0}; test score: {1}'.format(train_score, test_score))
[[1.799e+01 1.038e+01 1.228e+02 ... 2.654e-01 4.601e-01 1.189e-01]
[2.057e+01 1.777e+01 1.329e+02 ... 1.860e-01 2.750e-01 8.902e-02]
[1.969e+01 2.125e+01 1.300e+02 ... 2.430e-01 3.613e-01 8.758e-02]
...
[1.660e+01 2.808e+01 1.083e+02 ... 1.418e-01 2.218e-01 7.820e-02]
[2.060e+01 2.933e+01 1.401e+02 ... 2.650e-01 4.087e-01 1.240e-01]
[7.760e+00 2.454e+01 4.792e+01 ... 0.000e+00 2.871e-01 7.039e-02]] [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 1 0 1 1 1 1 1 0 0 1 0 0 1 1 1 1 0 1 0 0 1 1 1 1 0 1 0 0
1 0 1 0 0 1 1 1 0 0 1 0 0 0 1 1 1 0 1 1 0 0 1 1 1 0 0 1 1 1 1 0 1 1 0 1 1
1 1 1 1 1 1 0 0 0 1 0 0 1 1 1 0 0 1 0 1 0 0 1 0 0 1 1 0 1 1 0 1 1 1 1 0 1
1 1 1 1 1 1 1 1 0 1 1 1 1 0 0 1 0 1 1 0 0 1 1 0 0 1 1 1 1 0 1 1 0 0 0 1 0
1 0 1 1 1 0 1 1 0 0 1 0 0 0 0 1 0 0 0 1 0 1 0 1 1 0 1 0 0 0 0 1 1 0 0 1 1
1 0 1 1 1 1 1 0 0 1 1 0 1 1 0 0 1 0 1 1 1 1 0 1 1 1 1 1 0 1 0 0 0 0 0 0 0
0 0 0 0 0 0 0 1 1 1 1 1 1 0 1 0 1 1 0 1 1 0 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1
1 0 1 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 0 1 0 1 1 1 1 0 0 0 1 1
1 1 0 1 0 1 0 1 1 1 0 1 1 1 1 1 1 1 0 0 0 1 1 1 1 1 1 1 1 1 1 1 0 0 1 0 0
0 1 0 0 1 1 1 1 1 0 1 1 1 1 1 0 1 1 1 0 1 1 0 0 1 1 1 1 1 1 0 1 1 1 1 1 1
1 0 1 1 1 1 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 0 1 0 1 1 1 1 1 0 1 1
0 1 0 1 1 0 1 0 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 0 1
1 1 1 1 1 1 0 1 0 1 1 0 1 1 1 1 1 0 0 1 0 1 0 1 1 1 1 1 0 1 1 0 1 0 1 0 0
1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 0 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 0 0 0 0 0 0 1]
====================
svc_accuracy:92.397661 %
train score: 0.9604395604395605; test score: 0.9649122807017544