支持向量机
SVM 的诞生便由于它的优良的分类性能得到了业界的广泛应用,并压制了神经网络领域好多年。SVM是一个二分类问题,线性与非线性都支持,经过发展后也支持多分类问题。其实多分类也是在二分类的基础上做的,先分成两类,再在这两类的基础上分类再分…
SVM 的优化函数定义为:
m
a
x
    
1
∣
∣
w
∣
∣
2
    
s
.
t
    
y
i
(
w
T
x
i
+
b
)
≥
1
(
i
=
1
,
2
,
.
.
.
m
)
max \;\; \frac{1}{||w||_2} \;\; s.t \;\; y_i(w^Tx_i + b) \geq 1 (i =1,2,...m)
max∣∣w∣∣21s.tyi(wTxi+b)≥1(i=1,2,...m),其中
y
i
(
w
T
x
i
+
b
)
y_i(w^Tx_i + b)
yi(wTxi+b)为函数间隔,通常设为大于等于1.经过放缩变换,并使用拉格朗日函数将有约束优化目标转换为无约束优化目标,得到最终的SVM的目标函数为:
通俗来讲,支持向量机需要做的事情有两件:
- 1、找出离决策边界距离最小的样本点(目标函数里面的那一块)
- 2、找w、b使得这些样本点到决策边界的距离越大越好 (目标函数外边的那一块)
由于公式编辑比较浪费时间,近期我会手动推导一遍,这里先暂时留个位置,或者大家可以参考这个博客。
但是,在实际过程中,数据常常会遇到噪声或者不确定性因素的影响,个别的离群点对SVM的泛化性影响很大,如下图(由于离群点的影响,导致函数间隔变窄,泛化性降低):
所以,SVM算法引入了松弛变量,也就是函数间隔加上这个松弛变量后大于等于1,即
y
i
(
w
∙
x
i
+
b
)
≥
1
−
ξ
i
y_i(w\bullet x_i +b) \geq 1- \xi_i
yi(w∙xi+b)≥1−ξi,当然,引入了松弛变量就会有相应的代价,为了防止过拟合,还加上了惩罚参数C,所以最终的优化目标函数为:
L
(
w
,
b
,
ξ
,
α
,
μ
)
=
1
2
∣
∣
w
∣
∣
2
2
+
C
∑
i
=
1
m
ξ
i
−
∑
i
=
1
m
α
i
[
y
i
(
w
T
x
i
+
b
)
−
1
+
ξ
i
]
−
∑
i
=
1
m
μ
i
ξ
i
L(w,b,\xi,\alpha,\mu) = \frac{1}{2}||w||_2^2 +C\sum\limits_{i=1}^{m}\xi_i - \sum\limits_{i=1}^{m}\alpha_i[y_i(w^Tx_i + b) - 1 + \xi_i] - \sum\limits_{i=1}^{m}\mu_i\xi_i
L(w,b,ξ,α,μ)=21∣∣w∣∣22+Ci=1∑mξi−i=1∑mαi[yi(wTxi+b)−1+ξi]−i=1∑mμiξi
此外,当遇到线性不可分的时候,这时候就映入了一个核函数方法的东西,这个核函数的主要作用是要将低维的线性不可分数据映射到高维,变得线性可分。这时候的SVM决策边界就不是一条线了,而是一个平面。
最后,我以一个实际的工业数据(重介质选煤过程的数据),验证一下SVM算法。
import pandas as pd
df = pd.read_excel("./data_chuli_11.xlsx") # pandas读取Excel表格的内容
print(df.shape)
data = df.dropna(axis=0) # 删除列中有缺失值的数据
print(data.shape)
data.columns = ["v1", "v2", "v3", "v4", "v5", "label"] # 给每一列数据添加列名
X = data.loc[:, ["v1", "v2", "v3", "v4", "v5"]] # 前5列是变量名
y = data.loc[:, ["label"]] # “label”列为标签
# y.values
(1080, 6)
(788, 6)
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
std = StandardScaler()
X = std.fit_transform(X) # X数据标准化
# 为了画图方便,把数据降到两维
pca = PCA(n_components=2)
X_ = pca.fit_transform(X)
x1 = X_[y["label"]==0]
x2 = X_[y["label"]==1]
x3 = X_[y["label"]==2]
x4 = X_[y["label"]==3]
print(X_.shape)
(788, 2)
from matplotlib import pyplot as plt
%matplotlib inline
plt.scatter(x1[:, 0], x1[:, 1], c="red")
plt.scatter(x2[:, 0], x2[:, 1], c="blue")
plt.scatter(x3[:, 0], x3[:, 1], c="black")
plt.scatter(x4[:, 0], x4[:, 1], c="yellow")
plt.legend(["bad", "middle", "ok", "optimal"])
plt.show()
数据集的划分
from sklearn.model_selection import train_test_split
y = y.astype("int32")["label"].tolist()
x_train, x_test, y_train, y_test = train_test_split(X, y)
print(type(X), type(y))
<class 'numpy.ndarray'> <class 'list'>
使用CART进行分类
from sklearn.tree import DecisionTreeClassifier, export_graphviz
clf = DecisionTreeClassifier()
clf.fit(X=x_train, y=y_train)
clf.score(x_test, y_test)
0.7157360406091371
clf.get_params
<bound method BaseEstimator.get_params of DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort=False, random_state=None,
splitter='best')>
from IPython.display import Image
import pydotplus
dot_data = export_graphviz(clf, out_file=None, feature_names=['v1', 'v2', 'v3', 'v4', 'v5'], filled=True)
graph = pydotplus.graph_from_dot_data(dot_data)
Image(graph.create_png())
超参数调优
from sklearn.model_selection import GridSearchCV
dtc = DecisionTreeClassifier()
params = {"max_depth": list(range(1,20))}
gsc = GridSearchCV(dtc, param_grid=params, cv=5)
gsc.fit(x_train, y_train)
gsc.score(x_test, y_test)
0.7208121827411168
gsc.best_estimator_, gsc.best_score_, gsc.best_params_
(DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=17,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort=False, random_state=None,
splitter='best'), 0.7309644670050761, {'max_depth': 17})
使用SVM算法进行分类(多分类)
from sklearn.svm import SVC
clf = SVC(decision_function_shape="OvO", kernel="rbf")
clf.fit(x_train, y_train)
y_pred = clf.predict(x_test)
clf.score(x_test, y_test)
0.6649746192893401
超参数调优,网格搜索,5次交叉验证
from sklearn.model_selection import GridSearchCV
params = {"C": [0.1, 1, 10], "gamma": [10, 1, 0.1, 0.01]}
clf = SVC(decision_function_shape="OvO", kernel="rbf")
gsc = GridSearchCV(clf, param_grid=params, cv=5)
gsc.fit(x_train, y_train)
gsc.best_estimator_, gsc.best_params_, gsc.best_score_
(SVC(C=10, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='OvO', degree=3, gamma=1, kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False), {'C': 10, 'gamma': 1}, 0.7986463620981388)
测试集的分类精度
print(gsc.score(x_test, y_test))
0.7918781725888325
超参数调优,网格搜索
params = {"C": list(range(5, 20)), "gamma": [10, 1, 0.1, 0.01]}
clf = SVC(decision_function_shape="OvO", kernel="rbf", class_weight="balanced")
gsc = GridSearchCV(clf, param_grid=params, cv=5)
gsc.fit(x_train, y_train)
gsc.best_estimator_, gsc.best_params_, gsc.best_score_
(SVC(C=13, cache_size=200, class_weight='balanced', coef0=0.0,
decision_function_shape='OvO', degree=3, gamma=1, kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False), {'C': 13, 'gamma': 1}, 0.8104906937394247)
print(gsc.score(x_test, y_test))
0.7969543147208121
params = {"C": list(range(5, 20)), "gamma": list(range(0, 11))}
clf = SVC(decision_function_shape="OvO", kernel="rbf")
gsc = GridSearchCV(clf, param_grid=params, cv=5)
gsc.fit(x_train, y_train)
gsc.best_estimator_, gsc.best_params_, gsc.best_score_
(SVC(C=16, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='OvO', degree=3, gamma=1, kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False), {'C': 16, 'gamma': 1}, 0.8087986463620981)
print(gsc.score(x_test, y_test))
0.8020304568527918