向量机的参数,属性及接口
1.探索核函数
from sklearn.datasets import load_breast_cancer
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import numpy as np
from time import time
import datetime
data = load_breast_cancer()
X = data.data
y = data.target
X.shape
np.unique(y)
plt.scatter(X[:,0],X[:,1],c=y)
plt.show()
Xtrain, Xtest, Ytrain, Ytest = train_test_split(X,y,test_size=0.3,random_state=420)
Kernel = ["linear","poly","rbf","sigmoid"]
for kernel in Kernel:
time0 = time()
clf= SVC(kernel = kernel
, gamma="auto"
# , degree = 1
, cache_size=10000#使用计算的内存,单位是MB,默认是200MB
).fit(Xtrain,Ytrain)
print("The accuracy under kernel %s is %f" % (kernel,clf.score(Xtest,Ytest)))
print(time()-time0)
但并没有跑出来,模型停留在线性核函数中。
Kernel = ["linear","rbf","sigmoid"]
for kernel in Kernel:
time0 = time()
clf= SVC(kernel = kernel
, gamma="auto"
# , degree = 1
, cache_size=5000
).fit(Xtrain,Ytrain)
print("The accuracy under kernel %s is %f" % (kernel,clf.score(Xtest,Ytest)))
print(time()-time0)
----------------------------------------------------------------------
结果(linear核函数结果效果很好)
The accuracy under kernel linear is 0.929825
0.795527458190918
The accuracy under kernel rbf is 0.596491
0.06104254722595215
The accuracy under kernel sigmoid is 0.596491
0.008005142211914062
----------------------------------------------------------------------
Kernel = ["linear","poly","rbf","sigmoid"]
for kernel in Kernel:
time0 = time()
clf= SVC(kernel = kernel
, gamma="auto"
, degree = 1
, cache_size=5000
).fit(Xtrain,Ytrain)
print("The accuracy under kernel %s is %f" % (kernel,clf.score(Xtest,Ytest)))
print(time()-time0)
----------------------------------------------------------------------
结果(多项式核函数的运行速度和精度得以提升)
The accuracy under kernel linear is 0.929825
0.8025338649749756
The accuracy under kernel poly is 0.923977
0.14710068702697754
The accuracy under kernel rbf is 0.596491
0.06003713607788086
The accuracy under kernel sigmoid is 0.596491
0.011008739471435547
数据中存在严重的量纲不一的问题,我们预处理数据,并将数据进行标准化处理。
from sklearn.preprocessing import StandardScaler
X = StandardScaler().fit_transform(X)#将数据转化为0,1正态分布
data = pd.DataFrame(X)
data.describe([0.01,0.05,0.1,0.25,0.5,0.75,0.9,0.99]).T#均值很接近,方差为1了
Xtrain, Xtest, Ytrain, Ytest = train_test_split(X,y,test_size=0.3,random_state=420)
Kernel = ["linear","poly","rbf","sigmoid"]
for kernel in Kernel:
time0 = time()
clf= SVC(kernel = kernel
, gamma="auto"
, degree = 1
, cache_size=5000
).fit(Xtrain,Ytrain)
print("The accuracy under kernel %s is %f" % (kernel,clf.score(Xtest,Ytest)))
print(time()-time0)
----------------------------------------------------------------------
结果(所有的核函数的运行时间大大的缩小)
The accuracy under kernel linear is 0.976608
0.01501321792602539
The accuracy under kernel poly is 0.964912
0.006003141403198242
The accuracy under kernel rbf is 0.970760
0.011005401611328125
The accuracy under kernel sigmoid is 0.953216
0.0060024261474609375
结论:
1、线性核函数,尤其是degree在高次项时计算非常缓慢
2、SVM执行之前,非常推荐进行数据的无量纲化
2.与核函数相关的参数讲解
输入 | 含义 | 参数gramma | 参数degree | 参数coef0 |
---|---|---|---|---|
linear | 线性核 | No | No | No |
poly | 多项式核 | Yes | Yes | Yes |
rbf | 高斯径向基数 | Yes | No | No |
sigmoid | 双曲正切核 | Yes | No | Yes |
score = []
gamma_range = np.logspace(-10, 1, 50) #返回在对数刻度上均匀间隔的数字
for i in gamma_range:
clf = SVC(kernel="rbf",gamma = i,cache_size=5000).fit(Xtrain,Ytrain)
score.append(clf.score(Xtest,Ytest))
print(max(score), gamma_range[score.index(max(score))])
plt.plot(gamma_range,score)
plt.show()
----------------------------------------------------------------------
结果:通过学习曲线,很容易找到rbf最佳的gamma值
0.9766081871345029 0.012067926406393264
from sklearn.model_selection import StratifiedShuffleSplit#用于支持带交叉验证的网格搜索
from sklearn.model_selection import GridSearchCV#带交叉验证的网格搜索
time0 = time()
gamma_range = np.logspace(-10,1,20)
coef0_range = np.linspace(0,5,10)
param_grid = dict(gamma = gamma_range
,coef0 = coef0_range)
cv = StratifiedShuffleSplit(n_splits=5, test_size=0.3, random_state=420)#将数据分为5份,5份数据中测试集占30%
grid = GridSearchCV(SVC(kernel = "poly",degree=1,cache_size=5000
,param_grid=param_grid
,cv=cv)
grid.fit(X, y)
print("The best parameters are %s with a score of %0.5f" % (grid.best_params_,
grid.best_score_))
print(time()-time0)
----------------------------------------------------------------------
结果:通过网格搜索,得到poly的值的结果不如rbf和线性
The best parameters are {'coef0': 0.0, 'gamma': 0.18329807108324375} with a score of 0.96959
13.360332727432251
----------------------------------------------------------------------
----------------------------------------------------------------------
#调线性核函数
score = []
C_range = np.linspace(0.01,30,50)
for i in C_range:
clf = SVC(kernel="linear",C=i,cache_size=5000).fit(Xtrain,Ytrain)
score.append(clf.score(Xtest,Ytest))
print(max(score), C_range[score.index(max(score))])
plt.plot(C_range,score)
plt.show()
#换rbf
score = []
C_range = np.linspace(0.01,30,50)
for i in C_range:
clf = SVC(kernel="rbf",C=i,gamma = 0.012742749857031322,cache_size=5000).fit(Xtrain,Ytrain)
score.append(clf.score(Xtest,Ytest))
print(max(score), C_range[score.index(max(score))])
plt.plot(C_range,score)
plt.show()
#进一步细化
score = []
C_range = np.linspace(5,7,50)
for i in C_range:
clf = SVC(kernel="rbf",C=i,gamma =
0.012742749857031322,cache_size=5000).fit(Xtrain,Ytrain)
score.append(clf.score(Xtest,Ytest))
print(max(score), C_range[score.index(max(score))])
plt.plot(C_range,score)
plt.show()
总结:
参数 | 含义 |
---|---|
degree | 整数,默认为3,只适用于核函数为poly的参数 |
gramma | 浮点数,默认为’auto’ |
coef0 | 浮点数,默认为0 |
C | 浮点数,默认为1,可不填,松弛系数的惩罚项系数,如果C较大,能更好分类决策边界,但换来的结果是训练时间将更长 |
参考:CDA课堂,直播课后的个人笔记总结,仅供参考,有不一样的想法的大佬们,请辩证地观看,如果有问题可以在评论区指出我再订正。