支持向量机专题02

最新推荐文章于 2022-03-31 18:08:39 发布

Winfred_Bo

最新推荐文章于 2022-03-31 18:08:39 发布

阅读量213

点赞数 1

分类专栏：机器学习文章标签： python 机器学习支持向量机

本文链接：https://blog.csdn.net/Winfred_Bo/article/details/107888672

版权

机器学习专栏收录该内容

11 篇文章 0 订阅

订阅专栏

向量机的参数，属性及接口

1.探索核函数
2.与核函数相关的参数讲解

1.探索核函数

from sklearn.datasets import load_breast_cancer
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import numpy as np
from time import time
import datetime
 
data = load_breast_cancer()
X = data.data
y = data.target
 
X.shape
np.unique(y)
 
plt.scatter(X[:,0],X[:,1],c=y)
plt.show()
 
Xtrain, Xtest, Ytrain, Ytest = train_test_split(X,y,test_size=0.3,random_state=420)
 
Kernel = ["linear","poly","rbf","sigmoid"]
 
for kernel in Kernel:
    time0 = time()
    clf= SVC(kernel = kernel
             , gamma="auto"
            # , degree = 1
             , cache_size=10000#使用计算的内存，单位是MB，默认是200MB
            ).fit(Xtrain,Ytrain)
    print("The accuracy under kernel %s is %f" % (kernel,clf.score(Xtest,Ytest)))
    print(time()-time0)

但并没有跑出来，模型停留在线性核函数中。

Kernel = ["linear","rbf","sigmoid"]
 
for kernel in Kernel:
    time0 = time()
    clf= SVC(kernel = kernel
             , gamma="auto"
            # , degree = 1
             , cache_size=5000
            ).fit(Xtrain,Ytrain)
    print("The accuracy under kernel %s is %f" % (kernel,clf.score(Xtest,Ytest)))
    print(time()-time0)
----------------------------------------------------------------------
结果（linear核函数结果效果很好）
The accuracy under kernel linear is 0.929825
0.795527458190918
The accuracy under kernel rbf is 0.596491
0.06104254722595215
The accuracy under kernel sigmoid is 0.596491
0.008005142211914062
----------------------------------------------------------------------
Kernel = ["linear","poly","rbf","sigmoid"]
 
for kernel in Kernel:
    time0 = time()
    clf= SVC(kernel = kernel
             , gamma="auto"
             , degree = 1
             , cache_size=5000
            ).fit(Xtrain,Ytrain)
    print("The accuracy under kernel %s is %f" % (kernel,clf.score(Xtest,Ytest)))
    print(time()-time0)
----------------------------------------------------------------------
结果（多项式核函数的运行速度和精度得以提升）
The accuracy under kernel linear is 0.929825
0.8025338649749756
The accuracy under kernel poly is 0.923977
0.14710068702697754
The accuracy under kernel rbf is 0.596491
0.06003713607788086
The accuracy under kernel sigmoid is 0.596491
0.011008739471435547

数据中存在严重的量纲不一的问题，我们预处理数据，并将数据进行标准化处理。

from sklearn.preprocessing import StandardScaler
X = StandardScaler().fit_transform(X)#将数据转化为0,1正态分布
data = pd.DataFrame(X)
data.describe([0.01,0.05,0.1,0.25,0.5,0.75,0.9,0.99]).T#均值很接近，方差为1了
Xtrain, Xtest, Ytrain, Ytest = train_test_split(X,y,test_size=0.3,random_state=420)
 
Kernel = ["linear","poly","rbf","sigmoid"]
 
for kernel in Kernel:
    time0 = time()
    clf= SVC(kernel = kernel
             , gamma="auto"
             , degree = 1
             , cache_size=5000
            ).fit(Xtrain,Ytrain)
    print("The accuracy under kernel %s is %f" % (kernel,clf.score(Xtest,Ytest)))
    print(time()-time0)
    ----------------------------------------------------------------------
结果（所有的核函数的运行时间大大的缩小）
The accuracy under kernel linear is 0.976608
0.01501321792602539
The accuracy under kernel poly is 0.964912
0.006003141403198242
The accuracy under kernel rbf is 0.970760
0.011005401611328125
The accuracy under kernel sigmoid is 0.953216
0.0060024261474609375

结论:
1、线性核函数，尤其是degree在高次项时计算非常缓慢
2、SVM执行之前，非常推荐进行数据的无量纲化

2.与核函数相关的参数讲解

输入	含义	参数gramma	参数degree	参数coef0
linear	线性核	No	No	No
poly	多项式核	Yes	Yes	Yes
rbf	高斯径向基数	Yes	No	No
sigmoid	双曲正切核	Yes	No	Yes

score = []
gamma_range = np.logspace(-10, 1, 50) #返回在对数刻度上均匀间隔的数字
for i in gamma_range:
    clf = SVC(kernel="rbf",gamma = i,cache_size=5000).fit(Xtrain,Ytrain)
    score.append(clf.score(Xtest,Ytest))
    
print(max(score), gamma_range[score.index(max(score))])
plt.plot(gamma_range,score)
plt.show()
----------------------------------------------------------------------
结果：通过学习曲线，很容易找到rbf最佳的gamma值
0.9766081871345029 0.012067926406393264

在这里插入图片描述

from sklearn.model_selection import StratifiedShuffleSplit#用于支持带交叉验证的网格搜索
from sklearn.model_selection import GridSearchCV#带交叉验证的网格搜索
 
time0 = time()
 
gamma_range = np.logspace(-10,1,20)
coef0_range = np.linspace(0,5,10)
 
param_grid = dict(gamma = gamma_range
                  ,coef0 = coef0_range)
cv = StratifiedShuffleSplit(n_splits=5, test_size=0.3, random_state=420)#将数据分为5份，5份数据中测试集占30%
grid = GridSearchCV(SVC(kernel = "poly",degree=1,cache_size=5000
                        ,param_grid=param_grid
                        ,cv=cv)
grid.fit(X, y)
 
print("The best parameters are %s with a score of %0.5f" % (grid.best_params_, 
grid.best_score_))
print(time()-time0)
----------------------------------------------------------------------
结果：通过网格搜索，得到poly的值的结果不如rbf和线性
The best parameters are {'coef0': 0.0, 'gamma': 0.18329807108324375} with a score of 0.96959
13.360332727432251
----------------------------------------------------------------------
----------------------------------------------------------------------
#调线性核函数
score = []
C_range = np.linspace(0.01,30,50)
for i in C_range:
    clf = SVC(kernel="linear",C=i,cache_size=5000).fit(Xtrain,Ytrain)
    score.append(clf.score(Xtest,Ytest))
print(max(score), C_range[score.index(max(score))])
plt.plot(C_range,score)
plt.show()
 
#换rbf
score = []
C_range = np.linspace(0.01,30,50)
for i in C_range:
    clf = SVC(kernel="rbf",C=i,gamma = 0.012742749857031322,cache_size=5000).fit(Xtrain,Ytrain)
    score.append(clf.score(Xtest,Ytest))
    
print(max(score), C_range[score.index(max(score))])
plt.plot(C_range,score)
plt.show()
 
#进一步细化
score = []
C_range = np.linspace(5,7,50)
for i in C_range:
    clf = SVC(kernel="rbf",C=i,gamma = 
0.012742749857031322,cache_size=5000).fit(Xtrain,Ytrain)
    score.append(clf.score(Xtest,Ytest))
    
print(max(score), C_range[score.index(max(score))])
plt.plot(C_range,score)
plt.show()

总结：

参数	含义
degree	整数，默认为3，只适用于核函数为poly的参数
gramma	浮点数，默认为’auto’
coef0	浮点数，默认为0
C	浮点数，默认为1，可不填，松弛系数的惩罚项系数，如果C较大，能更好分类决策边界，但换来的结果是训练时间将更长

参考：CDA课堂，直播课后的个人笔记总结，仅供参考，有不一样的想法的大佬们，请辩证地观看，如果有问题可以在评论区指出我再订正。