sklearn svm.SVC应用+乳腺癌数据集

夺笋123

已于 2022-10-08 12:43:03 修改

阅读量1.4k

点赞数 3

分类专栏： # 机器学习框架的应用小例文章标签：支持向量机机器学习 python

于 2022-08-01 11:21:15 首次发布

本文链接：https://blog.csdn.net/m0_54510474/article/details/125898294

版权

机器学习框架的应用小例专栏收录该内容

14 篇文章 2 订阅

订阅专栏

本文详细介绍了如何使用matplotlib绘制决策边界，通过实例展示了简单数据点、生成数据集以及乳腺癌数据集上的高斯核函数和多项式核函数应用。重点讨论了过拟合问题，包括不同核函数对决策边界的影响，并以乳腺癌数据集为例，探讨了参数调优的重要性。

摘要由CSDN通过智能技术生成

决策边界绘制函数封装

参考书籍：《scikit-learn机器学习：常用算法原理及编程实战》
参考博客：基于支持向量机的数据分类以及绘制决策边界（超平面）

代码

import matplotlib.pyplot as plt
import numpy as np
def plot_hyperplane(clf, X, y,
                    h=0.02,
                    draw_sv=True,
                    title='hyperplan'):

    x_min, x_max = X[:, 0].min() - 0.1, X[:, 0].max() + 0.1
    y_min, y_max = X[:, 1].min() - 0.1, X[:, 1].max() + 0.1
    # 生成网格点坐标矩阵，其中参数h控制间距
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                         np.arange(y_min, y_max, h))

    plt.title(title)
    plt.xlim(xx.min(), xx.max())
    plt.ylim(yy.min(), yy.max())
    plt.xticks(())
    plt.yticks(())

    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()]) # SVM的分割超平面

    Z = Z.reshape(xx.shape)
    # 为坐标矩阵的不同类别的点填充不同的颜色
    plt.contourf(xx, yy, Z, cmap='hot', alpha=0.5)
    plt.scatter(X[:,0],X[:,1],c=y)

	# 是否将支持向量突出表示
    if draw_sv:
        sv = clf.support_vectors_
        plt.scatter(sv[:, 0], sv[:, 1], c='r', marker='.',s=1)
    plt.show()

该函数基于matplotlib库中的contourf()函数，可以画出等高线并填充颜色的函数

预测示例

1. 简单的数据点

import numpy as np
from sklearn.pipeline import make.pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
x=np.array([[-1,-1],[-2,-1],[1,1],[2,1]])
y=np.array([1,1,2,2])
clf=make_pipeline(StandardScaler(),SVC(gamma='atuo'))
clf.fit(x,y)
clf.predict([[-0.8,-1]])
>>> [1]

此处我们要注意make_pipeline()函数与StandardScaler()函数的联合使用，常用于构建管道pipeline，数据预处理、模型拟合等操作“一条龙”！

2. 生成数据点

接下来我们首先生成一个有两个特征、三种类别的数据集

from sklean.datasets import make_blobs
x,y=make_blobs(n_samples=100,centers=3,random_state=0,cluster_std=0.8)

然后使用线性核函数、三阶多项式核函数、 $g amma$ =0.5、 $g amma$ =0.1等4个SVC模型来拟合数据集，分别观察拟合效果

from sklearn.svm import SVC
clf_linear=svm.SVC(C=1.0,kernel='linear')
clf_poly=svm.SVC(C=1.0,kernel='poly',degree=3)
clf.rbf=svm.SVC(C=1.0,kernel='rbf',gamma=0.5)
clf_rbf2=svm.SVC(C=1.0,kernel='rbf',gamma=0.1)

最后将4个模型拟合出来的超平面画出来

plt.figure(figsize=(10,10),dpi=144)
clfs=[clf_linear,clf_poly,clf_rbf,clf_rbf2]
title=['linear','poly','rbf','rbf2']
for clf,i in zip(clfs,range(len(clfs))):
    clf.fit(x,y)
    plt.subplot(2,2,i+1)
    plot_hyperplane(clf,x,y,title=title[i])

运行结果
在这里插入图片描述通过上面例子我们发现，当支持向量机使用不同的核函数进行分类时候，绘制出的决策边界也会随着改变
对于高斯核函数，gamma值过大会造成过拟合，过小则会使得高斯核函数退化为线性核函数，我们可以通过调节gamma的值来调整分隔超平面的形状

3. 乳腺癌数据集

首先我们加载出数据集

from sklearn.datasets import load_breast_cancer
cancer=load_breast_cancer()
x=cancer.data
y=cancer.target

分割出训练集与测试集

from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2)

接下来我们使用不同的核函数对应的模型对数据进行拟合

高斯核函数

我们首先尝试使用高斯核函数训练模型,分别得到训练集得分和测试集得分

from sklearn.svm import SVC
clf=SVC(C=1.0,kernel='rbf',gamma=0.1)
clf.fit(x_train,y_train)
train_score=clf.score(x_train,y_train)
test_score=clf.score(x_test,y_test)
train_score,test_score
>>> (0.98778345781,0.52631578949)

通过观察实验结果，我们发现训练集接近满分，测试集评分很低，典型的过拟合现象！由于我们使用的是高斯核函数，所以我们可以尝试改变gamma参数值来调整模型的表现

from sklearn.model_selection import GridSearchCV
gammas=np.linspace(0,0.0003,30)
grid={'gamma':gammas}
clf=GridSearchCV(SVC(),grid,cv=5)
clf.fit(x,y)
clf.best_params_,clf.best_score_
>>> ({'gamma': 0.00011379310344827585}, 0.9367334264865704)

此处注意GridSearchCV()函数的常用于参数选择

在最好的gamma参数下，平均最优得分也只有0.93…，说明该核函数对于该问题并非最合适的。

但是我们同样可以通过学习曲线观察模型的拟合情况，取gamma=0.01

import matplotlib.pyplot as plt
from sklearn.model_selection import learning_curve
from sklearn.model_selection import ShuffleSplit
cv=ShuffleSplit(n_splits=10,test_size=0.2,random_state=0)
title='gaussian kernel'
plt.figure(figsize=(10,4),dpi=144)
plot_learning_curve(SVC(C=1.0,gamma=0.01),title,x,y,cv=cv)

运行结果
在这里插入图片描述

学习曲线的封装请参考sklearn官方api：plot_learning_curve()函数，
也可以参考笔者的博客：学习曲线的封装–learning_curve()函数

交叉验证得分过低，明显的过拟合现象，下面我们尝试改变核函数

多项式核函数

from sklearn.svm import SVC
clf=SVC(C=1.0,kernel='poly',degree=2)
clf.fit(x_train,y_train)
train_score=clf.score(x_train,y_train)
test_score=clf.score(x_test,y_test)
train_score,test_score
>>> (0.9186813186813186, 0.9122807017543859)

训练集和测试集两者的得分都比较高，看起来结果还可以，接下来我们分别画出一阶多项式、二阶多项式学习曲线

from sklearn.model_selection import learning_curve
cv=ShuffleSplit(n_splits=5,test_size=0.2,random_state=0)
degrees=[1,2]
plt.figure(figsize=(12,4),dpi=144)
for i in range(len(degrees)):
	plt.subplot(1,len(degrees),i+1)
	title='degree_{}'.format(degrees[i])
	plot_learning_curve(SVC(C=1.0,kernel='poly',degree=degrees[i]),title,x,y,cv=cv)