机器学习——支持向量机SVM之python实现简单实例一（含数据预处理、交叉验证、参数优化等）

最新推荐文章于 2024-06-19 23:40:42 发布

有情怀的机械男

最新推荐文章于 2024-06-19 23:40:42 发布

阅读量1.7w

点赞数 44

分类专栏：机器学习

本文链接：https://blog.csdn.net/qq_45769063/article/details/106628800

版权

这篇博客介绍了如何使用Python实现支持向量机（SVM），包括数据预处理、训练集与测试集划分、核函数选择、模型评估和参数优化。博主详细讲解了SVM的理论基础，数据清洗、归一化以及如何通过交叉验证确定最佳C和gamma参数。通过实例展示了不同C和gamma值对训练和测试识别率的影响，揭示了过拟合和泛化能力的关系。

摘要由CSDN通过智能技术生成

1). numpy.split(ary, indices_or_sections, axis=0)

2). sklearn.model_selection.train_test_split随机划分训练集与测试集。

3、训练svm分类器（即创建了svm模型）

1）C=1,gamma = 5——训练样本和测试样本的识别率都比较可观

2)C=0.1,gamma=5——训练样本和测试样本的识别率均下降

3）C=100，gamma=5——训练样本识别率高，测试样本低，泛化能力差

一、SVM理论

可见以下文章：

《机器学习——支持向量机SVM之线性模型》

《机器学习——支持向量机SVM之非线性模型低维到高维映射》

《机器学习——支持向量机SVM之非线性模型原问题与对偶问题》

《机器学习——支持向量机SVM之非线性模型原问题转化为对偶问题》

《机器学习——支持向量机SVM之多类问题》

二、numpy的相关函数介绍

《numpy——axis》

《什么是随机数及随机数种子》

《numpy——mgrid》

《Python——数组重组（flatten、flat、ravel、reshape、resize）》

《Numpy——stack》

三、python实现之准备

【注意】本文的运行环境是windows+Pycharm+python3.7。

1、数据集的下载

本文用的数据集为Iris.data可从UCI数据库中下载，http://archive.ics.uci.edu/ml/datasets/Iris

Iris.data的数据格式如下：共5列，前4列为样本特征，第5列为类别，分别有三种类别Iris-setosa， Iris-versicolor， Iris-virginica。

2、模块的下载

pip install moudule_name

Scikit-Learn库基本实现了所有的机器学习算法，具体使用详见官方文档说明：

http://scikit-learn.org/stable/auto_examples/index.html#support-vector-machines

3、相关模块的导入

SVM使用到的模块有sklearn、numpy、matplotlib等

from sklearn import svm
import numpy as np
import sklearn
import matplotlib.pyplot as plt
import matplotlib

四、python实现SVM的步骤

1、读取数据集

在进行数据集的导入前还需要对数据进行一定的处理，因为在分类中类别标签必须为数字，所以应将Iris.data中的第5列的类别（字符串）通过转换变为数字，具体如下，是通过将类别名的bytes作为字典的key,如何通过key来读取对应的value，进而实现了类别的转换

#define converts(字典)
def Iris_label(s):
    it={b'Iris-setosa':0, b'Iris-versicolor':1, b'Iris-virginica':2 }
    return it[s]

定义的转换函数为：可实现将类别Iris-setosa， Iris-versicolor， Iris-virginica映射成 0,1,2。

#1.读取数据集
path='E:\SCUT_study_files\PYTHON\machine_learning/Iris.data'
data=np.loadtxt(path, dtype=float, delimiter=',', converters={4:Iris_label} )
#converters={4:Iris_label}中“4”指的是第5列：将第5列的str转化为label(number)

读取文件用的是loadtxt函数，其声明如下：

def  loadtxt(fname, dtype=float, comments='#', delimiter=None,converters=None, skiprows=0, usecols=None, unpack=False, ndmin=0)

常用的参数有：

fname: 文件路径，例 path='F:/Python_Project/SVM/data/Iris.data'

dtype:样本的数据类型例dtype=float

delimiter：分隔符。例 delimiter=','

converters：将数据列与转换函数进行映射的字典。例 converters={4:Iris_label}含义是将第5列的数据对应转换函数进行转换。

usecols：选取数据的列。

2、划分训练样本和测试样本

#2.划分数据与标签
x,y=np.split(data,indices_or_sections=(4,),axis=1) #x为数据，y为标签,axis是分割的方向，1表示横向，0表示纵向，默认为0
x=x[:,0:2] #为便于后边画图显示（二维显示），只选取前两维度。若不用画图，可选取前四列x[:,0:4]
train_data,test_data,train_label,test_label =sklearn.model_selection.train_test_split(x,
                                                                                      y,
                                                                                      random_state=1,#作用是通过随机数来随机取得一定量得样本作为训练样本和测试样本
                                                                                      train_size=0.6,
                                                                                      test_size=0.4)
#train_data:训练样本，test_data：测试样本，train_label：训练样本标签，test_label：测试样本标签

1). numpy.split(ary, indices_or_sections, axis=0)

ary:要分割的数组

indices_or_sections：若等于一个整数，则将数组均匀地分成N份，若等于一个1-D数组，则会沿着指定的方向进行分割，1-D数组的元素个数为n，则数组会被分割成n+1份

可参考以下的说明：

indices_or_sections : int or 1-D array
        If `indices_or_sections` is an integer, N, the array will be divided
        into N equal arrays along `axis`.  If such a split is not possible,
        an error is raised.

        If `indices_or_sections` is a 1-D array of sorted integers, the entries
        indicate where along `axis` the array is split.  For example,
        ``[2, 3]`` would, for ``axis=0``, result in

          - ary[:2]
          - ary[2:3]
          - ary[3:]

        If an index exceeds the dimension of the array along `axis`,
        an empty sub-array is returned correspondingly.

axis:是切割的方向，等于0时表示将沿纵向进行切割，即按照多少行为一组进行切割，axis=1时，沿横向进行切割，即按照多少列为一组进行切割。