鸢尾花分类

最新推荐文章于 2024-03-01 23:28:47 发布

enginelong

最新推荐文章于 2024-03-01 23:28:47 发布

阅读量596

点赞数 1

分类专栏： ML 文章标签：机器学习

本文链接：https://blog.csdn.net/m0_46278903/article/details/107329406

版权

ML 专栏收录该内容

11 篇文章 0 订阅

订阅专栏

Iris数据集下载链接：
https://www.cnblogs.com/wjunneng/p/7324142.html

Step1、导入相关的包

from matplotlib import colors
import numpy as np
from sklearn import svm
from sklearn.svm import SVC
from sklearn import model_selection
import matplotlib.pyplot as plt
import matplotlib as mpl

sklearn包的安装:

pip install sklearn -i https://mirror.baidu.com/pypi/simple

这里推荐一下百度源，确实好用！！！
百度源：-i https://mirror.baidu.com/pypi/simple

Step2、加载数据集

Iris数据集组织形式：
5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa
4.6,3.1,1.5,0.2,Iris-setosa
5.0,3.6,1.4,0.2,Iris-setosa
5.4,3.9,1.7,0.4,Iris-setosa
4.6,3.4,1.4,0.3,Iris-setosa
5.0,3.4,1.5,0.2,Iris-setosa
……

# 将标签bIris-setosa、Iris-versicolor、Iris-virginica分别转换为0、1、2
def add_one(s):
    it = {b'Iris-setosa':0, b'Iris-versicolor':1, b'Iris-virginica':2}
    return it[s]

# 加载iris.data
data = np.loadtxt('D:\pycharm——project\iris.data', dtype='float64', delimiter=',', converters={4: add_one})
# 数据分割，按照列将标签与数据分开
# x => [150,4]    y => [150, 1]
x, y = np.split(data, [4, ], axis=1)
# 4个属性维度选择两个，便于之后可视化操作
x = x[:, :2]
# x_train => [105, 2]    x_test => [45, 2]    y_train => [105, 1]    y_test => [45, 1]
x_train, x_test, y_train, y_test = model_selection.train_test_split(x, y, random_state=1, test_size=0.3)

Step3、配置模型，这里使用sklearn提供的SVM函数

# 模型配置
# 采用sklearn提供的SVM函数进行计算
# SVM = Support Vector Machine 是支持向量
# SVC = Support Vector Classification就是支持向量机用于分类
# SVR = Support Vector Regression.就是支持向量机用于回归分析
def classifier():
    # C表示错误项的惩罚系数，是0~1的浮点数，默认1.0
    # C越大，则对分错样本的惩罚程度越大，训练的集中准确率越高，但是泛化能力会降低
    # C越小，惩罚力度下降，允许将错误分类的样本当作噪声处理，泛化能力得到增强

    # kernel 核函数，可以简化SVM的运算，通过将数据升到高维空间从而将非线性分类问题线性化
    # 常用的核函数包括：线性核函数linear，高斯核函数rbf，多项式核函数poly等

    # decision_function_shape表示决策函数(样本到分离超平面的距离)的类型，取值'ovo', 'ovr', None, 默认为None
    model = svm.SVC(C=0.5, kernel = 'linear', decision_function_shape='ovr')
    return model

Step4、生成、训练模型

# 模型生成
model = classifier()	
model.fit(x_train, y_train.ravel())  # y_train.ravel()将y_train由二维转换成一维 即(150,1) => (150)

Step5、准确性评估

# 准确性评估
# 将output、label统一为one_hot编码之后计算模型输出正确率
def show_accuracy(output, label, data_type):
    # print(output.shape, label.shape)
    # print(output.ravel().shape, label.ravel().shape)
    acc = output.ravel() == label.ravel()
    print('%s Accuracy: %.3f' %(data_type, np.mean(acc)))
    
def print_accuracy(model, x_train, y_train, x_test, y_test)
    # 模型在训练数据集上的正确率，自然要比测试集高
    show_accuracy(model.predict(x_train), y_train, 'trainind data')
    # 模型在测试集上的正确率
    show_accuracy(model.predict(x_test), y_test, 'testing data')
    # 计算每一个属性点到各分割平面的距离
    # print('decision_function_x_train:\n', model.decision_function(x_train))
    # print(model.decision_function(x_train).shape)
    # print('decision_function_y_test:\n', model.decision_function(x_test))

Step6、利用matplotlib进行模型训练结果可视化

# 可视化训练结果
def draw(model, x):
    iris_feature = 'sepal length', 'sepal width', 'petal length', 'petal width'
    x1_min, x1_max = x[:, 0].min(), x[:, 0].max()   # x第一列的最值
    x2_min, x2_max = x[:, 1].min(), x[:, 1].max()   # x第二列的最值
    # print(x1_min.shape, x1_max.shape)
    # 在x1_min、x1_max中均匀取数200个，x2_min、x2_max中均匀取数200个
    # x1 => [200, 200]    x2 => [200, 200]
    x1, x2 = np.mgrid[x1_min:x1_max:200j, x2_min:x2_max:200j]
    # 将x1、x2分别展成一维向量(40000,)然后根据axis=1堆叠成二维矩阵[40000, 2]
    test_data = np.stack((x1.flat, x2.flat), axis=1)   
    # z = model.decision_function(test_data)
    # 送入模型预测得到分类结果 => [0, 1, 1, 2......]
    label_predict = model.predict(test_data)
    label_predict = label_predict.reshape(x1.shape)   # reshape成与x1相同的维度[_, 2]，便于二维可视化操作
    # matplotlib中的配色操作，关联数据点的分组信息与颜色的分布
    cm_light = mpl.colors.ListedColormap(['#A0FFA0', '#FFA0A0', '#A0A0FF'])
    cm_dark = mpl.colors.ListedColormap(['g', 'b', 'r'])
    # 绘制分类图, pcolormesh()自动根据label_predict的结果在cmp中选择颜色
    plt.pcolormesh(x1, x2, label_predict, cmap=cm_light)
    plt.scatter(x[:, 0], x[:, 1], c = np.squeeze(y), edgecolors='k', s = 50, cmap=cm_dark)
    plt.scatter(x_test[:, 0], x_test[:, 1], s = 120, facecolor = 'None', zorder = 10)
    plt.xlabel(iris_feature[0], fontsize = 20)
    plt.ylabel(iris_feature[1], fontsize=20)
    plt.xlim(x1_min, x1_max)
    plt.ylim(x2_min, x2_max)
    plt.title('data classification', fontsize=30)
    plt.grid()
    plt.savefig('classification_result.png')
    plt.show()

在这里插入图片描述
这里简单记录一次以经典的Iris数据集为例做的一次基于svm的分类问题，但是支持向量机的模型原理是什么，基于什么数学公式呢？？？继续学习吧！！