机器学习与鸢尾花的_PaddlePaddle课程2-机器学习入门实践-鸢尾花分类

最新推荐文章于 2022-08-20 21:09:38 发布

weixin_39614561

最新推荐文章于 2022-08-20 21:09:38 发布

阅读量169

点赞数

文章标签：机器学习与鸢尾花的

本文链接：https://blog.csdn.net/weixin_39614561/article/details/113450811

版权

任务描述：
构建一个模型，根据鸢尾花的花萼和花瓣大小将其分为三种不同的品种。

数据集
总共包含150行数据
每一行数据由 4 个特征值及一个目标值组成。
4 个特征值分别为：萼片长度、萼片宽度、花瓣长度、花瓣宽度
目标值为三种不同类别的鸢尾花，分别为： Iris Setosa、Iris Versicolour、Iris Virginica

首先导入必要的包：numpy：python第三方库，用于科学计算matplotlib:python第三方库，主要用于进行可视化sklearn:python的重要机器学习库，其中封装了大量的机器学习算法，如：分类、回归、降维以及聚类

import numpy as np                
from matplotlib import colors     
from sklearn import svm            
from sklearn.svm import SVC
from sklearn import model_selection
import matplotlib.pyplot as plt
import matplotlib as mpl

In[2]

#若提示缺少matplotlib，则可使用以下命令pip安装matplotlib
!pip install matplotlib

Looking in indexes: https://pypi.mirrors.ustc.edu.cn/simple/
Requirement already satisfied: matplotlib in /opt/conda/envs/python35-paddle120-env/lib/python3.5/site-packages (2.2.3)
Requirement already satisfied: cycler>=0.10 in /opt/conda/envs/python35-paddle120-env/lib/python3.5/site-packages (from matplotlib) (0.10.0)
Requirement already satisfied: numpy>=1.7.1 in /opt/conda/envs/python35-paddle120-env/lib/python3.5/site-packages (from matplotlib) (1.15.4)
Requirement already satisfied: six>=1.10 in /opt/conda/envs/python35-paddle120-env/lib/python3.5/site-packages (from matplotlib) (1.12.0)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /opt/conda/envs/python35-paddle120-env/lib/python3.5/site-packages (from matplotlib) (2.3.0)
Requirement already satisfied: pytz in /opt/conda/envs/python35-paddle120-env/lib/python3.5/site-packages (from matplotlib) (2018.7)
Requirement already satisfied: python-dateutil>=2.1 in /opt/conda/envs/python35-paddle120-env/lib/python3.5/site-packages (from matplotlib) (2.7.5)
Requirement already satisfied: kiwisolver>=1.0.1 in /opt/conda/envs/python35-paddle120-env/lib/python3.5/site-packages (from matplotlib) (1.0.1)
Requirement already satisfied: setuptools in /opt/conda/envs/python35-paddle120-env/lib/python3.5/site-packages (from kiwisolver>=1.0.1->matplotlib) (40.4.3)

tep1.数据准备

(1)从指定路径下加载数据

(2)对加载的数据进行数据分割，x_train,x_test,y_train,y_test分别表示训练集特征、训练集标签、测试集特征、测试集标签

#*************将字符串转为整型，便于数据加载***********************
def iris_type(s):
    it = {b'Iris-setosa':0, b'Iris-versicolor':1, b'Iris-virginica':2}
    return it[s]

In[4]

#加载数据
data_path='/home/aistudio/data/data5420/iris.data'          #数据文件的路径
data = np.loadtxt(data_path,                                #数据文件路径
                  dtype=float,                              #数据类型
                  delimiter=',',                            #数据分隔符
                  converters={4:iris_type})                 #将第5列使用函数iris_type进行转换
#print(data)                                                 #data为二维数组，data.shape=(150, 5)
#print(data.shape)
#数据分割
x, y = np.split(data,                                       #要切分的数组
                (4,),                                       #沿轴切分的位置，第5列开始往后为y
                axis=1)                                     #代表纵向分割，按列分割
x = x[:, 0:2]                                               #在X中我们取前两列作为特征，为了后面的可视化。x[:,0:4]代表第一维(行)全取，第二维(列)取0~2
#print(x)
x_train,x_test,y_train,y_test=model_selection.train_test_split(x,              #所要划分的样本特征集
                                                               y,              #所要划分的样本结果
                                                               random_state=1, #随机数种子
                                                               test_size=0.3)  #测试样本占比

Step2.模型搭建
C越大，相当于惩罚松弛变量，希望松弛变量接近0，即对误分类的惩罚增大，趋向于对训练集全分对的情况，这样对训练集测试时准确率很高，但泛化能力弱。 C值小，对误分类的惩罚减小，允许容错，将他们当成噪声点，泛化能力较强。
kernel='linear'时，为线性核
decision_function_shape='ovr'时，为one v rest，即一个类别与其他类别进行划分，
decision_function_shape='ovo'时，为one v one，即将类别两两之间进行划分，用二分类的方法模拟多分类的结果。

#**********************SVM分类器构建*************************
def classifier():
    #clf = svm.SVC(C=0.8,kernel='rbf', gamma=50,decision_function_shape='ovr')
    clf = svm.SVC(C=0.5,                         #误差项惩罚系数,默认值是1
                  kernel='linear',               #线性核 kenrel="rbf":高斯核
                  decision_function_shape='ovr') #决策函数
    return clf

In[6]

# 2.定义模型：SVM模型定义
clf = classifier()

Step3.模型训练

#***********************训练模型*****************************
def train(clf,x_train,y_train):
    clf.fit(x_train,         #训练集特征向量
            y_train.ravel()) #训练集目标值

In[8]

#***********************训练模型*****************************
def train(clf,x_train,y_train):
    clf.fit(x_train,         #训练集特征向量
            y_train.ravel()) #训练集目标值

In[9]

# 3.训练SVM模型
train(clf,x_train,y_train)

Step4.模型评估

#**************并判断a b是否相等，计算acc的均值*************
def show_accuracy(a, b, tip):
    acc = a.ravel() == b.ravel()
    print('%s Accuracy:%.3f' %(tip, np.mean(acc)))

In[11]

def print_accuracy(clf,x_train,y_train,x_test,y_test):
    #分别打印训练集和测试集的准确率  score(x_train,y_train):表示输出x_train,y_train在模型上的准确率
    print('trianing prediction:%.3f' %(clf.score(x_train, y_train)))
    print('test data prediction:%.3f' %(clf.score(x_test, y_test)))
    #原始结果与预测结果进行对比   predict()表示对x_train样本进行预测，返回样本类别
    show_accuracy(clf.predict(x_train), y_train, 'traing data')
    show_accuracy(clf.predict(x_test), y_test, 'testing data')
    #计算决策函数的值，表示x到各分割平面的距离
    print('decision_function:n', clf.decision_function(x_train))

In[12]

# 4.模型评估
print_accuracy(clf,x_train,y_train,x_test,y_test)

trianing prediction:0.819 test data prediction:0.778 traing data Accuracy:0.819 testing data Accuracy:0.778 decision_function: [[-0.5 1.20887337 2.29112663] [ 2.06328814 -0.0769677 1.01367956] [ 2.16674973 0.91702835 -0.08377808] [ 2.11427813 0.99765248 -0.11193061] [ 0.9925538 2.06392138 -0.05647518] [ 2.11742969 0.95255534 -0.06998503] [ 2.05615004 -0.041847 0.98569697] [-0.31866596 1.02685964 2.29180632] [-0.27166251 1.09150338 2.18015913] [-0.37827567 1.14260447 2.2356712 ] [-0.22150749 1.11104997 2.11045752] [-0.18331208 2.10066724 1.08264485] [-0.05444966 0.99927764 2.05517201] [-0.46977766 1.17853774 2.29123992] [-0.05760122 2.04437478 1.01322644] [ 2.1747228 0.93698124 -0.11170404] [-0.13315707 2.12021384 1.01294323] [-0.21752096 2.12102642 1.09649454] [ 2.11427813 0.99765248 -0.11193061] [ 2.16359817 0.96212549 -0.12572366] [-0.21038286 1.08590572 2.12447714] [ 2.21291822 0.9265985 -0.13951672] [-0.13399204 1.06514025 2.06885179] [-0.18016052 1.0555701 2.12459042] [-0.2334671 1.08112064 2.15234646] [-0.08782356 2.0747104 1.01311315] [-0.20324476 1.05078502 2.15245974] [-0.11489433 1.05994888 2.05494545] [ 2.17787437 -0.1081159 0.93024154] [-0.23578369 2.18129137 1.05449232] [-0.20639632 1.09588216 2.11051416] [-0.21038286 1.08590572 2.12447714] [-0.02969547 2.11420989 0.91548558] [-0.12685394 1.03001955 2.09683439] [-0.09496166 2.1098311 0.98513056] [ 2.10547008 -0.07737399 0.97190391] [ 2.11029159 0.98767604 -0.09796763] [ 2.20411017 -0.14842797 0.9443178 ] [-0.20324476 1.05078502 2.15245974] [ 2.19066895 0.97688701 -0.16755596] [-0.16022784 2.10545232 1.05477553] [-0.23661866 1.12621778 2.11040088] [-0.09579663 2.05475752 1.04103911] [ 2.11344315 -0.05742111 0.94397795] [ 2.10231852 0.96772315 -0.07004167] [-0.12203243 2.09506958 1.02696285] [ 2.11029159 0.98767604 -0.09796763] [-0.41248455 1.16296364 2.2495209 ] [-0.16820091 1.08549943 2.08270149] [-0.42045762 1.14301076 2.27744686] [-0.24857827 1.09628845 2.15228982] [-0.27796564 2.18169766 1.09626798] [-0.09264507 1.00966038 2.08298469] [-0.25339978 1.03123843 2.22216135] [-0.05361468 2.05435123 0.99926346] [ 2.15395516 -0.16797456 1.01401941] [-0.12203243 2.09506958 1.02696285] [ 2.06579305 1.08825305 -0.15404611] [-0.11007283 2.12499891 0.98507392] [-0.27166251 1.09150338 2.18015913] [ 2.13652739 0.94736397 -0.08389137] [-0.29789831 1.13181544 2.16608287] [ 2.15163856 0.93219616 -0.08383473] [ 2.1747228 0.93698124 -0.11170404] [-0.11174277 1.01485174 2.09689103] [-0.06872585 2.06951904 0.99920682] [-0.23745364 1.0711442 2.16630944] [ 2.12141623 0.96253178 -0.08394801] [ 2.1627632 -0.09294809 0.93018489] [-0.06557429 1.0244219 2.04115239] [ 2.16758471 0.97210193 -0.13968664] [-0.12203243 2.09506958 1.02696285] [ 2.1293893 0.98248467 -0.11187396] [-0.21038286 1.08590572 2.12447714] [ 2.01962457 1.0786829 -0.09830747] [ 2.18269588 0.95693412 -0.13963 ] [-0.16106282 1.05037873 2.11068408] [ 2.20976665 0.97169564 -0.1814623 ] [-0.03850351 2.03918342 0.9993201 ] [ 2.17555778 0.99205482 -0.1676126 ] [-0.11007283 2.12499891 0.98507392] [-0.07502898 2.15971332 0.91531566] [ 2.13254086 0.93738753 -0.06992839] [ 2.09518042 1.00284385 -0.09802427] [ 1.0045134 2.09385071 -0.09836411] [ 2.24314055 0.89626288 -0.13940344] [-0.09579663 2.05475752 1.04103911] [-0.14910321 1.08030806 2.06879515] [ 2.13652739 0.94736397 -0.08389137] [-0.2334671 1.08112064 2.15234646] [-0.07271239 2.05954259 1.0131698 ] [-0.2739791 2.1916741 1.082305 ] [-0.27564905 1.08152693 2.19412211] [-0.12203243 2.09506958 1.02696285] [ 2.06013657 -0.03187056 0.97173399] [ 2.07608272 1.00803521 -0.08411793] [-0.19443672 2.12581149 1.06862523] [-0.16421438 2.09547587 1.06873851] [-0.3440668 1.12224529 2.22182151] [-0.1180459 2.10504603 1.01299987] [-0.20240979 1.10585861 2.09655118] [-0.17617399 1.06554654 2.11062744] [-0.2477433 2.15136204 1.09638126] [-0.2334671 1.08112064 2.15234646] [ 2.11029159 0.98767604 -0.09796763]]

Step5.模型使用

def draw(clf, x):
    iris_feature = 'sepal length', 'sepal width', 'petal lenght', 'petal width'
    # 开始画图
    x1_min, x1_max = x[:, 0].min(), x[:, 0].max()               #第0列的范围
    x2_min, x2_max = x[:, 1].min(), x[:, 1].max()               #第1列的范围
    x1, x2 = np.mgrid[x1_min:x1_max:200j, x2_min:x2_max:200j]   #生成网格采样点
    grid_test = np.stack((x1.flat, x2.flat), axis=1)            #stack():沿着新的轴加入一系列数组
    print('grid_test:n', grid_test)
    # 输出样本到决策面的距离
    z = clf.decision_function(grid_test)
    print('the distance to decision plane:n', z)
    
    grid_hat = clf.predict(grid_test)                           # 预测分类值 得到【0,0.。。。2,2,2】
    print('grid_hat:n', grid_hat)  
    grid_hat = grid_hat.reshape(x1.shape)                       # reshape grid_hat和x1形状一致
                                                                #若3*3矩阵e，则e.shape()为3*3,表示3行3列   
 
    cm_light = mpl.colors.ListedColormap(['#A0FFA0', '#FFA0A0', '#A0A0FF'])
    cm_dark = mpl.colors.ListedColormap(['g', 'b', 'r'])
 
    plt.pcolormesh(x1, x2, grid_hat, cmap=cm_light)                                   # pcolormesh(x,y,z,cmap)这里参数代入
                                                                                      # x1，x2，grid_hat，cmap=cm_light绘制的是背景。
    plt.scatter(x[:, 0], x[:, 1], c=np.squeeze(y), edgecolor='k', s=50, cmap=cm_dark) # 样本点
    plt.scatter(x_test[:, 0], x_test[:, 1], s=120, facecolor='none', zorder=10)       # 测试点
    plt.xlabel(iris_feature[0], fontsize=20)
    plt.ylabel(iris_feature[1], fontsize=20)
    plt.xlim(x1_min, x1_max)
    plt.ylim(x2_min, x2_max)
    plt.title('svm in iris data classification', fontsize=30)
    plt.grid()
    plt.show()

In[14]

# 5.模型使用
draw(clf,x)

grid_test: [[4.3 2. ] [4.3 2.0120603] [4.3 2.0241206] ... [7.9 4.3758794] [7.9 4.3879397] [7.9 4.4 ]] the distance to decision plane: [[ 2.04663576 1.0980928 -0.14472856] [ 2.04808477 1.09663836 -0.14472313] [ 2.04953377 1.09518392 -0.1447177 ] ... [-0.21454554 0.96016146 2.25438408] [-0.21309653 0.95870702 2.25438951] [-0.21164753 0.95725258 2.25439495]] grid_hat: [0. 0. 0. ... 2. 2. 2.]