【新手入门】课程2-机器学习入门实践-鸢尾花分类

飞桨PaddlePaddle

于 2019-09-25 11:13:57 发布

阅读量2.9k

点赞数 1

本文链接：https://blog.csdn.net/PaddlePaddle/article/details/101352202

版权

任务描述：

构建一个模型，根据鸢尾花的花萼和花瓣大小将其分为三种不同的品种。

数据集

总共包含150行数据

每一行数据由 4 个特征值及一个目标值组成。

4 个特征值分别为：萼片长度、萼片宽度、花瓣长度、花瓣宽度

目标值为三种不同类别的鸢尾花，分别为： Iris Setosa、Iris Versicolour、Iris Virginica

首先导入必要的包：

numpy：python第三方库，用于科学计算

matplotlib:python第三方库，主要用于进行可视化

sklearn:python的重要机器学习库，其中封装了大量的机器学习算法，如：分类、回归、降维以及聚类

In[1]

import numpy as np                
from matplotlib import colors     
from sklearn import svm            
from sklearn.svm import SVC
from sklearn import model_selection
import matplotlib.pyplot as plt
import matplotlib as mpl

In[2]

#若提示缺少matplotlib，则可使用以下命令pip安装matplotlib
!pip install matplotlib

Looking in indexes: https://pypi.mirrors.ustc.edu.cn/simple/
Requirement already satisfied: matplotlib in /opt/conda/envs/python35-paddle120-env/lib/python3.5/site-packages (2.2.3)
Requirement already satisfied: cycler>=0.10 in /opt/conda/envs/python35-paddle120-env/lib/python3.5/site-packages (from matplotlib) (0.10.0)
Requirement already satisfied: numpy>=1.7.1 in /opt/conda/envs/python35-paddle120-env/lib/python3.5/site-packages (from matplotlib) (1.15.4)
Requirement already satisfied: six>=1.10 in /opt/conda/envs/python35-paddle120-env/lib/python3.5/site-packages (from matplotlib) (1.12.0)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /opt/conda/envs/python35-paddle120-env/lib/python3.5/site-packages (from matplotlib) (2.3.0)
Requirement already satisfied: pytz in /opt/conda/envs/python35-paddle120-env/lib/python3.5/site-packages (from matplotlib) (2018.7)
Requirement already satisfied: python-dateutil>=2.1 in /opt/conda/envs/python35-paddle120-env/lib/python3.5/site-packages (from matplotlib) (2.7.5)
Requirement already satisfied: kiwisolver>=1.0.1 in /opt/conda/envs/python35-paddle120-env/lib/python3.5/site-packages (from matplotlib) (1.0.1)
Requirement already satisfied: setuptools in /opt/conda/envs/python35-paddle120-env/lib/python3.5/site-packages (from kiwisolver>=1.0.1->matplotlib) (40.4.3)

Step1.数据准备

(1)从指定路径下加载数据

(2)对加载的数据进行数据分割，x_train,x_test,y_train,y_test分别表示训练集特征、训练集标签、测试集特征、测试集标签

In[3]

#*************将字符串转为整型，便于数据加载***********************
def iris_type(s):
    it = {b'Iris-setosa':0, b'Iris-versicolor':1, b'Iris-virginica':2}
    return it[s]

In[4]

#加载数据
data_path='/home/aistudio/data/data5420/iris.data'          #数据文件的路径
data = np.loadtxt(data_path,                                #数据文件路径
                  dtype=float,                              #数据类型
                  delimiter=',',                            #数据分隔符
                  converters={4:iris_type})                 #将第5列使用函数iris_type进行转换
#print(data)                                                 #data为二维数组，data.shape=(150, 5)
#print(data.shape)
#数据分割
x, y = np.split(data,                                       #要切分的数组
                (4,),                                       #沿轴切分的位置，第5列开始往后为y
                axis=1)                                     #代表纵向分割，按列分割
x = x[:, 0:2]                                               #在X中我们取前两列作为特征，为了后面的可视化。x[:,0:4]代表第一维(行)全取，第二维(列)取0~2
#print(x)
x_train,x_test,y_train,y_test=model_selection.train_test_split(x,              #所要划分的样本特征集
                                                               y,              #所要划分的样本结果
                                                               random_state=1, #随机数种子
                                                               test_size=0.3)  #测试样本占比

Step2.模型搭建

C越大，相当于惩罚松弛变量，希望松弛变量接近0，即对误分类的惩罚增大，趋向于对训练集全分对的情况，这样对训练集测试时准确率很高，但泛化能力弱。 C值小，对误分类的惩罚减小，允许容错，将他们当成噪声点，泛化能力较强。

kernel='linear'时，为线性核

decision_function_shape='ovr'时，为one v rest，即一个类别与其他类别进行划分，

decision_function_shape='ovo'时，为one v one，即将类别两两之间进行划分，用二分类的方法模拟多分类的结果。

In[5]

#**********************SVM分类器构建*************************
def classifier():
    #clf = svm.SVC(C=0.8,kernel='rbf', gamma=50,decision_function_shape='ovr')
    clf = svm.SVC(C=0.5,                         #误差项惩罚系数,默认值是1
                  kernel='linear',               #线性核 kenrel="rbf":高斯核
                  decision_function_shape='ovr') #决策函数
    return clf

In[6]

# 2.定义模型：SVM模型定义
clf = classifier()

Step3.模型训练

In[7]

#***********************训练模型*****************************
def train(clf,x_train,y_train):
    clf.fit(x_train,         #训练集特征向量
            y_train.ravel()) #训练集目标值

In[8]

#***********************训练模型*****************************
def train(clf,x_train,y_train):
    clf.fit(x_train,         #训练集特征向量
            y_train.ravel()) #训练集目标值

In[9]

# 3.训练SVM模型
train(clf,x_train,y_train)

Step4.模型评估

In[10]

#**************并判断a b是否相等，计算acc的均值*************
def show_accuracy(a, b, tip):
    acc = a.ravel() == b.ravel()
    print('%s Accuracy:%.3f' %(tip, np.mean(acc)))

In[11]

def print_accuracy(clf,x_train,y_train,x_test,y_test):
    #分别打印训练集和测试集的准确率  score(x_train,y_train):表示输出x_train,y_train在模型上的准确率
    print('trianing prediction:%.3f' %(clf.score(x_train, y_train)))
    print('test data prediction:%.3f' %(clf.score(x_test, y_test)))
    #原始结果与预测结果进行对比   predict()表示对x_train样本进行预测，返回样本类别
    show_accuracy(clf.predict(x_train), y_train, 'traing data')
    show_accuracy(clf.predict(x_test), y_test, 'testing data')
    #计算决策函数的值，表示x到各分割平面的距离
    print('decision_function:\n', clf.decision_function(x_train))

In[12]

# 4.模型评估
print_accuracy(clf,x_train,y_train,x_test,y_test)

trianing prediction:0.819
test data prediction:0.778
traing data Accuracy:0.819
testing data Accuracy:0.778
decision_function:
 [[-0.5         1.20887337  2.29112663]
 [ 2.06328814 -0.0769677   1.01367956]
 [ 2.16674973  0.91702835 -0.08377808]
 [ 2.11427813  0.99765248 -0.11193061]
 [ 0.9925538   2.06392138 -0.05647518]
 [ 2.11742969  0.95255534 -0.06998503]
 [ 2.05615004 -0.041847    0.98569697]
 [-0.31866596  1.02685964  2.29180632]
 [-0.27166251  1.09150338  2.18015913]
 [-0.37827567  1.14260447  2.2356712 ]
 [-0.22150749  1.11104997  2.11045752]
 [-0.18331208  2.10066724  1.08264485]
 [-0.05444966  0.99927764  2.05517201]
 [-0.46977766  1.17853774  2.29123992]
 [-0.05760122  2.04437478  1.01322644]
 [ 2.1747228   0.93698124 -0.11170404]
 [-0.13315707  2.12021384  1.01294323]
 [-0.21752096  2.12102642  1.09649454]
 [ 2.11427813  0.99765248 -0.11193061]
 [ 2.16359817  0.96212549 -0.12572366]
 [-0.21038286  1.08590572  2.12447714]
 [ 2.21291822  0.9265985  -0.13951672]
 [-0.13399204  1.06514025  2.06885179]
 [-0.18016052  1.0555701   2.12459042]
 [-0.2334671   1.08112064  2.15234646]
 [-0.08782356  2.0747104   1.01311315]
 [-0.20324476  1.05078502  2.15245974]
 [-0.11489433  1.05994888  2.05494545]
 [ 2.17787437 -0.1081159   0.93024154]
 [-0.23578369  2.18129137  1.05449232]
 [-0.20639632  1.09588216  2.11051416]
 [-0.21038286  1.08590572  2.12447714]
 [-0.02969547  2.11420989  0.91548558]
 [-0.12685394  1.03001955  2.09683439]
 [-0.09496166  2.1098311   0.98513056]
 [ 2.10547008 -0.07737399  0.97190391]
 [ 2.11029159  0.98767604 -0.09796763]
 [ 2.20411017 -0.14842797  0.9443178 ]
 [-0.20324476  1.05078502  2.15245974]
 [ 2.19066895  0.97688701 -0.16755596]
 [-0.16022784  2.10545232  1.05477553]
 [-0.23661866  1.12621778  2.11040088]
 [-0.09579663  2.05475752  1.04103911]
 [ 2.11344315 -0.05742111  0.94397795]
 [ 2.10231852  0.96772315 -0.07004167]
 [-0.12203243  2.09506958  1.02696285]
 [ 2.11029159  0.98767604 -0.09796763]
 [-0.41248455  1.16296364  2.2495209 ]
 [-0.16820091  1.08549943  2.08270149]
 [-0.42045762  1.14301076  2.27744686]
 [-0.24857827  1.09628845  2.15228982]
 [-0.27796564  2.18169766  1.09626798]
 [-0.09264507  1.00966038  2.08298469]
 [-0.25339978  1.03123843  2.22216135]
 [-0.05361468  2.05435123  0.99926346]
 [ 2.15395516 -0.16797456  1.01401941]
 [-0.12203243  2.09506958  1.02696285]
 [ 2.06579305  1.08825305 -0.15404611]
 [-0.11007283  2.12499891  0.98507392]
 [-0.27166251  1.09150338  2.18015913]
 [ 2.13652739  0.94736397 -0.08389137]
 [-0.29789831  1.13181544  2.16608287]
 [ 2.15163856  0.93219616 -0.08383473]
 [ 2.1747228   0.93698124 -0.11170404]
 [-0.11174277  1.01485174  2.09689103]
 [-0.06872585  2.06951904  0.99920682]
 [-0.23745364  1.0711442   2.16630944]
 [ 2.12141623  0.96253178 -0.08394801]
 [ 2.1627632  -0.09294809  0.93018489]
 [-0.06557429  1.0244219   2.04115239]
 [ 2.16758471  0.97210193 -0.13968664]
 [-0.12203243  2.09506958  1.02696285]
 [ 2.1293893   0.98248467 -0.11187396]
 [-0.21038286  1.08590572  2.12447714]
 [ 2.01962457  1.0786829  -0.09830747]
 [ 2.18269588  0.95693412 -0.13963   ]
 [-0.16106282  1.05037873  2.11068408]
 [ 2.20976665  0.97169564 -0.1814623 ]
 [-0.03850351  2.03918342  0.9993201 ]
 [ 2.17555778  0.99205482 -0.1676126 ]
 [-0.11007283  2.12499891  0.98507392]
 [-0.07502898  2.15971332  0.91531566]
 [ 2.13254086  0.93738753 -0.06992839]
 [ 2.09518042  1.00284385 -0.09802427]
 [ 1.0045134   2.09385071 -0.09836411]
 [ 2.24314055  0.89626288 -0.13940344]
 [-0.09579663  2.05475752  1.04103911]
 [-0.14910321  1.08030806  2.06879515]
 [ 2.13652739  0.94736397 -0.08389137]
 [-0.2334671   1.08112064  2.15234646]
 [-0.07271239  2.05954259  1.0131698 ]
 [-0.2739791   2.1916741   1.082305  ]
 [-0.27564905  1.08152693  2.19412211]
 [-0.12203243  2.09506958  1.02696285]
 [ 2.06013657 -0.03187056  0.97173399]
 [ 2.07608272  1.00803521 -0.08411793]
 [-0.19443672  2.12581149  1.06862523]
 [-0.16421438  2.09547587  1.06873851]
 [-0.3440668   1.12224529  2.22182151]
 [-0.1180459   2.10504603  1.01299987]
 [-0.20240979  1.10585861  2.09655118]
 [-0.17617399  1.06554654  2.11062744]
 [-0.2477433   2.15136204  1.09638126]
 [-0.2334671   1.08112064  2.15234646]
 [ 2.11029159  0.98767604 -0.09796763]]

Step5.模型使用

In[13]

def draw(clf, x):
    iris_feature = 'sepal length', 'sepal width', 'petal lenght', 'petal width'
    # 开始画图
    x1_min, x1_max = x[:, 0].min(), x[:, 0].max()               #第0列的范围
    x2_min, x2_max = x[:, 1].min(), x[:, 1].max()               #第1列的范围
    x1, x2 = np.mgrid[x1_min:x1_max:200j, x2_min:x2_max:200j]   #生成网格采样点
    grid_test = np.stack((x1.flat, x2.flat), axis=1)            #stack():沿着新的轴加入一系列数组
    print('grid_test:\n', grid_test)
    # 输出样本到决策面的距离
    z = clf.decision_function(grid_test)
    print('the distance to decision plane:\n', z)
    
    grid_hat = clf.predict(grid_test)                           # 预测分类值 得到【0,0.。。。2,2,2】
    print('grid_hat:\n', grid_hat)  
    grid_hat = grid_hat.reshape(x1.shape)                       # reshape grid_hat和x1形状一致
                                                                #若3*3矩阵e，则e.shape()为3*3,表示3行3列   
 
    cm_light = mpl.colors.ListedColormap(['#A0FFA0', '#FFA0A0', '#A0A0FF'])
    cm_dark = mpl.colors.ListedColormap(['g', 'b', 'r'])
 
    plt.pcolormesh(x1, x2, grid_hat, cmap=cm_light)                                   # pcolormesh(x,y,z,cmap)这里参数代入
                                                                                      # x1，x2，grid_hat，cmap=cm_light绘制的是背景。
    plt.scatter(x[:, 0], x[:, 1], c=np.squeeze(y), edgecolor='k', s=50, cmap=cm_dark) # 样本点
    plt.scatter(x_test[:, 0], x_test[:, 1], s=120, facecolor='none', zorder=10)       # 测试点
    plt.xlabel(iris_feature[0], fontsize=20)
    plt.ylabel(iris_feature[1], fontsize=20)
    plt.xlim(x1_min, x1_max)
    plt.ylim(x2_min, x2_max)
    plt.title('svm in iris data classification', fontsize=30)
    plt.grid()
    plt.show()

In[14]

# 5.模型使用
draw(clf,x)

grid_test:
 [[4.3       2.       ]
 [4.3       2.0120603]
 [4.3       2.0241206]
 ...
 [7.9       4.3758794]
 [7.9       4.3879397]
 [7.9       4.4      ]]
the distance to decision plane:
 [[ 2.04663576  1.0980928  -0.14472856]
 [ 2.04808477  1.09663836 -0.14472313]
 [ 2.04953377  1.09518392 -0.1447177 ]
 ...
 [-0.21454554  0.96016146  2.25438408]
 [-0.21309653  0.95870702  2.25438951]
 [-0.21164753  0.95725258  2.25439495]]
grid_hat:
 [0. 0. 0. ... 2. 2. 2.]

飞桨PaddlePaddle

关注

1
点赞
踩
40

收藏

觉得还不错? 一键收藏
0
评论
【新手入门】课程2-机器学习入门实践-鸢尾花分类

任务描述：构建一个模型，根据鸢尾花的花萼和花瓣大小将其分为三种不同的品种。数据集总共包含150行数据每一行数据由 4 个特征值及一个目标值组成。4 个特征值分别为：萼片长度、萼片宽度、花瓣长度、花瓣宽度目标值为三种不同类别的鸢尾花，分别为： Iris Setosa、Iris Versicolour、Iris Virginica首先导入必要的包：numpy：p...
复制链接

扫一扫