【Machine Learning】鸢尾花机器学习笔记

最新推荐文章于 2023-03-21 16:25:13 发布

JPTJYY

最新推荐文章于 2023-03-21 16:25:13 发布

阅读量2k

点赞数 9

分类专栏： Machine Learning 文章标签： Machine Learning

本文链接：https://blog.csdn.net/buct200921073/article/details/100991516

版权

Machine Learning 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

文章目录

数据集
数据处理
模型训练
- 训练集和测试集
- - 提取输入X和输出Y
  - 划分训练集和测试集
模型选择
模型评价
预测数据输出
完整代码
参考资料

数据集

数据原址
http://archive.ics.uci.edu/ml/datasets/Iris

简要信息

类型	数据量	领域	变量特征	变量数	有无缺失值？	相关任务
多变量	150	生命科学	实数	4	无	分类

数据文件夹中文件列表

index - 文件名索引
bezdekIris.data - BezdekIris数据集（数据集）
iris.data - Fisher论文[1]原文数据（数据集）
iris.names - 数据说明文档
以上文件用写字板、word等应用程序都能打开。

注意事项
bezdekIris.data和iris.data的区别为：

The 35th sample should be: 4.9,3.1,1.5,0.2,“Iris-setosa”
where the error is in the fourth feature.
The 38th sample: 4.9,3.6,1.4,0.1,“Iris-setosa”
where the errors are in the second and third features.

详情见下表（sklearn中自带的iris数据集与iris.data相同）：

差异样本	数据集	花萼长	花萼宽	花瓣长	花瓣宽
35	bezdekIris.data	4.9	3.1	1.5	0.2
35	iris.data	4.9	3.1	1.5	0.1
38	bezdekIris.data	4.9	3.6	1.4	0.1
38	iris.data	4.9	3.1	1.5	0.1

本文采用bezdekIris.data数据集。

数据处理

读取数据

方法1 将.data文件处理成csv格式

直接将文件后缀名改成.csv
(1) 确保“查看”选项卡下“显示/隐藏”中的”文件扩展名“已勾选

(2) 直接将文件后缀名改为.csv格式，过程中会提示“可能会导致文件不可用”，点“是(Y)”即可。

(3) 格式改完，并用Excel打开后显示如下，注意到数据没有列名，因此下一步写读入数据代码时注意添加列名。
读入数据

import numpy as np
import pandas as pd

df = pd.read_csv(r'E:\Projects\Iris\bezdekIris.csv', 
                 header = None, 
                 names = ['sepal_length','sepal_width','petal_length','petal_width','class'])

# print (df.head())  #查看数据前5条

在这里插入图片描述

方法2 直接读取.data文件

import numpy as np
import pandas as pd
f = open(r'E:\Projects\Iris\bezdekIris.csv','r')
f.seek(0)
n = 0
m = []
for line in f.readlines():
    line = line.strip()
    if line != '':
        n+=1
        lst = line.split(',')
        sepal_length = float(lst[0])
        sepal_width = float(lst[1])
        petal_length = float(lst[2])
        petal_width = float(lst[3])
        classes = lst[4]
        data = [['sepal_length',sepal_length],['sepal_width',sepal_width],['petal_length',petal_length],['petal_width',petal_width],['class',classes]]
        m.append(dict(data))

df = pd.DataFrame(m)
# print (df.head())  #查看数据前5条

在这里插入图片描述

查看数据基本信息

由于鸢尾花数据集很完备，所以无需做数据清洗或处理，但在真实应用场景中，拿到的数据往往是不完备、甚至有错误的，因此为了养成良好的习惯，要进行基本数据信息查看。
这里用pandas包中的.describe()。

print(len(df))  #数据行数
print(df.describe())  #每列数据的基本信息
print(df.groupby('class').count()) #查看每类的数据量，对选择模型评价有帮助（见下述“模型评价”的[Tip]）

输出如下：
在这里插入图片描述

依据len(df)输出为150，说明总共有150条数据；
依据.describe()输出中所有特征对应的count数据都是150条，min值都大于0；
依据分类统计可以得出每类花的数据量均为50条。
以上说明有150条（每类50条）没有空值且不为0的数据，且可以得出粗略的数值分布情况。

数据可视化

数据是多变量数据，因此考虑采用seaborn包中的.pairplot()做图，可以自动按类别区分颜色，并进行两两变量间的数据比对，使我们对数据有更直观的印象。

import seaborn as sns

sns.set_style('ticks')
sns.pairplot(df,
             kind = 'scatter',
             diag_kind = 'hist',
             hue = 'class',
             palette = 'husl',
             markers = ['o','s','D'])

在这里插入图片描述
可以看到，setosa（红）与另外两种鸢尾花类型（蓝、绿）区分更明显。

模型训练

训练集和测试集

提取输入X和输出Y

观察数据（df）格式，前4列是需要给机器学习的输入列（X），最后1列是对应的输出列（Y），在划分训练集和测试集之前，需要先将输入（X）和输出（Y）提取出来，代码如下：

X = df.iloc[:,0:4] #取df的前4列（对应列位置 0~3，左闭右开区间）
Y = df.iloc[:,4]   #取df的第5列（对应列位置 4）

划分训练集和测试集

为了评价训练模型的效果，需要将数据集划分成两部分：训练集和测试集。

训练集：用于模型训练
测试集：用于评价训练后模型的效果

常用划分方法

直接留出：按一定比例留出数据中的一部分作为测试集，比如30%的数据留出来不参与模型的训练。

交叉验证：常用k-fold交叉验证，把数据分为k部分，分k次进行试验，每次取出一份数据作为验证数据，用其余k-1份数据参与训练。最后根据k次的总效果评价模型的好坏。

这里采用k-fold交叉验证法，使用sklearn.model_selection中的KFold。

from sklearn.model_selection import KFold

# 建立数据分割方法
k = 10 
kf = KFold(n_splits = k, shuffle = True ) # 把数据随机洗牌后（非顺序）分成10份，分10次进行试验，每次取出一份数据作为验证数据，用其余9份数据参与训练。150条数据分10份后，每份数据15条。

# 应用方法划分数据集
for train_index, test_index in kf.split(df):
    x_train, x_test = X.loc[train_index], X.loc[test_index] 
    y_train, y_test = Y.loc[train_index], Y.loc[test_index]

模型选择

对于分类学习，常用算法有：

决策树（Decision Tree）
朴素贝叶斯（Naive Bayes）
SVM支持向量机（Support Vector Machine）
KNN，K近邻（k-NearestNeighbor）
随机森林分类树（Random Trees）
神经网络（Neural Networks）
Boosting/Adaboosting
……

各种算法的常见优缺点可参考资料[2][3]

这里我们选择sklearn中包含的下列几种算法进行尝试：

决策树 from sklearn import tree
朴素贝叶斯 from sklearn import naive_bayes
SVM支持向量机 from sklearn import svm
随机森林分类树 from sklearn.ensemble import RandomForestClassifier
GBDT梯度提升树（Boosting的一种）from sklearn.ensemble import GradienBoostingClassifier
MLP多层神经网络（神经网络的一种）from sklearn.neural_network import MLPClassifier

from sklearn import tree, naive_bayes, svm
from sklearn.ensemble import RandomForrestClassifier, GradientBoostingClassifier
from sklearn.neural_network import MLPClassifier

# 建立模型信息的字典
models = {'decision tree': lambda:tree.DecisionTreeClassifier(),
               'naive bayes': lambda:naive_bayes.GaussianNB(),
               'svm':lambda:svm.SVC(),
               'random forest': lambda:RandomForestClassifier(),
               'GBDT': lambda:GradientBoostingClassifier(),
               'MLP': lambda:MLPClassifier(max_iter = 1000)
                }

# for循环，训练模型并预测
for model_name, model in models.item():
    for train_index, test_index in kf.split(df):
        x_train, x_test = X.loc(train_index), X.loc(test_index)
        y_train, y_test = Y.loc(train_index), Y.loc(test_index)
        model.fit(x_train, y_train) #用训练集训练模型
        y_predict = model.predict(x_test) #用测试集预测

模型评价

对于分类问题，常用的评价指标如下[4]：

accuracy
误分类
precision
recall
F1 score
ROC曲线
AUC
PR曲线
AP、mAP等

这里采用accuracy准确率来进行评价，准确率是分类正确的样本占总样本个数的比例，即：
在这里插入图片描述
其中，分子是被正确分类的样本数，分母是总样本个数。

[Tip]
准确率是分类问题中最简单直观的评价指标，但存在明显的缺陷。比如如果样本中有99%的样本为正样本，那么分类器只需要一直预测为正，就可以得到99%的准确率，但其实际性能是非常低下的。也就是说，当不同类别样本的比例非常不均衡时，占比大的类别往往成为影响准确率的最主要因素。

由于这个数据集中3类花的数据个数均为50个，差别不大，所以使用准确率可以达到评价效果。这里用sklearn.metrics中的accuracy_score。

from sklearn.metrics import accuracy_score

# for循环，训练模型并预测
for model_name, model in models.item():
     accuracies = []
    for train_index, test_index in kf.split(df):
        x_train, x_test = X.loc(train_index), X.loc(test_index)
        y_train, y_test = Y.loc(train_index), Y.loc(test_index)
        model.fit(x_train, y_train) #用训练集训练模型
        y_predict = model.predict(x_test) #用测试集预测
        accuracy = accuracy_score(y_pred = y_predict, y_true = y_test) #模型评价
        accuracie.append(accuracy) #存入list中（交叉验证后会有10个数据）
    print(model_name, np.mean(accuracies)) # 输出各个模型的平均准确率

在这里插入图片描述

可以看到svm支持向量机和MLP神经网络算法建出来的模型准确率更高一些。

预测数据输出


results = pd.DataFrame() #用于存入最终所有模型的输出结果 
for model_name, model in models.item():
     accuracies = []
     result = pd.DataFrame() #用于存入单个模型的结果
     i = 1 #测试轮数
    for train_index, test_index in kf.split(df):
        x_train, x_test = X.loc(train_index), X.loc(test_index)
        y_train, y_test = Y.loc(train_index), Y.loc(test_index)
        model.fit(x_train, y_train) #用训练集训练模型
        y_predict = model.predict(x_test) #用测试集预测
        accuracy = accuracy_score(y_pred = y_predict, y_true = y_test) #模型评价
        accuracies.append(accuracy) #存入list中（交叉验证后会有10个数据）
        news = pd.DataFrame()
        for j in test_index:
            #预测类别
            j_class_predict = y_predict[list(test_index).index(j)]
            #真实类别
            j_class_true = df.iloc[j,4]
            #判断学习结果是否正确
            if j_class_predict == j_class_true:
                y_or_n = 'Yes'
            else:
                y_or_n = 'No'
            # 创建新的DataFrame存入需要输出的信息
            new = pd.DataFrame({'model':[model_name],
                                'average_accuracy':[np.mean(accuracies)],
                                'round':[i],
                                'accuracy':[accuracy],
                                'Y/N':[y_or_n],
                                'index':[j],
                                'sepal_length':[df.iloc[j,0]],
                                'sepal_width':[df.iloc[j,1]],'petal_length':[df.iloc[j,2]],
                                'petal_width':[df.iloc[j,3]],'class_true':[j_class_true],
                                'class_predict':[j_class_predict]})
            news = news.append(new, ignore_index = True)
        result = result.append(news, ignore_index = True) #单个模型测试结果
        i += 1
    print(model_name, np.mean(accuracies)) # 输出各个模型的平均准确率
    results = results.append(result,ignore_index = True) #所有模型的数据
 results.to_csv(r'E:\Projects\Iris\results.csv',index = False) #输出结果文件

在这里插入图片描述

完整代码

import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.model_selection import KFold
from sklearn import tree, naive_bayes, svm
from sklearn.ensemble import RandomForrestClassifier, GradientBoostingClassifier
from sklearn.neural_network import MLPClassifier

#读取数据
df = pd.read_csv(r'E:\Projects\Iris\bezdekIris.csv', 
                 header = None, 
                 names = ['sepal_length','sepal_width','petal_length','petal_width','class'])

# 查看数据基本信息
print(len(df))  #数据行数
print(df.describe())  #每列数据的基本信息
print(df.groupby('class').count()) #查看每类的数据量

# 数据可视化
sns.set_style('ticks')
sns.pairplot(df,
             kind = 'scatter',
             diag_kind = 'hist',
             hue = 'class',
             palette = 'husl',
             markers = ['o','s','D'])

# 模型训练

## 提取输入X和输出Y
X = df.iloc[:,0:4] #取df的前4列（对应列位置 0~3，左闭右开区间）
Y = df.iloc[:,4]   #取df的第5列（对应列位置 4）

## 建立数据分割方法
k = 10 
kf = KFold(n_splits = k, shuffle = True ) 

## 建立模型信息的字典
models = {'decision tree': lambda:tree.DecisionTreeClassifier(),
               'naive bayes': lambda:naive_bayes.GaussianNB(),
               'svm':lambda:svm.SVC(),
               'random forest': lambda:RandomForestClassifier(),
               'GBDT': lambda:GradientBoostingClassifier(),
               'MLP': lambda:MLPClassifier(max_iter = 1000)
                }

## for循环，训练模型并预测、评价，输出评价效果和结果文件

results = pd.DataFrame() #用于存入最终所有模型的输出结果 
for model_name, model in models.item():
     accuracies = []
     result = pd.DataFrame() #用于存入单个模型的结果
     i = 1 #测试轮数
    for train_index, test_index in kf.split(df):
        x_train, x_test = X.loc(train_index), X.loc(test_index)
        y_train, y_test = Y.loc(train_index), Y.loc(test_index)
        model.fit(x_train, y_train) #用训练集训练模型
        y_predict = model.predict(x_test) #用测试集预测
        accuracy = accuracy_score(y_pred = y_predict, y_true = y_test) #模型评价
        accuracies.append(accuracy) #存入list中（交叉验证后会有10个数据）
        news = pd.DataFrame()
        for j in test_index:
            #预测类别
            j_class_predict = y_predict[list(test_index).index(j)]
            #真实类别
            j_class_true = df.iloc[j,4]
            #判断学习结果是否正确
            if j_class_predict == j_class_true:
                y_or_n = 'Yes'
            else:
                y_or_n = 'No'
            # 创建新的DataFrame存入需要输出的信息
            new = pd.DataFrame({'model':[model_name],
                                'average_accuracy':[np.mean(accuracies)],
                                'round':[i],
                                'accuracy':[accuracy],
                                'Y/N':[y_or_n],
                                'index':[j],
                                'sepal_length':[df.iloc[j,0]],
                                'sepal_width':[df.iloc[j,1]],'petal_length':[df.iloc[j,2]],
                                'petal_width':[df.iloc[j,3]],'class_true':[j_class_true],
                                'class_predict':[j_class_predict]})
            news = news.append(new, ignore_index = True)
        result = result.append(news, ignore_index = True) #单个模型测试结果
        i += 1
    print(model_name, np.mean(accuracies)) # 输出各个模型的平均准确率
    results = results.append(result,ignore_index = True) #所有模型的数据
 results.to_csv(r'E:\Projects\Iris\results.csv',index = False) #输出结果文件

参考资料

Fisher,R.A. “The use of multiple measurements in taxonomic problems” Annual Eugenics, 7, Part II, 179-188 (1936).
数据挖掘干货总结（三）–分类算法，https://blog.csdn.net/weixin_39793644/article/details/78979859
常见分类算法的优缺点，https://www.cnblogs.com/binyang/p/11289299.html
模型评价指标 - 分类任务,https://www.jianshu.com/p/c1978f632710

JPTJYY

关注

9
点赞
踩
38

收藏

觉得还不错? 一键收藏
0
评论
【Machine Learning】鸢尾花机器学习笔记

文章目录数据集数据处理读取数据方法1 将.data文件处理成csv格式方法2 直接读取.data文件查看数据基本信息数据可视化模型训练训练集和测试集提取输入X和输出Y划分训练集和测试集模型选择模型评价预测数据输出完整代码参考资料数据集数据原址http://archive.ics.uci.edu/ml/datasets/Iris简要信息类型数据量领域变量特征变量数有无缺...
复制链接

扫一扫