通过鸢尾花，学习如何解决分类问题

最新推荐文章于 2021-06-23 15:34:14 发布

徐尼莫

最新推荐文章于 2021-06-23 15:34:14 发布

阅读量565

点赞数

分类专栏： Python

本文链接：https://blog.csdn.net/qq_34140940/article/details/100689147

版权

Python 专栏收录该内容

6 篇文章 0 订阅

订阅专栏

通过鸢尾花，学习Python分类问题

载入模块方法

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris   #导入数据集iris
from sklearn.linear_model import LogisticRegression #导入逻辑回归方法
from sklearn.model_selection import train_test_split #导入划分训练集合测试集方法
from sklearn.metrics import *  #导入评价方法

载入数据集

iris = load_iris()
print(iris.DESCR)

    Iris Plants Database
    ====================
    
    Notes
    -----
    Data Set Characteristics:
        :Number of Instances: 150 (50 in each of three classes)
        :Number of Attributes: 4 numeric, predictive attributes and the class
        :Attribute Information:
            - sepal length in cm
            - sepal width in cm
            - petal length in cm
            - petal width in cm
            - class:
                    - Iris-Setosa
                    - Iris-Versicolour
                    - Iris-Virginica
        :Summary Statistics:
    
        ============== ==== ==== ======= ===== ====================
                        Min  Max   Mean    SD   Class Correlation
        ============== ==== ==== ======= ===== ====================
        sepal length:   4.3  7.9   5.84   0.83    0.7826
        sepal width:    2.0  4.4   3.05   0.43   -0.4194
        petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
        petal width:    0.1  2.5   1.20  0.76     0.9565  (high!)
        ============== ==== ==== ======= ===== ====================
    
        :Missing Attribute Values: None
        :Class Distribution: 33.3% for each of 3 classes.
        :Creator: R.A. Fisher
        :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
        :Date: July, 1988
    
    This is a copy of UCI ML iris datasets.
    http://archive.ics.uci.edu/ml/datasets/Iris
    
    The famous Iris database, first used by Sir R.A Fisher
    
    This is perhaps the best known database to be found in the
    pattern recognition literature.  Fisher's paper is a classic in the field and
    is referenced frequently to this day.  (See Duda & Hart, for example.)  The
    data set contains 3 classes of 50 instances each, where each class refers to a
    type of iris plant.  One class is linearly separable from the other 2; the
    latter are NOT linearly separable from each other.
    
    References
    ----------
       - Fisher,R.A. "The use of multiple measurements in taxonomic problems"
         Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to
         Mathematical Statistics" (John Wiley, NY, 1950).
       - Duda,R.O., & Hart,P.E. (1973) Pattern Classification and Scene Analysis.
         (Q327.D83) John Wiley & Sons.  ISBN 0-471-22361-1.  See page 218.
       - Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System
         Structure and Classification Rule for Recognition in Partially Exposed
         Environments".  IEEE Transactions on Pattern Analysis and Machine
         Intelligence, Vol. PAMI-2, No. 1, 67-71.
       - Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule".  IEEE Transactions
         on Information Theory, May 1972, 431-433.
       - See also: 1988 MLC Proceedings, 54-64.  Cheeseman et al"s AUTOCLASS II
         conceptual clustering system finds 3 classes in the data.
       - Many, many more ...

查看数据，理解数据

print(iris.data[0:10])
print(iris.target)

    [[5.1 3.5 1.4 0.2]
     [4.9 3.  1.4 0.2]
     [4.7 3.2 1.3 0.2]
     [4.6 3.1 1.5 0.2]
     [5.  3.6 1.4 0.2]
     [5.4 3.9 1.7 0.4]
     [4.6 3.4 1.4 0.3]
     [5.  3.4 1.5 0.2]
     [4.4 2.9 1.4 0.2]
     [4.9 3.1 1.5 0.1]]
    [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
     0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
     1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
     2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
     2 2]

画图数据探索

#画图数据
def plt_data(i,j):
    sepal_length = [x[i] for x in iris.data]
    sepal_width = [x[j] for x in iris.data]
    return sepal_length,sepal_width
#画图
def plt_catter(x_axis,y_axis):
    plt.scatter(x_axis[:50], y_axis[:50], color='red', marker='o', label='setosa') #前50个样本
    plt.scatter(x_axis[51:100], y_axis[51:100], color='blue', marker='x', label='versicolor') #中间50个
    plt.scatter(x_axis[101:], y_axis[101:],color='orange', marker='+', label='Virginica') #后50个样本
    plt.legend(loc=2) #左上角
    #plt.show()

plt.figure(figsize=(16, 4))
plt.subplot(121)
sepal_length,sepal_width=plt_data(0,1)
plt_catter(sepal_length,sepal_width)

plt.subplot(122)
sepal_length,sepal_width=plt_data(2,3)
plt_catter(sepal_length,sepal_width)
plt.show()

在这里插入图片描述

数据的存储形式改为数据框

df_iris=pd.DataFrame(iris.data)
df_iris_origin=pd.DataFrame(iris.data)
df_iris_class=pd.DataFrame(iris.target)
df_iris.columns=['sepal_length','sepal_width','petal_length','petal_width']
df_iris_class.columns=['category']
print(df_iris.head(10))

	       sepal_length  sepal_width  petal_length  petal_width
	    0           5.1          3.5           1.4          0.2
	    1           4.9          3.0           1.4          0.2
	    2           4.7          3.2           1.3          0.2
	    3           4.6          3.1           1.5          0.2
	    4           5.0          3.6           1.4          0.2
	    5           5.4          3.9           1.7          0.4
	    6           4.6          3.4           1.4          0.3
	    7           5.0          3.4           1.5          0.2
	    8           4.4          2.9           1.4          0.2
	    9           4.9          3.1           1.5          0.1

划分训练集、测试集

x_train1,x_test1,y_train1,y_test1 = train_test_split(df_iris_origin,df_iris_class['category'].tolist(),test_size=0.3,random_state=0)

模型训练-测试集-预测

lr1 = LogisticRegression()
lr1.fit(x_train1,y_train1)
y_predict1=lr1.predict(x_test1)

模型评价：

#accuracy_score(y_test, y_predict)
#classification_report(y_test, y_predict)
print('##########不构造变量的结果##########')
print('-----训练集---------')
print(confusion_matrix(y_train1, lr1.predict(x_train1),labels=None, sample_weight=None))
print('-----测试集---------')
print(confusion_matrix(y_test1, y_predict1,labels=None, sample_weight=None))

    ##########不构造变量的结果##########
    -----训练集---------
    [[34  0  0]
     [ 0 26  6]
     [ 0  0 39]]
    -----测试集---------
    [[16  0  0]
     [ 0 13  5]
     [ 0  0 11]]

print("因为模型效果不好，模型需要改进优化！！！")
print("因为模型效果不好，模型需要改进优化！！！")
print("因为模型效果不好，模型需要改进优化！！！")

    因为模型效果不好，模型需要改进优化！！！
    因为模型效果不好，模型需要改进优化！！！
    因为模型效果不好，模型需要改进优化！！！

构造变量

df_iris['sepal_divide']=df_iris.sepal_length/df_iris.sepal_width
df_iris['petal_divide']=df_iris.petal_length/df_iris.petal_width
df_iris['sepal_multiply']=df_iris.sepal_length*df_iris.sepal_width
df_iris['petal_multiply']=df_iris.petal_length*df_iris.petal_width
df_iris['sepal_square']=df_iris.sepal_length*df_iris.sepal_length
df_iris['petal_square']=df_iris.petal_length*df_iris.petal_length

划分训练集、测试集

x_train,x_test,y_train,y_test = train_test_split(df_iris,df_iris_class['category'].tolist(),test_size=0.3,random_state=0)

模型训练-测试集-预测

lr = LogisticRegression()
lr.fit(x_train,y_train)
y_predict=lr.predict(x_test)

print('##########构造变量的结果##########')
print('-----训练集---------')
print(confusion_matrix(y_train, lr.predict(x_train),labels=None, sample_weight=None))
print('-----测试集---------')
print(confusion_matrix(y_test, y_predict,labels=None, sample_weight=None))

    ##########构造变量的结果##########
    -----训练集---------
    [[34  0  0]
     [ 0 31  1]
     [ 0  0 39]]
    -----测试集---------
    [[16  0  0]
     [ 0 16  2]
     [ 0  0 11]]

徐尼莫

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
通过鸢尾花，学习如何解决分类问题

通过鸢尾花，学习分类问题import pandas as pdimport matplotlib.pyplot as pltfrom sklearn.datasets import load_iris #导入数据集irisfrom sklearn.linear_model import LogisticRegression #导入逻辑回归方法from sklearn.model_se...
复制链接

扫一扫