通过鸢尾花,学习如何解决分类问题

通过鸢尾花,学习Python分类问题

载入模块方法
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris   #导入数据集iris
from sklearn.linear_model import LogisticRegression #导入逻辑回归方法
from sklearn.model_selection import train_test_split #导入划分训练集合测试集方法
from sklearn.metrics import *  #导入评价方法
载入数据集
iris = load_iris()
print(iris.DESCR)
    Iris Plants Database
    ====================
    
    Notes
    -----
    Data Set Characteristics:
        :Number of Instances: 150 (50 in each of three classes)
        :Number of Attributes: 4 numeric, predictive attributes and the class
        :Attribute Information:
            - sepal length in cm
            - sepal width in cm
            - petal length in cm
            - petal width in cm
            - class:
                    - Iris-Setosa
                    - Iris-Versicolour
                    - Iris-Virginica
        :Summary Statistics:
    
        ============== ==== ==== ======= ===== ====================
                        Min  Max   Mean    SD   Class Correlation
        ============== ==== ==== ======= ===== ====================
        sepal length:   4.3  7.9   5.84   0.83    0.7826
        sepal width:    2.0  4.4   3.05   0.43   -0.4194
        petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
        petal width:    0.1  2.5   1.20  0.76     0.9565  (high!)
        ============== ==== ==== ======= ===== ====================
    
        :Missing Attribute Values: None
        :Class Distribution: 33.3% for each of 3 classes.
        :Creator: R.A. Fisher
        :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
        :Date: July, 1988
    
    This is a copy of UCI ML iris datasets.
    http://archive.ics.uci.edu/ml/datasets/Iris
    
    The famous Iris database, first used by Sir R.A Fisher
    
    This is perhaps the best known database to be found in the
    pattern recognition literature.  Fisher's paper is a classic in the field and
    is referenced frequently to this day.  (See Duda & Hart, for example.)  The
    data set contains 3 classes of 50 instances each, where each class refers to a
    type of iris plant.  One class is linearly separable from the other 2; the
    latter are NOT linearly separable from each other.
    
    References
    ----------
       - Fisher,R.A. "The use of multiple measurements in taxonomic problems"
         Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to
         Mathematical Statistics" (John Wiley, NY, 1950).
       - Duda,R.O., & Hart,P.E. (1973) Pattern Classification and Scene Analysis.
         (Q327.D83) John Wiley & Sons.  ISBN 0-471-22361-1.  See page 218.
       - Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System
         Structure and Classification Rule for Recognition in Partially Exposed
         Environments".  IEEE Transactions on Pattern Analysis and Machine
         Intelligence, Vol. PAMI-2, No. 1, 67-71.
       - Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule".  IEEE Transactions
         on Information Theory, May 1972, 431-433.
       - See also: 1988 MLC Proceedings, 54-64.  Cheeseman et al"s AUTOCLASS II
         conceptual clustering system finds 3 classes in the data.
       - Many, many more ...
查看数据,理解数据
print(iris.data[0:10])
print(iris.target)
    [[5.1 3.5 1.4 0.2]
     [4.9 3.  1.4 0.2]
     [4.7 3.2 1.3 0.2]
     [4.6 3.1 1.5 0.2]
     [5.  3.6 1.4 0.2]
     [5.4 3.9 1.7 0.4]
     [4.6 3.4 1.4 0.3]
     [5.  3.4 1.5 0.2]
     [4.4 2.9 1.4 0.2]
     [4.9 3.1 1.5 0.1]]
    [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
     0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
     1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
     2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
     2 2]
画图数据探索
#画图数据
def plt_data(i,j):
    sepal_length = [x[i] for x in iris.data]
    sepal_width = [x[j] for x in iris.data]
    return sepal_length,sepal_width
#画图
def plt_catter(x_axis,y_axis):
    plt.scatter(x_axis[:50], y_axis[:50], color='red', marker='o', label='setosa') #前50个样本
    plt.scatter(x_axis[51:100], y_axis[51:100], color='blue', marker='x', label='versicolor') #中间50个
    plt.scatter(x_axis[101:], y_axis[101:],color='orange', marker='+', label='Virginica') #后50个样本
    plt.legend(loc=2) #左上角
    #plt.show()

plt.figure(figsize=(16, 4))
plt.subplot(121)
sepal_length,sepal_width=plt_data(0,1)
plt_catter(sepal_length,sepal_width)

plt.subplot(122)
sepal_length,sepal_width=plt_data(2,3)
plt_catter(sepal_length,sepal_width)
plt.show()

在这里插入图片描述

数据的存储形式改为 数据框
df_iris=pd.DataFrame(iris.data)
df_iris_origin=pd.DataFrame(iris.data)
df_iris_class=pd.DataFrame(iris.target)
df_iris.columns=['sepal_length','sepal_width','petal_length','petal_width']
df_iris_class.columns=['category']
print(df_iris.head(10))
	       sepal_length  sepal_width  petal_length  petal_width
	    0           5.1          3.5           1.4          0.2
	    1           4.9          3.0           1.4          0.2
	    2           4.7          3.2           1.3          0.2
	    3           4.6          3.1           1.5          0.2
	    4           5.0          3.6           1.4          0.2
	    5           5.4          3.9           1.7          0.4
	    6           4.6          3.4           1.4          0.3
	    7           5.0          3.4           1.5          0.2
	    8           4.4          2.9           1.4          0.2
	    9           4.9          3.1           1.5          0.1
划分训练集、测试集
x_train1,x_test1,y_train1,y_test1 = train_test_split(df_iris_origin,df_iris_class['category'].tolist(),test_size=0.3,random_state=0)
模型训练-测试集-预测
lr1 = LogisticRegression()
lr1.fit(x_train1,y_train1)
y_predict1=lr1.predict(x_test1)
模型评价:
#accuracy_score(y_test, y_predict)
#classification_report(y_test, y_predict)
print('##########不构造变量的结果##########')
print('-----训练集---------')
print(confusion_matrix(y_train1, lr1.predict(x_train1),labels=None, sample_weight=None))
print('-----测试集---------')
print(confusion_matrix(y_test1, y_predict1,labels=None, sample_weight=None))
    ##########不构造变量的结果##########
    -----训练集---------
    [[34  0  0]
     [ 0 26  6]
     [ 0  0 39]]
    -----测试集---------
    [[16  0  0]
     [ 0 13  5]
     [ 0  0 11]]
print("因为模型效果不好,模型需要改进优化!!!")
print("因为模型效果不好,模型需要改进优化!!!")
print("因为模型效果不好,模型需要改进优化!!!")
    因为模型效果不好,模型需要改进优化!!!
    因为模型效果不好,模型需要改进优化!!!
    因为模型效果不好,模型需要改进优化!!!
构造变量
df_iris['sepal_divide']=df_iris.sepal_length/df_iris.sepal_width
df_iris['petal_divide']=df_iris.petal_length/df_iris.petal_width
df_iris['sepal_multiply']=df_iris.sepal_length*df_iris.sepal_width
df_iris['petal_multiply']=df_iris.petal_length*df_iris.petal_width
df_iris['sepal_square']=df_iris.sepal_length*df_iris.sepal_length
df_iris['petal_square']=df_iris.petal_length*df_iris.petal_length

划分训练集、测试集
x_train,x_test,y_train,y_test = train_test_split(df_iris,df_iris_class['category'].tolist(),test_size=0.3,random_state=0)
模型训练-测试集-预测
lr = LogisticRegression()
lr.fit(x_train,y_train)
y_predict=lr.predict(x_test)
print('##########构造变量的结果##########')
print('-----训练集---------')
print(confusion_matrix(y_train, lr.predict(x_train),labels=None, sample_weight=None))
print('-----测试集---------')
print(confusion_matrix(y_test, y_predict,labels=None, sample_weight=None))
    ##########构造变量的结果##########
    -----训练集---------
    [[34  0  0]
     [ 0 31  1]
     [ 0  0 39]]
    -----测试集---------
    [[16  0  0]
     [ 0 16  2]
     [ 0  0 11]]
  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值