《Python机器学习基础教程》学习记录（二）

最新推荐文章于 2021-07-22 19:11:55 发布

csdn_Mr_H

最新推荐文章于 2021-07-22 19:11:55 发布

阅读量1.9w

点赞数 2

分类专栏：机器学习 Python 总结文档文章标签： python 机器学习

本文链接：https://blog.csdn.net/CSDN_Mr_H/article/details/105273598

版权

Python 同时被 3 个专栏收录

29 篇文章 3 订阅

订阅专栏

总结文档

13 篇文章 0 订阅

订阅专栏

机器学习

4 篇文章 5 订阅

订阅专栏

第一章

1.7 第一个应用：鸢尾花分类
1.7.1 初识数据

#!/usr/bin/env python
# -*- coding: utf-8 -*-
from sklearn.datasets import load_iris

# TODO 初识数据
"""
    Iris 鸢尾花
"""
iris_dataset = load_iris()
# load_iris 返回的iris 对象是一个Bunch 对象，与字典分成相似，里面包含键和值


print(type(iris_dataset))  # <class 'sklearn.utils.Bunch'>
print(type(iris_dataset.keys()))  # <class 'dict_keys'>
print("Keys of iris_dataset: \n{}".format(iris_dataset.keys()))
"""
    打印信息
    Keys of iris_dataset:
    dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names', 'filename'])
"""

keys_li = ['data', 'target', 'target_names', 'DESCR', 'feature_names', 'filename']
for k in keys_li:
    print("键:{}\n".format(k))
    print("值:\n{}".format(iris_dataset[k]))
    print("-" * 80 + "\n")

"""
    打印信息
    键:data  data 里面是花萼长度、花萼宽度、花瓣长度、花瓣宽度的测量数据，格式为 NumPy 数组
             数组的每一行对应一朵花，列代表每朵花的四个测量数据：
             print("Shape of data: {}".format(iris_dataset['data'].shape))
             Shape of data: (150, 4)
             可以看出，数组中包含 150 朵不同的花的测量数据,150个样本，4个特征
    
    值:
    [[5.1 3.5 1.4 0.2]
     [4.9 3.  1.4 0.2]
     [4.7 3.2 1.3 0.2]
     [4.6 3.1 1.5 0.2]
     [5.  3.6 1.4 0.2]
     [5.4 3.9 1.7 0.4]
     [4.6 3.4 1.4 0.3]
     [5.  3.4 1.5 0.2]
     [4.4 2.9 1.4 0.2]
     [4.9 3.1 1.5 0.1]
     [5.4 3.7 1.5 0.2]
     [4.8 3.4 1.6 0.2]
     [4.8 3.  1.4 0.1]
     [4.3 3.  1.1 0.1]
     [5.8 4.  1.2 0.2]
     [5.7 4.4 1.5 0.4]
     [5.4 3.9 1.3 0.4]
     [5.1 3.5 1.4 0.3]
     [5.7 3.8 1.7 0.3]
     [5.1 3.8 1.5 0.3]
     [5.4 3.4 1.7 0.2]
     [5.1 3.7 1.5 0.4]
     [4.6 3.6 1.  0.2]
     [5.1 3.3 1.7 0.5]
     [4.8 3.4 1.9 0.2]
     [5.  3.  1.6 0.2]
     [5.  3.4 1.6 0.4]
     [5.2 3.5 1.5 0.2]
     [5.2 3.4 1.4 0.2]
     [4.7 3.2 1.6 0.2]
     [4.8 3.1 1.6 0.2]
     [5.4 3.4 1.5 0.4]
     [5.2 4.1 1.5 0.1]
     [5.5 4.2 1.4 0.2]
     [4.9 3.1 1.5 0.2]
     [5.  3.2 1.2 0.2]
     [5.5 3.5 1.3 0.2]
     [4.9 3.6 1.4 0.1]
     [4.4 3.  1.3 0.2]
     [5.1 3.4 1.5 0.2]
     [5.  3.5 1.3 0.3]
     [4.5 2.3 1.3 0.3]
     [4.4 3.2 1.3 0.2]
     [5.  3.5 1.6 0.6]
     [5.1 3.8 1.9 0.4]
     [4.8 3.  1.4 0.3]
     [5.1 3.8 1.6 0.2]
     [4.6 3.2 1.4 0.2]
     [5.3 3.7 1.5 0.2]
     [5.  3.3 1.4 0.2]
     [7.  3.2 4.7 1.4]
     [6.4 3.2 4.5 1.5]
     [6.9 3.1 4.9 1.5]
     [5.5 2.3 4.  1.3]
     [6.5 2.8 4.6 1.5]
     [5.7 2.8 4.5 1.3]
     [6.3 3.3 4.7 1.6]
     [4.9 2.4 3.3 1. ]
     [6.6 2.9 4.6 1.3]
     [5.2 2.7 3.9 1.4]
     [5.  2.  3.5 1. ]
     [5.9 3.  4.2 1.5]
     [6.  2.2 4.  1. ]
     [6.1 2.9 4.7 1.4]
     [5.6 2.9 3.6 1.3]
     [6.7 3.1 4.4 1.4]
     [5.6 3.  4.5 1.5]
     [5.8 2.7 4.1 1. ]
     [6.2 2.2 4.5 1.5]
     [5.6 2.5 3.9 1.1]
     [5.9 3.2 4.8 1.8]
     [6.1 2.8 4.  1.3]
     [6.3 2.5 4.9 1.5]
     [6.1 2.8 4.7 1.2]
     [6.4 2.9 4.3 1.3]
     [6.6 3.  4.4 1.4]
     [6.8 2.8 4.8 1.4]
     [6.7 3.  5.  1.7]
     [6.  2.9 4.5 1.5]
     [5.7 2.6 3.5 1. ]
     [5.5 2.4 3.8 1.1]
     [5.5 2.4 3.7 1. ]
     [5.8 2.7 3.9 1.2]
     [6.  2.7 5.1 1.6]
     [5.4 3.  4.5 1.5]
     [6.  3.4 4.5 1.6]
     [6.7 3.1 4.7 1.5]
     [6.3 2.3 4.4 1.3]
     [5.6 3.  4.1 1.3]
     [5.5 2.5 4.  1.3]
     [5.5 2.6 4.4 1.2]
     [6.1 3.  4.6 1.4]
     [5.8 2.6 4.  1.2]
     [5.  2.3 3.3 1. ]
     [5.6 2.7 4.2 1.3]
     [5.7 3.  4.2 1.2]
     [5.7 2.9 4.2 1.3]
     [6.2 2.9 4.3 1.3]
     [5.1 2.5 3.  1.1]
     [5.7 2.8 4.1 1.3]
     [6.3 3.3 6.  2.5]
     [5.8 2.7 5.1 1.9]
     [7.1 3.  5.9 2.1]
     [6.3 2.9 5.6 1.8]
     [6.5 3.  5.8 2.2]
     [7.6 3.  6.6 2.1]
     [4.9 2.5 4.5 1.7]
     [7.3 2.9 6.3 1.8]
     [6.7 2.5 5.8 1.8]
     [7.2 3.6 6.1 2.5]
     [6.5 3.2 5.1 2. ]
     [6.4 2.7 5.3 1.9]
     [6.8 3.  5.5 2.1]
     [5.7 2.5 5.  2. ]
     [5.8 2.8 5.1 2.4]
     [6.4 3.2 5.3 2.3]
     [6.5 3.  5.5 1.8]
     [7.7 3.8 6.7 2.2]
     [7.7 2.6 6.9 2.3]
     [6.  2.2 5.  1.5]
     [6.9 3.2 5.7 2.3]
     [5.6 2.8 4.9 2. ]
     [7.7 2.8 6.7 2. ]
     [6.3 2.7 4.9 1.8]
     [6.7 3.3 5.7 2.1]
     [7.2 3.2 6.  1.8]
     [6.2 2.8 4.8 1.8]
     [6.1 3.  4.9 1.8]
     [6.4 2.8 5.6 2.1]
     [7.2 3.  5.8 1.6]
     [7.4 2.8 6.1 1.9]
     [7.9 3.8 6.4 2. ]
     [6.4 2.8 5.6 2.2]
     [6.3 2.8 5.1 1.5]
     [6.1 2.6 5.6 1.4]
     [7.7 3.  6.1 2.3]
     [6.3 3.4 5.6 2.4]
     [6.4 3.1 5.5 1.8]
     [6.  3.  4.8 1.8]
     [6.9 3.1 5.4 2.1]
     [6.7 3.1 5.6 2.4]
     [6.9 3.1 5.1 2.3]
     [5.8 2.7 5.1 1.9]
     [6.8 3.2 5.9 2.3]
     [6.7 3.3 5.7 2.5]
     [6.7 3.  5.2 2.3]
     [6.3 2.5 5.  1.9]
     [6.5 3.  5.2 2. ]
     [6.2 3.4 5.4 2.3]
     [5.9 3.  5.1 1.8]]
    --------------------------------------------------------------------------------
    
    键:target  target 数组包含的是测量过的每朵花的品种，也是一个 NumPy 数组，每朵花对应其中一个数据
               品种被转换成从 0 到 2 的整数：0 代表 setosa，1 代表 versicolor，2 代表 virginica
    
    值:
    [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
     0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
     1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
     2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
     2 2]
    --------------------------------------------------------------------------------
    
    键:target_names  target_names 键对应的是字符串数组，里面包含我们要预测的话的品种
    
    值:
    ['setosa' 'versicolor' 'virginica']
    --------------------------------------------------------------------------------
    
    键:DESCR  DESCR 键对应的是数据集的简要说明
    
    值:
    .. _iris_dataset:
    
    Iris plants dataset
    --------------------
    
    **Data Set Characteristics:**
    
        :Number of Instances: 150 (50 in each of three classes)
        :Number of Attributes: 4 numeric, predictive attributes and the class
        :Attribute Information:
            - sepal length in cm
            - sepal width in cm
            - petal length in cm
            - petal width in cm
            - class:
                    - Iris-Setosa
                    - Iris-Versicolour
                    - Iris-Virginica
                    
        :Summary Statistics:
    
        ============== ==== ==== ======= ===== ====================
                        Min  Max   Mean    SD   Class Correlation
        ============== ==== ==== ======= ===== ====================
        sepal length:   4.3  7.9   5.84   0.83    0.7826
        sepal width:    2.0  4.4   3.05   0.43   -0.4194
        petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
        petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)
        ============== ==== ==== ======= ===== ====================
    
        :Missing Attribute Values: None
        :Class Distribution: 33.3% for each of 3 classes.
        :Creator: R.A. Fisher
        :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
        :Date: July, 1988
    
    The famous Iris database, first used by Sir R.A. Fisher. The dataset is taken
    from Fisher's paper. Note that it's the same as in R, but not as in the UCI
    Machine Learning Repository, which has two wrong data points.
    
    This is perhaps the best known database to be found in the
    pattern recognition literature.  Fisher's paper is a classic in the field and
    is referenced frequently to this day.  (See Duda & Hart, for example.)  The
    data set contains 3 classes of 50 instances each, where each class refers to a
    type of iris plant.  One class is linearly separable from the other 2; the
    latter are NOT linearly separable from each other.
    
    .. topic:: References
    
       - Fisher, R.A. "The use of multiple measurements in taxonomic problems"
         Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to
         Mathematical Statistics" (John Wiley, NY, 1950).
       - Duda, R.O., & Hart, P.E. (1973) Pattern Classification and Scene Analysis.
         (Q327.D83) John Wiley & Sons.  ISBN 0-471-22361-1.  See page 218.
       - Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System
         Structure and Classification Rule for Recognition in Partially Exposed
         Environments".  IEEE Transactions on Pattern Analysis and Machine
         Intelligence, Vol. PAMI-2, No. 1, 67-71.
       - Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule".  IEEE Transactions
         on Information Theory, May 1972, 431-433.
       - See also: 1988 MLC Proceedings, 54-64.  Cheeseman et al"s AUTOCLASS II
         conceptual clustering system finds 3 classes in the data.
       - Many, many more ...
    --------------------------------------------------------------------------------
    
    键:feature_names  feature_names 键对应的是字符串列表，对每一个特征进行了说明
    
    值:
    ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
    --------------------------------------------------------------------------------
    
    键:filename
    
    值:
    F:\apps\interpreters\3.6.4\lib\site-packages\sklearn\datasets\data\iris.csv
    --------------------------------------------------------------------------------

"""

1.7.2 衡量模型是否成功：训练数据与测试数据

我们要使用之前提到的150朵鸢尾花的数据构建一个机器学习模型，最主要的目的就是验证这个模型是否有效，即是否应该相信的预测结果。我们需要知道这个模型的“泛化能力”如何，即给它一个新数据，看它能否正确预测。所以，我们需要一些新的数据，通常的做法是将我们之前收集好的带标签的数据（确定品种的150朵鸢尾花）分成两部分，一部分数据用于构建机器学习模型，叫作训练数据（training data）或训练集（training set）。其余的数据用来评估模型的性能，叫作测试数据（test data）、测试集（test set）或留出集（hold-out set）。
一句话总结上段：用于构建模型预测能力的数据就是训练集；模型需要新的数据来检测其预测能力，这些数据就是测试集。一般把收集好的数据分成两部分：训练集、测试集。
scikit-learn 中的train_test_split函数可以打乱数据集并进行拆分。这个函数将 75% 的行数据及对应标签作为训练集，剩下 25% 的数据及其标签作为测试集。训练集与测试集的分配比例可以是随意的，但使用 25% 的数据作为测试集是很好的经验法则。

#!/usr/bin/env python
# -*- coding: utf-8 -*-

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

# TODO 衡量模型是否成功：训练数据与测试数据
iris_dataset = load_iris()

X_train, X_test, y_train, y_test = train_test_split(
    iris_dataset['data'], iris_dataset['target'], random_state=0
)
"""
    scikit-learn 中的数据通常用大写的 X 表示，而标签用小写的 y 表示。这是受到了数学 标准公式 f(x)=y 的启发，
    其中 x 是函数的输入，y 是输出。我们用大写的 X 是因为数据是 一个二维数组（矩阵），用小写的 y 是因为目标是
    一个一维数组（向量），这也是数学 中的约定。
    
    为了确保多次运行同一函数能够得到相同的输出，我们利用 random_state 参数指定了 随机数生成器的种子。这样函数
    输出就是固定不变的，所以这行代码的输出始终相同。本 书用到随机过程时，都会用这种方法指定 random_state。
    
    train_test_split方法，默认 训练集 : 测试集 = 75% : 25%

"""

print("X_train shape: {}".format(X_train.shape))
print("X_train : \n{}".format(X_train))
print("-" * 80 + "\n")

print("y_train shape: {}".format(y_train.shape))
print("y_train : \n{}".format(y_train))
print("-" * 80 + "\n")


print("X_test shape: {}".format(X_test.shape))
print("X_test : \n{}".format(X_test))
print("-" * 80 + "\n")

print("y_test shape: {}".format(y_test.shape))
print("y_test : \n{}".format(y_test))
print("-" * 80 + "\n")

"""
    打印信息
    X_train shape: (112, 4)  # 数组中包含 112 朵不同的花的测量数据,112个样本，4个特征
    X_train : 
    [[5.9 3.  4.2 1.5]
     [5.8 2.6 4.  1.2]
     [6.8 3.  5.5 2.1]
     [4.7 3.2 1.3 0.2]
     [6.9 3.1 5.1 2.3]
     [5.  3.5 1.6 0.6]
     [5.4 3.7 1.5 0.2]
     [5.  2.  3.5 1. ]
     [6.5 3.  5.5 1.8]
     [6.7 3.3 5.7 2.5]
     [6.  2.2 5.  1.5]
     [6.7 2.5 5.8 1.8]
     [5.6 2.5 3.9 1.1]
     [7.7 3.  6.1 2.3]
     [6.3 3.3 4.7 1.6]
     [5.5 2.4 3.8 1.1]
     [6.3 2.7 4.9 1.8]
     [6.3 2.8 5.1 1.5]
     [4.9 2.5 4.5 1.7]
     [6.3 2.5 5.  1.9]
     [7.  3.2 4.7 1.4]
     [6.5 3.  5.2 2. ]
     [6.  3.4 4.5 1.6]
     [4.8 3.1 1.6 0.2]
     [5.8 2.7 5.1 1.9]
     [5.6 2.7 4.2 1.3]
     [5.6 2.9 3.6 1.3]
     [5.5 2.5 4.  1.3]
     [6.1 3.  4.6 1.4]
     [7.2 3.2 6.  1.8]
     [5.3 3.7 1.5 0.2]
     [4.3 3.  1.1 0.1]
     [6.4 2.7 5.3 1.9]
     [5.7 3.  4.2 1.2]
     [5.4 3.4 1.7 0.2]
     [5.7 4.4 1.5 0.4]
     [6.9 3.1 4.9 1.5]
     [4.6 3.1 1.5 0.2]
     [5.9 3.  5.1 1.8]
     [5.1 2.5 3.  1.1]
     [4.6 3.4 1.4 0.3]
     [6.2 2.2 4.5 1.5]
     [7.2 3.6 6.1 2.5]
     [5.7 2.9 4.2 1.3]
     [4.8 3.  1.4 0.1]
     [7.1 3.  5.9 2.1]
     [6.9 3.2 5.7 2.3]
     [6.5 3.  5.8 2.2]
     [6.4 2.8 5.6 2.1]
     [5.1 3.8 1.6 0.2]
     [4.8 3.4 1.6 0.2]
     [6.5 3.2 5.1 2. ]
     [6.7 3.3 5.7 2.1]
     [4.5 2.3 1.3 0.3]
     [6.2 3.4 5.4 2.3]
     [4.9 3.  1.4 0.2]
     [5.7 2.5 5.  2. ]
     [6.9 3.1 5.4 2.1]
     [4.4 3.2 1.3 0.2]
     [5.  3.6 1.4 0.2]
     [7.2 3.  5.8 1.6]
     [5.1 3.5 1.4 0.3]
     [4.4 3.  1.3 0.2]
     [5.4 3.9 1.7 0.4]
     [5.5 2.3 4.  1.3]
     [6.8 3.2 5.9 2.3]
     [7.6 3.  6.6 2.1]
     [5.1 3.5 1.4 0.2]
     [4.9 3.1 1.5 0.2]
     [5.2 3.4 1.4 0.2]
     [5.7 2.8 4.5 1.3]
     [6.6 3.  4.4 1.4]
     [5.  3.2 1.2 0.2]
     [5.1 3.3 1.7 0.5]
     [6.4 2.9 4.3 1.3]
     [5.4 3.4 1.5 0.4]
     [7.7 2.6 6.9 2.3]
     [4.9 2.4 3.3 1. ]
     [7.9 3.8 6.4 2. ]
     [6.7 3.1 4.4 1.4]
     [5.2 4.1 1.5 0.1]
     [6.  3.  4.8 1.8]
     [5.8 4.  1.2 0.2]
     [7.7 2.8 6.7 2. ]
     [5.1 3.8 1.5 0.3]
     [4.7 3.2 1.6 0.2]
     [7.4 2.8 6.1 1.9]
     [5.  3.3 1.4 0.2]
     [6.3 3.4 5.6 2.4]
     [5.7 2.8 4.1 1.3]
     [5.8 2.7 3.9 1.2]
     [5.7 2.6 3.5 1. ]
     [6.4 3.2 5.3 2.3]
     [6.7 3.  5.2 2.3]
     [6.3 2.5 4.9 1.5]
     [6.7 3.  5.  1.7]
     [5.  3.  1.6 0.2]
     [5.5 2.4 3.7 1. ]
     [6.7 3.1 5.6 2.4]
     [5.8 2.7 5.1 1.9]
     [5.1 3.4 1.5 0.2]
     [6.6 2.9 4.6 1.3]
     [5.6 3.  4.1 1.3]
     [5.9 3.2 4.8 1.8]
     [6.3 2.3 4.4 1.3]
     [5.5 3.5 1.3 0.2]
     [5.1 3.7 1.5 0.4]
     [4.9 3.1 1.5 0.1]
     [6.3 2.9 5.6 1.8]
     [5.8 2.7 4.1 1. ]
     [7.7 3.8 6.7 2.2]
     [4.6 3.2 1.4 0.2]]
    --------------------------------------------------------------------------------
    
    y_train shape: (112,)
    y_train : 
    [1 1 2 0 2 0 0 1 2 2 2 2 1 2 1 1 2 2 2 2 1 2 1 0 2 1 1 1 1 2 0 0 2 1 0 0 1
     0 2 1 0 1 2 1 0 2 2 2 2 0 0 2 2 0 2 0 2 2 0 0 2 0 0 0 1 2 2 0 0 0 1 1 0 0
     1 0 2 1 2 1 0 2 0 2 0 0 2 0 2 1 1 1 2 2 1 1 0 1 2 2 0 1 1 1 1 0 0 0 2 1 2
     0]
    --------------------------------------------------------------------------------
    
    X_test shape: (38, 4)
    X_test : 
    [[5.8 2.8 5.1 2.4]
     [6.  2.2 4.  1. ]
     [5.5 4.2 1.4 0.2]
     [7.3 2.9 6.3 1.8]
     [5.  3.4 1.5 0.2]
     [6.3 3.3 6.  2.5]
     [5.  3.5 1.3 0.3]
     [6.7 3.1 4.7 1.5]
     [6.8 2.8 4.8 1.4]
     [6.1 2.8 4.  1.3]
     [6.1 2.6 5.6 1.4]
     [6.4 3.2 4.5 1.5]
     [6.1 2.8 4.7 1.2]
     [6.5 2.8 4.6 1.5]
     [6.1 2.9 4.7 1.4]
     [4.9 3.6 1.4 0.1]
     [6.  2.9 4.5 1.5]
     [5.5 2.6 4.4 1.2]
     [4.8 3.  1.4 0.3]
     [5.4 3.9 1.3 0.4]
     [5.6 2.8 4.9 2. ]
     [5.6 3.  4.5 1.5]
     [4.8 3.4 1.9 0.2]
     [4.4 2.9 1.4 0.2]
     [6.2 2.8 4.8 1.8]
     [4.6 3.6 1.  0.2]
     [5.1 3.8 1.9 0.4]
     [6.2 2.9 4.3 1.3]
     [5.  2.3 3.3 1. ]
     [5.  3.4 1.6 0.4]
     [6.4 3.1 5.5 1.8]
     [5.4 3.  4.5 1.5]
     [5.2 3.5 1.5 0.2]
     [6.1 3.  4.9 1.8]
     [6.4 2.8 5.6 2.2]
     [5.2 2.7 3.9 1.4]
     [5.7 3.8 1.7 0.3]
     [6.  2.7 5.1 1.6]]
    --------------------------------------------------------------------------------
    
    y_test shape: (38,)
    y_test : 
    [2 1 0 2 0 2 0 1 1 1 2 1 1 1 1 0 1 1 0 0 2 1 0 0 2 0 0 1 1 0 2 1 0 2 2 1 0
     1]
    --------------------------------------------------------------------------------
"""

1.7.3 要事第一：观察数据

检查数据的最佳方法之一就是将其可视化。一种可视化方法是绘制散点图
数据点的颜色与鸢尾花的品种相对应。为了绘制这张图，我们首先将 NumPy 数组转换成 pandas DataFrame。pandas 有一个绘制散点图矩阵的函数，叫作 scatter_matrix。矩阵的对角线是每个特征的直方图：（这里看的不是很明白）

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import mglearn
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
# TODO 观察数据

iris_dataset = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
    iris_dataset['data'], iris_dataset['target'], random_state=0
)
# 利用X_train中的数据创建DataFrame
# 利用iris_dataset.feature_names中的字符串对数据列进行标记
iris_dataframe = pd.DataFrame(X_train, columns=iris_dataset.feature_names)
grr = pd.plotting.scatter_matrix(iris_dataframe, c=y_train, figsize=(15, 15), marker='o', hist_kwds={'bins': 20}, s=60,
                                 alpha=.8, cmap=mglearn.cm3)

plt.show()

观察数据
上图是训练集中特征（萼片长、萼片宽、花瓣长、花瓣宽）的散点图矩阵。数据点的颜色与鸢尾花的品种相对应。为了绘制这张图，我们首先将 NumPy 数组转换成 pandas DataFrame。pandas 有一个绘制散点图矩阵的函数，叫作 scatter_matrix。矩阵的对角线是每个特征的直方图。
对上图的理解：
对角线的直方图是每个特征的样本数量统计(x轴对应特征，y轴对应样本数量)；其他图是某两个特征 + 鸢尾花种类（不用颜色的点）所作的散点图，一共有12张图，对角线上的6张与对角线下的6张，有的只是横坐标和纵坐标调换后的散点图；比如说左下角与右上角的图，只是横坐标与纵坐标互换，所以看起来很像，其实就是一张图。
所以有了以下结论：
从图中可以看出，利用花瓣和花萼的测量数据基本可以将三个类别区分开。这说明机器学习模型很可能可以学会区分它们。

注意

在Pycharm中运行代码时，必须添加

plt.show()

图像才能在SciView（科学视图）中显示。此外Pycharm运行窗口中可能会显示如下信息：

F:\apps\interpreters\3.6.4\lib\site-packages\sklearn\externals\six.py:31: DeprecationWarning: The module is deprecated 
    in version 0.21 and will be removed in version 0.23 since we've dropped support for Python 2.7. Please rely on the 
    official version of six (https://pypi.org/project/six/).
      "(https://pypi.org/project/six/).", DeprecationWarning)
F:\apps\interpreters\3.6.4\lib\site-packages\sklearn\externals\joblib\__init__.py:15: DeprecationWarning: 
    sklearn.externals.joblib is deprecated in 0.21 and will be removed in 0.23. Please import this functionality directly 
    from joblib, which can be installed with: pip install joblib. If this warning is raised when loading pickled models, you
     may need to re-serialize those models with scikit-learn 0.21+.
      warnings.warn(msg, category=DeprecationWarning)

这些信息只是用来提示，现在代码中的某些模块可能在未来不再维护，不用管它就行

参考

传送门

《Python机器学习基础教程》学习记录（三）

csdn_Mr_H

关注

2
点赞
踩
4

收藏

觉得还不错? 一键收藏
0
评论
《Python机器学习基础教程》学习记录（二）

第一章1.7 第一个应用：鸢尾花分类1.7.1 初识数据#!/usr/bin/env python# -*- coding: utf-8 -*-from sklearn.datasets import load_iris# TODO 初识数据""" Iris 鸢尾花"""iris_dataset = load_iris()# load_iris 返回的ir...
复制链接

扫一扫