机器学习_鸢尾花分类_knn

最新推荐文章于 2024-06-21 16:40:24 发布
炼丹师666
最新推荐文章于 2024-06-21 16:40:24 发布
阅读量1.4k
点赞数
分类专栏：机器学习案例
本文链接：https://blog.csdn.net/wj1298250240/article/details/103404057
版权
机器学习案例专栏收录该内容
40 篇文章 3 订阅
订阅专栏
机器学习：第一个应用：鸢尾花分类

Meet the data
iris_dataset
# 1.7.1　初识数据
# 本例中我们用到了鸢尾花（Iris）数据集，这是机器学习和统计学中一个经典的数据集。它
# 包含在 scikit-learn 的 datasets 模块中。我们可以调用 load_iris 函数来加载数据：
from sklearn.datasets import load_iris
iris_dataset = load_iris()
print("Keys of iris_dataset: {}".format(iris_dataset.keys()))
Keys of iris_dataset: dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names', 'filename'])
DESCR 键对应的值是数据集的简要说明。我们这里给出说明的开头部分（你可以自己查看
# 其余的内容）：
# DESCR 键对应的值是数据集的简要说明。我们这里给出说明的开头部分（你可以自己查看
# 其余的内容）：
print(iris_dataset['DESCR'][:193] + "\n...")
.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, pre
...
target_names
# target_names 键对应的值是一个字符串数组，里面包含我们要预测的花的品种：
print("Target names: {}".format(iris_dataset['target_names']))
Target names: ['setosa' 'versicolor' 'virginica']
print("Feature names: {}".format(iris_dataset['feature_names']))
Feature names: ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
print("Type of data: {}".format(type(iris_dataset['data'])))
Type of data: <class 'numpy.ndarray'>
可以看出，数组中包含 150 朵不同的花的测量数据。前面说过，机器学习中的个体叫作样
# 本（sample），其属性叫作特征（feature）。 data 数组的形状（shape）是样本数乘以特征
# 数。 这是 scikit-learn 中的约定，你的数据形状应始终遵循这个约定。下面给出前 5 个样
# 本的特征数值：
# 可以看出，数组中包含 150 朵不同的花的测量数据。前面说过，机器学习中的个体叫作样
# 本（sample），其属性叫作特征（feature）。 data 数组的形状（shape）是样本数乘以特征
# 数。 这是 scikit-learn 中的约定，你的数据形状应始终遵循这个约定。下面给出前 5 个样
# 本的特征数值：
print("Shape of data: {}".format(iris_dataset['data'].shape))
Shape of data: (150, 4)
print("First five rows of data:\n{}".format(iris_dataset['data'][:5]))
First five rows of data:
[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]]
从数据中可以看出，前 5 朵花的花瓣宽度都是 0.2cm，第一朵花的花萼最长，是 5.1cm。
# target 数组包含的是测量过的每朵花的品种，也是一个 NumPy 数组：
# 从数据中可以看出，前 5 朵花的花瓣宽度都是 0.2cm，第一朵花的花萼最长，是 5.1cm。
# target 数组包含的是测量过的每朵花的品种，也是一个 NumPy 数组：
print("Type of target: {}".format(type(iris_dataset['target'])))
Type of target: <class 'numpy.ndarray'>
print("Shape of target: {}".format(iris_dataset['target'].shape))
Shape of target: (150,)
品种被转换成从 0 到 2 的整数：
# 品种被转换成从 0 到 2 的整数：
print("Target:\n{}".format(iris_dataset['target']))
Target:
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]
Measuring Success: Training and testing data
random_state
# scikit-learn 中的 train_test_split 函数可以打乱数据集并进行拆分。这个函数将 75% 的
# 行数据及对应标签作为训练集，剩下 25% 的数据及其标签作为测试集。训练集与测试集的
# 分配比例可以是随意的，但使用 25% 的数据作为测试集是很好的经验法则。
# scikit-learn 中的数据通常用大写的 X 表示，而标签用小写的 y 表示。这是受到了数学
# 标准公式 f(x)=y 的启发，其中 x 是函数的输入， y 是输出。我们用大写的 X 是因为数据是
# 一个二维数组（矩阵），用小写的 y 是因为目标是一个一维数组（向量），这也是数学中
# 的约定。
# 对数据调用 train_test_split，并对输出结果采用下面这种命名方法：
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    iris_dataset['data'], iris_dataset['target'], random_state=0)
print("X_train shape: {}".format(X_train.shape))
print("y_train shape: {}".format(y_train.shape))
X_train shape: (112, 4)
y_train shape: (112,)
print("X_test shape: {}".format(X_test.shape))
print("y_test shape: {}".format(y_test.shape))
X_test shape: (38, 4)
y_test shape: (38,)
First things first: Look at your data
# 1.7.3　要事第一： 观察数据
# 在构建机器学习模型之前，通常最好检查一下数据，看看如果不用机器学习能不能轻松完
# 成任务，或者需要的信息有没有包含在数据中。
# 此外，检查数据也是发现异常值和特殊值的好方法。举个例子，可能有些鸢尾花的测量单
# 位是英寸而不是厘米。在现实世界中，经常会遇到不一致的数据和意料之外的测量数据。
# 检查数据的最佳方法之一就是将其可视化。一种可视化方法是绘制散点图（scatter plot）。
# 数据散点图将一个特征作为 x 轴，另一个特征作为 y 轴，将每一个数据点绘制为图上的一
# 个点。不幸的是，计算机屏幕只有两个维度，所以我们一次只能绘制两个特征（也可能是
# 3 个）。用这种方法难以对多于 3 个特征的数据集作图。解决这个问题的一种方法是绘制散
# 点图矩阵（pair plot），从而可以两两查看所有的特征。如果特征数不多的话，比如我们这
# 里有 4 个，这种方法是很合理的。但是你应该记住，散点图矩阵无法同时显示所有特征之
# 间的关系，所以这种可视化方法可能无法展示数据的某些有趣内容。
# 图 1-3 是训练集中特征的散点图矩阵。数据点的颜色与鸢尾花的品种相对应。为了绘制这
# 张图，我们首先将 NumPy 数组转换成 pandas DataFrame。 pandas 有一个绘制散点图矩阵的
# 函数，叫作 scatter_matrix。矩阵的对角线是每个特征的直方图：
# create dataframe from data in X_train
# label the columns using the strings in iris_dataset.feature_names
# 利用X_train中的数据创建DataFrame
# 利用iris_dataset.feature_names中的字符串对数据列进行标记
iris_dataframe = pd.DataFrame(X_train, columns=iris_dataset.feature_names)
# create a scatter matrix from the dataframe, color by y_train
# 利用DataFrame创建散点图矩阵，按y_train着色
pd.plotting.scatter_matrix(iris_dataframe, c=y_train, figsize=(15, 15), marker='o',
                           hist_kwds={'bins': 20}, s=60, alpha=.8, cmap=mglearn.cm3)
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x00000280D9508DA0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x00000280DB2565C0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x00000280DB272BA8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x00000280DB2A1208>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x00000280DB2C9828>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x00000280DB2C9860>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x00000280DB3254A8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x00000280DB34BAC8>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x00000280DB380128>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x00000280DB3A9748>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x00000280DB3D0D68>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x00000280DB4033C8>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x00000280DB42A9E8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x00000280DB45C048>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x00000280DB485668>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x00000280DB4AEC88>]],
      dtype=object)

从图中可以看出，利用花瓣和花萼的测量数据基本可以将三个类别区分开。这说明机器学
# 习模型很可能可以学会区分它们。
# 从图中可以看出，利用花瓣和花萼的测量数据基本可以将三个类别区分开。这说明机器学
# 习模型很可能可以学会区分它们。
iris_dataframe.head()
sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
0	5.9	3.0	4.2	1.5
1	5.8	2.6	4.0	1.2
2	6.8	3.0	5.5	2.1
3	4.7	3.2	1.3	0.2
4	6.9	3.1	5.1	2.3
Building your first model: k nearest neighbors
# 1.7.4　构建第一个模型： k近邻算法
# 现在我们可以开始构建真实的机器学习模型了。 scikit-learn 中有许多可用的分类算法。
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=1)

# 想要基于训练集来构建模型，需要调用 knn 对象的 fit 方法，输入参数为 X_train 和 y_
# train，二者都是 NumPy 数组，前者包含训练数据，后者包含相应的训练标签：
# k 近邻算法中 k 的含义是，我们可以考虑训练集中与新数据点最近的任意 k 个邻居（比如
# 说，距离最近的 3 个或 5 个邻居），而不是只考虑最近的那一个。然后，我们可以用这些
# 邻居中数量最多的类别做出预测。第 2 章会进一步介绍这个算法的细节，现在我们只考虑
# 一个邻居的情况。
# scikit-learn 中所有的机器学习模型都在各自的类中实现，这些类被称为 Estimator
# 类。 k 近邻分类算法是在 neighbors 模块的 KNeighborsClassifier 类中实现的。我们需
# 要将这个类实例化为一个对象，然后才能使用这个模型。这时我们需要设置模型的参数。
# KNeighborsClassifier 最重要的参数就是邻居的数目，这里我们设为 1：

# 想要基于训练集来构建模型，需要调用 knn 对象的 fit 方法，输入参数为 X_train 和 y_
# train，二者都是 NumPy 数组，前者包含训练数据，后者包含相应的训练标签：
knn.fit(X_train, y_train)
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=1, p=2,
                     weights='uniform')
Making predictions
（1）乘以特征数（4）：
# 注意，我们将这朵花的测量数据转换为二维 NumPy 数组的一行，这是因为 scikit-learn
# 的输入数据必须是二维数组。
# 1.7.5　做出预测
# 现在我们可以用这个模型对新数据进行预测了，我们可能并不知道这些新数据的正确标
# 签。想象一下，我们在野外发现了一朵鸢尾花，花萼长 5cm 宽 2.9cm，花瓣长 1cm 宽
# 0.2cm。这朵鸢尾花属于哪个品种？我们可以将这些数据放在一个 NumPy 数组中，再次计
# # 算形状，数组形状为样本数（1）乘以特征数（4）：
# 注意，我们将这朵花的测量数据转换为二维 NumPy 数组的一行，这是因为 scikit-learn
# 的输入数据必须是二维数组。

X_new = np.array([[5, 2.9, 1, 0.2]])
print("X_new.shape: {}".format(X_new.shape))
X_new.shape: (1, 4)
prediction = knn.predict(X_new)
print("Prediction: {}".format(prediction))
print("Predicted target name: {}".format(
       iris_dataset['target_names'][prediction]))
Prediction: [0]
Predicted target name: ['setosa']
Evaluating the model
X_test
# 1.7.6　评估模型
# 这里需要用到之前创建的测试集。这些数据没有用于构建模型，但我们知道测试集中每朵
# 鸢尾花的实际品种。
# 因此，我们可以对测试数据中的每朵鸢尾花进行预测，并将预测结果与标签（已知的品
# 种）进行对比。我们可以通过计算精
炼丹师666
关注
0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
机器学习_鸢尾花分类_knn

机器学习：第一个应用：鸢尾花分类Meet the datairis_dataset# 1.7.1　初识数据# 本例中我们用到了鸢尾花（Iris）数据集，这是机器学习和统计学中一个经典的数据集。它# 包含在 scikit-learn 的 datasets 模块中。我们可以调用 load_iris 函数来加载数据：from sklearn.datasets import load_iri...
复制链接

扫一扫