机器学习:第一个应用: 鸢尾花分类
Meet the data
iris_dataset
from sklearn.datasets import load_iris
iris_dataset = load_iris()
print("Keys of iris_dataset: {}".format(iris_dataset.keys()))
Keys of iris_dataset: dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names', 'filename'])
DESCR 键对应的值是数据集的简要说明。我们这里给出说明的开头部分(你可以自己查看
print(iris_dataset['DESCR'][:193] + "\n...")
.. _iris_dataset:
Iris plants dataset
--------------------
**Data Set Characteristics:**
:Number of Instances: 150 (50 in each of three classes)
:Number of Attributes: 4 numeric, pre
...
target_names
print("Target names: {}".format(iris_dataset['target_names']))
Target names: ['setosa' 'versicolor' 'virginica']
print("Feature names: {}".format(iris_dataset['feature_names']))
Feature names: ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
print("Type of data: {}".format(type(iris_dataset['data'])))
Type of data: <class 'numpy.ndarray'>
可以看出,数组中包含 150 朵不同的花的测量数据。前面说过,机器学习中的个体叫作样
print("Shape of data: {}".format(iris_dataset['data'].shape))
Shape of data: (150, 4)
print("First five rows of data:\n{}".format(iris_dataset['data'][:5]))
First five rows of data:
[[5.1 3.5 1.4 0.2]
[4.9 3. 1.4 0.2]
[4.7 3.2 1.3 0.2]
[4.6 3.1 1.5 0.2]
[5. 3.6 1.4 0.2]]
从数据中可以看出,前 5 朵花的花瓣宽度都是 0.2cm,第一朵花的花萼最长,是 5.1cm。
print("Type of target: {}".format(type(iris_dataset['target'])))
Type of target: <class 'numpy.ndarray'>
print("Shape of target: {}".format(iris_dataset['target'].shape))
Shape of target: (150,)
品种被转换成从 0 到 2 的整数:
print("Target:\n{}".format(iris_dataset['target']))
Target:
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2]
Measuring Success: Training and testing data
random_state
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
iris_dataset['data'], iris_dataset['target'], random_state=0)
print("X_train shape: {}".format(X_train.shape))
print("y_train shape: {}".format(y_train.shape))
X_train shape: (112, 4)
y_train shape: (112,)
print("X_test shape: {}".format(X_test.shape))
print("y_test shape: {}".format(y_test.shape))
X_test shape: (38, 4)
y_test shape: (38,)
First things first: Look at your data
iris_dataframe = pd.DataFrame(X_train, columns=iris_dataset.feature_names)
pd.plotting.scatter_matrix(iris_dataframe, c=y_train, figsize=(15, 15), marker='o',
hist_kwds={'bins': 20}, s=60, alpha=.8, cmap=mglearn.cm3)
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x00000280D9508DA0>,
<matplotlib.axes._subplots.AxesSubplot object at 0x00000280DB2565C0>,
<matplotlib.axes._subplots.AxesSubplot object at 0x00000280DB272BA8>,
<matplotlib.axes._subplots.AxesSubplot object at 0x00000280DB2A1208>],
[<matplotlib.axes._subplots.AxesSubplot object at 0x00000280DB2C9828>,
<matplotlib.axes._subplots.AxesSubplot object at 0x00000280DB2C9860>,
<matplotlib.axes._subplots.AxesSubplot object at 0x00000280DB3254A8>,
<matplotlib.axes._subplots.AxesSubplot object at 0x00000280DB34BAC8>],
[<matplotlib.axes._subplots.AxesSubplot object at 0x00000280DB380128>,
<matplotlib.axes._subplots.AxesSubplot object at 0x00000280DB3A9748>,
<matplotlib.axes._subplots.AxesSubplot object at 0x00000280DB3D0D68>,
<matplotlib.axes._subplots.AxesSubplot object at 0x00000280DB4033C8>],
[<matplotlib.axes._subplots.AxesSubplot object at 0x00000280DB42A9E8>,
<matplotlib.axes._subplots.AxesSubplot object at 0x00000280DB45C048>,
<matplotlib.axes._subplots.AxesSubplot object at 0x00000280DB485668>,
<matplotlib.axes._subplots.AxesSubplot object at 0x00000280DB4AEC88>]],
dtype=object)
从图中可以看出,利用花瓣和花萼的测量数据基本可以将三个类别区分开。这说明机器学
iris_dataframe.head()
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
0 5.9 3.0 4.2 1.5
1 5.8 2.6 4.0 1.2
2 6.8 3.0 5.5 2.1
3 4.7 3.2 1.3 0.2
4 6.9 3.1 5.1 2.3
Building your first model: k nearest neighbors
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(X_train, y_train)
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
metric_params=None, n_jobs=None, n_neighbors=1, p=2,
weights='uniform')
Making predictions
(1)乘以特征数(4):
X_new = np.array([[5, 2.9, 1, 0.2]])
print("X_new.shape: {}".format(X_new.shape))
X_new.shape: (1, 4)
prediction = knn.predict(X_new)
print("Prediction: {}".format(prediction))
print("Predicted target name: {}".format(
iris_dataset['target_names'][prediction]))
Prediction: [0]
Predicted target name: ['setosa']
Evaluating the model
X_test