《Hands-On Machine Learning with Scikit-Learn & TensorFlow》读书笔记第三章分类

最新推荐文章于 2024-08-13 08:48:54 发布

justry24

最新推荐文章于 2024-08-13 08:48:54 发布

阅读量1.6k

点赞数 3

本文链接：https://blog.csdn.net/justry24/article/details/80538554

版权

hands-on ML with Sklearn&TF 专栏收录该内容

5 篇文章 10 订阅

订阅专栏

第三章分类

Classification

MNIST

在本章当中，我们将会使用 MNIST 这个数据集，它有着 70000 张规格较小的手写数字图片，由美国的高中生和美国人口调查局的职员手写而成。

Scikit-Learn 提供了许多辅助函数，以便于下载流行的数据集。MNIST 是其中一个。下面的代码获取 MNIST

from sklearn.datasets import fetch_mldata
mnist = fetch_mldata('MNIST original')

可能报错
File “streams.pyx”, line 181, in scipy.io.matlab.streams.FileStream.read_string (scipy\io\matlab\streams.c:2711)
IOError: could not read bytes
解决方法：
Looks like the cached data are corrupted. Try removing them and download again (it takes a moment). If not specified differently the data for ‘MINST original’ should be in

>>>mnist
{'COL_NAMES': ['label', 'data'],
 'DESCR': 'mldata.org dataset: mnist-original',
 'data': array([[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ..., 
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]], dtype=uint8),
 'target': array([ 0.,  0.,  0., ...,  9.,  9.,  9.])}

仍然下不动

from sklearn.datasets import fetch_mldata
try:
    mnist = fetch_mldata('MNIST original')
except Exception as ex:        
    from six.moves import urllib
    from scipy.io import loadmat
    import os

    mnist_path = os.path.join(".", "mnist-original.mat")

    # download dataset from github.
    mnist_alternative_url = "https://github.com/amplab/datascience-sp14/raw/master/lab7/mldata/mnist-original.mat"
    response = urllib.request.urlopen(mnist_alternative_url)
    with open(mnist_path, "wb") as f:
        content = response.read()
        f.write(content)

    mnist_raw = loadmat(mnist_path)
    mnist = {
        "data": mnist_raw["data"].T,
        "target": mnist_raw["label"][0],
        "COL_NAMES": ["label", "data"],
        "DESCR": "mldata.org dataset: mnist-original",
    }
    print("Done!")

查看数据实例

%matplotlib inline 
import matplotlib 
import matplotlib.pyplot as plt

some_digit = X[36000] 
some_digit_image = some_digit.reshape(28, 28)

plt.imshow(some_digit_image,cmap = matplotlib.cm.binary,interpolation="nearest") 
plt.axis("off") 
plt.show()

批量查看数据样例

def plot_digits(instances, images_per_row=10, **options):
    size = 28
    images_per_row = min(len(instances), images_per_row)
    images = [instance.reshape(size,size) for instance in instances]
    n_rows = (len(instances) - 1) // images_per_row + 1
    row_images = []
    n_empty = n_rows * images_per_row - len(instances)
    images.append(np.zeros((size, size * n_empty)))
    for row in range(n_rows):
        rimages = images[row * images_per_row : (row + 1) * images_per_row]
        row_images.append(np.concatenate(rimages, axis=1))
    image = np.concatenate(row_images, axis=0)
    plt.imshow(image, cmap = matplotlib.cm.binary, **options)
    plt.axis("off")

import numpy as np
plt.figure(figsize=(9,9))
example_images = np.r_[X[:12000:600], X[13000:30600:600], X[30600:60000:590]]
plot_digits(example_images, images_per_row=10)
plt.show()

这里写图片描述

# 拆分数据集
X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]
# 打乱数据顺序
import numpy as np
shuffle_index = np.random.permutation(60000) 
X_train, y_train = X_train[shuffle_index], y_train[shuffle_index]

训练一个二分类器

将数据分成 5 和非5 两类

y_train_5 = (y_train == 5) 
y_test_5 = (y_test == 5)

使用SGD（随机梯度下降法）进行分类，这个分类器有一个好处是能够高效地处理非常大的数据集。这部分原因在于SGD一次只处理一条数据，这也使得 SGD 适合在线学习（online learning）。SGDClassifier依赖于训练集的随机程度（所以被命名为 stochastic，随机之义）。如果你想重现结果，你应该固定参数random_state

#from sklearn.linear_model import SGDClassifier
from sklearn.linear_model.stochastic_gradient import SGDClassifier
sgd_clf = SGDClassifier(random_state=42) 
sgd_clf.fit(X_train, y_train_5)
# 更改 random_state 的值，预测结果可能不同。

使用交叉验证测量准确性

from sklearn.model_selection import cross_val_score
cross_val_score(sgd_clf, X_train, y_train_5, cv=3, scoring="accuracy")

手动编程实现cross_val_score()。

from sklearn.model_selection import StratifiedKFold 
from sklearn.base import clone

skfolds = StratifiedKFold(n_splits=3, random_state=42) 
for train_index, test_index in skfolds.split(X_train, y_train_5):
    clone_clf = clone(sgd_clf) 
    X_train_folds = X_train[train_index] 
    y_train_folds = (y_train_5[train_index]) 
    X_test_fold = X_train[test_index] 
    y_test_fold = (y_train_5[test_index])

    clone_clf.fit(X_train_folds, y_train_folds) 
    y_pred = clone_clf.predict(X_test_fold) 
    n_correct = sum(y_pred == y_test_fold) 
    print(n_correct / len(y_pred))

结果
array([ 0.9502 , 0.96565, 0.96495]

虽然有大约96%的准确率，但仍然不能说这是一个好的分类器，这是因为只有 10% 的图片是数字 5，所以你总是猜测某张图片不是 5，你也会有90%的可能性是对的。

这证明了为什么精度通常来说不是一个好的性能度量指标，特别是当你处理有偏差的数据集，比方说其中一些类比其他类频繁得多。

混淆矩阵

对分类器来说，一个好得多的性能评估指标是混淆矩阵。大体思路是：输出类别A被分类成类别 B 的次数。举个例子，为了知道分类器将 5 误分为 3 的次数，你需要查看混淆矩阵的第五航第三列。

为了计算混淆矩阵，首先你需要有一系列的预测值，这样才能将预测值与真实值做比较。你或许想在测试集上做预测。但是我们现在先不碰它。（记住，只有当你处于项目的尾声，当你准备上线一个分类器的时候，你才应该使用测试集）。相反，你应该使用cross_val_predict()函数

from sklearn.model_selection import cross_val_predict
y_train_pred = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3)

就像 cross_val_score()，cross_val_predict()也使用 K 折交叉验证。它不是返回一个评估分数，而是返回基于每一个测试折做出的一个预测值。这意味着，对于每一个训练集的样例，你得到一个干净的预测（“干净”是说一个模型在训练过程当中没有用到测试集的数据）。

现在使用 confusion_matrix()函数，传递目标类(y_train_5)和预测类（y_train_pred）给它。

>>> from sklearn.metrics import confusion_matrix
>>> confusion_matrix(y_train_5, y_train_pred)
array([[53272, 1307],
        [ 1077, 4344]])

准确率与召回率

这里写图片描述
precision = TP / (TP + FP)
recall = TP / (TP + FN)

Scikit-Learn 提供了一些函数去计算分类器的指标，包括准确率和召回率。

>>> from sklearn.metrics import precision_score, recall_score
>>> precision_score(y_train_5, y_pred) # == 4344 / (4344 + 1307)
0.76871350203503808

>>> recall_score(y_train_5, y_train_pred) # == 4344 / (4344 + 1077)
0.79136690647482011

当你去观察精度的时候，你的“数字 5 探测器”看起来还不够好。当它声明某张图片是 5 的时候，它只有 77% 的可能性是正确的。而且，它也只检测出“是 5”类图片当中的 79%。

通常结合准确率和召回率会更加方便，这个指标叫做“F1 值”，特别是当你需要一个简单的方法去比较两个分类器的优劣的时候。F1 值是准确率和召回率的调和平均。普通的平均值平等地看待所有的值，而调和平均会给小的值更大的权重。所以，要想分类器得到一个高的 F1 值，需要召回率和准确率同时高。

调用f1_score()

>>> from sklearn.metrics import f1_score
>>> f1_score(y_train_5, y_pred)
0.78468208092485547

F1 支持那些有着相近准确率和召回率的分类器。

有的场景你会绝大程度地关心准确率，有的场景你会更关心召回率。举例子，如果你训练一个分类器去检测视频是否适合儿童观看，你会倾向选择那种即便拒绝了很多好视频、但保证所保留的视频都是好（高准确率）的分类器，而不是那种高召回率、但让坏视频混入的分类器（这种情况下你或许想增加人工去检测分类器选择出来的视频）。另一方面，加入你训练一个分类器去检测监控图像当中的窃贼，有着 30% 准确率、99% 召回率的分类器或许是合适的（当然，警卫会得到一些错误的报警，但是几乎所有的窃贼都会被抓到）。当出现这些情况时，需要自定义评价函数。

准确率/召回率之间的折衷

为了弄懂这个折衷，我们看一下SGDClassifier是如何做分类决策的。对于每个样例，它根据决策函数计算分数,如果这个分数大于一个阈值，它会将样例分配给正例，否则它将分配给反例。图 3-3 显示了几个数字从左边的最低分数排到右边的最高分。假设决策阈值位于中间的箭头（介于两个 5 之间）：您将发现4个真正例（数字 5）和一个假正例（数字 6）在该阈值的右侧。因此,使用该阈值,准确率为 80%（4/5）。但实际有 6 个数字 5，分类器只检测 4 个, 所以召回是 67% （4/6）。现在，如果你提高阈值（移动到右侧的箭头），假正例（数字 6）成为一个真反例，从而提高准确率（在这种情况下高达 100%），但一个真正例变成假反例，召回率降低到 50%。相反，降低阈值可提高召回率、降低准确率。

这里写图片描述

Scikit-Learn 不能直接设置阈值，但是它给你提供了设置决策分数的方法，这个决策分数可以用来产生预测。它不是调用分类器的predict()方法，而是调用decision_function()方法。这个方法返回每一个样例的分数值，然后基于这个分数值，使用你想要的任何阈值做出预测。

>>> y_scores = sgd_clf.decision_function([some_digit])
>>> y_scores
array([ 161855.74572176])
>>> threshold = 0
>>> y_some_digit_pred = (y_scores > threshold)
array([ True], dtype=bool)

SGDClassifier用了一个等于 0 的阈值，所以前面的代码返回了跟predict()方法一样的结果（都返回了true）。让我们提高这个阈值到 20000，分类器返回结果为 false：

>>> threshold = 200000
>>> y_some_digit_pred = (y_scores > threshold)
>>> y_some_digit_pred
array([False], dtype=bool)

那么，你应该如何使用哪个阈值呢？首先，你需要再次使用cross_val_predict()得到每一个样例的分数值，但是这一次指定返回一个决策分数，而不是预测值。

y_scores = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3, 
                            method="decision_function")

现在有了这些分数值。对于任何可能的阈值，使用precision_recall_curve(),你都可以计算准确率和召回率:

from sklearn.metrics import precision_recall_curve
precisions, recalls, thresholds = precision_recall_curve(y_train_5, y_scores)

使用 Matplotlib 画出准确率和召回率

def plot_precision_recall_vs_threshold(precisions, recalls, thresholds): 
    plt.figure(figsize=(10,5))
    plt.plot(thresholds, precisions[:-1], "b--", label="Precision") 
    plt.plot(thresholds, recalls[:-1], "g-", label="Recall") 
    plt.xlabel("Threshold") 
    plt.legend(loc="upper left") 
    plt.ylim([0, 1])

plot_precision_recall_vs_threshold(precisions, recalls, thresholds) 
plt.show()

这里写图片描述

你也许会好奇为什么准确率曲线比召回率曲线更加起伏不平。原因是当你提高阈值的时候，通常来说准确率会随之提高，但有时候也会降低，。另一方面，当阈值提高时候，召回率只会降低。这也就说明了为什么召回率的曲线更加平滑。

另一种方法是绘制出准确率对召回率（PR)的曲线

def plot_precision_recall_curve(precisions, recalls, thresholds): 
    plt.plot(recalls[:-1],precisions[:-1], "b--", label="pr_curve")
    plt.xlabel("Recalls") 
    plt.ylabel("Precisions")
    plt.legend(loc="upper right") 
    plt.ylim([0, 1])

plot_precision_recall_curve(precisions, recalls, thresholds) 
plt.show()

这里写图片描述

现在你可以根据图像选择适合你任务的最佳阈值。

ROC 曲线

受试者工作特征（ROC）曲线是另一个二分类器常用的工具。它非常类似与准确率/召回率曲线，但不是画出准确率对召回率的曲线，ROC 曲线是真正例率（true positive rate，另一个名字叫做召回率）对假正例率（false positive rate, FPR）的曲线。FPR 是反例被错误分成正例的比率。它等于 1 减去真反例率（true negative rate， TNR）。TNR是反例被正确分类的比率。TNR也叫做特异性。所以 ROC 曲线画出召回率对（1 减特异性）的曲线。

为了画出 ROC 曲线，你首先需要计算各种不同阈值下的 TPR、FPR，使用roc_curve()函数：

横 FPR = TP / (TP + FN)
纵 TPR = FP / (TN + FP) 【相当于召回率】

from sklearn.metrics import roc_curve

fpr, tpr, thresholds = roc_curve(y_train_5, y_scores)

def plot_roc_curve(fpr, tpr, label=None):
    plt.plot(fpr, tpr, linewidth=2, label=label) 
    plt.plot([0, 1], [0, 1], 'k--') 
    plt.axis([0, 1, 0, 1]) 
    plt.xlabel('False Positive Rate') 
    plt.ylabel('True Positive Rate')

plot_roc_curve(fpr, tpr) 
plt.show()

这里写图片描述
ROC 曲线下面的面积为 AUC(area under roc curve)，一般情况下，面积越大，则学习器的预测效果越好可以看出，AUC考量的是样本预测的排序质量。

from sklearn.metrics import roc_auc_score
>>> roc_auc_score(y_train_5, y_scores)
0.96195841064232612

多元分类

二分类器只能区分两个类，而多类分类器（也被叫做多项式分类器）可以区分多于两个类。

有些算法（比如随机森林分类器或者朴素贝叶斯分类器）可以直接处理多类分类问题。有些算法（比如 SVM 分类器或者线性分类器）则是严格的二分类器，使用一些策略可以使得二元分类器可执行多类分类。

举例子，创建一个可以将图片分成 10 类（从 0 到 9）的系统的一个方法是：训练10个二分类器，每一个对应一个数字（探测器 0，探测器 1，探测器 2，以此类推）。然后当你想对某张图片进行分类的时候，让每一个分类器对这个图片进行分类，选出决策分数最高的那个分类器。这叫做“一对所有”（OvA）策略（也被叫做“一对其他”）。

另一个策略是对每一对数字都训练一个二分类器：一个分类器用来处理数字 0 和数字 1，一个用来处理数字 0 和数字 2，一个用来处理数字 1 和 2，以此类推。这叫做“一对一”（OvO）策略。如果有 N 个类。你需要训练N*(N-1)/2个分类器。对于 MNIST 问题，需要训练 45 个二分类器！当你想对一张图片进行分类，你必须将这张图片跑在全部45个二分类器上。然后看哪个类胜出。OvO 策略的主要有点是：每个分类器只需要在训练集的部分数据上面进行训练。这部分数据是它所需要区分的那两个类对应的数据。

一些算法（比如 SVM 分类器）在训练集的大小上很难扩展，所以对于这些算法，OvO 是比较好的，因为它可以在小的数据集上面可以更多地训练，较之于巨大的数据集而言。但是，对于大部分的二分类器来说，OvA 是更好的选择。

Scikit-Learn 可以探测出你想使用一个二分类器去完成多分类的任务，它会自动地执行 OvA（除了 SVM 分类器，它使用 OvO）。

sgd_clf.fit(X_train, y_train)
sgd_clf.predict([X[48000]])

Scikit-Learn实际上训练了10个二元分类器，得到了他们对图像的决定分数，并且选择了得分最高的分类。
可以调用decision_function（）方法验证。每个实例只返回10个分数，而不是每个实例只返回一个分数：

some_digit_scores = sgd_clf.decision_function([X[48000]])
>>>some_digit_scores
array([[-338449.0100089 , -542077.71379419, -301613.15143963,
        -162382.42178286, -419133.03820512, -239228.19316064,
        -625940.25295827,  129878.66137418, -138179.51048083,
         -98982.25970337]])
>>>np.argmax(some_digit_scores)
7

如果希望强制ScikitLearn使用one-one-one或one-versus-all，则可以使用OneVsOneClassifier或OneVsRestClassifier类。

from sklearn.multiclass import OneVsOneClassifier

ovo_clf = OneVsOneClassifier(SGDClassifier(random_state=42))
ovo_clf.fit(X_train, y_train)
>>>ovo_clf.predict([X[48000]])
array([ 7.])
# 查看分类器的总个数
>>>len(ovo_clf.estimators_)
45

误差分析

混淆矩阵

y_train_pred = cross_val_predict(sgd_clf, X_train_scaled, y_train, cv=3)

conf_mx = confusion_matrix(y_train, y_train_pred)
>>>conf_mx
array([[5739,    2,   27,   13,   10,   39,   43,    8,   38,    4],
       [   1, 6494,   51,   24,    5,   39,    5,   11,  103,    9],
       [  57,   40, 5341,   99,   82,   21,   91,   54,  150,   23],
       [  56,   44,  137, 5335,    5,  217,   37,   57,  141,  102],
       [  21,   30,   31,    9, 5367,   10,   52,   29,   92,  201],
       [  82,   42,   40,  196,   78, 4581,  105,   32,  176,   89],
       [  37,   24,   55,    2,   44,   84, 5620,    3,   48,    1],
       [  24,   24,   71,   28,   54,   10,    8, 5805,   14,  227],
       [  49,  155,   72,  160,   15,  158,   55,   27, 5014,  146],
       [  44,   33,   20,   87,  157,   44,    2,  203,   73, 5286]])

用图形表示错误率

row_sums = conf_mx.sum(axis=1, keepdims=True) 
norm_conf_mx = conf_mx / row_sums
np.fill_diagonal(norm_conf_mx, 0) 
plt.matshow(norm_conf_mx, cmap=plt.cm.gray) 
plt.show()

这里写图片描述

方块越白代表错误越多

可以看出3和5，7和9出现的错误率较高
分析3和5

cl_a, cl_b = 3, 5 
X_aa = X_train[(y_train == cl_a) & (y_train_pred == cl_a)] 
X_ab = X_train[(y_train == cl_a) & (y_train_pred == cl_b)] 
X_ba = X_train[(y_train == cl_b) & (y_train_pred == cl_a)] 
X_bb = X_train[(y_train == cl_b) & (y_train_pred == cl_b)]
plt.figure(figsize=(8,8)) 
plt.subplot(221)
plot_digits(X_aa[:25], images_per_row=5) 
plt.subplot(222)
plot_digits(X_ab[:25], images_per_row=5)
plt.subplot(223)
plot_digits(X_ba[:25], images_per_row=5)
plt.subplot(224)
plot_digits(X_bb[:25], images_per_row=5) 
plt.show()