机器学习实践—基于Scikit-Learn、Keras和TensorFlow2第二版—第3章分类任务

最新推荐文章于 2022-06-10 21:54:26 发布

bigcindy

最新推荐文章于 2022-06-10 21:54:26 发布

阅读量2.2k

点赞数 5

分类专栏： Hands On ML 文章标签： TensorFlow2 sklearn MNIST 二分类任务多分类任务

本文链接：https://blog.csdn.net/Jwenxue/article/details/107006484

版权

Hands On ML 专栏收录该内容

14 篇文章 28 订阅

订阅专栏

0. 导入所需的库

import sklearn
import matplotlib as mpl
from matplotlib import pyplot as plt
plt.rcParams['font.sans-serif']=['SimHei'] #用来正常显示中文标签
plt.rcParams['axes.unicode_minus']=False #用来正常显示负号
import numpy as np

for i in [mpl, np, sklearn]:
    print(i.__name__,": ",i.__version__)

输出：

matplotlib :  3.1.2
numpy :  1.17.4
sklearn :  0.21.3

1. MNIST数据集

MNIST数据集共有7万张图像，每张图像是28*28大小的灰度图。图像源于高中学习和美国人口普查局员工手写数字。MNIST数据集被大量用作机器学习入门案例，因此常常被称为机器学习的“hello world”。

from sklearn.datasets import fetch_openml

mnist = fetch_openml("mnist_784",version=1)
mnist.keys()

输出：

dict_keys(['data', 'target', 'feature_names', 'DESCR', 'details', 'categories', 'url'])

通过sklearn普通方法导入的数据是Python字典格式的，键值如上所示：

X,y = mnist["data"], mnist["target"]
X.shape, y.shape

输出：

((70000, 784), (70000,))

mnist.DESCR

输出：

"**Author**: Yann LeCun, Corinna Cortes, Christopher J.C. Burges  \n**Source**: [MNIST Website](http://yann.lecun.com/exdb/mnist/) - Date unknown  \n**Please cite**:  \n\nThe MNIST database of handwritten digits with 784 features, raw data available at: http://yann.lecun.com/exdb/mnist/. It can be split in a training set of the first 60,000 examples, and a test set of 10,000 examples  \n\nIt is a subset of a larger set available from NIST. The digits have been size-normalized and centered in a fixed-size image. It is a good database for people who want to try learning techniques and pattern recognition methods on real-world data while spending minimal efforts on preprocessing and formatting. The original black and white (bilevel) images from NIST were size normalized to fit in a 20x20 pixel box while preserving their aspect ratio. The resulting images contain grey levels as a result of the anti-aliasing technique used by the normalization algorithm. the images were centered in a 28x28 image by computing the center of mass of the pixels, and translating the image so as to position this point at the center of the 28x28 field.  \n\nWith some classification methods (particularly template-based methods, such as SVM and K-nearest neighbors), the error rate improves when the digits are centered by bounding box rather than center of mass. If you do this kind of pre-processing, you should report it in your publications. The MNIST database was constructed from NIST's NIST originally designated SD-3 as their training set and SD-1 as their test set. However, SD-3 is much cleaner and easier to recognize than SD-1. The reason for this can be found on the fact that SD-3 was collected among Census Bureau employees, while SD-1 was collected among high-school students. Drawing sensible conclusions from learning experiments requires that the result be independent of the choice of training set and test among the complete set of samples. Therefore it was necessary to build a new database by mixing NIST's datasets.  \n\nThe MNIST training set is composed of 30,000 patterns from SD-3 and 30,000 patterns from SD-1. Our test set was composed of 5,000 patterns from SD-3 and 5,000 patterns from SD-1. The 60,000 pattern training set contained examples from approximately 250 writers. We made sure that the sets of writers of the training set and test set were disjoint. SD-1 contains 58,527 digit images written by 500 different writers. In contrast to SD-3, where blocks of data from each writer appeared in sequence, the data in SD-1 is scrambled. Writer identities for SD-1 is available and we used this information to unscramble the writers. We then split SD-1 in two: characters written by the first 250 writers went into our new training set. The remaining 250 writers were placed in our test set. Thus we had two sets with nearly 30,000 examples each. The new training set was completed with enough examples from SD-3, starting at pattern # 0, to make a full set of 60,000 training patterns. Similarly, the new test set was completed with SD-3 examples starting at pattern # 35,000 to make a full set with 60,000 test patterns. Only a subset of 10,000 test images (5,000 from SD-1 and 5,000 from SD-3) is available on this site. The full 60,000 sample training set is available.\n\nDownloaded from openml.org."

查看几张图像：

plt.figure(figsize=(10,5))
for i in range(50):
    plt.subplot(5,10,i+1)
    image = X[i].reshape(28,28)
    plt.imshow(image, cmap="binary", interpolation="nearest")
    plt.axis("off")
    plt.title(y[i])
    

plt.tight_layout()
plt.show()

输出：

type(y[0])

输出：

str

注意：如上输出所示，此时标签的值类型为字符串，需要将其转换为数字格式，uint8表示范围为0-255：

y = y.astype(np.uint8)  
type(y[0])

输出：

numpy.uint8

训练模型前需要将数据集拆分成训练集和测试集，MNIST常规的拆分方法是60000张作训练集，10000张作测试集：

X_train, X_test, y_train, y_test = X[:60000],X[60000:],y[:60000],y[60000:]

for i in (X_train, X_test, y_train, y_test):
    print(i.shape)

输出：

(60000, 784)
(10000, 784)
(60000,)
(10000,)

2. 二分类任务

现在将MNIST 10分类的问题转换成二分类问题，假设以5为目标，将数据集标签转换成2分类，即是5的样本和不是5的样本。

y_train_5 = (y_train == 5)
y_test_5 = (y_test == 5)

y_train_5[:10]

输出：

array([ True, False, False, False, False, False, False, False, False,
       False])

做分类任务通常从SGD分类器开始，sklearn提供了SGBClassifier类以方便实现SGD分类模型。SGBClassifier的优点是能够快速处理非常大的数据集：

from sklearn.linear_model import SGDClassifier

sgd_clf = SGDClassifier(max_iter=1000, tol=1e-3, random_state=42)
sgd_clf.fit(X_train, y_train_5)

输出：

SGDClassifier(alpha=0.0001, average=False, class_weight=None,
              early_stopping=False, epsilon=0.1, eta0=0.0, fit_intercept=True,
              l1_ratio=0.15, learning_rate='optimal', loss='hinge',
              max_iter=1000, n_iter_no_change=5, n_jobs=None, penalty='l2',
              power_t=0.5, random_state=42, shuffle=True, tol=0.001,
              validation_fraction=0.1, verbose=0, warm_start=False)

模型已训练完成，此时取测试集前20个样本预测，看看结果如何：

plt.figure(figsize=(10,5))
for i in range(50):
    plt.subplot(5,10,i+1)
    image = X_test[i].reshape(28,28)
    plt.imshow(image, cmap="binary", interpolation="nearest")
    plt.axis("off")
    plt.title("5" if sgd_clf.predict([X_test[i]])[0] else "Not 5")
    

plt.tight_layout()
plt.show()

输出：

如上输出结果显示，大部分结果预测正确了。

3. 模型评估

评估分类器比评估回归器难很多：

3.1 通过交叉验证计算准确率

from sklearn.model_selection import cross_val_score

cross_val_score(sgd_clf, X_train, y_train_5, cv=3, scoring="accuracy")

输出：

array([0.95035, 0.96035, 0.9604 ])

手动实现sklearn中cross_val_score的功能：

from sklearn.model_selection import StratifiedKFold
from sklearn.base import clone

skfolds = StratifiedKFold(n_splits=3, random_state=42)
for train_index, test_index in skfolds.split(X_train, y_train_5):
    clone_clf = clone(sgd_clf)
    X_train_folds = X_train[train_index]
    y_train_folds = y_train_5[train_index]
    X_test_fold = X_train[test_index]
    y_test_fold = y_train_5[test_index]
    
    clone_clf.fit(X_train_folds, y_train_folds)
    y_pred = clone_clf.predict(X_test_fold)
    n_correct = sum(y_pred==y_test_fold)
    print(n_correct/len(y_pred))

输出：

0.95035
0.96035
0.9604

从输出结果看出，训练的模型似乎随随便便就达到了95%以上的准确率。是否真的有这么高的准确率，不妨再训练一个模型看看结果：

from sklearn.base import BaseEstimator

class Never5Classifier(BaseEstimator):
    def fit(self,X,y=None):
        pass
    def predict(self, X):
        return np.zeros((len(X),1),dtype=bool)  # 生成全0矩阵，模型不做预测，直接返回False
    
never_5_clf = Never5Classifier()
cross_val_score(never_5_clf, X_train, y_train_5, cv=3, scoring="accuracy")

输出：

array([0.91125, 0.90855, 0.90915])

以上模型的作用是，数据输入模型后不做任何事情，直接返回True值。也就是说训练数据喂入模型后，不做判断分析，直接认为不是5。而对于MNIST 10分类问题，随机猜是5的概率也就10%，而不是5的概率90%。因此，模型不作判断直接认为不是5，这样准确率也达到90%以上了。

因此，准确率似乎不能真正反映模型的好坏优劣，尤其是在处理这种样本不均衡时的分类问题时。

3.2 混淆矩阵

判断模型好坏更好的办法是计算混淆矩阵，基本思想是统计类别A被分类为类别B的个数。

from sklearn.model_selection import cross_val_predict

y_train_pred = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3) # cross_val_predict返回预测结果，而非score

from sklearn.metrics import confusion_matrix
confusion_matrix(y_train_5, y_train_pred)

输出：

array([[53892,   687],
       [ 1891,  3530]], dtype=int64)

精度计算公式：precision = TP/(TP+FP)

recall值更能反映模型的好坏，又称阳性率，计算公式：recall = TP/(TP+FN)

TP：正例被预测成正例
TN：负例被预测成负例
FP：负例被预测成正例
FN：正例被预测成负例

sklearn提供了函数用于计算精度和recall：

from sklearn.metrics import precision_score, recall_score

print("Precision Score: ",precision_score(y_train_5, y_train_pred))
print("Recall Score: ",recall_score(y_train_5, y_train_pred))

输出：

Precision Score:  0.8370879772350012
Recall Score:  0.6511713705958311

根据混淆矩阵的结果可以验证一下计算过程：

精度：TP/(TP+FP)=3530/(3530+687)=0.8370879772350012

Recall：TP/(TP+FN)=3530/(3530+1891)=0.6511713705958311

为了兼顾精度和Recall值，提出F1 score概念：即精度Recall的调和平均数，计算公式为：

F1 = 2(精度*Recall)/(精度+Recall)

from sklearn.metrics import f1_score

f1_score(y_train_5, y_train_pred)

输出：

0.7325171197343846

F1分类相当于对精度和Recall作了一个权衡，两者兼顾。但有时模型训练者更关注精度，而有时更关注Recall，这就取决于模型使用的场景以及实际应用中能容忍错误发生程度。例如一个癌症病人筛查的模型，则更应该关注Recall，因为希望模型能够尽可能地找出所有的病人。

3.3 精度/召回率权衡

y_scores = sgd_clf.decision_function([X[0]])
y_scores

输出：

array([2164.22030239])

threshold = 0
y_some_digit_pred = (y_scores > threshold)
y_some_digit_pred

输出：

array([ True])

threshold = 8000
y_some_digit_pred = (y_scores > threshold)
y_some_digit_pred

输出：

array([False])

y_scores = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3, method="decision_function") # 返回决策分数

from sklearn.metrics import precision_recall_curve

precisions, recalls, thresholds = precision_recall_curve(y_train_5, y_scores)

def plot_precision_recall_vs_threshold(precisions, recalls, thresholds):
    plt.plot(thresholds, precisions[:-1],"b--",label="Precision",linewidth=2)
    plt.plot(thresholds, recalls[:-1],"g-",label="Recall",linewidth=2)
    plt.legend(fontsize=16)
    plt.xlabel("Threshold",fontsize=16)
    plt.grid(True)
    plt.axis([-50000,50000,0,1])
    
plt.figure(figsize=(10,5))
plot_precision_recall_vs_threshold(precisions, recalls, thresholds)


plt.show()

输出：

精度曲线有上下颠簸的现象，这是因为提高阈值的有时会造成精度下降

(y_train_pred == (y_scores > 0)).all()

输出：

True

def plot_precision_vs_recall(precisions, recalls):
    plt.plot(recalls, precisions, "b-",linewidth=2)
    plt.xlabel("Recall",fontsize=16)
    plt.ylabel("Precision",fontsize=16)
    plt.axis([0,1,0,1])
    plt.grid(True)

plt.figure(figsize=(10,5))
plot_precision_vs_recall(precisions, recalls)

plt.show()

输出：

可以看到Recall在80%左右时，精度急剧下降。因此权衡精度和召回率可以选择在80%之间的某个点，这就取决于模型应用的场景和实际情况。

threshold_90_precision = thresholds[np.argmax(precisions>=0.90)]
threshold_90_precision

输出：

3370.0194991439557

y_train_pred_90 = (y_scores >= threshold_90_precision)

precision_score(y_train_5, y_train_pred_90)

输出：

0.9000345901072293

recall_score(y_train_5, y_train_pred_90)

输出：

0.4799852425751706

可以看到，只要设置合适的阈值，就能获得任意精度的模型！但也要注意，精度高了，召回率可能就低了，需要在两者之间做好权衡！！！

3.4 ROC曲线

ROC：receiver operating characteristic。ROC曲线是一个在二分类问题常用的工具。ROC曲线由召回率和FPR（false positive rate）作图：

TPR：true positive rate，即召回率
FPR：False positive rate，假阳性率

from sklearn.metrics import roc_curve

fpr, tpr, thresholds = roc_curve(y_train_5, y_scores)

def plot_roc_curve(fpr, tpr, label=None):
    plt.plot(fpr, tpr, linewidth=2, label=label)
    plt.plot([0,1],[0,1],"k--")
    plt.axis([0,1,0,1])
    plt.xlabel("False Positive Rate(Fall-Out)",fontsize=16)
    plt.ylabel("True Positive Rate(Recall)",fontsize=16)
    plt.grid(True)
    
plt.figure(figsize=(8,8))
plot_roc_curve(fpr, tpr)
plt.show()

输出：

可以看到召回率越高，假阳性率也就变高。黑色虚线代表纯随机分类器的ROC曲线。

利用ROC曲线对比模型好坏的常用方法是计算曲线下方的面积（AUC: area under the curve)。当然理想的面积就是1，因此模型ROC曲线面积越接近1，说明模型越好。黑色虚线代表的随机模型，其面积是0.1，也就反映了一个二分类问题随机预测的概率也就是0.5。

from sklearn.metrics import roc_auc_score

roc_auc_score(y_train_5, y_scores)

输出：

0.9604938554008616

ROC曲线和Precision/Recall(PR)曲线是类似的，那么在实际应用该如何选择呢？

如果数据中正样本数很少，或者比起假阴性更关注假阳性时，请使用PR曲线。
否则请使用ROC曲线。

现在训练一个随机森林分类器，并通过ROC曲线及其面积与SGD分类器进行比较：

from sklearn.ensemble import RandomForestClassifier

forest_clf = RandomForestClassifier(n_estimators=100, random_state=42)
y_probas_forest = cross_val_predict(forest_clf, X_train, y_train_5, cv=3, method="predict_proba")

y_score_forest = y_probas_forest[:,1]
fpr_forest, tpr_forest, thresholds_forest = roc_curve(y_train_5, y_score_forest)

plt.figure(figsize=(8,8))
plt.plot(fpr, tpr, "b:",linewidth=2, label="SGD")
plot_roc_curve(fpr_forest, tpr_forest, "Random Forest")
plt.grid(True)
plt.legend(fontsize=16)
plt.show()

输出：

可以看到，随机森林的ROC曲线面积比SGD的更大，说明随机森林模型效果更好。

# ROC曲线面积：
roc_auc_score(y_train_5, y_score_forest)

输出：

0.9983436731328145

y_train_pred_forest = cross_val_predict(forest_clf, X_train, y_train_5, cv=3)
precision_score(y_train_5, y_train_pred_forest)

输出：

0.9905083315756169

recall_score(y_train_5, y_train_pred_forest)

输出：

0.8662608374838591

对比精度和召回率，随机森林模型的都比SGD高。

4. 多分类任务

SGD、随机森林、贝叶斯分类器等可以直接进行多分类的任务，而逻辑回归、SVM等是严格的二分类器，而通过组合多个二分类器也可以完成多分类的任务，具体组合方式有以下两种：

一对多（One versus the Rest)：对于多分类任务，训练多个二分类器，每个二分类器将其中一个类当作正例，其余当作负例，因此训练二分类器的个数等于分类任务类别个数。例如对于MNIST 10分类任务，就需要10个二分类器。
一对一方式（OvO：One versus One）：每两个类别之间训练一个二分类器，假设样本有N类，则需要训练二分类器的个数为N*(N-1)/2。例如对于MNIST 10分类任务，就需要45个二分类器

sklearn会自动判断是否用二分类器进行多分类的任务，并用根据选择的二分类算法特点自动地选择上面两种方式之一。对于类似SVM之类的二分类算法，更倾向于选择OvO方式，其它二分类算法基本都更倾向于OvR方式。

from sklearn.svm import SVC

svm_clf = SVC(gamma="auto", random_state=42)
svm_clf.fit(X_train[:1000], y_train[:1000])  # 利用全部数据时耗时比较长，可以选择前1000个数据从而加快训练速度X_train[:1000], y_train[:1000]
svm_clf.predict([X[0]])

输出：

array([5], dtype=uint8)

some_dicit = X[0]
some_digit_scores = svm_clf.decision_function([some_dicit])
some_digit_scores

输出：

array([[ 2.81585438,  7.09167958,  3.82972099,  0.79365551,  5.8885703 ,
         9.29718395,  1.79862509,  8.10392157, -0.228207  ,  4.83753243]])

np.argmax(some_digit_scores)

输出：

svm_clf.classes_

输出：

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype=uint8)

svm_clf.classes_[5]

输出：

如果不让sklearn自动选择OvO还是OvR，这时就需要OneVsOneClassifier类和OneVsRestClassifier类：

from sklearn.multiclass import OneVsRestClassifier
ovr_clf = OneVsRestClassifier(SVC(gamma="auto",random_state=42))
ovr_clf.fit(X_train[:1000],y_train[:1000])
ovr_clf.predict([some_dicit])

输出：

array([5], dtype=uint8)

len(ovr_clf.estimators_)

输出：

也可用随机森林直接完成多分类任务：

sgd_clf.fit(X_train, y_train)
sgd_clf.predict([some_dicit])

输出：

array([3], dtype=uint8)

sgd_clf.decision_function([some_dicit])

输出：

array([[-31893.03095419, -34419.69069632,  -9530.63950739,
          1823.73154031, -22320.14822878,  -1385.80478895,
        -26188.91070951, -16147.51323997,  -4604.35491274,
        -12050.767298  ]])

cross_val_score(sgd_clf, X_train, y_train, cv=3, scoring="accuracy")

输出：

array([0.87082583, 0.87089354, 0.88628294])

通过交叉验证结果看到，准确率平均达到了87%，这对于一个十分类任务来说结果还不错（因为随机猜对一个类别的概率是10%）。

上述结果是对原始输入未做任何处理的结果，如果对输入图像数值进行归一化处理，那么可能会得到更好的结果：

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train.astype(np.float64))
cross_val_score(sgd_clf, X_train_scaled, y_train, cv=3, scoring="accuracy")

输出：

array([0.89957009, 0.89344467, 0.89963495])

可以看到，对输入数据进行归一化处理后，准确率又提升到了89%以上。

5. 误差分析

如果你已经选定了一个比较不错的模型，并且打算要使用它了，那么现在需要做的就是研究如何再提升一下它。研究如何提升模型效果，就需要清楚明白误差产生在哪里？

首先观察一下混淆矩阵：

y_train_pred = cross_val_predict(sgd_clf, X_train_scaled, y_train, cv=3)
conf_mx = confusion_matrix(y_train, y_train_pred)
conf_mx

输出：

array([[5576,    0,   21,    6,    9,   43,   37,    6,  224,    1],
       [   0, 6398,   38,   23,    4,   44,    4,    8,  213,   10],
       [  26,   27, 5242,   90,   71,   26,   62,   36,  371,    7],
       [  24,   17,  117, 5220,    2,  208,   28,   40,  405,   70],
       [  12,   14,   48,   10, 5192,   10,   36,   26,  330,  164],
       [  28,   15,   33,  166,   55, 4437,   76,   14,  538,   59],
       [  30,   14,   41,    2,   43,   95, 5560,    4,  128,    1],
       [  21,    9,   52,   27,   51,   12,    3, 5693,  188,  209],
       [  17,   63,   46,   90,    3,  125,   25,   10, 5429,   43],
       [  23,   18,   31,   66,  116,   32,    1,  179,  377, 5106]],
      dtype=int64)

混淆矩阵直接查看是一堆数字，可以使用matplotlib进行可视化：

def plot_confusion_matrix(matrix):
    fig = plt.figure(figsize=(5,5))
    ax = fig.add_subplot(111)
    cax = ax.matshow(matrix)
    fig.colorbar(cax)

plot_confusion_matrix(conf_mx)
plt.show()

输出：

黑白形式：

plt.matshow(conf_mx, cmap=plt.cm.gray)
plt.show()

输出：

混淆矩阵中每行表示原始真正的类别，每列表示预测的类别，对角线位置越亮，表明正确分类的个数越多。上述输出混淆矩阵结果表明，对角线位置很亮，说明大部分都被正确分类了。同时还可以发现数字5的对角位置比较暗，说明在数据集中数字5可能比较少，或者分类器没有很好地识别5。到底是哪种可能，只需对混淆矩阵除以每个类别在数据集中的个数即可：

row_sums = conf_mx.sum(axis=1,keepdims=True)
row_sums

输出：

array([[5923],
       [6742],
       [5958],
       [6131],
       [5842],
       [5421],
       [5918],
       [6265],
       [5851],
       [5949]], dtype=int64)

以上输出为：每个数字类别在MNIST数据集中图像的个数，可以看出平均都在6千左右，最多的是数字1共有6742张图像，最少的是数字5共有5421。这也反映了上面混淆矩阵中1的对角位置最亮，5的位置最暗。

继续对混淆矩阵除以类别个数，同时将对角线位置全部填充成0：

norm_conf_mx = conf_mx / row_sums

np.fill_diagonal(norm_conf_mx, 0)
plot_confusion_matrix(norm_conf_mx)
plt.show()

输出：

可以看到，数字8列很亮，而每列表示预测的类别，说明很多数字图像被错误地归类为8了。但是反过来看数字8行，都比较暗，结果还不错。说明这个模型分类器将好多数字错分成为数字8，而数字8基本被正确分类。

同时观察数字3和5的结果，有部分3被错误分类为5，也有部分5被错误分类为3，而且亮度差不多。

观察上述混淆矩阵，可以得到一些改进模型的思路：应该考虑如何降低错误分类为8：

例如收集更多像8而又不是8的数字图像，使得模型能够区别8和像8的图像
或者进行特征工程，或者帮助模型增加识别准确率的规则，例如数字中带闭合圈的个数，8有两个，6有1个，9有1个等等。
使用Scikit-Image, Pillow, OpenCV等对原始图像的某些特征更加突出，例如闭合圈。

研究各个类别中的分类错误，可以通过可视化方式显示原始图像和分类标签：

# 定义显示图像函数：
def plot_digits(instances, images_per_row=10, **options):
    size = 28
    images_per_row = min(len(instances), images_per_row)
    images = [instance.reshape(size,size) for instance in instances]
    n_rows = (len(instances) - 1) // images_per_row + 1
    row_images = []
    n_empty = n_rows * images_per_row - len(instances)
    images.append(np.zeros((size, size * n_empty)))
    for row in range(n_rows):
        rimages = images[row * images_per_row : (row + 1) * images_per_row]
        row_images.append(np.concatenate(rimages, axis=1))
    image = np.concatenate(row_images, axis=0)
    plt.imshow(image, cmap = mpl.cm.binary, **options)
    plt.axis("off")

def compare_origin_pred(a,b):
    cl_a, cl_b = a,b
    X_aa = X_train[(y_train == cl_a) & (y_train_pred == cl_a)]
    X_ab = X_train[(y_train == cl_a) & (y_train_pred == cl_b)]
    X_ba = X_train[(y_train == cl_b) & (y_train_pred == cl_a)]
    X_bb = X_train[(y_train == cl_b) & (y_train_pred == cl_b)]

    plt.figure(figsize=(10,10))
    X_list = [X_aa, X_ab, X_ba, X_bb]
    X_title = ["真实标签: {}  预测标签: {}".format(a,a),
               "真实标签: {}  预测标签: {}".format(a,b),
               "真实标签: {}  预测标签: {}".format(b,a),
               "真实标签: {}  预测标签: {}".format(b,b)]
    for i in range(len(X_list)):
        plt.subplot(2,2,i+1)
        plot_digits(X_list[i][:25],images_per_row=5)
        plt.title(X_title[i])

    plt.tight_layout()
    plt.show()

compare_origin_pred(3,5)

输出：

从上面输出可以看出，好多真实标签是3的被预测成了5，很明显有很多是预测错了，但有些书写的也太像5了，例如5行3列，或许这根本就是个5，而在打标签时打成3了；真实标签是5，而被预测成了3，很多也是预测错了，例如1行2列的书写也太像3了，用肉眼识别也有很大概率会被认为是3，而非5。

下面再看看混淆矩阵中最亮的那列，即8和5的部分预测结果：

compare_origin_pred(5,8)

输出：

与上面3和5的比较结果对比，将5预测成8，或者将8预测成5好像大部分还是模型不够给力，是很明显的错误，书写混淆程度好像没有3和5那么严重。

造成这些误分类的原因也是我们的模型太简单了，SGD也就是个普通的线性模型，

6. 多标签分类任务

实际应用中有时会需要多标签分类任务，例如人脸识别中，模型预测多人合照时就需要给出多个值的预测结果。

from sklearn.neighbors import KNeighborsClassifier

y_train_large = (y_train >= 7)
y_train_odd = (y_train % 2 == 1)
y_multilabel = np.c_[y_train_large, y_train_odd]

knn_clf = KNeighborsClassifier()
knn_clf.fit(X_train, y_multilabel)

输出：

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                     weights='uniform')

knn_clf.predict([some_dicit])

输出：

array([[False,  True]])

结果是正确的，因为some_digit=5，5小于7并且是奇数。

评估多分类任务模型常用的办法是计算每个类别的F1值，再求平均值：

# 该步骤非常的耗时，可能需要一两个小时：
y_train_knn_pred = cross_val_predict(knn_clf, X_train, y_multilabel, cv=3)
f1_score(y_multilabel, y_train_knn_pred, average="macro")

输出：

0.976410265560605

7. 多输出分类任务

为了说明多分类任务，现在创建一个去除图像噪声的模型，即输入一张还有噪音的图像，经过模型输出一张干净的图像。模型的输出是每个像素的值，像素取值范围0至255，因此这是一个典型的多输出多分类任务。

首先为图像添加噪音：

noise = np.random.randint(0,100,(len(X_train),784))
X_train_mod = X_train + noise
noise = np.random.randint(0,100,(len(X_test),784))
X_test_mod = X_test + noise
y_train_mod = X_train
y_test_mod = X_test

some_index = 0
plt.subplot(1,2,1)
plot_digits([X_test_mod[some_index]])
plt.subplot(1,2,2)
plot_digits([y_test_mod[some_index]])
plt.show()

输出：

左边为添加噪音的图像，即为模型输入图像，右边为干净的图像，即相当于数据标签。

利用KNN训练模型：

knn_clf.fit(X_train_mod, y_train_mod)
clean_digit = knn_clf.predict([X_test_mod[some_index]])
plot_digits(clean_digit)

输出：

plot_digits([y_test_mod[some_index]])

输出：

可以看出，经过模型噪音被去除了；同时经过模型之后图像发生了一些细微的变化（观察图像棱角）。