《蜥蜴书》_讲义及源码解读_03

最新推荐文章于 2024-08-30 13:41:39 发布

webufoqiu

最新推荐文章于 2024-08-30 13:41:39 发布

阅读量617

点赞数

分类专栏： AI 文章标签： sklearn python

本文链接：https://blog.csdn.net/webufoqiu/article/details/120834875

版权

AI 专栏收录该内容

12 篇文章

订阅专栏

本文详细介绍了MNIST手写数字数据集，包括数据获取、预处理、模型训练与评估。通过二分类任务识别数字5，探讨了精准度、召回率、F1分数、混淆矩阵、PR曲线和ROC曲线等概念，并展示了随机梯度下降分类器和交叉验证的应用。最后，进行了误差分析，揭示了模型在预测数字5时的挑战。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

文章目录

`MNIST`数据集

MNIST有70000 张(28*28)规格的手写数字图片，由美国的高中生和美国人口调查局的职员手写而成。它相当于机器学习当中的“Hello World”。机器学习的初学者迟早也会处理 MNIST 这个数据集。Scikit-Learn提供了许多辅助函数，以便于下载流行的数据集。MNIST 是其中一个：

关于此数据集相关：https://www.openml.org/d/554

拉取数据集

from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784', version=1, as_frame=False)
mnist.keys()

#dict_keys(['data', 'target', 'frame', 'categories', 'feature_names', 'target_names', 'DESCR', 'details', 'url'])

X, y = mnist["data"], mnist["target"]
X.shape,y.shape,mnist.frame, mnist.categories,mnist.target_names

#data和target数据的形状，特征名称，标签名称等
#((70000, 784), (70000,), None, {}, ['class'])

Windows系统：默认下载到C:\Users\admin\scikit_learn_data

Linux/Mac系统：默认下载到$HOME/scikit_learn_data

由 sklearn 加载的数据集有着相似的字典结构，这包括：

DESCR 键描述数据集
data 键存放一个数组，数组的一行表示一个样例，一列表示一个特征
target键存放一个标签数组

sklearn.datasets.fetch_openml函数相关：参阅

https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_openml.html#sklearn.datasets.fetch_openml

浏览查看数据：

%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt

some_digit = X[0] #第0,第11，对应的都是手写图片5
some_digit_image = some_digit.reshape(28, 28)
plt.imshow(some_digit_image, cmap=mpl.cm.binary)  #显示二进制灰度图
plt.show()

在这里插入图片描述


def plot_digit(data):
    image = data.reshape(28, 28)
    #灰度图，图像插值算法默认：最近邻算法
    plt.imshow(image, cmap = mpl.cm.binary, interpolation="nearest")
    plt.axis("off")

# 显示多张手写图片，默认每行10个
def plot_digits(instances, images_per_row=10, **options):
    size = 28
    images_per_row = min(len(instances), images_per_row)  #不足一行
    images = [instance.reshape(size,size) for instance in instances]
    n_rows = (len(instances) - 1) // images_per_row + 1 #整除
    row_images = []
    n_empty = n_rows * images_per_row - len(instances)
    images.append(np.zeros((size, size * n_empty)))
    for row in range(n_rows):
        rimages = images[row * images_per_row : (row + 1) * images_per_row]
        row_images.append(np.concatenate(rimages, axis=1))
    image = np.concatenate(row_images, axis=0)
    plt.imshow(image, cmap = mpl.cm.binary, **options)
    plt.axis("off")

#显示前25个手写图片
plt.figure(figsize=(9,9))
example_images = X[:25]
plot_digits(example_images, images_per_row=5)
plt.show()

在这里插入图片描述

准备训练集和测试集

y = y.astype(np.uint8) #标签数据默认是字符型，都转换成数值
X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]

#X_train,y_train  训练集中的特征数据和标签数据 60000
#X_test,y_test 测试集中的特征数据和标签数据 10000

训练二分类器

将识别 10 个手写数字的任务，简化成识别 1 个数字的任务。

比如数字5 ，即，只有两个结果：5或 非5，对应，true 或 false

将标签数据改造一下：有0-9 的 10 个分类数字，转换成，true（5） 和 false（非5）

y_train_5 = (y_train == 5)  # 训练时用的标签数据
y_test_5 = (y_test == 5)    # 测试时用的标签数据

随机梯度下降分类器

from sklearn.linear_model import SGDClassifier
#迭代次数，学习率
sgd_clf = SGDClassifier(max_iter=1000, tol=1e-3, random_state=42)
sgd_clf.fit(X_train, y_train_5) 
#.......ing

some_digit = X[11]
sgd_clf.predict([some_digit])
#检测下数字5的图片是5还是非5，结果为true，貌似不错

随机梯度下降法（SGD）是一个简单有效的方法，用于判断使用凸loss函数（convex loss function）的分类器（SVM或logistic回归）。SGD被成功地应用在大规模稀疏机器学习问题上（large-scale and sparse machine learning），经常用在文本分类及自然语言处理上。

SGD的优点是：高效的处理大型数据集;容易实现（有许多机会进行代码调优）

SGD的缺点是：SGD需要许多超参数：比如正则项参数、迭代数;SGD对于特征归一化（feature scaling）是敏感的。

使用交叉验证测量准确率

from sklearn.model_selection import cross_val_score
cross_val_score(sgd_clf, X_train, y_train_5, cv=3, scoring="accuracy")

#array([0.95035, 0.96035, 0.9604 ])
#cv=3,训练数据分成三个折叠，每一次用一个折叠进行预测，另外两个折叠用来训练
#三次迭代，准确率很高？都在95%以上？
#问题是：训练数据中有只有约10%是5,90都是非5，如果所有检测都预测为非5，准确率也是90%！
#结论：1 只有一个accuracy来衡量分类器性能不靠谱， 2 数据集是个偏数据集，用它分类，也不靠谱。

准确率： $\frac {TP+TN}{TP+FN+FP+TN}=\frac{真阳性+真阴性}{样本全部}$

使用混淆矩阵

from sklearn.model_selection import cross_val_predict
from sklearn.metrics import confusion_matrix

y_train_pred = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3)

confusion_matrix_plt=confusion_matrix(y_train_5, y_train_pred)
confusion_matrix_plt
#array([[53892,   687],  真阴(正确的非5)   假阳(错误的5)
#       [ 1891,  3530]], dtype=int64)  假阴(错误的非5)  真阳（正确的5）

在这里插入图片描述

plt.matshow(confusion_matrix_plt)   #把矩阵或者数组绘制成图像的函数
plt.title('Confusion matrix')
plt.colorbar()
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.show()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-i3yXmcjL-1634562574536)(ml_handout03.assets/混淆矩阵.png)]

假装完美的混淆矩阵

y_train_perfect_predictions = y_train_5  # pretend we reached perfection
confusion_matrix(y_train_5, y_train_perfect_predictions)
#array([[54579,     0],
#       [    0,  5421]], dtype=int64)
#60000个样本，真阴54579，真阳5421，完美！假装的

精度，精准度： $Precision_{关注阳性} = \frac {TP}{TP+FP} =\frac{真阳性}{真阳性+假阳性}$

from sklearn.metrics import precision_score, recall_score

precision_score(y_train_5, y_train_pred)
#等价于
cm = confusion_matrix(y_train_5, y_train_pred)
cm[1, 1] / (cm[0, 1] + cm[1, 1])

#0.8370879772350012

召回率，真阳率： $Recall_{关注阳性} = \frac{TP}{TP+FN} =\frac{真阳性}{真阳性+假阴性}$

recall_score(y_train_5, y_train_pred)
#等价于
cm[1, 1] / (cm[1, 0] + cm[1, 1])
#0.6511713705958311

`F1_SCORE` : $F1\_Score =\frac{2\cdot Precision \cdot Recall}{Precision + Recall}=\frac{TP}{TP+\frac{FN+FP}{2}}$

from sklearn.metrics import f1_score

f1_score(y_train_5, y_train_pred)
#等价于：
cm[1, 1] / (cm[1, 1] + (cm[1, 0] + cm[0, 1]) / 2)
#0.7325171197343846

F1_Score 对那些有着相近精准度和召回率的分类器更为有利。

重要的 $F\beta-Score$ ：

精准率（Precision）和召回率（Recall）是一对矛盾的指标。一般来说，精准率高时，召回率旺旺偏低；召回率高时，精准率往往偏低。需要调和！ $F β$ 的物理意义就是将精准率和召回率的一种加权平均，在合并的过程中，召回率的权重是正确率的β倍。
计算 $P r e c i s i o n ， R e c a l l$ 等只是计算某一分类的特性，而 $A c c u r a c y$ 和 $F 1 - S c o r e$ 是判断分类模型总体的标准。
如图：
Scikit-Learn 不允许直接设置阈值，但是可以访问它用于预测的决策分数。调用decision+function()函数方法，这个方法返回每个样本实例的决策分数，然后根据这个分数，使用任意阈值进行预测：

y_scores = sgd_clf.decision_function([some_digit])   
#训练之后的sgd_clf，取手写图片5的决策分数
y_scores
#array([4742.52813158])  这个4742就是决策分数，

权衡精准率和召回率

有了上面的决策分数后，如何使用阈值来影响预测结果呢？

得到每一个样本实例的决策分数，计算所有可能的精准率，召回率和阈值

y_scores = cross_val_predict(sgd_clf,X_train,y_train_5,cv=3,method="decision_function")
#同样使用前面导入的cross_val_predict函数，这次返回的不是结果，是决策分数。

from sklearn.metrics import precision_recall_curve
precisions, recalls, thresholds = precision_recall_curve(y_train_5, y_scores)
#导入precision_recall_curve方法，用它计算所有可能的精准率，召回率，和阈值。

使用`Matplotlib` 绘制精准度、召回率相对于阈值的图形

def plot_precision_recall_vs_threshold(precisions, recalls, thresholds):
    plt.plot(thresholds, precisions[:-1], "b--", label="Precision", linewidth=2)
    plt.plot(thresholds, recalls[:-1], "g-", label="Recall", linewidth=2)
    plt.legend(loc="center right", fontsize=16) # 
    plt.xlabel("Threshold", fontsize=16)        # 
    plt.grid(True)                              # 网格
    plt.axis([-50000, 50000, 0, 1])             # 轴

    #绘制：X轴：thresholds， Y轴：precisions 和recalls


recall_90_precision = recalls[np.argmax(precisions >= 0.90)]
threshold_90_precision = thresholds[np.argmax(precisions >= 0.90)]

print(recall_90_precision,threshold_90_precision)
# 0.4799852425751706 3370.0194991439557 
# 精准度超过0.9时的召回率 ,精准度超过0.9时的阈值 

plt.figure(figsize=(8, 4))                                                               
plot_precision_recall_vs_threshold(precisions, recalls, thresholds)
#调用函数绘图

plt.plot([threshold_90_precision, threshold_90_precision], [0., 0.9], "r:")                 #描红线（3370,3370),(0.,0.9)
plt.plot([-50000, threshold_90_precision], [0.9, 0.9], "r:")                               #描红线（-50000,3370),(0.,0.9)
plt.plot([-50000, threshold_90_precision], [recall_90_precision, recall_90_precision], "r:")
 #描红线（-50000,3370),(0.47,0.47)
plt.plot([threshold_90_precision], [0.9], "ro")      
 #描红点
plt.plot([threshold_90_precision], [recall_90_precision], "ro")                            #描红点
#save_fig("precision_recall_vs_threshold_plot")                                           
plt.show()

在这里插入图片描述

PR曲线，直接绘制精度和召回率的函数图像

def plot_precision_vs_recall(precisions, recalls):
    plt.plot(recalls, precisions, "b-", linewidth=2)
    plt.xlabel("Recall", fontsize=16)
    plt.ylabel("Precision", fontsize=16)
    plt.axis([0, 1, 0, 1])
    plt.grid(True)

plt.figure(figsize=(8, 6))
plot_precision_vs_recall(precisions, recalls)
plt.plot([recall_90_precision, recall_90_precision], [0., 0.9], "r:")
plt.plot([0.0, recall_90_precision], [0.9, 0.9], "r:")
plt.plot([recall_90_precision], [0.9], "ro")

#save_fig("precision_vs_recall_plot")
plt.show()

在这里插入图片描述

根据需要权衡阈值

如上图，如果项目需求，需要精准度在0.9附近，那么选择阈值如下

threshold_90_precision = thresholds[np.argmax(precisions >= 0.90)]
#3370.0194991439557

用阈值做考量，修改预测结果向量

y_train_pred_90 = (y_scores >= threshold_90_precision)
# 对于每个样本实例，决策分数大于 所选阈值的话，此例选择阈值3370， 其预测结果为真，是5.
precision_score(y_train_5, y_train_pred_90)
#0.9000345901072293
#用修改后的y_train_pred_90 预测结果，再一次评估精准度，有0.9000 大于以前的 083

`ROC`曲线

也是一种常用的测试二分类性能的工具，受试者工作特征曲线。

是假阳率(FPR) --X 和召回率（真阳率,TPR）–Y

from sklearn.metrics import roc_curve

fpr, tpr, thresholds = roc_curve(y_train_5, y_scores)


def plot_roc_curve(fpr, tpr, label=None):
    plt.plot(fpr, tpr, linewidth=2, label=label)
    plt.plot([0, 1], [0, 1], 'k--') # dashed diagonal
    plt.axis([0, 1, 0, 1])                                    
    plt.xlabel('False Positive Rate (Fall-Out)', fontsize=16) 
    plt.ylabel('True Positive Rate (Recall)', fontsize=16)    
    plt.grid(True)                                            

plt.figure(figsize=(8, 6))                                    
plot_roc_curve(fpr, tpr)
fpr_90 = fpr[np.argmax(tpr >= recall_90_precision)]           
plt.plot([fpr_90, fpr_90], [0., recall_90_precision], "r:")   
plt.plot([0.0, fpr_90], [recall_90_precision, recall_90_precision], "r:")  
plt.plot([fpr_90], [recall_90_precision], "ro")               
#save_fig("roc_curve_plot")                                    # Not shown
plt.show()

在这里插入图片描述

召回率（TPR）越高，分类器就会产生越多的假正例（FPR）。图中的虚线是一个完全随机的分类器生成的 ROC 曲线；一个好的分类器的 ROC 曲线应该尽可能远离这条线（即向左上角方向靠拢）。

一个比较分类器之间优劣的方法是：测量ROC 曲线下的面积（AUC）。

一个完美的分类器的 ROC_AUC 等于 1，而一个纯随机分类器的 ROC_AUC 等于 0.5。

Scikit-Learn提供了一个函数来计算ROC_AUC：

from sklearn.metrics import roc_auc_score

roc_auc_score(y_train_5, y_scores)

#0.9604938554008616

`PR`曲线和`ROC`曲线的选择

使用PR曲线：当正例很少，或者当你关注假正例多于假反例的时候。优先使用PR曲线。本例就是，正例样本很少，反例样本（非5的样本）很多。更关注假阳性样本（非5样本错误识别为5）
其他情况使用 ROC 曲线。举例子，回顾前面的 ROC 曲线和 ROC AUC 数值，你或许认为这个分类器很棒。但是这几乎全是因为只有少数正例（“是 5”），而大部分是反例（“非 5”）。相反，PR 曲线清楚显示出这个分类器还有很大的改善空间（PR 曲线应该尽可能地靠近右上角）。
由此，换用随机森林分类器改善一下如上的二分类任务，再对比它们的ROC 曲线如下：

roc_auc_score(y_train_5, y_scores_forest)
#0.9983436731328145
precision_score(y_train_5, y_train_pred_forest)
#0.9905083315756169
recall_score(y_train_5, y_train_pred_forest)
#0.8662608374838591

多分类器

二分类器只能区分两个类，而多类分类器（也被叫做多项式分类器）可以区分多个类。比如：随机森林分类器或者朴素贝叶斯分类器）可以直接处理多类分类问题

也有一些二分类器，使用一些策略来实现多分类（比如支持向量机SVM 分类器或者线性分类器）。

一种策略是OVR （一对多）：比如此例，用10个二分类器（每个数字一个），哪个得分高就分为哪类。
另一种策略是OvO （一对一）：比如此例，训练区分0和1 、0和2 、0和3 、 …… 、1和2 、1和3的多个分类器,(n*(n-1))/2个，45个分类器！

Scikit-Learn 可以探测出你尝试使用二分类器去完成多分类的任务，它会根据情况自动执行 OvR或者OvO

如下SVC默认是OvO

from sklearn.svm import SVC

svm_clf = SVC(gamma="auto", random_state=42)
svm_clf.fit(X_train[:1000], y_train[:1000]) # y_train, not y_train_5
svm_clf.predict([some_digit]) #5
svm_clf.predict([X[12]]) #3
svm_clf.predict([X[13]]) #6

#5 3 6, 

some_digit_scores = svm_clf.decision_function([some_digit])
some_digit_scores

#array([[ 2.81585438,  7.09167958,  3.82972099,  0.79365551,  5.8885703 ,
#         9.29718395,  1.79862509,  8.10392157, -0.228207  ,  4.83753243]])
# 5的分类决策分数：9.29718395 最高，对应5

svm_clf.classes_
#array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype=uint8)

策略一下OvR

from sklearn.multiclass import OneVsRestClassifier
ovr_clf = OneVsRestClassifier(SVC(gamma="auto", random_state=42))
ovr_clf.fit(X_train[:1000], y_train[:1000])
ovr_clf.predict([some_digit])
#5

再一次尝试一下随机梯度下降：

sgd_clf.fit(X_train, y_train)
sgd_clf.predict([some_digit])
sgd_clf.decision_function([some_digit])
# 如下： 数字5的得分:3839.6369
#array([[-30446.28807622, -20771.5971377 ,  -4069.97324426,
#         -7297.18084344,  -2335.35560779,   3839.63697115,
#        -27845.48223256, -11660.53889926,   -540.54962964,
#        -11918.2854924 ]])

#3次折叠的交叉验证一下
cross_val_score(sgd_clf, X_train, y_train, cv=3, scoring="accuracy")

#array([0.87365, 0.85835, 0.8689 ])  还不错

#对输入数据特征缩放下后，再交叉验证下
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train.astype(np.float64))
cross_val_score(sgd_clf, X_train_scaled, y_train, cv=3, scoring="accuracy")
#array([0.8983, 0.891 , 0.9018])  特征缩放后，准确率有所提升

误差分析

对如上随机梯度下降模型计算下混淆矩阵

y_train_pred = cross_val_predict(sgd_clf, X_train_scaled, y_train, cv=3)
conf_mx = confusion_matrix(y_train, y_train_pred)

#array([[5577,    0,   22,    5,    8,   43,   36,    6,  225,    1],
#       [   0, 6400,   37,   24,    4,   44,    4,    7,  212,   10],
#       [  27,   27, 5220,   92,   73,   27,   67,   36,  378,   11],
#       [  22,   17,  117, 5227,    2,  203,   27,   40,  403,   73],
#       [  12,   14,   41,    9, 5182,   12,   34,   27,  347,  164],
#       [  27,   15,   30,  168,   53, 4444,   75,   14,  535,   60],
#       [  30,   15,   42,    3,   44,   97, 5552,    3,  131,    1],
#       [  21,   10,   51,   30,   49,   12,    3, 5684,  195,  210],
#       [  17,   63,   48,   86,    3,  126,   25,   10, 5429,   44],
#       [  25,   18,   30,   64,  118,   36,    1,  179,  371, 5107]])

10*10 的，注意对角线，但看起来还是费劲，用matplotlib 的 matshow() 函数，绘制下混淆矩阵。

plt.matshow(conf_mx, cmap=plt.cm.gray)
plt.show()

在这里插入图片描述

如图，对角线5的位置，稍暗，猜测下两种可能：

数据集中，数字5的手写图片较少
分类器在数字5上的执行效果不如其他数字好

仅是猜测，有更好的办法：将混淆矩阵的每个值除以对应类别的图片数量，得到错误率。然后，用0 填充对角线，只保留错误，重新绘制修改后的只聚焦错误的“混淆矩阵”。

row_sums = conf_mx.sum(axis=1, keepdims=True)
norm_conf_mx = conf_mx / row_sums

np.fill_diagonal(norm_conf_mx, 0)
plt.matshow(norm_conf_mx, cmap=plt.cm.gray)
plt.show()