机器学习——模型评估方法（Model evaluation method）

Nikko6688

于 2024-08-02 17:53:15 发布

阅读量484

点赞数 12

文章标签：机器学习人工智能算法 python pycharm

本文链接：https://blog.csdn.net/weixin_57205312/article/details/140875230

版权

此文章采用的数据集是sklearn工具包中自带的数据集：mnist-original.mat

首先的工作就是导包：

import numpy as np
import os
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
plt.rcParams['axes.labelsize'] = 14
plt.rcParams['xtick.labelsize'] = 12
plt.rcParams['ytick.labelsize'] = 12
import warnings
warnings.filterwarnings('ignore')
np.random.seed(42)

然后就是下载数据集：

之前的方法是直接下载，但是不知道是地址变了还是怎么回事，一直下载失败，而且mldata方法也不能使用（现在已经变为openml方法），之前的方法如下：

from sklearn.datasets import fetch_mldata
mnist = fetch_mldata('MNIST original')
mnist

但是这个方法已经不能使用，所以可以先将数据集下载下来，然后通过scipy.io中的loadmat方法将数据集导入项目：

mnist-original.mat数据集的下载地址：https://github.com/amplab/datascience-sp14/raw/master/lab7/mldata/mnist-original.mat

from sklearn.datasets import fetch_openml
from scipy.io import  loadmat
mnist = loadmat('C:/Users/13491/datasets/openml/mnist-original.mat')
mnist

结果展示：

X = mnist['data'].T#这个一定要转置一下，因为这里面的行列是反的！！！！！
y = mnist['label'].T.flatten()#将数据展开
y= y.astype(np.uint8)#将格式变为uint8

然后可以先看一下X的shape

X,y = mnist['data'],mnist['label']
X = X.T
y = y.T
X.shape

结果：(70000, 784)

y.shape

结果：(70000, 1)

然后将训练集与测试集分开：

X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]

再进行洗牌操作：

# 洗牌操作
import numpy as np

shuffle_index = np.random.permutation(60000)
X_train, y_train = X_train[shuffle_index], y_train[shuffle_index]

然后将数据索引随机打乱，从而实现对数据集的随机化处理。

shuffle_index   #将数据索引随机打乱，从而实现对数据集的随机化处理。

结果：array([58201, 51940, 46388, ..., 13474, 47342, 44629])

然后进行交叉验证的操作，交叉验证图解如下：

y_train_5 = (y_train == 5)
y_test_5 = (y_test == 5)
# 将原始的标签数组 y_train 和 y_test 转换为布尔数组，其中每个元素表示对应的标签是否等于5

看一下前十个值的结果：

y_train_5[:10]

结果：

array([[False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False]])

from sklearn.linear_model import SGDClassifier    # 使用随机梯度下降（Stochastic Gradient Descent，SGD）算法训练一个线性分类器。
# SGDClassifier 是一个用于分类任务的线性模型，它通过迭代地更新模型参数来最小化损失函数。
sgd_clf = SGDClassifier(max_iter=5,random_state=42)
# 调用 SGDClassifier 对象的 fit 方法，将训练数据 X_train 和对应的标签 y_train_5 传递给模型进行训练。训练完成后，模型的参数将被优化，可以用于对新的数据进行预测。
sgd_clf.fit(X_train,y_train_5)

结果：

调用分类器并进行预测：

sgd_clf.predict([X[35000]])

结果：array([ True])

调用模型选择器进行交叉验证：

from sklearn.model_selection import cross_val_score
cross_val_score(sgd_clf,X_train,y_train_5,cv=3,scoring='accuracy')

结果：array([0.95555, 0.95465, 0.9365 ])

Confusion Matrix-混淆矩阵

为每个输入数据点生成交叉验证估计值。

from sklearn.model_selection import cross_val_predict
y_train_pred = cross_val_predict(sgd_clf,X_train,y_train_5,cv=3)

预测完之后看一下y_train_pred的shape

y_train_pred.shape

结果：(60000,)

X训练集的shape：

X_train.shape

结果：(60000, 784)

调用混淆矩阵算法对数据集进行处理：

from sklearn.metrics import confusion_matrix
confusion_matrix(y_train_5,y_train_pred)

结果：

array([[52323,  2256],
       [  810,  4611]], dtype=int64)

进行精度和召回率的计算：

from sklearn.metrics import precision_score   #精度
from sklearn.metrics import recall_score      #召回率
precision_score(y_train_5,y_train_pred)

结果：0.6714722586282219

recall_score(y_train_5,y_train_pred)

结果：0.8505810736026563

计算F1 score值：

from sklearn.metrics import f1_score
f1_score(y_train_5,y_train_pred)

阈值对结果的影响

y_scores = sgd_clf.decision_function([X[35000]])
y_scores

结果：array([115085.0473917])

# 使用分层交叉验证通过cross_val_predict获得决策分数
y_scores = cross_val_predict(sgd_clf, X_train, y_train_5.ravel(), cv=3, 
                            method="decision_function")

看一下决策分数的前十个值：

y_scores[:10]

结果：

array([-536036.00668134,  -89458.01375601, -681816.53295261,
       -739117.46479804, -856179.53973373, -305344.01318884,
       -293899.72739638, -264011.91700926, -465360.1824235 ,
        -44909.83192426])

计算精确率（precision）和召回率（recall）曲线，以及对应的阈值（thresholds）。precision_recall_curve 函数来自 sklearn.metrics 模块，它用于评估分类器在不同决策阈值下的性能。这个函数接收两个参数：真实标签（y_train_5）和预测得分（y_scores）。

from sklearn.metrics import precision_recall_curve
precisions, recalls, thresholds = precision_recall_curve(y_train_5, y_scores)

绘制精确率（Precision）和召回率（Recall）随阈值变化的曲线图。具体来说，它定义了一个名为plot_precision_recall_vs_threshold的函数，该函数接收三个参数：精确率列表、召回率列表和阈值列表。

def plot_precision_recall_vs_threshold(precisions, recalls, thresholds):
    plt.plot(thresholds, precisions[:-1], "b--", label="Precision")
    plt.plot(thresholds, recalls[:-1], "g-", label="Recall")
    plt.xlabel("Threshold")
    plt.legend(loc="upper left")
    plt.ylim([0, 1])

plt.figure(figsize=(8,4))
plot_precision_recall_vs_threshold(precisions, recalls, thresholds)
plt.xlim([-700000,700000])
plt.show()

ROC曲线的绘制

from sklearn.metrics import roc_curve
fpr, tpr, thresholds = roc_curve(y_train_5, y_scores)

def plot_roc_curve(fpr, tpr, label=None):
    plt.plot(fpr, tpr, linewidth=2, label=label)
    plt.plot([0, 1], [0, 1], 'k--')
    plt.axis([0, 1, 0, 1])
    plt.xlabel('False Positive Rate',fontsize=16)
    plt.ylabel('True Positive Rate',fontsize=16)

plt.figure(figsize=(8,6))
plot_roc_curve(fpr, tpr)
plt.show()

计算ROC曲线下的面积（Area Under the Curve, AUC），用于评估分类器的性能。roc_auc_score函数来自sklearn.metrics模块，它接受两个参数：真实标签（y_train_5）和预测得分（y_scores）。

from sklearn.metrics import roc_auc_score
roc_auc_score(y_train_5, y_scores)

结果：

0.9609693199230956

Nikko6688

关注

12
点赞
踩
9

收藏

觉得还不错? 一键收藏
0
评论
机器学习——模型评估方法（Model evaluation method）

模型评估方法、交叉验证、混淆矩阵、ROC曲线的绘制
复制链接

扫一扫