机器学习—模型评估

最新推荐文章于 2024-09-14 20:08:20 发布

阿楷不当程序员

最新推荐文章于 2024-09-14 20:08:20 发布

阅读量385

点赞数

分类专栏： ML 文章标签：机器学习 python 人工智能

本文链接：https://blog.csdn.net/nivegiveup/article/details/126934777

版权

ML 专栏收录该内容

8 篇文章 0 订阅

订阅专栏

模型评估

模型评估

模型评估

模型选择：对候选模型的泛化误差进行评估，然后选择泛化误差最小的那个模型。

数据集读取

Mnist数据是图像数据：(28,28,1)的灰度图

import numpy as np
import os
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
plt.rcParams['axes.labelsize'] = 14
plt.rcParams['xtick.labelsize'] = 12
plt.rcParams['ytick.labelsize'] = 12
import warnings
warnings.filterwarnings('ignore')
np.random.seed(42)

# 导入鸢尾花数据集
from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784')
mnist

# 取数据
X, y = mnist["data"],mnist["target"]
X.shape
y.shape # 7000 * 1

# 构造训练集和测试集，分别取前6万和后1万
X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]

# 洗牌操作：打乱顺序，独立同分布
import numpy as np
shuffle_index = np.random.permutation(60000) # 随机生成排列
X_train, y_train = X_train.iloc[shuffle_index],y_train.iloc[shuffle_index] # 按随机索引洗牌

交叉验证

建立模型时，会将原始数据集切分成两部分：训练集和测试集。训练集用于实际构建模型，测试集用于评估模型。
交叉验证

“交叉验证法”先将数据集D划分为k个大小相似的互斥子集，每次用k-1个子集的并集作为训练集，余下的那个子集作为测试集。

(k=5 , 10, 20)

k折检查验证通常要随机使用不同的划分重复p次，最终评估结果是这p次k折交叉验证结果的均值。

# 建立模型：判断一个数字是否为5
y_train_5 = (y_train=='5')
y_test_5 = (y_train=='5')

y_train_5[:10]
'''
返回结果：
28004    False
30680    False
49100     True
48650    False
7057     False
35832    False
41897    False
3656     False
52633    False
24449    False
'''
# 导入随机梯度下降分类模型
from sklearn.linear_model import SGDClassifier
sgd_clf = SGDClassifier(max_iter=5,random_state=42) # 参数：迭代次数，随机种子(为下一次返回新的随机种子，减少随机性)
sgd_clf.fit(X_train,y_train_5) # 开始训练

# 预测数据
sgd_clf.predict([X.loc[49100]]) # X为dataFrame型，索引时需用loc

# 导入交叉验证函数
from sklearn.model_selection import cross_val_score
cross_val_score(sgd_clf,X_train,y_train,cv=3,scoring='accuracy') # 参数：模型，原始训练集，切分成3份，得分标准-准确率
'''
返回结果：array([0.8709, 0.8549, 0.8543])	三次验证的结果
'''

交叉验证最重要的就是他的验证方式，选择不同的评价方法，会产生不同的评价结果。

原始交叉验证代码

# 原始交叉验证代码
from sklearn.model_selection import StratifiedKFold # 切分函数
from sklearn.base import clone # 克隆器

# 实例化对象
skflods = StratifiedKFold(n_splits=3,shuffle=True,random_state=42) # 参数：切分成三份，是否打乱样本，随机种子
# 遍历切分的数据
for train_index,test_index in skflods.split(X_train, y_train_5):
    clone_clf = clone(sgd_clf) # 克隆分类器
    # 获取切分的训练集，测试集
    X_train_folds = X_train.iloc[train_index] # 验证全部都在训练集中操作
    y_train_folds = y_train_5.iloc[train_index]
    X_test_folds = X_train.iloc[test_index]
    y_test_folds = y_train_5.iloc[test_index]
    
    # 开始训练
    clone_clf.fit(X_train_folds, y_train_folds) 
    # 预测数据
    y_pred = clone_clf.predict(X_test_folds) 
    n_correct = sum(y_pred == y_test_folds) # 判断预测与实际值是否相等，存储相等的个数
    print(n_correct/len(y_pred)) # 正确率
    '''
    返回结果：0.90865	0.96565		 0.959
    '''

混淆矩阵（Confusion Matrix）

混淆矩阵是机器学习中总结分类模型预测结果的情形分析表，以矩阵形式将数据集中的记录按照真实的类别与分类模型预测的类别判断两个标准进行汇总。其中矩阵的行表示真实值，矩阵的列表是预测值，

true_positives：正确选出女生，作为正类。

false_positives：错误选出男生，作为正类。

false_negatives：错误选出男生，作为负类。

true_positives：正确选出男生，作为负类。

from sklearn.model_selection import cross_val_predict
# 获取预测值
y_train_pred = cross_val_predict(sgd_clf,X_train,y_train_5,cv=3) # 参数：分类器，训练集，切分3份

y_train_pred.shape # 60000 * 1
X_train.shape # (60000, 784)

# 导入混淆矩阵函数
from sklearn.metrics import confusion_matrix 
confusion_matrix(y_train_5,y_train_pred) # 参数：实际值，预测值
'''
返回结果：
array([[53655,   924],
       [ 1297,  4124]], dtype=int64)
'''

结果分析：

negative class [[ true negatives , false positives ],

positive class [ false negatives , true positives ]]

true_positives： 4124张被正确的分为5类别
false_positives：924张被错误的分为5类别（本来不是5，判断成了5）
false_negatives：1297张错误的分为非5类别（本来是5，判断成不是5）
true_negatives: 53655个数据被正确的分为非5类别

一个完美的分类器应该只有true positives 和 true negatives, 即主对角线元素不为0，其余元素为0

查准率与查全率（Precision and Recall）

查准率：类似于“检索出的信息中有多少比例是用户感兴趣的”（判断5且为正类）

查全率：类似于“用户感兴趣的信息中有多少被检索出来了”（判断非5为正类）

$\frac{TP}{TP+FP} \\[2ex] recall = \frac{TP}{TP+FN}$

# 导入查准率、查全率函数
from sklearn.metrics import precision_score,recall_score
precision_score(y_train_5,y_train_pred) # 查准率
# 0.8711352955725946
recall_score(y_train_5,y_train_pred) # 查全率
# 0.6496956281128943

综合考虑查准率、查全率的性能度量

将Precision 和 Recall结合到一个称为F1 score 的指标，基于查准率和查全率的调和平均。因此，如果查准率和查全率度都很高，分类器将获得高F1分数。
$F_1 = \frac{2}{\frac{1}{precision}+\frac{1}{recall}}=2*\frac{precision*recall}{precision+recall}=\frac{TP}{TP+\frac{FN+FP}{2}}$

from sklearn.metrics import f1_score
f1_score(y_train_5,y_train_pred) # F1
# 0.7442941673710904

阈值对结果的影响

即决策界限：何时会将y预测为正类，何时为负类

如图，若阈值越低，precision值变低，FP就越多；recall值就会变高，TP就越多。

阈值越高，precision值变高，FP就越少；recall值变低，TP就越少。

# 预测样本的置信度得分
y_scores = sgd_clf.decision_function([X.loc[35000]])
y_scores
# array([-232764.96231592])

# 决策界限
t = 0 # 自定义阈值
y_pred = (y_scores >t)
y_pred
# array([False])

阈值函数

# 导入阈值函数
from sklearn.metrics import precision_recall_curve
precisions, recalls, thresholds = precision_recall_curve(y_train_5, y_scores)
# 会将所有值纳入阈值的范畴

ROC曲线

receiver operating characteristic (ROC) 曲线是二元分类中的常用评估方法

它与精确度/召回曲线非常相似，但ROC曲线不是绘制精确度与召回率，而是绘制true positive rate(TPR，真正例率) 与 false positive rate(FPR，假正例率)
要绘制ROC曲线，首先需要使用roc_curve（）函数计算各种阈值的TPR和FPR：

$TPR = \frac{TP}{TP+FN} $，$ FPR= \frac{FP}{TN+FP} $

# 导入ROC函数
from sklearn.metrics import roc_curve
fpr, tpr, thresholds = roc_curve(y_train_5, y_scores)

# 画图：ROC曲线
def plot_roc_curve(fpr, tpr, label=None):
    plt.plot(fpr, tpr, linewidth=2, label=label)
    plt.plot([0, 1], [0, 1], 'k--')
    plt.axis([0, 1, 0, 1])
    plt.xlabel('False Positive Rate', fontsize=16)
    plt.ylabel('True Positive Rate', fontsize=16)

plt.figure(figsize=(8, 6))
plot_roc_curve(fpr, tpr)
plt.show()

在这里插入图片描述

虚线表示纯随机分类器的ROC曲线; 一个好的分类器尽可能远离该线（朝左上角）。

比较分类器的一种方法是测量曲线下面积（AUC）。完美分类器的ROC AUC趋近1，而纯随机分类器的ROC AUC等于0.5。 Scikit-Learn提供了计算ROC AUC的函数：

# 导入AUC函数
from sklearn.metrics import roc_auc_score
roc_auc_score(y_train_5, y_scores)
# 0.9598058535696421

阿楷不当程序员

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录