机器学习笔记：模型评估

最新推荐文章于 2024-07-01 16:45:28 发布

lambda666

最新推荐文章于 2024-07-01 16:45:28 发布

阅读量2.1k

点赞数

分类专栏：机器学习文章标签：机器学习深度学习人工智能模型评估

本文链接：https://blog.csdn.net/u012386878/article/details/88884762

版权

机器学习专栏收录该内容

3 篇文章 0 订阅

订阅专栏

分类模型性能评估指标

混淆矩阵

混淆矩阵就是把模型对样本的预测结果统计成如下表格的形式

混淆矩阵一般都是针对二分类问题，如果是多分类问题，则可以把需要关注的那个类别作为正类，其他类别作为负类，就可转化为二分类问题

	actual positive	actual negative
predicted positive	TP	FP
predicted negative	FN	TN

混淆矩阵中的四个值：

True Positive(TP)：被模型预测为正的正样本数；
False Positive(FP)：被模型预测为正的负样本数；
False Negative(FN)：被模型预测为负的负样本数；
True Negative(TN)：被模型预测为负的正样本数；

其中T代表预测正确，F代表预测错误，P代表预测为正例，N代表预测为负例

四个值之和为样本总数： $T P + F P + F N + T N = 样本总数$

混淆矩阵可以用韦恩图表示如下：

混淆矩阵

从韦恩图中可以看出，最理想的情况是黑色虚线框与红色实线框完全重合；

基本指标

下面4个评估指标都是依据混淆矩阵得出

Recall

$\frac{TP}{TP+FN}$ (1)

查全率、召回率：真实正样本中，被预测为正的样本所占的比率；

“宁可错杀一千，不可放过一个”体现的就是高查全率。一点机会都不放过，尽可能把样本预测为正样本，减少遗漏，这样可以提高查全率；

有些疾病的筛查时可能更看重查全率，漏诊会耽误最佳治疗时机，后果严重；
在上面的韦恩图中，要提高查全率，可以把黑色虚线框放大，让它完全包含红色实线框；

Precision

$\frac{TP}{TP+FP}$ (2)

查准率、精准率：预测为正的样本中，真实正样本所占的比率；

“宁缺毋滥”体现的就是高查准率。除非很有把握，否则就不要把样本预测为正样本，这样可以提高查准率；

嫌疑人定罪时可能更加看重查准率，需要证据确凿，不能冤枉好人；
在上面的韦恩图中，要提高查准率，可以把黑色虚线框缩小，让它只在红色实线框以内；

True Positive Rate

$\frac{TP}{TP+FN}$ (3)

真正率：与 $R e c a l l$ 等价

False Positive Rate

$\frac{FP}{FP+TN}$ (4)

假正率：真实负样本中，被预测为正的样本所占的比率；

假正即把假的当成真的；

假正率和查准率在意思上成反比的关系，查准率高，自然假正率就低。但是两者在公式上没有严格的反比关系；
在上面的韦恩图中，要降低假正率，可以把黑色虚线框缩小，让它只在红色实线框以内；

注意， $P r e c i s i o n$ 和 $A c c u r a c y$ (准确率)不一样， $P r e c i s i o n$ 针对部分样本，而 $A c c u r a c y$ 针对所有样本。 $A c c u r a c y$ 指标较为笼统，而 $R e c a l l$ 和 $P r e c i s i o n$ 指标则对不同性质的问题更加具有针对性

$\frac{TP+TN}{TP+FN+FP+TN}$ (5)

查全率和查准率是一对矛盾。一般而言，查准率高时，查全率往往偏低；而查全率高时，查准率往往偏低。如果想要提高查全率，可以尽量把样本预测为正样本，最极端的情况是把所有的样本都预测为正样本，这样任何正样本都不会被遗漏；但是如此一来，很多真实负样本也被错误地预测为正样本了，查准率会降低。如果想要提高查准率，可以只在最有把握的情况下才把样本预测为正样本，保证预测出的正样本尽可能准确；但是这样也会导致很多没有大把握的真实正样本被预测为负样本，查全率降低了。

PRC

由于查全率和查准率是一对矛盾，想要同时提高查全率和查准率一般是不可能的，需要综合两者之间的平衡。Precision Recall Curve(PRC，PRC曲线)，它以查准率为Y轴，以查全率为X轴，它可以综合评价整体的评估指标。

PRC曲线

PR曲线是一条总体趋势递减的曲线，它可以直观地展现出一个模型在样本总体上的查全率和查准率的关系。不同模型可以得出不同的PR曲线，通过PR曲线可以比较模型的优略，PR曲线越靠近右上角的点(1,1)处说明模型的性能越好。

如上图，模型B的PR曲线被模型C的PR曲线完全包围，则说明模型B更优。模型A和模型B的PR曲线有交叉，则不能直接判断两个模型的优略，这时可以用PR曲线下的面积来评定，该指标一定程度上表征了查准率和查全率取得“双高”的的程度。

但是，一般PR曲线下面积不易计算。这时人们引入了平衡点来度量，它是PR曲线与斜率为1的直线的交点，即“查全率=查准率”的点，在平衡点处的查全率查准率取值越大，说明模型性能越好。由此上图中模型A优于模型B。

F1-score

另外一个综合考虑查全率和查准率的指标是F1-score，它比平衡点更加常用。它是查全率和查准率的调和平均数（倒数的平均值的倒数）。

$\frac{1} {\frac{1}{2}(\frac{1}{Recall} + \frac{1}{Precision})} = \frac{2 * Recall * Precision} {Recall + Precision}$

另外还有F2-score和F0.5-score，F1-score认为查全率和查准率同等重要，F2-score认为查全率的重要程度是查准率的2倍，F0.5-score认为查全率的重要程度是查准率的一半。它们都是查全率和查准率的加权调和平均数。

$F_\beta=(1+\beta^2)\frac{Recall * Precision}{Recall + \beta^2 * Precision}$

思考：调和平均数的意义是什么？

调和平均值的意义

另外还有G-score，它是查全率和查准率的几何平均数。

$\sqrt{Recall * Precision}$

ROC

ROC全称是“受试者工作特征”（Receiver Operating Characteristic）曲线，它以“真正率”(TPR)为Y轴，以“假正率”(FPR)为X轴,它是一条总体趋势向上的曲线。真正率和查全率等价。

ROC曲线

可以这样理解ROC曲线：模型具有较高查全率的时候，其假正率越低越好。就是尽可能不让正例遗漏，但又不能把太多负例误判为正例。

对角线上的真正率和假正率持平，对应随机猜测模型，(0,1)点对应理想模型。

PRC由查准率和查全率得到，ROC由真正率和假正率得到，又查准率等价于真正率，查全率“反比”于假正率，所以可以认为PRC与ROC等价，他们都在说明同一个问题。

在模型比较时，与PR曲线类似，当一个模型的ROC曲线被另一个模型的曲线包住，则后者性能优于前者；若两个模型的ROC曲线有交叉，则可比较ROC曲线下面积。

AUC

Area Under Curve，ROC曲线下面积。AUC的值越高，模型性能越好。若模型分类性能极好，则AUC为1。但现实中不存在这么完美的模型，一般AUC在0.5到1之间。随机猜测时AUC的值为0.5左右，如果AUC值小于0.5，则模型比随机猜测还差，有可能是分类标签取反了。

IOU

在目标检测任务中常用的评价指标是IOU，即Intersection over Union（交并比）。IOU的计算方法：是用预测框与标注框的交集面积除以其并集面积，IOU越大，说明目标检测效果越好，如下图所示。

IOU

这张图可以对比上面的韦恩图，即可得出IOU的原理其实和混淆矩阵相似，以上的分析同样适用于IOU。

代码分析

在scikit-learn官网教程找到一个例子。使用SVM来训练鸢尾花数据集，最终结果有3个分类。对每个分类可做出ROC曲线。

print(__doc__)

import numpy as np
import matplotlib.pyplot as plt
from itertools import cycle

from sklearn import svm, datasets
from sklearn.metrics import roc_curve, auc
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import label_binarize
from sklearn.multiclass import OneVsRestClassifier
from scipy import interp

# Import some data to play with
iris = datasets.load_iris()
X = iris.data
y = iris.target

# Binarize the output
y = label_binarize(y, classes=[0, 1, 2]) #将类标签转换为one-hot编码
n_classes = y.shape[1]

# Add noisy features to make the problem harder
random_state = np.random.RandomState(0)
n_samples, n_features = X.shape
X = np.c_[X, random_state.randn(n_samples, 200 * n_features)] #原样本的特征维数是n_features，现在原维数的基础上在后面增加了200*n_features维数的特征，特征取随机值

# shuffle and split training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.5,
                                                    random_state=0)

# Learn to predict each class against the other
# OneVsRestClassifier函数把三分类问题转为3个二分类问题
classifier = OneVsRestClassifier(svm.SVC(kernel='linear', probability=True,
                                 random_state=random_state))
y_score = classifier.fit(X_train, y_train).decision_function(X_test)

# Compute ROC curve and ROC area for each class
fpr = dict()
tpr = dict()
roc_auc = dict()
for i in range(n_classes): # 计算每个类的tpr、tpr、auc
    # roc_curve函数第一个参数为真实标签，第二个参数为分类器输出的分类概率值
    # roc_curve函数返回值依次为fpr序列、tpr序列、分类阈值序列，三个值的关系是模型在某分类阈值下的fpr与tpr值，函数自动使用不同阈值来计算fpr和tpr，最后得到值序列
    fpr[i], tpr[i], _ = roc_curve(y_test[:, i], y_score[:, i])
    roc_auc[i] = auc(fpr[i], tpr[i]) # fpr和tpr是序列，但是auc是一个值

# Compute micro-average ROC curve and ROC area
fpr["micro"], tpr["micro"], _ = roc_curve(y_test.ravel(), y_score.ravel()) # ravel函数把多维数组变为一维数组，即把三个类别混在一起，计算平均的fpr、tpr
roc_auc["micro"] = auc(fpr["micro"], tpr["micro"])

对特定分类画出ROC曲线。

plt.figure()
lw = 2
plt.plot(fpr[2], tpr[2], color='darkorange',
         lw=lw, label='ROC curve (area = %0.2f)' % roc_auc[2])
plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic example')
plt.legend(loc="lower right")
plt.show()

二分类ROC实例

对多分类画出ROC曲线。

# Compute macro-average ROC curve and ROC area

# First aggregate all false positive rates
all_fpr = np.unique(np.concatenate([fpr[i] for i in range(n_classes)]))

# Then interpolate all ROC curves at this points
mean_tpr = np.zeros_like(all_fpr)
for i in range(n_classes):
    mean_tpr += interp(all_fpr, fpr[i], tpr[i])

# Finally average it and compute AUC
mean_tpr /= n_classes

fpr["macro"] = all_fpr
tpr["macro"] = mean_tpr
roc_auc["macro"] = auc(fpr["macro"], tpr["macro"])

# Plot all ROC curves
plt.figure()
plt.plot(fpr["micro"], tpr["micro"],
         label='micro-average ROC curve (area = {0:0.2f})'
               ''.format(roc_auc["micro"]),
         color='deeppink', linestyle=':', linewidth=4)

plt.plot(fpr["macro"], tpr["macro"],
         label='macro-average ROC curve (area = {0:0.2f})'
               ''.format(roc_auc["macro"]),
         color='navy', linestyle=':', linewidth=4)

colors = cycle(['aqua', 'darkorange', 'cornflowerblue'])
for i, color in zip(range(n_classes), colors):
    plt.plot(fpr[i], tpr[i], color=color, lw=lw,
             label='ROC curve of class {0} (area = {1:0.2f})'
             ''.format(i, roc_auc[i]))

plt.plot([0, 1], [0, 1], 'k--', lw=lw)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Some extension of Receiver operating characteristic to multi-class')
plt.legend(loc="lower right")
plt.show()

多分类ROC实例

roc_curve函数核心代码分析。

def roc_curve(y_true, y_score, pos_label=None, sample_weight=None,
    drop_intermediate=True):

    fps, tps, thresholds = _binary_clf_curve(
        y_true, y_score, pos_label=pos_label, sample_weight=sample_weight)

    if drop_intermediate and len(fps) > 2:
        optimal_idxs = np.where(np.r_[True,
                                      np.logical_or(np.diff(fps, 2),
                                                    np.diff(tps, 2)),
                                      True])[0]
        fps = fps[optimal_idxs]
        tps = tps[optimal_idxs]
        thresholds = thresholds[optimal_idxs]

    if tps.size == 0 or fps[0] != 0 or tps[0] != 0:
        # Add an extra threshold position if necessary
        # to make sure that the curve starts at (0, 0)
        tps = np.r_[0, tps]
        fps = np.r_[0, fps]
        thresholds = np.r_[thresholds[0] + 1, thresholds]

    if fps[-1] <= 0:
        warnings.warn("No negative samples in y_true, "
                      "false positive value should be meaningless",
                      UndefinedMetricWarning)
        fpr = np.repeat(np.nan, fps.shape)
    else:
        fpr = fps / fps[-1]

    if tps[-1] <= 0:
        warnings.warn("No positive samples in y_true, "
                      "true positive value should be meaningless",
                      UndefinedMetricWarning)
        tpr = np.repeat(np.nan, tps.shape)
    else:
        tpr = tps / tps[-1]

    return fpr, tpr, thresholds


def _binary_clf_curve(y_true, y_score, pos_label=None, sample_weight=None):
    # sort scores and corresponding truth values
    desc_score_indices = np.argsort(y_score, kind="mergesort")[::-1] # 根据模型分类概率来排序
    y_score = y_score[desc_score_indices]
    y_true = y_true[desc_score_indices]

    if sample_weight is not None:
        weight = sample_weight[desc_score_indices]
    else:
        weight = 1.

    # y_score typically has many tied values. Here we extract
    # the indices associated with the distinct values. We also
    # concatenate a value for the end of the curve.
    distinct_value_indices = np.where(np.diff(y_score))[0] # 分类阈值序列根据y_score来取
    threshold_idxs = np.r_[distinct_value_indices, y_true.size - 1]

    # accumulate the true positives with decreasing threshold
    tps = stable_cumsum(y_true * weight)[threshold_idxs] # 用分类阈值来划分已经排好序的y_score序列，然后统计
    if sample_weight is not None:
        # express fps as a cumsum to ensure fps is increasing even in
        # the presence of floating point errors
        fps = stable_cumsum((1 - y_true) * weight)[threshold_idxs]
    else:
        fps = 1 + threshold_idxs - tps
return fps, tps, y_score[threshold_idxs]