Scikit-learn：分类模型评估Model evaluation

-柚子皮-

已于 2023-10-17 15:22:12 修改

阅读量1.9w

点赞数 6

CC 4.0 BY-SA版权

分类专栏： Scikit-Learn 文章标签： Scikit-learn 模型评估 Model evaluation

于 2016-08-19 15:19:34 首次发布

本文链接：https://blog.csdn.net/pipisorry/article/details/52250760

Scikit-Learn 专栏收录该内容

15 篇文章

订阅专栏

本文详细介绍了模型评估的方法，包括评分函数、交叉验证自定义评分及多种评估指标，如准确率、召回率、F1分数等，并提供了多类和多标签分类的具体应用场景。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

http://blog.csdn.net/pipisorry/article/details/52250760

模型评估Model evaluation: quantifying the quality of predictions

3 different approaches to evaluate the quality of predictions of a model:

Estimator score method: Estimators have a score method providing a default evaluation criterion for the problem they are designed to solve. This is not discussed on this page, but in each estimator’s documentation.
Scoring parameter: Model-evaluation tools using cross-validation (such as cross_validation.cross_val_scoreand grid_search.GridSearchCV) rely on an internal scoring strategy. This is discussed in the section The scoring parameter: defining model evaluation rules.
Metric functions: The metrics module implements functions assessing prediction error for specific purposes. These metrics are detailed in sections on Classification metrics, Multilabel ranking metrics, Regression metrics and Clustering metrics.

Finally, Dummy estimators are useful to get a baseline value of those metrics for random predictions.

For “pairwise” metrics, between samples and not estimators or predictions, see the Pairwise metrics, Affinities and Kernels section.

sklearn模型的评估准则：scoring参数

The scoring parameter: defining model evaluation rules

Scoring	Function	Comment
Classification
‘accuracy’	metrics.accuracy_score
‘average_precision’	metrics.average_precision_score
‘f1’	metrics.f1_score	for binary targets
‘f1_micro’	metrics.f1_score	micro-averaged
‘f1_macro’	metrics.f1_score	macro-averaged
‘f1_weighted’	metrics.f1_score	weighted average
‘f1_samples’	metrics.f1_score	by multilabel sample
‘neg_log_loss’	metrics.log_loss	requires `predict_proba` support
‘precision’ etc.	metrics.precision_score	suffixes apply as with ‘f1’
‘recall’ etc.	metrics.recall_score	suffixes apply as with ‘f1’
‘roc_auc’	metrics.roc_auc_score
Clustering
‘adjusted_rand_score’	metrics.adjusted_rand_score
Regression
‘neg_mean_absolute_error’	metrics.mean_absolute_error
‘neg_mean_squared_error’	metrics.mean_squared_error
‘neg_median_absolute_error’	metrics.median_absolute_error
‘r2’	metrics.r2_score

使用如cross_val_score(clf, X, y, scoring='neg_log_loss')

Note：

1 计算auc值的时候需要注意的一点是，sklearn.metric.roc_auc_score(y_actual, y_pred)才是计算ROC AUC的，而sklearn.metrics.auc(x,y)是计算折线与x轴之间的面积，x是折线上点的横坐标，y是折线上点的纵坐标。而且需要注意的是，y_actual必须是二类的，不能是连续数值（如果是，那么就需要自己利用sklearn.metrics.auc写一个计算面积的函数了）。

2 precision_score、recall_score、f1_score中的y_pred必须是01的值，不能是连续值，否则报错。相反，roc_auc_score就需要连续值吧。

[Common cases: predefined values¶]

自定义scoring函数

>>> import numpy as np
>>> def my_custom_loss_func(ground_truth, predictions):
...     diff = np.abs(ground_truth - predictions).max()
...     return np.log(1 + diff)
...
>>> # loss_func will negate the return value of my_custom_loss_func,
>>> #  which will be np.log(2), 0.693, given the values for ground_truth
>>> #  and predictions defined below.
>>> loss  = make_scorer(my_custom_loss_func, greater_is_better=False)
>>> score = make_scorer(my_custom_loss_func, greater_is_better=True)
>>> ground_truth = [[1, 1]]
>>> predictions  = [0, 1]
>>> from sklearn.dummy import DummyClassifier
>>> clf = DummyClassifier(strategy='most_frequent', random_state=0)
>>> clf = clf.fit(ground_truth, predictions)
>>> loss(clf,ground_truth, predictions) 
-0.69...
>>> score(clf,ground_truth, predictions) 
0.69...

交叉验证自定义scoring函数

def rocAucScorer(*args):
    '''
    自定义ROC-AUC评价指标rocAucScorer(clf, x_test, y_true)
    :param y_true: y_test真值
    :param x_test: x测试集
    '''
    from sklearn import metrics
    # y值比对函数
    fun = lambda yt, ys: metrics.roc_auc_score([1.0 if _ > 0.0 else 0.0 for _ in yt],
                                               np.select([ys < 0.0, ys > 1.0, True],
                                                         [0.0, 1.0, ys]))

    return metrics.make_scorer(fun, greater_is_better=True)(*args)

[Defining your scoring strategy from metric functions¶]

或者You can generate even more flexible model scorers by constructing your ownscoring object from scratch, without using the make_scorer factory.For a callable to be a scorer, it needs to meet the protocol specified bythe following two rules:

It can be called with parameters (estimator, X, y), where estimatoris the model that should be evaluated, X is validation data, and y isthe ground truth target for X (in the supervised case) or None (in theunsupervised case).
It returns a floating point number that quantifies theestimator prediction quality on X, with reference to y.Again, by convention higher numbers are better, so if your scorerreturns loss, that value should be negated.

def rocAucScorer(clf, x_test, y_true):
    '''
    自定义ROC-AUC评价指标
    :param y_true: y_test真值
    :param x_test: x测试集
    '''
    from sklearn import metrics
    ys = clf.predict(x_test)
    score = metrics.roc_auc_score([1.0 if _ > 0.0 else 0.0 for _ in y_true],
                                  np.select([ys < 0.0, ys > 1.0, True],
                                            [0.0, 1.0, ys]))
    return score

[Implementing your own scoring object¶]

皮皮blog

分类模型评估

主要适用于二分类，部分支持多分类multi-class和多标签分类multi-label。

分类指标

Some of these are restricted to the binary classification case:

matthews_corrcoef(y_true, y_pred[, ...])	Compute the Matthews correlation coefficient (MCC) for binary classes
precision_recall_curve(y_true, probas_pred)	Compute precision-recall pairs for different probability thresholds
roc_curve(y_true, y_score[, pos_label, ...])	Compute Receiver operating characteristic (ROC)

Others also work in the multiclass case:

cohen_kappa_score(y1, y2[, labels, weights])	Cohen’s kappa: a statistic that measures inter-annotator agreement.
confusion_matrix(y_true, y_pred[, labels, ...])	Compute confusion matrix to evaluate the accuracy of a classification
hinge_loss(y_true, pred_decision[, labels, ...])	Average hinge loss (non-regularized)

Some also work in the multilabel case:

accuracy_score(y_true, y_pred[, normalize, ...])	Accuracy classification score.
classification_report(y_true, y_pred[, ...])	Build a text report showing the main classification metrics
f1_score(y_true, y_pred[, labels, ...])	Compute the F1 score, also known as balanced F-score or F-measure
fbeta_score(y_true, y_pred, beta[, labels, ...])	Compute the F-beta score
hamming_loss(y_true, y_pred[, labels, ...])	Compute the average Hamming loss.
jaccard_similarity_score(y_true, y_pred[, ...])	Jaccard similarity coefficient score
log_loss(y_true, y_pred[, eps, normalize, ...])	Log loss, aka logistic loss or cross-entropy loss.
precision_recall_fscore_support(y_true, y_pred)	Compute precision, recall, F-measure and support for each class
precision_score(y_true, y_pred[, labels, ...])	Compute the precision
recall_score(y_true, y_pred[, labels, ...])	Compute the recall
zero_one_loss(y_true, y_pred[, normalize, ...])	Zero-one classification loss.

And some work with binary and multilabel (but not multiclass) problems:

average_precision_score(y_true, y_score[, ...])	Compute average precision (AP) from prediction scores
roc_auc_score(y_true, y_score[, average, ...])	Compute Area Under the Curve (AUC) from prediction scores

[Classification metrics]

参数

y_true

1d array-like, or label indicator array / sparse matrix
Ground truth (correct) target values.

y_pred

1d array-like, or label indicator array / sparse matrix
Estimated targets as returned by a classifier.

y_true和y_pred输入格式，可如['a', 'b', 'b', 'b', 'c', 'c']，或者multihot [[0, 1, 0, 0, 0, 0, 0, 1, 0, 0] [0, 1, 0, 0, 1, 0, 0, 0, 0, 0]]都可以。对于输入onehot类型的多分类y_true，等价于y_true.argmax(axis=1)，即类别id形式（或者类别string形式）。{multihot应该也等价于变成多标签的类别形式如[['a', 'b'], ['b'], ['b', 'c'], ['b'], ['c'], ['c']]吧（可能需要对齐？）}

zero_division

{“warn”, 0.0, 1.0, np.nan}, default=”warn”
Sets the value to return when there is a zero division.

Notes: - If set to “warn”, this acts like 0, but a warning is also raised. - If set to np.nan, such values will be excluded from the average.

multiclass和multilabel可用的参数：

average参数

average{‘micro’, ‘macro’, ‘samples’, ‘weighted’, ‘binary’} or None, default=’binary’
This parameter is required for multiclass/multilabel targets. If None, the scores for each class are returned. Otherwise, this determines the type of averaging performed on the data:

average='binary':Only report results for the class specified by pos_label. This is applicable only if targets (y_{true,pred}) are binary.

将 binary metric （二分指标）扩展为 multiclass （多类）或 multilabel （多标签）问题时，数据将被视为二分问题的集合，每个类都有一个。然后可以使用多种方法在整个类中 average binary metric calculations （平均二分指标计算），每种类在某些情况下可能会有用。如果可用，您应该使用 average 参数来选择它们。

        average="macro" 简单地计算 binary metrics （二分指标）的平均值，赋予每个类别相同的权重。在不常见的类别重要的问题上，macro-averaging （宏观平均）可能是突出表现的一种手段。另一方面，所有类别同样重要的假设通常是不真实的，因此 macro-averaging （宏观平均）将过度强调不频繁类的典型的低性能。"macro" simply calculates the mean of the binary metrics,giving equal weight to each class. In problems where infrequent classesare nonetheless important, macro-averaging may be a means of highlightingtheir performance. On the other hand, the assumption that all classes areequally important is often untrue, such that macro-averaging willover-emphasize the typically low performance on an infrequent class.
        average="micro" 给每个 sample-class pair （样本类对）对 overall metric （总体指数）（sample-class 权重的结果除外）等同的贡献。除了对每个类别的 metric 进行求和之外，这个总和构成每个类别度量的 dividends （除数）和 divisors （除数）计算一个整体商。在 multilabel settings （多标签设置）中，Micro-averaging 可能是优先选择的，包括要忽略 majority class （多数类）的 multiclass classification （多类分类）。"micro" gives each sample-class pair an equal contribution to the overallmetric (except as a result of sample-weight). Rather than summing themetric per class, this sums the dividends and divisors that make up theper-class metrics to calculate an overall quotient. Micro-averaging may be preferred in multilabel settings, includingmulticlass classification where a majority class is to be ignored.
        average="weighted" 通过计算其在真实数据样本中的存在来对每个类的 score 进行加权的 binary metrics （二分指标）的平均值来计算类不平衡。"weighted" accounts for class imbalance by computing the average ofbinary metrics in which each class’s score is weighted by its presence in thetrue data sample. 对每一类别的f1_score进行加权平均，权重为各类别数在y_true中所占比例。Calculate metrics for each label, and find their average, weighted by support (the number of true instances for each label). This alters ‘macro’ to account for label imbalance; it can result in an F-score that is not between precision and recall.”

average="samples" 仅适用于 multilabel （多标签问题）。它不计算每个类别的 measure，而是计算评估数据中的每个样本的真实和预测类别的 metric，并返回加权平均。"samples" applies only to multilabel problems. It does not calculate a per-class measure, instead calculating the metric over the true and predictedclasses for each sample in the evaluation data, and returning their(sample_weight-weighted) average.在multilabel评估测试中发现samples和micro的值一样，不知道是不是等价的。
average=None 将返回一个 array 与每个类的 score 。

labels参数 array-like, default=None
The set of labels to include when average != 'binary', and their order if average is None. Labels present in the data can be excluded, for example to calculate a multiclass average ignoring a majority negative class, while labels not present in the data will result in 0 components in a macro average. For multilabel targets, labels are column indices. By default, all labels in y_true and y_pred are used in sorted order.list, optional 当average！='binary'时要包括的一组标签，如果average是None，则为标签的顺序。可以排除数据中存在的标签，例如，以忽略多数否定类的方式计算多类平均值，而数据中不存在的标签将导致宏平均值中的0成分。对于多标签目标，标签是列索引。默认情况下，y_true和y_pred中的所有标签均按排序顺序使用。

选择计算需要的labels。比如原始的labels = [0, 1, 2, 3]，设置labels = [2, 3]后，accuracy、micro_*、macro_*、weight_*、none_*、classification_report的计算都只考虑指定的labels，未指定的忽略。注意，这个应该还是会算所有labels的混淆矩阵的，只是计算acc\m*cro时，只考虑指定labels，none只显示指定的而已。

用于比如只想计算除去某个大类的其它类的指标情况。比如labels=[2,3,4]，则计算的precision指标如macro、none都是只对类别是2-4来计算。（比如此时不想看label=0，1时的情况）
对于y_*是multihot形式时，labels指定的list中的元素对应的是multihot转成单个数字时的值。

示例

# average=None和average='macro'的示例：

y_pred为1时，每个类（共3个）的precision

average=None就是分别求multilabel的precision，结果是一个列表。
average='macro'实际就是这个列表的均值（包括precision_score、recall_score和f1_score）。

from sklearn.metrics import precision_score

y_true = [[0, 0, 0], [1, 1, 1], [0, 1, 1], [0, 1, 1]]
y_pred = [[0, 0, 0], [1, 1, 1], [1, 1, 0], [1, 1, 0]]
print(precision_score(y_true, y_pred, average=None))
# [0.33333333 1.  1. ]
print(precision_score(y_true, y_pred, average='macro'))
# 0.7777777777777777

# average='micro'的示例

from sklearn import metrics

# y_true = [[1, 1, 0], [1, 1, 1], [0, 1, 1], [0, 1, 1]]
# y_pred = [[0, 1, 1], [1, 1, 1], [1, 1, 0], [1, 1, 0]]
y_true = [[1, 0, 0], [0, 0, 1], [0, 1, 0], [0, 0, 1]]
y_pred = [[0, 1, 0], [1, 0, 0], [0, 1, 0], [0, 1, 0]]
acc = metrics.accuracy_score(y_true, y_pred)
micro_p = metrics.precision_score(y_true, y_pred, average='micro')
micro_r = metrics.recall_score(y_true, y_pred, average='micro')
micro_f = metrics.f1_score(y_true, y_pred, average='micro')
print("acc:{}, micro_p:{}, micro_r:{}, micro_f:{}".format(acc, micro_p, micro_r, micro_f))
# acc:0.25, micro_p:0.6666666666666666, micro_r:0.6666666666666666, micro_f:0.6666666666666666
# acc:0.25, micro_p:0.25, micro_r:0.25, micro_f:0.25

Note: Micro P / Micro R / Micro F1 计算下来都等于 Accuracy，除非pred中存在true没有的标签或者存在[0 0 0]这种非onehot的错误标签或者 multilabel时？。

[宏平均（macro-average）、微平均（micro-average）和Weight F1]

多类分类指标计算示例

[多类分类指标计算]

accuracy_score分类准确率（Classification accuracy）

分类准确率分数是指所有分类正确的百分比。

In multilabel classification, the function returns the subset accuracy.

# calculate accuracy
from sklearn import metrics
print metrics.accuracy_score(y_test, y_pred_class)
0.692708333333

空准确率（null accuracy）

指假设模型总是预测比例较高的类别，其正确的比例。

# examine the class distribution of the testing set (using a Pandas Series method)
y_test.value_counts()
0 130
1 62

# calculate null accuracy(for binary classification problems coded as 0/1)
max(y_test.mean(), 1-y_test.mean())
0.67708333333333326

# calculate null accuracy (for multi-class classification problems)
y_test.value_counts().head(1) / len(y_test)
0 0.677083

我们看到空准确率是68%，而分类准确率是69%，这说明该分类准确率并不是很好的模型度量方法，分类准确率的一个缺点是其不能表现任何有关测试数据的潜在分布。

比较真实和预测的类别响应值：

# print the first 25 true and predicted responses
print "True:", y_test.values[0:25]
print "Pred:", y_pred_class[0:25]

True: [1 0 0 1 0 0 1 1 0 0 1 1 0 0 0 0 1 0 0 0 1 1 0 0 0]
Pred: [0 0 0 0 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0]

从上面真实值和预测值的比较中可以看出，当正确的类别是0时，预测的类别基本都是0；当正确的类别是1时，预测的类别大都不是1。换句话说，该训练的模型大都在比例较高的那项类别的预测中预测正确，而在另外一中类别的预测中预测失败，而我们没法从分类准确率这项指标中发现这个问题。

基于混淆矩阵的评估度量

混淆矩阵confusion_matrix

适用于二分类、多分类。因为参数y的原因，不适用于多标签分类[专用confusion_matrix]。

混淆矩阵赋予一个分类器性能表现更全面的认识，同时它通过计算各种分类度量，指导你进行模型选择。

sklearn.metrics.confusion_matrix(y_true, y_pred, *, labels=None, sample_weight=None, normalize=None)

参数：

y_true array-like of shape (n_samples,)
Ground truth (correct) target values.

y_pred array-like of shape (n_samples,)
Estimated targets as returned by a classifier.

返回：Cndarray of shape (n_classes, n_classes)
Confusion matrix whose i-th row and j-th column entry indicates the number of samples with true label being i-th class and predicted label being j-th class. 图示如下：

from sklearn.metrics import confusion_matrix
y_true = [2, 0, 2, 2, 0, 1]
y_pred = [0, 0, 2, 2, 0, 2]
confusion_matrix(y_true, y_pred)
array([[2, 0, 0],
       [0, 0, 1],
       [1, 0, 2]])


y_true = ["cat", "ant", "cat", "cat", "ant", "bird"]
y_pred = ["ant", "ant", "cat", "cat", "ant", "cat"]
confusion_matrix(y_true, y_pred, labels=["ant", "bird", "cat"])
array([[2, 0, 0],
       [0, 0, 1],
       [1, 0, 2]])

准确率、识别率（Classification Accuracy）：分类器正确分类的比例

print (TP+TN) / float(TP+TN+FN+FP)
print metrics.accuracy_score(y_test, y_pred_class)

错误率、误分类率（Classification Error）：分类器误分类的比例

print (FP+FN) / float(TP+TN+FN+FP)
print 1-metrics.accuracy_score(y_test, y_pred_class)

考虑类不平衡问题，其中感兴趣的主类是稀少的。即数据集的分布反映负类显著地占多数，而正类占少数。故面对这种问题，需要其他的度量，评估分类器正确地识别正例数据的情况和正确地识别负例数据的情况。

召回率（Recall），也称为真正例识别率、灵敏性（Sensitivity）、查全率：正确识别的正例数据在实际正例数据中的百分比

print TP / float(TP+FN)
print metrics.recall_score(y_test, y_pred_class)

特效性（Specificity），也称为真负例率：正确识别的负例数据在实际负例数据中的百分比

print TN / float(TN+FP)

假阳率（False Positive Rate）：实际值是负例数据，预测错误的百分比

print FP / float(TN+FP)
specificity = TN / float(TN+FP)
print 1 - specificity

精度（Precision）：看做精确性的度量，即标记为正类的数据实际为正例的百分比

print TP / float(TP+FP)
precision = metrics.precision_score(y_test, y_pred_class)
print precision

F度量（又称为F1分数或F分数），是使用精度和召回率的方法组合到一个度量上

F=2∗ p r e c i s i o n∗ r e c a l l p r e c i s i o n+ r e c a l l

F β=(1+ β2)∗ p r e c i s i o n∗ r e c a l l β2∗ p r e c i s i o n+ r e c a l l

F度量是精度和召回率的调和均值，它赋予精度和召回率相等的权重。

Fβ度量是精度和召回率的加权度量，它赋予召回率权重是赋予精度的β倍。

print (2*precision*recall) / (precision+recall)
print metrics.f1_score(y_test, y_pred_class)

sklearn评估报告classification_report

metrics.classification_report(y_true, y_pred, *, labels=None, target_names=None, sample_weight=None, digits=2, output_dict=False, zero_division='warn')

参数

labels：分类报告中显示的类标签的索引列表

target_names：显示与labels对应的名称

digits ：int, default=2 Number of digits for formatting output floating point values. When output_dict is True, this will be ignored and the returned values will not be rounded.指定输出格式的精度

Note: 结果中的各个类的指标数值，和直接算并设置average=None的数值是一样的。

多类分类示例

y_true = [0, 1, 2, 2, 0]
y_pred = [0, 0, 2, 1, 0]
print(metrics.classification_report(y_true, y_pred))
# target_names = ['class 0', 'class 1', 'class 2']
# print(metrics.classification_report(y_true, y_pred, target_names=target_names))

precision recall f1-score support

0 0.67 1.00 0.80 2
1 0.00 0.00 0.00 1
2 1.00 0.50 0.67 2

accuracy 0.60 5
macro avg 0.56 0.50 0.49 5
weighted avg 0.67 0.60 0.59 5

多标签分类示例

y_true = [[0, 1], [0, 1], [1, 0], [0, 1], [1, 0]]
y_pred = [[1, 0], [0, 1], [0, 1], [0, 1], [1, 1]]
print(metrics.classification_report(y_true, y_pred))

precision recall f1-score support

0 0.50 0.50 0.50 2
1 0.50 0.67 0.57 3

micro avg 0.50 0.60 0.55 5
macro avg 0.50 0.58 0.54 5
weighted avg 0.50 0.60 0.54 5
samples avg 0.50 0.60 0.53 5

Precision-Recall曲线

sklearn.metrics.precision_recall_curve(y_true, probas_pred, pos_label=None, sample_weight=None)

参数

y_true : array, shape = [n_samples]

True targets of binary classification in range {-1, 1} or {0, 1}. 也就是二分类的binary。

probas_pred : array, shape = [n_samples]

Estimated probabilities or decision function. 必须是连续的continuous.

def plotPR(yt, ys, title=None):
    '''
    绘制precision-recall曲线
    :param yt: y真值
    :param ys: y预测值, recall,
    '''
    import seaborn
    from sklearn import metrics
    from matplotlib import pyplot as plt
    precision, recall, thresholds = metrics.precision_recall_curve(yt, ys)

    plt.plot(precision, recall, 'darkorange', lw=1, label='x=precision')
    plt.plot(recall, precision, 'blue', lw=1, label='x=recall')
    plt.legend(loc='best')
    plt.plot([0, 1], [0, 1], color='navy', linestyle='--')
    plt.title('Precision-Recall curve for %s' % title)
    plt.ylabel('Recall')
    plt.xlabel('Precision')
    plt.show()
    plt.savefig(os.path.join(CWD, 'middlewares/pr-' + title + '.png'))

[ sklearn.metrics.precision_recall_curve]

ROC曲线和AUC

[机器学习模型的评价指标和方法 :ROC曲线和AUC]

sklearn.metrics.roc_curve(y_true, y_score, pos_label=None, sample_weight=None, drop_intermediate=True)

参数

y_true : array, shape = [n_samples]

True targets of binary classification in range {-1, 1} or {0, 1}. 也就是二分类的binary。

y_score : array, shape = [n_samples]

Target scores, can either be probability estimates of the positiveclass, confidence values, or non-thresholded measure of decisions(as returned by “decision_function” on some classifiers).应该也必须是连续的continuous.

绘制实现

def plotRUC(yt, ys, title=None):
    '''
    绘制ROC-AUC曲线
    :param yt: y真值
    :param ys: y预测值
    '''
    from sklearn import metrics
    from matplotlib import pyplot as plt
    f_pos, t_pos, thresh = metrics.roc_curve(yt, ys)
    auc_area = metrics.auc(f_pos, t_pos)
    print('auc_area: {}'.format(auc_area))

    plt.plot(f_pos, t_pos, 'darkorange', lw=2, label='AUC = %.2f' % auc_area)
    plt.legend(loc='lower right')
    plt.plot([0, 1], [0, 1], color='navy', linestyle='--')
    plt.title('ROC-AUC curve for %s' % title)
    plt.ylabel('True Pos Rate')
    plt.xlabel('False Pos Rate')
    plt.show()
    plt.savefig(os.path.join(CWD, 'middlewares/roc-' + title + '.png'))

[ roc_curve(y_true, y_score[, pos_label, ...])]

[roc_auc_score(y_true, y_score[, average, ...])]

使用什么度量取决于具体的业务要求

垃圾邮件过滤器：优先优化precision，因为该应用对假阳性（非垃圾邮件被放进垃圾邮件箱）的要求高于对假阴性（垃圾邮件被放进正常的收件箱）的要求
欺诈交易检测器：优先优化recall，因为该应用对假阴性（欺诈行为未被检测）的要求高于假阳性（正常交易被认为是欺诈）的要求

[【scikit-learn】评估分类器性能的度量，像混淆矩阵、ROC、AUC等]

多标签分类指标Multilabel ranking metrics

通用指标

前面的示例中多类分类中的accuracy_score、precision_score、recall_score、f1_score、micro、macro都可以使用在多标签分类中。参考[多类分类中的average参数使用]

Note: multilabel的precision等的计算，需要先计算出multilabel_confusion_matrix，而不需要具体的confusion_matrix，因为precision只需要tp_sum和pred_sum。

专用multilabel_confusion_matrix

metrics.multilabel_confusion_matrix(y_true, y_pred, *, sample_weight=None, labels=None, samplewise=False)

metrics.multilabel_confusion_matrix 是 scikit-learn 0.21 新增的一个函数。用来计算多标签的混淆矩阵的，不过也可以用它来计算多分类的混淆矩阵。
MCM将多分类数据转化为2分类问题，采用one-vs-rest策略，即某一类为正样本，其余类别为负样本，每一类都作为正样本，计算混淆矩阵，按标签的顺序返回所有。
但是如下面示例可知，看不出来某一类错分到了哪一类；而不像metrics.confusion_matrix那样，能看出来，预测错误分到的类别，看哪些类别更容易混淆。

示例

import numpy as np
from sklearn.metrics import multilabel_confusion_matrix
y_true = np.array([[1, 0, 1],
                   [0, 1, 0]])
y_pred = np.array([[1, 0, 0],
                   [0, 1, 1]])
multilabel_confusion_matrix(y_true, y_pred)

array([[[1, 0],
        [0, 1]],

       [[1, 0],
        [0, 1]],

       [[0, 1],
        [1, 0]]])

定义良好的multi-label confusion matrix

仔细想想，multi-label不太好定义confusion matrix，它的confusion matrix还可能需要多增加n行或n列。

方式1：
比如y_true = [0 1 0 1], y_pred = [0 1 0 0]，我们只能拆解成[0 1 0 0]被识别成了[0 1 0 0]，[0 0 0 1]被识别成了[0 0 0 0]（能且只能这样定义吧，总不能像multi-class一样说是被识别成了[0 1 0 0]吧），并在相应的矩阵位置+1。

更细节的看下面的示例code

import numpy as np


def multilabel_confusion_matrix(label_true, label_pred):
    num_classes = len(label_pred[0])  # number of all classes
    num_instances = len(label_pred)  # number of instances (input)
    # initializing the confusion matrix
    conf_mat = np.zeros((num_classes + 1, num_classes + 1), dtype=np.int64)

    for row_id in range(num_instances):

        num_of_true_labels = np.sum(label_true[row_id])
        num_of_pred_labels = np.sum(label_pred[row_id])

        if num_of_true_labels == 0:
            print('num_of_true_labels=0, something maybe wrong!!!')
            if num_of_pred_labels == 0:
                conf_mat[num_classes][num_classes] += 1
            else:
                for k in range(num_classes):
                    if label_pred[row_id][k] == 1:
                        conf_mat[num_classes][k] += 1  # NTL


        elif num_of_true_labels == 1:
            for j in range(num_classes):
                if label_true[row_id][j] == 1:
                    if num_of_pred_labels == 0:
                        conf_mat[j][num_classes] += 1  # NPL
                    else:
                        for k in range(num_classes):
                            if label_pred[row_id][k] == 1:
                                conf_mat[j][k] += 1

        else:
            if num_of_pred_labels == 0:
                for j in range(num_classes):
                    if label_true[row_id][j] == 1:
                        conf_mat[j][num_classes] += 1  # NPL
            else:
                true_checked = np.zeros((num_classes, 1), dtype=int)
                pred_checked = np.zeros((num_classes, 1), dtype=int)
                # Check for correct prediction
                # 完全正确的，在对角线+1
                for j in range(num_classes):
                    if label_true[row_id][j] == 1 and label_pred[row_id][j] == 1:
                        conf_mat[j][j] += 1
                        true_checked[j] = 1
                        pred_checked[j] = 1

                # check for incorrect prediction(s)
                # pred有1 true无1（但除去完全正确的还有1），在(所有（除去完全正确的）true=1的true_id,pred_id)处+1
                # eg. [1, 1, 0]>[1, 0, 1]中的[0, 1, 0]>[0, 0, 1]
                for k in range(num_classes):
                    if (label_pred[row_id][k] == 1) and (pred_checked[k] != 1):
                        for j in range(num_classes):
                            if (label_true[row_id][j] == 1) and (true_checked[j] != 1):
                                conf_mat[j][k] += 1
                                pred_checked[k] = 1
                                true_checked[j] = 1
                # check for incorrect prediction(s) while all True labels were predicted correctly
                # pred有1 true无1（但除去完全正确的也无1），在(所有（完全正确的）true=1的true_id,pred_id)处+1
                # eg. [1, 1, 0]>[1, 1, 1]中的[1, 0, 0]>[0, 0, 1]和[0, 1, 0]>[0, 0, 1]
                for k in range(num_classes):
                    if (label_pred[row_id][k] == 1) and (pred_checked[k] != 1):
                        for j in range(num_classes):
                            if label_true[row_id][j] == 1:
                                conf_mat[j][k] += 1
                                pred_checked[k] = 1
                                true_checked[j] = 1

                # check for cases with True label(s) and no predicted label [110]>[100]
                # true有1但是pred无1，需要多增加一列，在true=1的true_id,num_classes处+1
                # eg. [1, 1, 0]>[1, 0, 0]中的[1, 0, 0]>[0, 0, 0]和[0, 1, 0]>[0, 0, 0]
                for k in range(num_classes):
                    if (label_true[row_id][k] == 1) and (true_checked[k] != 1):
                        conf_mat[k][num_classes] += 1  # NPL

    return conf_mat


y_true = np.array([[0, 0, 1], [1, 1, 0], [1, 1, 0], [1, 1, 0], [0, 1, 0]])
y_pred = np.array([[0, 0, 1], [1, 0, 1], [1, 0, 0], [1, 1, 1], [1, 0, 0]])

conf_mat = multilabel_confusion_matrix(y_true, y_pred)
print('Multi-label confusion Matrix:\n{}'.format(conf_mat))
# [[3 0 1 0]
#  [1 1 2 1]
#  [0 0 1 0]
#  [0 0 0 0]]

主要参考代码来自[mlcm · PyPI]，但是这个code太老有bug。

方式2：
如果y_true和y_pred截断后是onehot的（比如分层的flat后形成的），可以分别截断成多个y_true和y_pred后再argmax(1)，直接算多分类的metrics.confusion_matrix。
另外可能需要处理一下[000]这种，即将[000]这种转成[0001]，多加一个num_class，拓展一维也可以。
方式3：
当然应该也可以将[0101]这种multihot转成拓展维度的onehot[00001]（先得算出最大维度），再看metrics.confusion_matrix。