详解多分类模型的Macro-F1/Precision/Recall计算过程

最新推荐文章于 2025-04-15 00:20:04 发布

ybdesire

最新推荐文章于 2025-04-15 00:20:04 发布

阅读量3.6w

点赞数 30

分类专栏： Machine Learning 源码分析 Python

本文链接：https://blog.csdn.net/ybdesire/article/details/96507733

版权

Machine Learning 同时被 3 个专栏收录

110 篇文章

订阅专栏

Python

80 篇文章

订阅专栏

源码分析

29 篇文章

订阅专栏

引入

关于准确率(accuracy)、精度(precision)、查全率(recall)、F1的计算过程，之前写过一篇文章[1]。

根据文章[1]中的公式，我们可以知道，精度(precision)、查全率(recall)、F1的计算，是针对于二分类器来说的。他们的计算，只与y_true/y_pred有关，也要求y_true/y_pred中，只含有0和1两个数。

对二分类模型来说，可以直接调用sklearn.metrics中的f1_score, precision_score, 和recall_score来进行计算。但对多分类模型来说，y_true/y_pred中可能会有种label（比如y_true=[1,2,3]），应该如何计算其F1/P/R值呢？

Macro Average

在google上搜索，可以看到如下描述

The F1 measure is widely used to evaluate the success of a binary classifier when one class is rare. Micro average, macro average, and per instance average F1 measures are used in multilabel classification.

可见，传统的F1计算公式[1]，只适用于二分类模型。对多分类模型来说，要用Macro Average规则来进行F1（或者P、R）的计算。

举例来说，假设是三个类别的分类模型:

y_true=[1,2,3]
y_pred=[1,1,3]

根据P/R的计算规则[1]，

Precision = (预测为1且正确预测的样本数)/(所有预测为1的样本数) = TP/(TP+FP)
Recall = (预测为1且正确预测的样本数)/(所有真实情况为1的样本数) = TP/(TP+FN)
F1 = 2*(Precision*Recall)/(Precision+Recall)

下面计算过程中，若除法过程中，分子分母同时为零，则最终结果也为0.

则Macro Average F1的计算过程如下：

（1）如下，将第1个类别设置为True（1），非第1个类别的设置为False（0），计算其P1,R1

y_true=[1,0,0]
y_pred=[1,1,0]

P1 = (预测为1且正确预测的样本数)/(所有预测为1的样本数) = TP/(TP+FP) = 1/(1+1)=0.5
R1 = (预测为1且正确预测的样本数)/(所有真实情况为1的样本数) = TP/(TP+FN)= 1/1 = 1.0
F1_1 = 2*(PrecisionRecall)/(Precision+Recall)=20.5*1.0/(0.5+1.0)=0.6666667

（2）如下，将第2个类别设置为True（1），非第2个类别的设置为False（0），计算其P2,R2

y_true=[0,1,0]
y_pred=[0,0,0]

P2 = (预测为1且正确预测的样本数)/(所有预测为1的样本数) = TP/(TP+FP) =0.0
R2 = (预测为1且正确预测的样本数)/(所有真实情况为1的样本数) = TP/(TP+FN)= 0.0
F1_2 = 2*(Precision*Recall)/(Precision+Recall)=0

（3）如下，将第3个类别设置为True（1），非第3个类别的设置为False（0），计算其P3,R3

y_true=[0,0,1]
y_pred=[0,0,1]

P3 = (预测为1且正确预测的样本数)/(所有预测为1的样本数) = TP/(TP+FP) = 1/1=1.0
R3 = (预测为1且正确预测的样本数)/(所有真实情况为1的样本数) = TP/(TP+FN)= 1/1 = 1.0
F1_3 = 2*(PrecisionRecall)/(Precision+Recall)=21.0*1.0/(1.0+1.0)=1.0

（4）对P1/P2/P3取平均为P，对R1/R2/R3取平均为R，对F1_1/F1_2/F1_3取平均F1

P=(P1+P2+P3)/3=(0.5+0.0+1.0)/3=0.5
R=(R1+R2+R3)/3=(1.0+0.0+1.0)/3=0.6666666
F1 = (0.6666667+0.0+1.0)/3=0.5556

最后这个取平均后的得到的P值/R值，就是Macro规则下的P值/R值。

对这个3类别模型来说，它的F1就是0.5556。

sklearn计算程序（macro）

下面是使用sklearn直接计算多类别F1/P/R的程序，将接口中的average参数配置为’macro’即可。

from sklearn.metrics import f1_score, precision_score, recall_score

y_true=[1,2,3]
y_pred=[1,1,3]

f1 = f1_score( y_true, y_pred, average='macro' )
p = precision_score(y_true, y_pred, average='macro')
r = recall_score(y_true, y_pred, average='macro')

print(f1, p, r)
# output: 0.555555555556 0.5 0.666666666667

可见，输出的P/R/F1值，都和上面我们手动计算的结果一致。

sklearn中macro-F1源码静态分析

（1）找到第三方库所在的位置
先利用如下Python代码找到sklearn源码位置。我的位置在/root/anaconda3/envs/envtf/lib/python3.5/site-packages/sklearn

import sklearn, os
path = os.path.dirname(sklearn.__file__)

（2）找到要调试的源码位置

我们调试源码的目的，就是想看看f1_score()计算的过程，所以应该调试f1_score的源码。

问题是怎么找到f1_score()的源码呢？

sklearn的api文档（https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html#sklearn.metrics.f1_score）中，都在“[source]”中给出了源码链接。

通过这个“[source]”的链接（https://github.com/scikit-learn/scikit-learn/blob/b7b4d3e2f/sklearn/metrics/classification.py#L950），可以发现，我们要调试的源码，位于sklearn/metrics/classification.py。

（3）分析classification.py中的F1计算函数f1_score()

主要是调用了计算F值的函数fbeta_score()，并将beta设置为1，就是计算F1值。fbeta_score()中最重要的函数是[precision_recall_fscore_support()](https://github.com/scikit-learn/scikit-learn/blob/b7b4d3e2f1a65bcb6d40431d3b61ed1d563c9dab/sklearn/metrics/classification.py#L1263)，该函数的核心逻辑如下：

#参数解释：
#y_true：真实值
#y_pred：预测值
#beta=1.0：默认计算F1值
#其他参数这里用不到
def precision_recall_fscore_support(y_true, y_pred, beta=1.0, labels=None,
                                    pos_label=1, average=None,
                                    warn_for=('precision', 'recall',
                                              'f-score'),
                                    sample_weight=None):
    # F值得beta必须大于0
    if beta <= 0:
        raise ValueError("beta should be >0 in the F-beta score")
    
    # 计算混淆矩阵
    MCM = multilabel_confusion_matrix(y_true, y_pred,
                                      sample_weight=sample_weight,
                                      labels=labels, samplewise=samplewise)
    tp_sum = MCM[:, 1, 1]
    pred_sum = tp_sum + MCM[:, 0, 1]
    true_sum = tp_sum + MCM[:, 1, 0]
    
    # 若为micro准则，则将各个类别的TP等值累加后进行计算
    # 若为micro，tp_sum/pred_sum/true_sum，最终就由一个list（各个类别自己的值）变为一个值
    if average == 'micro':
        tp_sum = np.array([tp_sum.sum()])
        pred_sum = np.array([pred_sum.sum()])
        true_sum = np.array([true_sum.sum()])

    beta2 = beta ** 2# 这里beta=1,则beta2也等于1，就是计算F1

    # 计算precision和recall
    precision = _prf_divide(tp_sum, pred_sum,
                            'precision', 'predicted', average, warn_for)
    recall = _prf_divide(tp_sum, true_sum,
                         'recall', 'true', average, warn_for)

    # 计算f_score, 
    denom = beta2 * precision + recall
    denom[denom == 0.] = 1  # avoid division by 0
    f_score = (1 + beta2) * precision * recall / denom

    # 如果考虑weight，则需要设置weights变量（后面函数中会用到）
    if average == 'weighted':
        weights = true_sum
        if weights.sum() == 0:
            return 0, 0, 0, None
    elif average == 'samples':
        weights = sample_weight
    else:
        weights = None

    # macro/micro都会运行到这里
    if average is not None:
        assert average != 'binary' or len(precision) == 1
        # 把各个类别的precision取平均，作为多类别的precision
        precision = np.average(precision, weights=weights)
        # 把各个类别的recall取平均，作为多类别的recall
        recall = np.average(recall, weights=weights)
        # 把各个类别的f_score取平均，作为多类别的f_score
        f_score = np.average(f_score, weights=weights)
        true_sum = None  # return no support

    return precision, recall, f_score, true_sum

因为有if average == 'micro'中的逻辑，会将各个类别计算得到的tp_sum/pred_sum/true_sum这样一个多个值的list，转换为单个值的list。

所以，后续计算precision和recall时，若为’micro’，则precision和recall是单个值。否则（macro时），precision和recall也是各个类别计算得到的多个值的list。

动态分析

（1）通过上面的步骤，“找到要调试的源码位置”

我的位置在/root/anaconda3/envs/envtf/lib/python3.5/site-packages/sklearn/metrics/classification.py

（2）删掉Python预编译的字节码

到如下目录，删掉__pycache__目录。

/root/anaconda3/envs/envtf/lib/python3.5/site-packages/sklearn/metrics/

（3）在第三方库源码中加断点

f1_score()的关键计算过程，在源码precision_recall_fscore_support()函数中的如下位置。在其中用pdb加入断点，如下所示：

import pdb;pdb.set_trace()

beta2 = beta ** 2

# Divide, and on zero-division, set scores to 0 and warn:

precision = _prf_divide(tp_sum, pred_sum,
                        'precision', 'predicted', average, warn_for)
recall = _prf_divide(tp_sum, true_sum,
                     'recall', 'true', average, warn_for)
# Don't need to warn for F: either P or R warned, or tp == 0 where pos
# and true are nonzero, in which case, F is well-defined and zero
denom = beta2 * precision + recall
denom[denom == 0.] = 1  # avoid division by 0
f_score = (1 + beta2) * precision * recall / denom
if average is not None:
    assert average != 'binary' or len(precision) == 1
    precision = np.average(precision, weights=weights)
    recall = np.average(recall, weights=weights)
    f_score = np.average(f_score, weights=weights)
    true_sum = None  # return no support

静态看这个代码，也能基本理解其含义。我们下面参考[2]中的过程，来动态调试程序，看看各个变量动态过程中实际的数值。

运行上文中的“sklearn计算程序（macro）”，单步调试，过程如下

(Pdb) n
> /root/anaconda3/envs/envtf/lib/python3.5/site-packages/sklearn/metrics/classification.py(1107)precision_recall_fscore_support()
-> with np.errstate(divide='ignore', invalid='ignore'):
(Pdb) n
> /root/anaconda3/envs/envtf/lib/python3.5/site-packages/sklearn/metrics/classification.py(1112)precision_recall_fscore_support()
-> precision = _prf_divide(tp_sum, pred_sum,
(Pdb) n
> /root/anaconda3/envs/envtf/lib/python3.5/site-packages/sklearn/metrics/classification.py(1113)precision_recall_fscore_support()
-> 'precision', 'predicted', average, warn_for)
(Pdb) n
> /root/anaconda3/envs/envtf/lib/python3.5/site-packages/sklearn/metrics/classification.py(1114)precision_recall_fscore_support()
-> recall = _prf_divide(tp_sum, true_sum,
(Pdb) n
> /root/anaconda3/envs/envtf/lib/python3.5/site-packages/sklearn/metrics/classification.py(1115)precision_recall_fscore_support()
-> 'recall', 'true', average, warn_for)
(Pdb) precision
array([ 0.5,  0. ,  1. ])
(Pdb) recall
*** NameError: name 'recall' is not defined
(Pdb) n
> /root/anaconda3/envs/envtf/lib/python3.5/site-packages/sklearn/metrics/classification.py(1118)precision_recall_fscore_support()
-> f_score = ((1 + beta2) * precision * recall /
(Pdb) recall
array([ 1.,  0.,  1.])

可见，sklearn源码计算得到的recall与precision是和我们手工计算一致的。继续单步，如下

(Pdb) n
> /root/anaconda3/envs/envtf/lib/python3.5/site-packages/sklearn/metrics/classification.py(1128)precision_recall_fscore_support()
-> elif average == 'samples':
(Pdb) f_score
array([ 0.66666667,  0.        ,  1.        ])

可见，f_score的值就是这个数组，和我们上面手动计算的数值一样。继续单步执行程序，如下所示

(Pdb) n
> /root/anaconda3/envs/envtf/lib/python3.5/site-packages/sklearn/metrics/classification.py(1137)precision_recall_fscore_support()
-> f_score = np.average(f_score, weights=weights)
(Pdb) f_score
array([ 0.66666667,  0.        ,  1.        ])
(Pdb) n
> /root/anaconda3/envs/envtf/lib/python3.5/site-packages/sklearn/metrics/classification.py(1138)precision_recall_fscore_support()
-> true_sum = None  # return no support
(Pdb) f_score
0.55555555555555547