分类评估指标有哪些_分类评估指标

最新推荐文章于 2023-03-20 18:15:21 发布

李_涛

最新推荐文章于 2023-03-20 18:15:21 发布

阅读量1.8k

点赞数 1

文章标签：机器学习 python 人工智能算法数据分析

原文链接：https://towardsdatascience.com/evaluation-metrics-for-classification-cdea775c97d4

版权

本文介绍了分类模型的评估标准，包括准确率、精确率、召回率、F1分数等关键指标，帮助理解如何衡量分类算法的性能。

摘要由CSDN通过智能技术生成

分类评估指标有哪些

Evaluation metric refers to a measure that we use to evaluate different models. Choosing an appropriate evaluation metric is a decision problem that requires a thorough understanding of the goal of a project and is a fundamental step before all modeling process that follows. So why is it so important, and how should we choose?

评估指标是指我们用来评估不同模型的度量。选择适当的评估指标是一个决策问题，需要对项目目标有透彻的了解，并且是随后进行所有建模过程的基本步骤。那么为什么它如此重要，我们应该如何选择呢？

它为什么如此重要？ (Why is it important?)

Let’s say you live in a city where it is sunny about 300 days a year. During the other 65 days, it gets very uncomfortable with heavy rain or snow. In this imaginary world, there are only two weather apps. One app predicts the weather correctly 80% of the time, while the other one is right about 70% of the time. Which app would you choose?

假设您居住在一个每年约300天阳光明媚的城市。在其他65天中，大雨或大雪会使人非常不舒服。在这个虚构的世界中，只有两个天气应用程序。一个应用程序有80％的时间可以正确预测天气，而另一个应用程序有70％的时间可以正确预测天气。您会选择哪个应用程序？

Image for post — Photo by Sora Sagano on Unsplash

If an app always predicts sunny in this city, this app will be 80% correct, even though it never predicts bad weather accurately. In this case, this app is useless! On the other hand, the other app might fail to predict the sunny weather about 30% of the time, but it will tell you about bad weather accurately over 80% of the time. In this case, this app that correctly predicts the weather 70% of the time is much more useful than the 80% app.

如果某个应用始终预测该城市晴天，那么即使该应用永远无法准确预测恶劣的天气，该应用也将达到80％的正确率。在这种情况下，此应用程序无用！另一方面，另一个应用程序可能无法在大约30％的时间中预测晴天，但是在80％的时间内它会准确地告诉您有关恶劣天气的信息。在这种情况下，此应用程序可以在70％的时间内正确预测天气，比80％的应用程序有用得多。

As we can see here, how we choose to evaluate our model can tell vastly different information about our data. It must be determined by the goal of the project and the properties of the dataset. No data will be without its variance, and no model can explain the world perfectly. So we must know where we would rather have such prediction errors to happen.

正如我们在这里看到的，我们如何选择评估模型可以告诉我们有关数据的极大不同信息。它必须由项目的目标和数据集的属性确定。没有数据就没有其变异性，没有任何模型可以完美地解释世界。因此，我们必须知道在哪里发生此类预测错误。

是非题，正面或负面 (True or False, Positive or Negative)

When a model makes a binary prediction, there are 4 possible outcomes. A model can predict yes (positive) or no (negative), and it can either be correct (true) or wrong (false). The same logic applies when a model is making a multi-class prediction. The model predicts yes or no for an individual class, and it can be right or it can be wrong, and we decide how to combine these individual outcomes.

当模型进行二进制预测时，有4种可能的结果。模型可以预测是( 正 )或否( 负 )，并且可以是正确的( true )或错误的( false )。当模型进行多类别预测时，将应用相同的逻辑。该模型针对单个班级预测是或否，它可能是正确的也可能是错误的，我们决定如何组合这些单独的结果。

正确预测的可能性 (Probability of Correct Prediction)

精度和召回率 (Precision & Recall)

But how we may interpret a model goes beyond these four outcomes. For one, true positive (TP) involves two events: model prediction and the actual state. Our interpretation can change based on how we want to prioritize these two events. We can see whether the model predicted positive when the actual value was positive, or we can see whether the actual value was positive when the model predicted positive.

但是我们如何解释模型超出了这四个结果。首先，真实肯定(TP)涉及两个事件：模型预测和实际状态。我们的解释可以根据我们要如何确定这两个事件的优先级而改变。我们可以看到当实际值为正时模型是否预测为正，或者可以看到当模型预测为正时模型实际值是否为正。

Let’s say when the weather is sunny, there is an 80% chance that the app would predict it to be sunny. But this does not tell us anything useful in this context. Instead, we should see how often the weather is sunny when the app says so. This will tell us how precise the app’s prediction is in predicting sunny days. This is the precision metric, representing the probability of the model being correct out of all the times model said yes.

假设当天气晴朗时，应用程式有80％的机会会预测天气晴朗。但这并不能告诉我们在此情况下有用的信息。取而代之的是，当应用显示时，我们应该查看天气晴朗的频率。这将告诉我们该应用程序的预测在预测晴天时有多精确。这是精度度量标准，代表模型在表示“是”的所有时间中都是正确的概率。

from sklearn.metrics import 
precision_score(actual_y, predicted_y)
# scoring parameter:

But sometimes knowing the probability of positive prediction out of all positive cases might be more important. Imagine we are predicting whether a food is poisonous or not. The precision of a model (a chance of food being poisonous when it is predicted to be) is not as important as predicting food to be poisonous when it is poisonous. Because if the model says it’s poisonous we can simply not eat it, but if it fails to tell us that, we risk eating poisonous food. So we want a model that has a high probability of predicting positive when food is poisonous. This is a case for the recall or sensitivity metric. It’s also called the true positive rate. It represents how sensitive our model is in detecting positive cases.

但是有时了解所有阳性病例中阳性预测的可能性可能更为重要。想象我们正在预测一种食物是否有毒。模型的精度 (预测食物有毒的可能性)并不像预测食物有毒的重要性那么重要。因为如果模型说它有毒，我们根本不能吃，但是如果模型不能告诉我们，我们就有吃有毒食物的风险。因此，我们需要一个能够在食物有毒时预测阳性的可能性很高的模型。这是召回率或灵敏度指标的一种情况。也称为真实正利率 。它代表了我们的模型在检测阳性病例中的敏感度。

from sklearn.metrics import 
recall_score(actual_y, predicted_y)
# scoring parameter:

F1分数 (F1 Score)

But what if both of these are important? In other words, what should we do when not missing positive cases and not missing negative cases are equally important? Imagine you are predicting a disease that can be cured by a certain treatment. But this treatment can be detrimental to those without the disease. In this case, we need a model that is sensitive to detecting positive cases and equally precise in its detection. That’s when the F1 Score comes into play. F1 Score is the harmonic mean of precision and recall, an average between precision and recall ratios.

但是，如果这两项都很重要怎么办？换句话说，当不漏掉积极案例和不漏掉消极案例同样重要时，我们该怎么办？想象您正在预测某种可以通过某种治疗方法治愈的疾病。但是这种治疗可能对没有这种疾病的人有害。在这种情况下，我们需要一个模型，该模型对检测阳性病例敏感，并且在检测中也同样精确。那就是F1分数开始发挥作用的时候。 F1分数是精度和查全率的谐波平均值，是精度和查全率之间的平均值。

from sklearn.metrics import f1
f1_score(actual_y, predicted_y)
# scoring parameter:

准确性 (Accuracy)

Precision and recall metrics focus on predicting positive cases: saying yes to the question of ‘Is this ___?’. But as we saw earlier there are two different ways a model can be correct: it can say yes when it should say yes, and it can say no when it should say no. Accuracy is a probability of being correct in both of these cases. We can use accuracy when classes are equally balanced. For example, let’s say we want to predict whether a picture is a dog or not. We just want the model to say yes when it’s a dog, and no when it’s not. Saying that it’s a dog when it is a cat does not have any different consequences than saying that it’s not a dog when it is a dog. In this case, we can use accuracy.

精确度和召回率指标着重于预测阳性病例：对“ 这是___？ '。但是，正如我们前面所看到的，模型的正确设置有两种不同的方法：当它应该说“是”时，它可以说“是”，而当它应该说“否”时，它可以说“不”。在这两种情况下， 准确性都是正确的可能性。当类相等地平衡时，我们可以使用准确性 。例如，假设我们要预测图片是否是狗。我们只希望模型在它是狗时说是，在不是狗时就说不。当说它是猫时，说它是狗，与说它是猫时，说它不是狗并没有什么不同。在这种情况下，我们可以使用precision 。

from sklearn.metrics import 
accuracy_score(actual_y, predicted_y)
# scoring parameter:

特异性 (Specificity)

Lastly, specificity is a sensitivity of predicting the negative cases (probability of predicting negatives out of all negative cases). In other words, we can use specificity when it’s more important to not miss negative cases than to be wrong. Let’s say you want to know if the water from a well is drinkable or not. You’d rather mark drinkable water not drinkable than to wrongfully mark undrinkable ‘drinkable’. If we swap our positive and negative and ask a question of ‘is this water contagious?’, we would be using sensitivity instead of specificity.

最后， 特异性是预测阴性病例的敏感性 (在所有阴性病例中预测阴性的可能性)。换句话说，当不遗漏负面案例比错错更为重要时，我们可以使用特异性 。假设您想知道井中的水是否可饮用。您宁愿将饮用水标记为不可饮用，而不是错误地标记不可饮用的“可饮用”。如果我们交换正面和负面的意见，并提出以下问题：“ 这种水会传染吗？ '，我们将使用敏感性而非特异性 。

Specificity does not have a built-in function within the Sci-kit Learn package. Instead, you can either use sensitivity and swap positive and negative cases or calculate it through custom function using the confusion matrix.

特异性在Sci-kit Learn软件包中没有内置功能。取而代之的是，您可以使用敏感度并交换正例和负例，也可以使用混淆矩阵通过自定义函数进行计算。

# custom function to calculate specificity for binary classificationfrom sklearn.metrics import confusion_matrixdef specificity_score(y_true, y_predicted):
    cm = confusion_matrix(y_true, y_predicted)return cm[0, 0] / (cm[0, 0] + cm[0, 1])print(specificity_score(y_true, y_predicted))

多类问题 (Problem of Multi-Class)

微观平均 (Micro Average)

We reviewed how to choose evaluation metrics for our binary classification data. But what do we do when our target is not yes or no but comprised of multiple categories? One way is to count each outcome globally regardless of within-class distribution and calculate the metric. We can accomplish this by using the micro average.

我们回顾了如何为二进制分类数据选择评估指标。但是，如果我们的目标不是“是”或“不是”而是由多个类别组成，我们该怎么办？一种方法是全局计算每个结果，而与班级内部分布无关，并计算指标。我们可以通过使用微平均值来实现。

Let’s think about what it means to use global counts. If we just look at the global outcome, we don’t have the four outcomes as before (TP, TN, FP, FN). Instead we have cases that are true (prediction = actual class) or false (prediction != actual class). Therefore, the micro precision, micro recall, and accuracy all represent the probability of accurate prediction and are equal. Also since the F1 Score is the harmonic mean of precision and recall, the micro F1 Score is the same as other metrics.

让我们考虑一下使用全局计数的含义。如果仅看全局结果，则没有像以前那样的四个结果(TP，TN，FP，FN)。取而代之的是，我们得出的情况为true(预测=实际类)或false(预测！=实际类)。因此， 微观精度 ， 微观召回率和准确性都代表了准确预测的可能性，并且是相等的。同样，由于F1得分是精确度和召回率的谐和平均值，因此微型F1得分与其他指标相同。

# these are all the same
recall_score('micro')precision_score('micro')accuracy_score(f1_score('micro')# scoring parameter: 'f1_micro', 'recall_micro', 'precision_micro'

Counting global outcomes disregards the distribution of predictions within each class (it just counts how many got it right). This can be helpful when the dataset has classes that are highly imbalanced and when we do not care about controlling prediction errors in any specific class. But as we discussed above, many of our problems do involve having specific control over where prediction errors come from.

计算全局结果会忽略每个类中预测的分布(它只是计算有多少正确的预测)。当数据集的类高度不平衡并且我们不关心控制任何特定类中的预测错误时，这可能会有所帮助。但是，正如我们上面讨论的那样，我们的许多问题确实涉及对预测错误的来源进行特定控制。

宏观平均 (Macro Average)

Another method to deal with multi-class is to simply calculate the binary measures for each class. For example, if our target variable can be either cat, dog, or bird, we get a binary yes or no answer for each prediction. Is this a cat? Is this a dog? Is this a bird? This will lead to as many scores as a number of our target classes. Then we can aggregate these scores and turn them into a single metric using a macro average or weighted average.

处理多类的另一种方法是简单地为每个类计算二进制度量。例如，如果我们的目标变量可以是cat ， dog或bird ，则每个预测都会得到二进制的是或否答案。这是猫吗？这是狗吗？这是一只鸟吗？这将导致获得与我们目标班数一样多的分数。然后，我们可以汇总这些分数，并使用宏平均值或加权平均值将其转换为单个指标。

Macro average calculates metrics for individual class then computes the average of them. This means that it gives equal weight to the results returned from each class regardless of their overall size. So it is not sensitive to the size of each class, but it takes the performance of individual class more seriously even when it’s a minority. Therefore, the macro average is a good measure if predicting minority class well is as important as the overall accuracy and we also believe that there is a reliable amount of information in the minority class to represent the ground truth pattern accurately.

宏平均计算单个类的指标，然后计算它们的平均值。这意味着无论每个类的总大小如何，它都将同等地权衡从每个类返回的结果。因此，它对每个班级的规模并不敏感，但是即使是少数派，它也更认真地对待每个班级的表现。因此，如果很好地预测少数群体的等级与总体准确度一样重要，那么宏观平均值是一个很好的衡量标准，并且我们还认为，少数群体的类别中存在可靠的信息量，可以准确地代表地面真理模式。

The macro average recall score is the same as the balanced accuracy in multi-class problems.

宏观平均召回得分与多类问题的均衡准确性相同。

print(f1_score( 'macro'))
print(precision_score('macro'))# below two are the same measuresfrom sklearn.metrics import balanced_accuracy_score, recall_score
print(recall_score('macro'))
print(balanced_accuracy_score(# scoring parameter: 'f1_macro', 'recall_macro', ...etc.

加权平均 (Weighted Average)

To recap, the micro average gives equal weight to individual observations, and the macro average gives equal weight to individual classes. The macro average by definition does not care about how much data each class has. So even when the minority class does not have enough data to show a reliable pattern, it will still weigh the minority class equally. If this is the case, we might want to consider the overall size of each class by using a weighted average.

概括地说，微观平均值赋予各个观察值相同的权重，宏观平均值赋予各个类别相同的权重。根据定义，宏平均值并不关心每个类有多少数据。因此，即使少数派类别没有足够的数据来显示可靠的模式，它仍将平等地权衡少数派类别。在这种情况下，我们可能希望使用加权平均值来考虑每个类别的总体规模。

# we can also compute precision, recall, and f1 at the same timefrom sklearn.metrics import precision_recall_fscore_supportprecision_recall_fscore_support(weighted')# scoring parameter: 'f1_weighted', 'recall_weighted', ...etc.

科恩的卡帕 (Cohen’s Kappa)

Each of the metrics we discussed so far individually tells a part of a story, and that was why we wanted to make sure our goal is clear and aligns with the metric we choose. But what if we just want a measure that can holistically tell us how our model classification is doing relative to a random chance? Cohen’s Kappa compares the classifier’s performance to that of a random classifier. Naturally, it takes the class imbalance into account as the random classifier will rely on the distribution of each class. Unlike other measures that are probabilities, Cohen’s Kappa ranges from -1 to 1.

到目前为止，我们讨论的每个指标都分别讲述了一个故事的一部分，这就是为什么我们要确保目标明确并与我们选择的指标保持一致。但是，如果我们只想要一个可以整体告诉我们模型分类相对于随机机会的表现的度量，该怎么办？ 科恩的Kappa将分类器的性能与随机分类器的性能进行比较。自然地，它会考虑类别不平衡，因为随机分类器将依赖于每个类别的分布。与其他概率度量不同， 科恩的Kappa的范围是-1至1。

from sklearn.metrics import cohen_kappa_scoreprint(cohen_kappa_score(

We looked at how each evaluation metrics for binary and multi-class classification are uniquely utilized for individual problems, and why it’s important to care about metrics we choose. I tried to go over the most foundational measures in plain English. But I still left off another very important metric, ROC/AUC score, which I will discuss in another post sometime.

我们研究了针对二进制和多类分类的每种评估指标如何独特地用于单个问题，以及关注选择的指标为何如此重要。我尝试过用朴素的英语讲解最基础的方法。但是我仍然忽略了另一个非常重要的指标ROC / AUC分数，我将在以后的另一篇文章中进行讨论。