Introduction to Classification Evaluation Methods -- Part 1

1. Classification problem

Classification is a kind of supervised learning, which leverages label data (ground truth) to guide the model training. Different from regression, the label here is categorical (qualitative) rather than numeric (quantitative).
The most general classification is multi-label classification. It means there are multiple classes in the ground truth (class means label’s value) and one data point (i.e. observation or record) can belong to more than one class.
More specific, if we restrain that one data point belongs to only one class, it becomes multi-class classification or multinomial classification. In the most practical cases, we encounter multi-class classification.
Moreover, if we restrain that there are only two classes in the ground truth, then it becomes binary classification, which is the most common case.
The above definitions can be summarized by the following table:

type of classificationmultiple classes (in the ground truth)two classes (in the ground truth)
multiple labels (one record has)multi-label classification
single label (one record has)multi-class classificationbinary classification

Binary classification is simple and popular, since we usually encounter detection problems that determine whether a signal exists or not, e.g. in face detection, whether a face exists in an image, or in lead generation, whether a company is qualified as a lead. Therefore, we start our introduction from binary classification.

2. Metrics for binary classification

2.1. Binary classification

In binary classification, given a data point x with its label y (y{0,1}), the classifier scores the data point as f (we assume f[0,1]). By comparing to a threshold t , we can get the predicted label ŷ . If ft , then ŷ =1 (positive), otherwise, ŷ =0 (negative).
For a set of N data points X with labels y , the corresponding predicted scores and labels are f and ŷ  , respectively.

2.2. Examples

Here we illustrate 3 example values that we try to predict i.e. y1 , y2 , y3 , we assume that we have 10 data points i=[0...9] having scores fi in descending order with index i . To get ŷ , we use 0.5 as threshold.

i 0 1 2 3 4 5 6 7 8 9
y1,i 1111100000
y2,i 1010100110
y3,i 0000011111
fi 0.960.910.750.620.580.520.450.280.170.13
ŷ i 1111110000

From the data above, we can see that fi predicts y1,i the best because fi value is above 0.5 when y1,i=1 ; however, fi predicts y3,i poorly.

2.3. Metrics based on labels and predicted labels

Based on labels, the data points can be divided into positive and negative. Based on predicted labels, the data points can be divided into predicted positive and negative. We make the following basic definitions:

  • P : the number of positive points, i.e. #(y=1)
  • N : the number of negative points, i.e. #(y=0)
  • P̂  : the number of predicted positive points, i.e. #(ŷ =1)
  • N̂  : the number of predicted negative points, i.e. #(ŷ =0)
  • TP : the number of predicted positive points that are actually positive, i.e. #(y=1,ŷ =1) (aka. True Positive, Hit)
    FP : the number of predicted positive points that are actually negative, i.e. #(y=0,ŷ =1) (aka. False Positive, False Alarm, Type I Error)
  • TN : the number of predicted negative points that are actually negative, i.e. #(y=0,ŷ =0) (aka. True Negative, Correct Rejection)
  • FN : the number of predicted negative points that are actually positive, i.e. #(y=1,ŷ =0) (aka. False Negative, Miss, Type II Error)

The above definitions can be summarized as the following confusion matrix:

confusion matrix P̂  N̂ 
P TP FN
N FP TN

Based on TP , FP , FN , TN , we can define the following metrics:

  • Recall: recall=TPP=TPTP+FN , the percentage of positive points that are predicted as positive (aka. Hit Rate, Sensitivity, True Positive Rate, TPR)
  • Precision: precision=TPP̂ =TPTP+FP , the percentage of predicted positive points that are actually positive (aka. Positive Predictive value, PPV)
  • False Alarm Rate: fa=FPN=FPFP+TN , the precentage of negative points that are predicted as positive (aka. False Positive Rate, FPR)
  • F score: f1=2precisionrecallprecision+recall , the harmonic mean of precision and recall. It is a special case of fβ score, when β=1 . (aka. F1 Score)
  • Accuracy: accuracy=TP+TNP+N , the percentage of correct predicted points out of all points
  • Matthews Correlation Coefficient: MCC=TPTNFPFN(TP+FP)(TP+FN)(TN+FP)(TN+FN) (aka. MCC)
  • Mean Consequential Error: MCE=1nyiŷ i1=FP+FNP+N=1accuracy

In the above 3 examples, we can calculate these metrics as follows:

Metrics TP FP TN FN recall precision f1 fa accuracy MCC MCE
1514010.8330.9090.20.90.8170.1
233220.60.50.5450.60.500.5
315040.20.1670.18210.1-0.8170.9

The following table lists the value range of each metric

Metrics TP FP TN FN recall precision f1 fa accuracy MCC MCE
Range[0, P ] [0,N][0, N ] [0,P][0,1][0,1][0,1][0,1][0,1][-1,1][0,1]
Best P 0 N01110110
Worst*0 P 0 N00010-11

* There are different ways to define the worst case. For example, the prediction that equals to random guess can be defined as worst, since it doesn’t provide any useful information. Here we define the worst case as the prediction is totally opposite to the ground truth.

2.4. Metrics based on labels and predicted scores

The above metrics depend on threshold, i.e. if threshold varies, the above metrics varies accordingly. To build threshold-free measurement, we can define some metrics based on labels and predicted scores (instead of predicted labels).
The rationale is that we can measure the overall performance when the threshold goes through its value range. Here we introduce 2 curves:

  • ROC Curve: recall (y-axis) vs. fa (x-axis) curve as threshold varies
  • Precision-Recall Curve: precision (y-axis) vs. recall (x-axis) curve as threshold varies

Then we can define the following metrics:

  • AUC: area under ROC (Receiver Operating Characteristic) curve
  • Average Precision: area under precision-recall curve
  • Precision-Recall Breakeven Point: precision (or recall or f1 score) when precision=recall

To describe the characteristics of these curves and metrics, let’s do some case study first.

Example 1:

index0123456789
fi 0.960.910.750.620.580.520.450.280.170.13
y1,i 1111100000
threshold range(0.96, 1](0.91, 0.96](0.75, 0.91](0.62, 0.75](0.58, 0.62](0.52, 0.58](0.45, 0.52](0.28, 0.45](0.17, 0.28](0.13, 0.17][0, 0.13]
TP 01234555555
FP 00000012345
TN 55555543210
FN 54321000000
recall 00.20.40.60.8111111
precision NaN111110.8330.7140.6250.5560.5
f1 NaN0.3330.5710.750.88910.9090.8330.7690.7140.667
fa 0000000.20.40.60.81
accuracy 0.50.60.70.80.910.90.80.70.60.5
MCC NaN0.3330.50.6550.81610.8160.6550.50.333NaN
MCE 0.50.40.30.20.100.10.20.30.40.5

Example 2:

index0123456789
fi 0.960.910.750.620.580.520.450.280.170.13
y2,i 1111100000
threshold range(0.96, 1](0.91, 0.96](0.75, 0.91](0.62, 0.75](0.58, 0.62](0.52, 0.58](0.45, 0.52](0.28, 0.45](0.17, 0.28](0.13, 0.17][0, 0.13]
TP01122333455
FP00112234445
TN55443321110
FN54433222100
recall00.20.20.40.40.60.60.60.811
precisionNaN10.50.6670.50.60.50.4290.50.5560.5
f1NaN0.3330.2860.50.4440.60.5450.50.6150.7140.667
fa000.20.20.40.40.60.80.80.81
accuracy0.50.60.50.60.50.60.50.40.50.60.5
MCCNaN0.33300.21800.20-0.21800.333NaN
MCE0.50.40.50.40.50.40.50.60.50.40.5

Example 3:

index0123456789
fi 0.960.910.750.620.580.520.450.280.170.13
y3,i 1111100000
threshold range(0.96, 1](0.91, 0.96](0.75, 0.91](0.62, 0.75](0.58, 0.62](0.52, 0.58](0.45, 0.52](0.28, 0.45](0.17, 0.28](0.13, 0.17][0, 0.13]
TP 00000012345
FP 01234555555
TN 54321000000
FN 55555543210
recall 0000000.20.40.60.81
precision NaN000000.1670.2860.3750.4440.5
f1 NaNNaNNaNNaNNaNNaN0.1820.3330.4620.5710.667
fa 00.20.40.60.8111111
accuracy 0.50.40.30.20.100.10.20.30.40.5
MCC NaN-0.333-0.5-0.655-0.816-1-0.816-0.655-0.5-0.333NaN
MCE 0.50.60.70.80.910.90.80.70.60.5

The above three tables list the calculation details of the recall , precision , fa and other metrics under different thresholds for three examples, respectively. The corresponding ROC curve and precision-recall curve are plotted as follows:
Precision-recall curve
ROC curve
From the above figures, we can summarize the characteristics of precision recall curve:

  • the curve is usually not monotonous
  • sometimes there is no definition at recall = 0, since precision is NaN (the data point with highest score is positive)
  • usually, as recall increases, the precision decreases with fluctuation
  • the curve has an intersection with line precision=recall
  • in the ideal case, the area under the curve is 1

The characteristics of ROC curve can be summarized as follows:

  • the curve is always monotonous (flat or increase)
  • in the best case (positive data points have higher score than negative data points), the area under the curve is 1
  • in the worst case (positive data points have lower score than negative data points), the area under the curve is 0
  • in the random case (random scoring), the area under the curve is 0.5

The following table summarizes auc, average precision and breakeven point in these three examples

metricsaucaverage precisionbreakeven point
11.0001.0001.000
20.5650.4670.600
30.0000.3040.000

The following table lists the value range of each metric

metricsaucaverage precisionbreakeven point
Range[0,1][0,1][0,1]
Best111
Worst00*0

* in the case that there is no positive sample, average precison can achieve 0.

2.5. Metrics selection

We have introduced several metrics to measure the performance of binary classification. In a practical case, what metrics should we adopt and what metrics should be avoided?
Usually, there are two cases we can encounter: balanced and unbalanced.

  • In balanced case, the number of positive samples is close to that of negative samples
  • In unbalanced case, there are orders of magnitude difference between numbers of positive and negative samples.
    • In practical case, positive number is usually less than negative number

The conclusions are that

  • In balanced case, all the above metrics can be used
  • In unbalanced case, precision, recall, f1 score, average precision and breakeven point are preferred rather than fa, accuracy, MCC, MCE, auc, ATOP*

*ATOP is another metric, which is similar to AUC that also cares about the order of positive and negative data points.
The main reason is that

  • precision, recall, f1 score, average precision, breakeven point focus on the correctness of the positive samples (related to TP , but not TN )
  • fa, accuracy, MCC, MCE, auc, ATOP are related to the correctness of the negative samples ( TN )

In unbalanced case, TN is usually huge, comparing to TP . Therefore, fa0 , accuracy1 , MCC1 , MCE0 , auc1 , ATOP1 . However, these “amazing” values don’t make any sense.
Let consider the following 4 examples:

import numpy as np
from sklearn.metrics import roc_auc_score, average_precision_score
def atop(y_sorted):
    num = len(y_sorted)
    index = range(num)
    atop = 1 - float(sum(y_sorted * index)) / sum(y_sorted) / num
    return atop

y1 = np.array([1,0,1,0,1,0,0,1,1,0] + [0] * 10)
f1 = np.array([i / float(len(y1)) for i in range(len(y1), 0, -1)])
auc1 = roc_auc_score(y1, f1)
ap1 = average_precision_score(y1, f1)
atop1 = atop(y1)
print auc1, atop1, ap1

y2 = np.array([1,0,1,0,1,0,0,1,1,0] + [0] * 990)
f2 = np.array([i / float(len(y2)) for i in range(len(y2), 0, -1)])
auc2 = roc_auc_score(y2, f2)
atop2 = atop(y2)
ap2 = average_precision_score(y2, f2)
print auc2, atop2, ap2

y3 = np.array([0] * 100 + [1] * 100 + [0] * 999800)
f3 = np.array([i / float(len(y3)) for i in range(len(y3), 0, -1)])
auc3 = roc_auc_score(y3, f3)
atop3 = atop(y3)
ap3 = average_precision_score(y3, f3)
print auc3, atop3, ap3

y4 = np.array([1] * 100 + [0] * 999900)
f4 = np.array([i / float(len(y4)) for i in range(len(y4), 0, -1)])
auc4 = roc_auc_score(y4, f4)
atop4 = atop(y4)
ap4 = average_precision_score(y4, f4)
print auc4, atop4, ap4
metricsaucatopaverage precision
10.853330.790000.62508
20.997790.995800.62508
30.999900.999850.30685
41.000000.999951.00000

Consider example 1 and 2. The former is balanced and the latter is unbalanced. The orders of the positive samples are the same in both examples.

  • However, auc and atop change a lot (0.85333 vs. 0.99779, 0.79000 vs. 0.99580). The more unbalanced the case is, auc and atop tend to be higher.
  • average precisions in both examples are the same (0.62508 vs. 0.62508)

Consider example 3 and 4. Both of them are extremely unbalanced. The difference is that the positive samples in example 3 are ordered from 100~199 while in example 4 ordered from 0~99.

  • However, auc and atop are nearly the same in both cases (0.99990 vs. 1.0, 0.99985 vs. 0.99995)
  • the average precision is able to distinguish this difference obviously (0.30685 vs. 1.0)
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值