假设有12个样本可被分为A,B, C 三个类别,某个分类器的结果如下:
真实 | 预测 |
---|---|
A | B |
A | A |
A | A |
A | C |
B | B |
B | B |
B | A |
B | C |
C | C |
C | C |
C | A |
C | C |
对于多分类来说,P值,R值,F1值都是针对某个类别来说的。对于A类来说,A类就是正类,其他类别都是负类。
A类的混淆矩阵:
预测为正类 | 预测为负类 | |
---|---|---|
实际为正类 | 2 (TP) | 2 (FN) |
实际为负类 | 2 (FP) | 6 (TN) |
B类的混淆矩阵:
预测为正类 | 预测为负类 | |
---|---|---|
实际为正类 | 2 (TP) | 2 (FN) |
实际为负类 | 1 (FP) | 5 (TN) |
C类的混淆矩阵:
预测为正类 | 预测为负类 | |
---|---|---|
实际为正类 | 3 (TP) | 1 (FN) |
实际为负类 | 2 (FP) | 6 (TN) |
P值
precision 精确率
预测为A类的样本中,实际是A类的样本占比:
P
A
=
T
P
T
P
+
F
P
=
1
2
P_{A}=\frac {TP} {TP+FP}=\frac 1 2
PA=TP+FPTP=21
预测为B类的样本中,实际是B类的样本占比:
P
B
=
2
3
P_{B}=\frac 2 3
PB=32
预测为C类的样本中,实际是C类的样本占比:
P
C
=
3
5
P_{C}=\frac 3 5
PC=53
R值
recall 召回率
实际为A类的样本中,被预测为A类的占比:
R
A
=
T
P
T
P
+
F
N
=
1
2
R_{A}=\frac {TP} {TP+FN}=\frac 1 2
RA=TP+FNTP=21
实际为B类的样本中,被预测为B类的占比:
R
B
=
1
2
R_{B}=\frac 1 2
RB=21
实际为C类的样本中,被预测为C类的占比:
R
C
=
3
4
R_{C}=\frac 3 4
RC=43
F1值
F1值是P值和R值得调和平均数:
2
F
1
=
1
P
+
1
R
\frac 2 F_{1}=\frac 1 P + \frac 1 R
F21=P1+R1
F 1 = 2 ∗ P ∗ R P + R F_{1}=\frac{2*P*R} {P+R} F1=P+R2∗P∗R
F
A
=
2
∗
1
2
∗
1
2
1
2
+
1
2
=
0.5
F_{A}=2*\frac {\frac 1 2*\frac 1 2} {\frac 1 2+\frac 1 2}=0.5
FA=2∗21+2121∗21=0.5
F
B
=
2
∗
2
3
∗
1
2
2
3
+
1
2
=
0.571
F_{B}=2*\frac {\frac 2 3 *\frac 1 2} {\frac 2 3 +\frac 1 2}=0.571
FB=2∗32+2132∗21=0.571
F
C
=
2
∗
3
5
∗
3
4
3
5
+
3
4
=
0.667
F_{C}=2*\frac {\frac 3 5 *\frac 3 4} {\frac 3 5 +\frac 3 4}=0.667
FC=2∗53+4353∗43=0.667
这里做一下延伸,F1值计算将P值和R值看得同等重要。若当更关注某一个指标时,公式需要稍作修改:
F β = ( 1 + β 2 ) ∗ P ∗ R β 2 ∗ P + R F_\beta=\frac{(1+\beta^2)*P*R} {\beta^2*P+R} Fβ=β2∗P+R(1+β2)∗P∗R
当 β = 1 \beta=1 β=1时,即是计算 F 1 F_{1} F1的值。 β < 1 \beta<1 β<1时, 更关注P值, β > 1 \beta>1 β>1时, 更关注R值.
在语法纠错中,一般更关注P值,用F0.5。被分类为错误的类别里面,实际错误的比率要足够高,误判会让人感觉很不好。
F β = ( 1 + 0. 5 2 ) ∗ P ∗ R 0. 5 2 ∗ P + R F_\beta=\frac{(1+0.5^2)*P*R} {0.5^2*P+R} Fβ=0.52∗P+R(1+0.52)∗P∗R
Micro-F1
不区分类别。使用总体的P值和R值计算出Micro-F1
P: 预测为ABC类里面真正正确的占比
R: 实际为ABC类里面真正正确的占比
P = R = F 1 = 7 12 = 0.583 P=R=F1=\frac 7 {12}=0.583 P=R=F1=127=0.583
这里的P,R,F1在任何情况下都是相等的,相当于多分类的accuracy
Macro-F1
宏平均有两种计算方式:
- 直接对每个类别的F1值求平均
Macro-F1(type1) = 1 C ∑ i = 1 C F i \text{Macro-F1(type1)}=\frac 1 C \sum_{i=1}^{C}F_{i} Macro-F1(type1)=C1i=1∑CFi - 对每一个类别的P值,R值求平均,再计算F1值
Macro-P = 1 C ∑ i = 1 C P i \text{Macro-P}=\frac 1 C \sum_{i=1}^{C}P_{i} Macro-P=C1∑i=1CPi
Macro-R = 1 C ∑ i = 1 C R i \text{Macro-R}=\frac 1 C \sum_{i=1}^{C}R_{i} Macro-R=C1∑i=1CRi
Macro-F1(type2) = 2 ∗ Macro-P ∗ Macro-R Macro-P + Macro-R \text{Macro-F1(type2)}=\frac {2*\text{Macro-P}*\text{Macro-R}} {\text{Macro-P}+\text{Macro-R}} Macro-F1(type2)=Macro-P+Macro-R2∗Macro-P∗Macro-R
sklearn计算的是第一种。
代码验证:
from sklearn.metrics import classification_report
from sklearn.metrics import f1_score
print(classification_report(y_true=["A", "A", "A", "A", "B", "B", "B", "B", "C", "C", "C", "C"],
y_pred=["B", "A", "A", "C", "B", "B", "A", "C", "C", "C", "A", "C"],
labels=["A", "B", "C"]))
print(f1_score(y_true=["A", "A", "A", "A", "B", "B", "B", "B", "C", "C", "C", "C"],
y_pred=["B", "A", "A", "C", "B", "B", "A", "C", "C", "C", "A", "C"],
labels=["A", "B", "C"], average="micro"))
print(f1_score(y_true=["A", "A", "A", "A", "B", "B", "B", "B", "C", "C", "C", "C"],
y_pred=["B", "A", "A", "C", "B", "B", "A", "C", "C", "C", "A", "C"],
labels=["A", "B", "C"], average="macro"))
结果:
precision recall f1-score support
A 0.50 0.50 0.50 4
B 0.67 0.50 0.57 4
C 0.60 0.75 0.67 4
avg / total 0.59 0.58 0.58 12
0.5833333333333334
0.5793650793650794