1. 说明
本文性能度量指标有
- 错误率与精度
- 查准率查全率与F1
- ROC与AUC
以MNIST为例,训练并保存模型后重启,数据预处理后进行实验。
训练并保存模型
from keras.datasets import mnist
from keras.models import Sequential
from keras.layers.core import Dense
from keras.utils import to_categorical
(X_train, y_train), (X_test, y_test) = mnist.load_data()
X_train = X_train.reshape(len(X_train), -1)
X_test = X_test.reshape(len(X_test), -1)
X_train = X_train.astype('float32') / 255
X_test = X_test.astype('float32') / 255
y_train = to_categorical(y_train)
y_test = to_categorical(y_test)
model = Sequential()
model.add(Dense(20, input_shape=(784,), activation='relu'))
model.add(Dense(10, activation='sigmoid'))
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
model.fit(X_train, y_train, epochs=20, batch_size=64, verbose=1, validation_split=0.05)
model.save('mnistmodel.h5')
数据预处理
from keras.datasets import mnist
from keras.models import load_model
from keras.utils import to_categorical
(_, _), (X_test, y_test) = mnist.load_data()
X_test = X_test.reshape(len(X_test), -1)
X_test = X_test.astype('float32') / 255
y_test = to_categorical(y_test)
model = load_model('mnistmodel.h5')
2. 性能度量
性能度量(performance measure):模型泛化能力的评价标准
3. 错误率与精度
错误率(Error Rate):分类错误的样本数占样本总数的比例
精度(Accuracy):分类正确的样本数占样本总数的比例
A c c u r a c y = T P + T N P + N Accuracy=\frac{TP+TN}{P+N} Accuracy=P+NTP+TN
E r r o r R a t e = 1 − A c c u r a c y ErrorRate=1-Accuracy ErrorRate=1−Accuracy
3.1 代码
- Keras的
evaluate
函数
_, accuracy = model.evaluate(X_test, y_test)
print('Accuracy: {:.2f}%'.format(accuracy * 100))
print('Error Rate: {:.2f}%'.format((1 - accuracy) * 100))
Accuracy: 95.86%
Error Rate: 4.14%
- sklearn的
accuracy_score
函数
import numpy as np
from sklearn.metrics import accuracy_score
y_true = np.argmax(y_test, axis=1)
y_pred = model.predict(X_test)
y_pred = np.argmax(y_pred, axis=1)
accuracy = accuracy_score(y_true, y_pred)
print('Accuracy: {}%'.format(accuracy * 100))
print('Error rate: {:.2f}%'.format((1 - accuracy) * 100))
Accuracy: 95.86%
Error rate: 4.14%
4. 查准率、查全率、混淆矩阵与F1
查准率(precision):与精度一致。西瓜中有多少比例是好瓜。
查全率(recall):好瓜中有多少比例是好瓜。
查准率和查全率是一对矛盾的变量。一般来说,查准率高时查全率低。
商品推荐系统中希望尽可能少打扰用户,更希望推荐用户感兴趣的内容,此时查准率更重要;而在逃犯信息检索系统中,更希望尽可能少漏掉逃犯,此时查全率更重要。
P r e c i s i o n = T P T P + F P Precision=\frac{TP}{TP+FP} Precision=TP+FPTP
R e c a l l = T P T P + F N Recall=\frac{TP}{TP+FN} Recall=TP+FNTP
混淆矩阵(Confusion matrix)
预测正例 | 预测反例 | |
---|---|---|
真实正例 | TP(真正例) | FN(假正例) |
真实反例 | FP(假正例) | TN(真反例) |
P-R曲线:查准率—查全率曲线
平衡点(Break-Even Point,BEP):查准率=查全率时的取值
F1:基于查准率和查全率的调和平均定义
F 1 = 2 × T P 样 例 总 数 + T P − T N F1=\frac{2×TP}{样例总数+TP-TN} F1=样例总数+TP−TN2×TP
4.1 代码
from sklearn.metrics import precision_score,recall_score,f1_score
precision = precision_score(y_true, y_pred, average='macro')
recall = recall_score(y_true, y_pred, average='micro')
f1 = f1_score(y_true, y_pred, average='weighted')
print('Precision: {:.2f}%'.format(precision * 100))
print('Recall: {:.2f}%'.format(recall * 100))
print('F1: {:.2f}%'.format(f1 * 100))
Precision: 95.87%
Recall: 95.86%
F1: 95.86%
from sklearn.metrics import confusion_matrix
confusion_matrix(y_true, y_pred)
array([[ 958, 0, 5, 3, 1, 2, 5, 3, 3, 0],
[ 0, 1120, 5, 2, 0, 1, 3, 0, 4, 0],
[ 2, 2, 1000, 8, 3, 1, 3, 3, 9, 1],
[ 0, 0, 9, 981, 1, 6, 0, 6, 4, 3],
[ 2, 0, 8, 3, 930, 0, 8, 5, 3, 23],
[ 6, 3, 2, 25, 1, 834, 6, 2, 9, 4],
[ 11, 4, 1, 1, 5, 8, 925, 0, 3, 0],
[ 1, 5, 19, 3, 3, 1, 0, 982, 4, 10],
[ 5, 5, 4, 26, 6, 7, 4, 4, 906, 7],
[ 5, 6, 2, 10, 14, 6, 0, 6, 10, 950]],
dtype=int64)
import seaborn as sn
import matplotlib.pyplot as plt
plt.figure(figsize=(10, 6),dpi=100)
sn.heatmap(confusion_matrix(y_true, y_pred), annot=True)
分类报告:precision/recall/f1/分类个数
from sklearn.metrics import classification_report
target_names = ['0', '1', '2','3','4','5','6','7','8','9']
print(classification_report(y_true, y_pred, target_names=target_names))
precision recall f1-score support
0 0.97 0.98 0.97 980
1 0.98 0.99 0.98 1135
2 0.95 0.97 0.96 1032
3 0.92 0.97 0.95 1010
4 0.96 0.95 0.96 982
5 0.96 0.93 0.95 892
6 0.97 0.97 0.97 958
7 0.97 0.96 0.96 1028
8 0.95 0.93 0.94 974
9 0.95 0.94 0.95 1009
micro avg 0.96 0.96 0.96 10000
macro avg 0.96 0.96 0.96 10000
weighted avg 0.96 0.96 0.96 10000
5. ROC与AUC
很多学习器根据测试样本产生一个概率预测,然后将这个预测值与一个分类阈值比较,大于阈值为正例,否则为反例。将诸多测试样本的预测值进行排序,最可能是正例的放最前面,最不可能的正例放最后面。分类过程相当于在这个排序中以某个截断点将样本分为两部分,前一部分为正例,后一部分为反例。
ROC(Receiver Operating Characteristic):一条纵轴是真正例率TPR(True Positive Rate),横轴是假正例率FPR(False Positive Rate)的曲线,越远离中间越好。
T P R = T P T P + F N TPR=\frac{TP}{TP+FN} TPR=TP+FNTP
F P R = F P T N + F P FPR=\frac{FP}{TN+FP} FPR=TN+FPFP
AUC(Area Under ROC Curve):ROC曲线下的面积。
A U C = 1 2 ∑ i = 1 m − 1 ( x i + 1 − x i ) ∗ ( y i + y i + 1 ) AUC=\frac{1}{2}\sum_{i=1}^{m-1}(x_{i+1}-x_{i})*(y_{i}+y_{i+1}) AUC=21∑i=1m−1(xi+1−xi)∗(yi+yi+1)
5.1 代码
- sklearn的
roc_auc_score
函数(只对二分类有用,此处无用) - sklearn的
roc_curve
和auc
函数 (只对多分类有用)
from sklearn.metrics import roc_curve, auc
y_scores = model.predict(X_test)
# AUC of each classes
n_classes = 10
fpr = dict()
tpr = dict()
roc_auc = dict()
for i in range(n_classes):
fpr[i], tpr[i], _ = roc_curve(y_test[:, i], y_scores[:, i])
roc_auc[i] = auc(fpr[i], tpr[i])
print('AUC class {}: {:.4f}'.format(i, roc_auc[i]))
# AUC of micro-average
fpr["micro"], tpr["micro"], _ = roc_curve(y_test.ravel(), y_scores.ravel())
roc_auc["micro"] = auc(fpr["micro"], tpr["micro"])
print('AUC micro-average: {:.4f}'.format(roc_auc["micro"]))
# AUC of macro-average
all_fpr = np.unique(np.concatenate([fpr[i] for i in range(n_classes)]))
mean_tpr = np.zeros_like(all_fpr)
for i in range(n_classes):
mean_tpr += np.interp(all_fpr, fpr[i], tpr[i])
mean_tpr /= n_classes
fpr["macro"] = all_fpr
tpr["macro"] = mean_tpr
roc_auc["macro"] = auc(fpr["macro"], tpr["macro"])
print('AUC macro-average: {:.4f}'.format(roc_auc["macro"]))
AUC class 0: 0.9921
AUC class 1: 0.9928
AUC class 2: 0.9786
AUC class 3: 0.9373
AUC class 4: 0.9849
AUC class 5: 0.9616
AUC class 6: 0.9897
AUC class 7: 0.9879
AUC class 8: 0.8285
AUC class 9: 0.9470
AUC micro-average: 0.9676
AUC macro-average: 0.9601
plt.figure(figsize=(8, 6), dpi=100)
plt.plot([0, 1])
plt.plot(fpr['micro'], tpr['micro'], label='ROC micro-average (area = {:.4})'.format(roc_auc['micro']), linestyle='--')
plt.plot(fpr['macro'], tpr['macro'], label='ROC macro-average (area = {:.4})'.format(roc_auc['macro']), linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.01])
plt.title('ROC')
plt.xlabel('FPR')
plt.ylabel('TPR')
plt.legend(loc='best')
plt.show()
plt.figure(figsize=(8, 6), dpi=100)
plt.plot([0, 1])
for i in range(n_classes):
plt.plot(fpr[i], tpr[i], label='ROC class {} (area = {:.4})'.format(i, roc_auc[i]), linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.01])
plt.title('ROC')
plt.xlabel('FPR')
plt.ylabel('TPR')
plt.legend(loc='best')
plt.show()
6. 参考文献
- 周志华.机器学习[M].北京:清华大学出版社,2016:28.
- python + sklearn ︱分类效果评估——acc、recall、F1、ROC、回归、距离