三、评估指标与评分
2.二分类指标
(3)混淆矩阵
from sklearn.metrics import confusion_matrix
confusion = confusion_matrix(y_test, pred_logreg)
print("Confusion matrix:\n{}".format(confusion))
Confusion matrix: [[401 2] [ 8 39]]
mglearn.plots.plot_confusion_matrix_illustration()
即二分类混淆矩阵:
mglearn.plots.plot_binary_confusion_matrix()
使用混淆矩阵来比较前面拟合过的模型(两个虚拟模型、决策树和Logistic回归)
print("Most frequent class:")
print(confusion_matrix(y_test, pred_most_frequent))
print("\nDummy model:")
print(confusion_matrix(y_test, pred_dummy))
print("\nDecision tree:")
print(confusion_matrix(y_test, pred_tree))
print("\nLogistic Regression")
print(confusion_matrix(y_test, pred_logreg))
Most frequent class: [[403 0] [ 47 0]] Dummy model: [[369 34] [ 43 4]] Decision tree: [[390 13] [ 24 23]] Logistic Regression [[401 2] [ 8 39]]
从这个对比可以明确看出,只有决策树和Logistic回归给出了合理的结果,并且Logistic回归的效果全面好于决策树。
几个公式:
以上是精度、准确率、召回率与f-分数的公式。
几个f-分数的对比:
from sklearn.metrics import f1_score
print("f1 score most frequent:{:.2f}".format(
f1_score(y_test, pred_most_frequent)))
print("f1 score dummy:{:.2f}".format(f1_score(y_test, pred_dummy)))
print("f1 score tree:{:.2f}".format(f1_score(y_test, pred_tree)))
print("f1 score logistic regression:{:.2f}".format(
f1_score(y_test, pred_logreg)))
f1 score most frequent: 0.00 f1 score dummy: 0.09 f1 score tree: 0.55 f1 score logistic regression: 0.89
获取准确率、召回率和f1-分数的全面的总结,可以使用classification_report函数,同时计算这三个值,以美观的格式打印出来。
from sklearn.metrics import classification_report
print(classification_report(y_test, pred_most_frequent,
target_names=["not nine", "nine"]))
precision recall f1-score support not nine 0.90 1.00 0.94 403 nine 0.00 0.00 0.00 47 avg / total 0.80 0.90 0.85 450
print(classification_report(y_test, pred_dummy,
target_names=["not nine", "nine"]))
precision recall f1-score support not nine 0.90 0.92 0.91 403 nine 0.11 0.09 0.09 47 avg / total 0.81 0.83 0.82 450
print(classification_report(y_test, pred_logreg,
target_names=["not nine", "nine"]))
precision recall f1-score support not nine 0.98 1.00 0.99 403 nine 0.95 0.83 0.89 47 avg / total 0.98 0.98 0.98 450
(4)考虑不确定性
下面是一个不平衡二分类任务
from mglearn.datasets import make_blobs
X, y = make_blobs(n_samples=(400, 50), centers=2, cluster_std=[7.0, 2],
random_state=22)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
svc = SVC(gamma=.05).fit(X_train, y_train)
print(classification_report(y_test, svc.predict(X_test)))
precision recall f1-score support 0 0.97 0.89 0.93 104 1 0.35 0.67 0.46 9 avg / total 0.92 0.88 0.89 113
y_pred_lower_threshold = svc.decision_function(X_test) > -.8
print(classification_report(y_test, y_pred_lower_threshold))
precision recall f1-score support 0 1.00 0.82 0.90 104 1 0.32 1.00 0.49 9 avg / total 0.95 0.83 0.87 113
这方面的概念主要涉及了召回率和准确率的平衡,通过选择决策阈值来调整这一平衡,以后遇到时在查阅相关资料。
(5)准确率-召回率曲线
from sklearn.metrics import precision_recall_curve
precision, recall, thresholds = precision_recall_curve(
y_test, svc.decision_function(X_test))
# create a similar dataset as before, but with more samples
# to get a smoother curve
X, y = make_blobs(n_samples=(4000, 500), centers=2, cluster_std=[7.0, 2],
random_state=22)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
svc = SVC(gamma=.05).fit(X_train, y_train)
precision, recall, thresholds = precision_recall_curve(
y_test, svc.decision_function(X_test))
# find threshold closest to zero
close_zero = np.argmin(np.abs(thresholds))
plt.plot(precision[close_zero], recall[close_zero], 'o', markersize=10,
label="threshold zero", fillstyle="none", c='k', mew=2)
plt.plot(precision, recall, label="precision recall curve")
plt.xlabel("Precision")
plt.ylabel("Recall")
plt.legend(loc="best")
我们可以看到,在准确率约为0.75的位置对应的召回率0.4。黑色圆圈表示的是阈值为0的点,0是decision_function的默认阈值。这个点是在调用predict方法时所选择的折中点。
曲线越靠近右上角,则分类器越好。右上角的点表示对于同一个阈值,准确率和召回率都很高。
下面是比较SVM与随机森林的准确率-召回率曲线
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=100, random_state=0, max_features=2)
rf.fit(X_train, y_train)
# RandomForestClassifier has predict_proba, but not decision_function
precision_rf, recall_rf, thresholds_rf = precision_recall_curve(
y_test, rf.predict_proba(X_test)[:, 1])
plt.plot(precision, recall, label="svc")
plt.plot(precision[close_zero], recall[close_zero], 'o', markersize=10,
label="threshold zero svc", fillstyle="none", c='k', mew=2)
plt.plot(precision_rf, recall_rf, label="rf")
close_default_rf = np.argmin(np.abs(thresholds_rf - 0.5))
plt.plot(precision_rf[close_default_rf], recall_rf[close_default_rf], '^', c='k',
markersize=10, label="threshold 0.5 rf", fillstyle="none", mew=2)
plt.xlabel("Precision")
plt.ylabel("Recall")
plt.legend(loc="best")
从这张对比图可以看出,随机森林在极值处(要求很高的召回率或很高的准确率)的表现更好。在中间位置(准确率约为0.7)SVM的表现更好。
f1-分数和平均准确率的对比:
print("f1_score of random forest:{:.3f}".format(
f1_score(y_test, rf.predict(X_test))))
print("f1_score of svc:{:.3f}".format(f1_score(y_test, svc.predict(X_test))))
f1_score of random forest: 0.610
f1_score of svc: 0.656
from sklearn.metrics import average_precision_score
ap_rf = average_precision_score(y_test, rf.predict_proba(X_test)[:, 1])
ap_svc = average_precision_score(y_test, svc.decision_function(X_test))
print("Average precision of random forest:{:.3f}".format(ap_rf))
print("Average precision of svc:{:.3f}".format(ap_svc))
Average precision of random forest: 0.660
Average precision of svc: 0.666
(6)受试者工作特征(ROC)与AUC
受试者工作特征曲线(receiver operating characteristics curve),简称为ROC曲线(ROC curve)。与准确率-召回率曲线类似,ROC曲线考虑了给定分类器的所有可能阈值,但它显示的是假正例率(false positive rate,FPR)和真正例率(true positive rate,TPR),而不是报告准确率和召回率。(真正例率只是召回率的另一个名称,而假正例率则是假正例占所有反类样本的比例):
可以用roc_curve函数来计算ROC曲线:
from sklearn.metrics import roc_curve
fpr, tpr, thresholds = roc_curve(y_test, svc.decision_function(X_test))
plt.plot(fpr, tpr, label="ROC Curve")
plt.xlabel("FPR")
plt.ylabel("TPR (recall)")
# find threshold closest to zero
close_zero = np.argmin(np.abs(thresholds))
plt.plot(fpr[close_zero], tpr[close_zero], 'o', markersize=10,
label="threshold zero", fillstyle="none", c='k', mew=2)
plt.legend(loc=4)
对ROC曲线,理想的曲线要靠近左上角:你希望分类器的召回率很高,同时保持假正例率很低。从曲线中可以看出,与默认阈值0相比,我们可以得到明显更高的召回率(约0.9),而FPR仅稍有增加。最接近左上角的点可能是比默认选择更好地工作点。同样注意,不应该在测试集上选择阈值,而是应该在单独的验证集上选择。
下面给出随机森林和SVM的ROC曲线对比:
fpr_rf, tpr_rf, thresholds_rf = roc_curve(y_test, rf.predict_proba(X_test)[:, 1])
plt.plot(fpr, tpr, label="ROC Curve SVC")
plt.plot(fpr_rf, tpr_rf, label="ROC Curve RF")
plt.xlabel("FPR")
plt.ylabel("TPR (recall)")
plt.plot(fpr[close_zero], tpr[close_zero], 'o', markersize=10,
label="threshold zero SVC", fillstyle="none", c='k', mew=2)
close_default_rf = np.argmin(np.abs(thresholds_rf - 0.5))
plt.plot(fpr_rf[close_default_rf], tpr[close_default_rf], '^', markersize=10,
label="threshold 0.5 RF", fillstyle="none", c='k', mew=2)
plt.legend(loc=4)
与准确率-召回率曲线一样,我们通常希望使用一个数字来总结ROC曲线,即曲线下的面积(通常被称为AUC(area under the curve),这里的曲线指的就是ROC曲线)。我们可以利用roc_auc_sore函数来计算ROC曲线下的面积:
from sklearn.metrics import roc_auc_score
rf_auc = roc_auc_score(y_test, rf.predict_proba(X_test)[:, 1])
svc_auc = roc_auc_score(y_test, svc.decision_function(X_test))
print("AUC for Random Forest:{:.3f}".format(rf_auc))
print("AUC for SVC:{:.3f}".format(svc_auc))
AUC for Random Forest: 0.937
AUC for SVC: 0.916
一个不同gamma值的SVM的ROC曲线的对比例子:
y = digits.target == 9
X_train, X_test, y_train, y_test = train_test_split(
digits.data, y, random_state=0)
plt.figure()
for gamma in [1, 0.05, 0.01]:
svc = SVC(gamma=gamma).fit(X_train, y_train)
accuracy = svc.score(X_test, y_test)
auc = roc_auc_score(y_test, svc.decision_function(X_test))
fpr, tpr, _ = roc_curve(y_test , svc.decision_function(X_test))
print("gamma ={:.2f}accuracy ={:.2f}AUC ={:.2f}".format(
gamma, accuracy, auc))
plt.plot(fpr, tpr, label="gamma={:.3f}".format(gamma))
plt.xlabel("FPR")
plt.ylabel("TPR")
plt.xlim(-0.01, 1)
plt.ylim(0, 1.02)
plt.legend(loc="best")
gamma = 1.00 accuracy = 0.90 AUC = 0.50
gamma = 0.05 accuracy = 0.90 AUC = 1.00
gamma = 0.01 accuracy = 0.90 AUC = 1.00
3.多分类指标
10个数字分类任务的混淆矩阵
from sklearn.metrics import accuracy_score
X_train, X_test, y_train, y_test = train_test_split(
digits.data, digits.target, random_state=0)
lr = LogisticRegression().fit(X_train, y_train)
pred = lr.predict(X_test)
print("Accuracy:{:.3f}".format(accuracy_score(y_test, pred)))
print("Confusion matrix:\n{}".format(confusion_matrix(y_test, pred)))
Accuracy: 0.953
Confusion matrix:
[[37 0 0 0 0 0 0 0 0 0]
[ 0 39 0 0 0 0 2 0 2 0]
[ 0 0 41 3 0 0 0 0 0 0]
[ 0 0 1 43 0 0 0 0 0 1]
[ 0 0 0 0 38 0 0 0 0 0]
[ 0 1 0 0 0 47 0 0 0 0]
[ 0 0 0 0 0 0 52 0 0 0]
[ 0 1 0 1 1 0 0 45 0 0]
[ 0 3 1 0 0 0 0 0 43 1]
[ 0 0 0 1 0 1 0 0 1 44]]
scores_image = mglearn.tools.heatmap(
confusion_matrix(y_test, pred), xlabel='Predicted label',
ylabel='True label', xticklabels=digits.target_names,
yticklabels=digits.target_names, cmap=plt.cm.gray_r, fmt="%d")
plt.title("Confusion matrix")
plt.gca().invert_yaxis()
print(classification_report(y_test, pred))
precision recall f1-score support
0 1.00 1.00 1.00 37
1 0.89 0.91 0.90 43
2 0.95 0.93 0.94 44
3 0.90 0.96 0.92 45
4 0.97 1.00 0.99 38
5 0.98 0.98 0.98 48
6 0.96 1.00 0.98 52
7 1.00 0.94 0.97 48
8 0.93 0.90 0.91 48
9 0.96 0.94 0.95 47
avg / total 0.95 0.95 0.95 450
print("Micro average f1 score:{:.3f}".format(
f1_score(y_test, pred, average="micro")))
print("Macro average f1 score:{:.3f}".format(
f1_score(y_test, pred, average="macro")))
Micro average f1 score: 0.953
Macro average f1 score: 0.954
4.回归指标
一般来说,我们认为R^2是评估回归模型的更直观的指标。
5.在模型选择中使用评估指标
scikit-learn提供了一种非常简便的实现方法,就是scoring参数,它可以同时用于GridSearchCV和cross_val_score。你只需要提供一个字符串,用于描述想要使用的评估指标。
# default scoring for classification is accuracy
print("Default scoring:{}".format(
cross_val_score(SVC(), digits.data, digits.target == 9)))
# providing scoring="accuracy" doesn't change the results
explicit_accuracy = cross_val_score(SVC(), digits.data, digits.target == 9,
scoring="accuracy")
print("Explicit accuracy scoring:{}".format(explicit_accuracy))
roc_auc = cross_val_score(SVC(), digits.data, digits.target == 9,
scoring="roc_auc")
print("AUC scoring:{}".format(roc_auc))
Default scoring: [0.9 0.9 0.9]
Explicit accuracy scoring: [0.9 0.9 0.9]
AUC scoring: [0.994 0.99 0.996]
X_train, X_test, y_train, y_test = train_test_split(
digits.data, digits.target == 9, random_state=0)
# we provide a somewhat bad grid to illustrate the point:
param_grid = {'gamma': [0.0001, 0.01, 0.1, 1, 10]}
# using the default scoring of accuracy:
grid = GridSearchCV(SVC(), param_grid=param_grid)
grid.fit(X_train, y_train)
print("Grid-Search with accuracy")
print("Best parameters:", grid.best_params_)
print("Best cross-validation score (accuracy)):{:.3f}".format(grid.best_score_))
print("Test set AUC:{:.3f}".format(
roc_auc_score(y_test, grid.decision_function(X_test))))
print("Test set accuracy:{:.3f}".format(grid.score(X_test, y_test)))
# using AUC scoring instead:
grid = GridSearchCV(SVC(), param_grid=param_grid, scoring="roc_auc")
grid.fit(X_train, y_train)
print("\nGrid-Search with AUC")
print("Best parameters:", grid.best_params_)
print("Best cross-validation score (AUC):{:.3f}".format(grid.best_score_))
print("Test set AUC:{:.3f}".format(
roc_auc_score(y_test, grid.decision_function(X_test))))
print("Test set accuracy:{:.3f}".format(grid.score(X_test, y_test)))
Grid-Search with accuracy
Best parameters: {'gamma': 0.0001}
Best cross-validation score (accuracy)): 0.970
Test set AUC: 0.992
Test set accuracy: 0.973
Grid-Search with AUC
Best parameters: {'gamma': 0.01}
Best cross-validation score (AUC): 0.997
Test set AUC: 1.000
Test set accuracy: 1.000
from sklearn.metrics.scorer import SCORERS
print("Available scorers:\n{}".format(sorted(SCORERS.keys())))
Available scorers:
['accuracy', 'adjusted_mutual_info_score', 'adjusted_rand_score', 'average_precision', 'completeness_score', 'explained_variance', 'f1', 'f1_macro', 'f1_micro', 'f1_samples', 'f1_weighted', 'fowlkes_mallows_score', 'homogeneity_score', 'log_loss', 'mean_absolute_error', 'mean_squared_error', 'median_absolute_error', 'mutual_info_score', 'neg_log_loss', 'neg_mean_absolute_error', 'neg_mean_squared_error', 'neg_mean_squared_log_error', 'neg_median_absolute_error', 'normalized_mutual_info_score', 'precision', 'precision_macro', 'precision_micro', 'precision_samples', 'precision_weighted', 'r2', 'recall', 'recall_macro', 'recall_micro', 'recall_samples', 'recall_weighted', 'roc_auc', 'v_measure_score']