1. 了解为什么仅仅准确性不足以获得更完整的分类器性能图
2. 了解机器学习中各种评估指标的动机和定义,以及如何解释使用给定评估指标的结果
3. 使用适合给定任务的特定评估指标优化机器学习算法
Accuracy的不足
accuracy = # correct predictions / # total instances
在正负样本平衡的分类任务中,正确率是可以评估模型表现的;
但是在正负样本不平衡的分类任务中,单单依靠正确率是不行的;比如训练集中有1000个样本,其中1个正样本,999个负样本,那么即使你只是简单地实现一个分类器对训练集进行投票法(选择训练集标签中出现次数最多的作为预测结果),那么你的准确率也会有999/1000=99%,实际上你的模型并没有挖掘数据中存在的关系,你并没有找到到底是什么因素影响了正负样本的分布。接下来我们来看如何解决这个问题。
评估指标
以二分类为例,
预测值:0 | 预测值:1 | |
真实值:0 | TN | FP |
真实值:1 | FN | TP |
Accuracy = (TN+TP) / (TN+TP+FN+FP),指分类正确率;
Classfication error = 1 - Accuracy,指分类错误率;
Recall(TPR) = TP / (TP + FN),指正样例中被正确预测的概率;
FPR = FP / (TN + FP),指负样例中没有被正确预测(即预测值为1)的概率;
Precision = TP / (TP + FP),指预测为正样例的样本中正样例的概率;
F1 score = 2*Precision*Recall / (Precision + Recall),指precision与recall的调和平均数;
Precision-Recall-curve: x轴代表precision,y轴代表recall,右上角代表最优解,precision与recall均为1,红色圆圈表示决策得分为0的点;
ROC curve: x轴代表FPR,y轴代表TPR,左上角是最优解,代表FPR=0,TPR=1
AUC:area under the curve,auc可以被理解为ROC曲线包含的面积,也可以理解为分类器将随机选择一个正确分类的样例的概率
优点:
- 是一个单独的实数指标,易于比较;
- 不需要显示地传递决策得分;
缺点:
- 相较于ROC曲线,它丢失了部分信息,比如:关于ROC曲线的形状以及权衡;
应用场景
召回率为导向的机器学习任务:
- 法律发现中的搜索和信息提取
- 肿瘤检测
- 通常与人类专家配对以过滤误报
精准率为导向的机器学习任务:
- 搜索引擎排名,查询建议
- 文件分类
- 许多面向客户的任务(用户记住故障!)
优化机器学习算法
1. 使用交叉验证并指定评估指标
from sklearn.model_selection import cross_val_score
from sklearn.svm import SVC
dataset = load_digits()
# again, making this a binary problem with 'digit 1' as positive class
# and 'not 1' as negative class
X, y = dataset.data, dataset.target == 1
clf = SVC(kernel='linear', C=1)
# accuracy is the default scoring metric
print('Cross-validation (accuracy)', cross_val_score(clf, X, y, cv=5))
# use AUC as scoring metric
print('Cross-validation (AUC)', cross_val_score(clf, X, y, cv=5, scoring = 'roc_auc'))
# use recall as scoring metric
print('Cross-validation (recall)', cross_val_score(clf, X, y, cv=5, scoring = 'recall'))
# 运行结果:
# Cross-validation (accuracy) [0.91944444 0.98611111 0.97214485 0.97493036
# 0.96935933]
# Cross-validation (AUC) [0.9641871 0.9976571 0.99372205 0.99699002 0.98675611]
# Cross-validation (recall) [0.81081081 0.89189189 0.83333333 0.83333333
# 0.83333333]
2. 使用GridSearchCV并指定评估指标
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import roc_auc_score
dataset = load_digits()
X, y = dataset.data, dataset.target == 1
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
clf = SVC(kernel='rbf')
grid_values = {'gamma': [0.001, 0.01, 0.05, 0.1, 1, 10, 100]}
# default metric to optimize over grid parameters: accuracy
grid_clf_acc = GridSearchCV(clf, param_grid = grid_values)
grid_clf_acc.fit(X_train, y_train)
y_decision_fn_scores_acc = grid_clf_acc.decision_function(X_test)
# 属性
print('Grid best parameter (max. accuracy): ', grid_clf_acc.best_params_)
print('Grid best score (accuracy): ', grid_clf_acc.best_score_)
# alternative metric to optimize over grid parameters: AUC
grid_clf_auc = GridSearchCV(clf, param_grid = grid_values, scoring = 'roc_auc')
grid_clf_auc.fit(X_train, y_train)
y_decision_fn_scores_auc = grid_clf_auc.decision_function(X_test)
print('Test set AUC: ', roc_auc_score(y_test, y_decision_fn_scores_auc))
print('Grid best parameter (max. AUC): ', grid_clf_auc.best_params_)
print('Grid best score (AUC): ', grid_clf_auc.best_score_)
# 运行结果:
# Grid best parameter (max. accuracy): {'gamma': 0.001}
# Grid best score (accuracy): 0.9962880475129918
# Test set AUC: 0.99982858122393
# Grid best parameter (max. AUC): {'gamma': 0.001}
# Grid best score (AUC): 0.9998741278302142