# 【scikit-learn】评估分类器性能的度量，像混淆矩阵、ROC、AUC等

﻿﻿

## 内容概要¶

• 模型评估的目的及一般评估流程
• 分类准确率的用处及其限制
• 混淆矩阵（confusion matrix）是如何表示一个分类器的性能
• 混淆矩阵中的度量是如何计算的
• 通过改变分类阈值来调整分类器性能
• ROC曲线的用处
• 曲线下面积（Area Under the Curve, AUC）与分类准确率的不同

## 2. 分类准确率（Classification accuracy）¶

In [1]:
# read the data into a Pandas DataFrame
import pandas as pd
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data'
col_names = ['pregnant', 'glucose', 'bp', 'skin', 'insulin', 'bmi', 'pedigree', 'age', 'label']

In [2]:
# print the first 5 rows of data

Out[2]:
pregnant glucose bp skin insulin bmi pedigree age label
0 6 148 72 35 0 33.6 0.627 50 1
1 1 85 66 29 0 26.6 0.351 31 0
2 8 183 64 0 0 23.3 0.672 32 1
3 1 89 66 23 94 28.1 0.167 21 0
4 0 137 40 35 168 43.1 2.288 33 1

In [3]:
# define X and y
feature_cols = ['pregnant', 'insulin', 'bmi', 'age']
X = pima[feature_cols]
y = pima.label

In [4]:
# split X and y into training and testing sets
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

In [5]:
# train a logistic regression model on the training set
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(X_train, y_train)

Out[5]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, max_iter=100, multi_class='ovr',
penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
verbose=0)
In [6]:
# make class predictions for the testing set
y_pred_class = logreg.predict(X_test)

In [7]:
# calculate accuracy
from sklearn import metrics
print metrics.accuracy_score(y_test, y_pred_class)

0.692708333333


In [8]:
# examine the class distribution of the testing set (using a Pandas Series method)
y_test.value_counts()

Out[8]:
0    130
1     62
dtype: int64
In [9]:
# calculate the percentage of ones
y_test.mean()

Out[9]:
0.32291666666666669
In [10]:
# calculate the percentage of zeros
1 - y_test.mean()

Out[10]:
0.67708333333333326
In [11]:
# calculate null accuracy(for binary classification problems coded as 0/1)
max(y_test.mean(), 1-y_test.mean())

Out[11]:
0.67708333333333326

In [12]:
# calculate null accuracy (for multi-class classification problems)

Out[12]:
0    0.677083
dtype: float64

In [13]:
# print the first 25 true and predicted responses
print "True:", y_test.values[0:25]
print "Pred:", y_pred_class[0:25]

True: [1 0 0 1 0 0 1 1 0 0 1 1 0 0 0 0 1 0 0 0 1 1 0 0 0]
Pred: [0 0 0 0 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0]


## 3. 混淆矩阵¶

In [14]:
# IMPORTANT: first argument is true values, second argument is predicted values
print metrics.confusion_matrix(y_test, y_pred_class)

[[118  12]
[ 47  15]]


• 真阳性（True Positive，TP）：指被分类器正确分类的正例数据
• 真阴性（True Negative，TN）：指被分类器正确分类的负例数据
• 假阳性（False Positive，FP）：被错误地标记为正例数据的负例数据
• 假阴性（False Negative，FN）：被错误地标记为负例数据的正例数据
In [15]:
# save confusion matrix and slice into four pieces
confusion = metrics.confusion_matrix(y_test, y_pred_class)
TP = confusion[1, 1]
TN = confusion[0, 0]
FP = confusion[0, 1]
FN = confusion[1, 0]
print "TP:", TP
print "TN:", TN
print "FP:", FP
print "FN:", FN

TP: 15
TN: 118
FP: 12
FN: 47


## 4. 基于混淆矩阵的评估度量¶

In [16]:
print (TP+TN) / float(TP+TN+FN+FP)
print metrics.accuracy_score(y_test, y_pred_class)

0.692708333333
0.692708333333


In [17]:
print (FP+FN) / float(TP+TN+FN+FP)
print 1-metrics.accuracy_score(y_test, y_pred_class)

0.307291666667
0.307291666667


In [18]:
print TP / float(TP+FN)
recall = metrics.recall_score(y_test, y_pred_class)
print metrics.recall_score(y_test, y_pred_class)

0.241935483871
0.241935483871


In [19]:
print TN / float(TN+FP)

0.907692307692


In [20]:
print FP / float(TN+FP)
specificity = TN / float(TN+FP)
print 1 - specificity

0.0923076923077
0.0923076923077


In [21]:
print TP / float(TP+FP)
precision = metrics.precision_score(y_test, y_pred_class)
print precision

0.555555555556
0.555555555556


F度量（又称为F1分数或F分数），是使用精度和召回率的方法组合到一个度量上

F=2precisionrecallprecision+recall  F=2∗precision∗recallprecision+recall
F β =(1+β 2 )precisionrecallβ 2 precision+recall  Fβ=(1+β2)∗precision∗recallβ2∗precision+recall

F F度量是精度和召回率的调和均值，它赋予精度和召回率相等的权重。

F β  度量是精度和召回率的加权度量，它赋予召回率权重是赋予精度的β β倍。

In [22]:
print (2*precision*recall) / (precision+recall)
print metrics.f1_score(y_test, y_pred_class)

0.337078651685
0.337078651685


• 垃圾邮件过滤器：优先优化精度或者特效性，因为该应用对假阳性（非垃圾邮件被放进垃圾邮件箱）的要求高于对假阴性（垃圾邮件被放进正常的收件箱）的要求
• 欺诈交易检测器：优先优化灵敏度，因为该应用对假阴性（欺诈行为未被检测）的要求高于假阳性（正常交易被认为是欺诈）的要求

## 5. 调整分类的阈值¶

In [23]:
# print the first 10 predicted responses
logreg.predict(X_test)[0:10]

Out[23]:
array([0, 0, 0, 0, 0, 0, 0, 1, 0, 1], dtype=int64)
In [24]:
y_test.values[0:10]

Out[24]:
array([1, 0, 0, 1, 0, 0, 1, 1, 0, 0], dtype=int64)
In [25]:
# print the first 10 predicted probabilities of class membership
logreg.predict_proba(X_test)[0:10, :]

Out[25]:
array([[ 0.63247571,  0.36752429],
[ 0.71643656,  0.28356344],
[ 0.71104114,  0.28895886],
[ 0.5858938 ,  0.4141062 ],
[ 0.84103973,  0.15896027],
[ 0.82934844,  0.17065156],
[ 0.50110974,  0.49889026],
[ 0.48658459,  0.51341541],
[ 0.72321388,  0.27678612],
[ 0.32810562,  0.67189438]])

In [26]:
# print the first 10 predicted probabilities for class 1
logreg.predict_proba(X_test)[0:10, 1]

Out[26]:
array([ 0.36752429,  0.28356344,  0.28895886,  0.4141062 ,  0.15896027,
0.17065156,  0.49889026,  0.51341541,  0.27678612,  0.67189438])

In [27]:
# store the predicted probabilities for class 1
y_pred_prob = logreg.predict_proba(X_test)[:, 1]

In [28]:
# allow plots to appear in the notebook
%matplotlib inline
import matplotlib.pyplot as plt

In [29]:
# histogram of predicted probabilities
plt.hist(y_pred_prob, bins=8)
plt.xlim(0, 1)
plt.title('Histogram of predicted probabilities')
plt.xlabel('Predicted probability of diabetes')
plt.ylabel('Frequency')

Out[29]:
<matplotlib.text.Text at 0x76853b0>

In [30]:
# predict diabetes if the predicted probability is greater than 0.3
from sklearn.preprocessing import binarize
y_pred_class = binarize(y_pred_prob, 0.3)[0]

In [31]:
# print the first 10 predicted probabilities
y_pred_prob[0:10]

Out[31]:
array([ 0.36752429,  0.28356344,  0.28895886,  0.4141062 ,  0.15896027,
0.17065156,  0.49889026,  0.51341541,  0.27678612,  0.67189438])
In [32]:
# print the first 10 predicted classes with the lower threshold
y_pred_class[0:10]

Out[32]:
array([ 1.,  0.,  0.,  1.,  0.,  0.,  1.,  1.,  0.,  1.])
In [33]:
y_test.values[0:10]

Out[33]:
array([1, 0, 0, 1, 0, 0, 1, 1, 0, 0], dtype=int64)

In [34]:
# previous confusion matrix (default threshold of 0.5)
print confusion

[[118  12]
[ 47  15]]

In [35]:
# new confusion matrix (threshold of 0.3)
print metrics.confusion_matrix(y_test, y_pred_class)

[[80 50]
[16 46]]

In [36]:
# sensitivity has increased (used to be 0.24)
print 46 / float(46 + 16)
print metrics.recall_score(y_test, y_pred_class)

0.741935483871
0.741935483871

In [37]:
# specificity has decreased (used to be 0.91)
print 80 / float(80 + 50)

0.615384615385


• 0.5作为阈值时默认的情况
• 调节阈值可以改变灵敏性和特效性
• 灵敏性和特效性是一对相反作用的指标
• 该阈值的调节是作为改善分类性能的最后一步，应更多去关注分类器的选择或构建更好的分类器

## 6. ROC曲线和AUC¶

ROC曲线是根据一系列不同的二分类方式（分界值或决定阈），以真正例率（也就是灵敏度）（True Positive Rate,TPR）为纵坐标，假正例率（1-特效性）（False Positive Rate,FPR）为横坐标绘制的曲线。

ROC观察模型正确地识别正例的比例与模型错误地把负例数据识别成正例的比例之间的权衡。TPR的增加以FPR的增加为代价。ROC曲线下的面积是模型准确率的度量。

In [38]:
# IMPORTANT: first argument is true values, second argument is predicted probabilities
fpr, tpr, thresholds = metrics.roc_curve(y_test, y_pred_prob)
plt.plot(fpr, tpr)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.title('ROC curve for diabetes classifier')
plt.xlabel('False Positive Rate (1 - Specificity)')
plt.ylabel('True Positive Rate (Sensitivity)')
plt.grid(True)


ROC曲线上的每一个点对应于一个threshold，对于一个分类器，每个threshold下会有一个TPR和FPR。 比如Threshold最大时，TP=FP=0，对应于原点；Threshold最小时，TN=FN=0，对应于右上角的点(1,1)

In [39]:
# define a function that accepts a threshold and prints sensitivity and specificity
def evaluate_threshold(threshold):
print 'Sensitivity:', tpr[thresholds > threshold][-1]
print 'Specificity:', 1 - fpr[thresholds > threshold][-1]

In [40]:
evaluate_threshold(0.5)

Sensitivity: 0.241935483871
Specificity: 0.907692307692

In [41]:
evaluate_threshold(0.3)

Sensitivity: 0.741935483871
Specificity: 0.615384615385


AUC（Area Under Curve）被定义为ROC曲线下的面积，也可以认为是ROC曲线下面积占单位面积的比例，显然这个面积的数值不会大于1。又由于ROC曲线一般都处于y=x这条直线的上方，所以AUC的取值范围在0.5和1之间。

In [42]:
# IMPORTANT: first argument is true values, second argument is predicted probabilities
print metrics.roc_auc_score(y_test, y_pred_prob)

0.724565756824

In [43]:
# calculate cross-validated AUC
from sklearn.cross_validation import cross_val_score
cross_val_score(logreg, X, y, cv=10, scoring='roc_auc').mean()

Out[43]:
0.73782336182336183