python——画几种分类器的P-R曲线与ROC曲线

最新推荐文章于 2024-06-06 00:14:07 发布

小山风

最新推荐文章于 2024-06-06 00:14:07 发布

阅读量2.5k

点赞数 5

分类专栏： python 文章标签： python

本文链接：https://blog.csdn.net/weixin_46974686/article/details/115029826

版权

python 专栏收录该内容

5 篇文章 0 订阅

订阅专栏

写在前面：P-R曲线与ROC曲线的作用：比较学习器的性能。

若一个学习器的P-R曲线/ROC曲线被另一个学习器的曲线完全“包住”，则后者的性能优于前者。

一、P-R曲线

（1）混淆矩阵

首先要知道用于衡量分类器性能的混淆矩阵：
在这里插入图片描述
其中TP、FN、FP、TN可以这样记忆：第一个字母为预测的是否正确，正确为True(T)，错误为False(F)；第二个字母为预测的结果，预测为正例为Positive§，预测为反例为Negative(N)。

（2）查准率与查全率

·查准率P（Precision）定义为：
在这里插入图片描述
可以理解为在预测为正例的样本中预测正确的比例；

·查全率R（Recall）定义为：
在这里插入图片描述
可以理解为在真正例中预测正确的比例；

（3）P-R曲线的绘制

在很多情形下，我们可以根据学习器的预测结果对样例进行排序，排在前面的是学习器认为“最可能”是正例的样本，排在最后的则是学习器认为是“最不可能”是正例的样本。按此顺序逐个把样例作为正例进行预测，则每次可以计算出当前的查全率和查准率。以查准率为纵轴，查全率为横轴作图，就得到了查准率-查全率曲线，简称“P-R曲线”。

（4）性能度量

当两个学习器的P-R曲线发生了交叉，则一般难以比较两个学习器的优劣，于是人们设计了一些综合考虑查准率和查全率的性能度量，包括BEP、F1-score等等。

二、ROC与AUC

（1）TPR与FPR

与P-R曲线类似，我们根据学习器的预测结果对样例进行排序，然后逐个把样例作为正例进行预测，以“真正例率”（TPR）为纵轴，以“假正例率”（FPR）为横轴，两者分别定义为：
在这里插入图片描述

注意：ROC与AUC不受不平衡样本的影响，因为TPR与真实反例无关，而FPR与真实正例无关。

（2）AUC值

对于ROC曲线来说，若曲线发生交叉，则一般难以比较两个学习器的优劣，此时可以根据AUC（Area Under ROC Curve）的值来判断。

三、代码实现

以下是python代码，以简单的二分类任务为例，比较 RF, LR, GaussianNB, SVC, KNN 这几种算法的性能：

#划分数据集
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2,random_state=0)

#RF
from sklearn.ensemble import RandomForestClassifier
estimator = RandomForestClassifier(max_depth=5,n_estimators=5)
estimator.fit(x_train,y_train)
algorithm = ['RF', 'LR', 'GaussianNB', 'SVC', 'KNN']
evaluation = pd.DataFrame(index=algorithm, columns=['fpr', 'tpr', 'pre', 'rec', 'auc'])

from sklearn import metrics 
evaluation.loc['RF', 'fpr'], evaluation.loc['RF', 'tpr'], thresholds  =  metrics.roc_curve(y_test, estimator.predict_proba(x_test)[:,1]) 
evaluation.loc['RF', 'auc'] = metrics.auc(evaluation.loc['RF', 'fpr'], evaluation.loc['RF', 'tpr']) 
evaluation.loc['RF', 'pre'], evaluation.loc['RF', 'rec'], thresholds  =  metrics.precision_recall_curve(y_test, estimator.predict_proba(x_test)[:,1])

#LR
from sklearn.linear_model import LogisticRegression
estimator = LogisticRegression(solver='liblinear',penalty='l2',C=1.0)
estimator.fit(x_train,y_train)
evaluation.loc['LR', 'fpr'], evaluation.loc['LR', 'tpr'], thresholds  =  metrics.roc_curve(y_test, estimator.predict_proba(x_test)[:,1]) 
evaluation.loc['LR', 'auc'] = metrics.auc(evaluation.loc['LR', 'fpr'], evaluation.loc['LR', 'tpr']) 
evaluation.loc['LR', 'pre'], evaluation.loc['LR', 'rec'], thresholds  =  metrics.precision_recall_curve(y_test, estimator.predict_proba(x_test)[:,1])

#GaussianNB
from sklearn.naive_bayes import GaussianNB
estimator = GaussianNB()#参数只有一个，先验概率，P(Y=Ck)=mk/m
estimator.fit(x_train,y_train)
evaluation.loc['GaussianNB', 'fpr'], evaluation.loc['GaussianNB', 'tpr'], thresholds  =  metrics.roc_curve(y_test, estimator.predict_proba(x_test)[:,1]) 
evaluation.loc['GaussianNB', 'auc'] = metrics.auc(evaluation.loc['GaussianNB', 'fpr'], evaluation.loc['GaussianNB', 'tpr']) 
evaluation.loc['GaussianNB', 'pre'], evaluation.loc['GaussianNB', 'rec'], thresholds  =  metrics.precision_recall_curve(y_test, estimator.predict_proba(x_test)[:,1])

#SVC
from sklearn.svm import SVC
estimator = SVC(kernel='rbf', random_state=0, probability=True)
estimator.fit(x_train,y_train)
evaluation.loc['SVC', 'fpr'], evaluation.loc['SVC', 'tpr'], thresholds  =  metrics.roc_curve(y_test, estimator.predict_proba(x_test)[:,1]) 
evaluation.loc['SVC', 'auc'] = metrics.auc(evaluation.loc['SVC', 'fpr'], evaluation.loc['SVC', 'tpr']) 
evaluation.loc['SVC', 'pre'], evaluation.loc['SVC', 'rec'], thresholds  =  metrics.precision_recall_curve(y_test, estimator.predict_proba(x_test)[:,1])

#KNN
from sklearn.neighbors import KNeighborsClassifier
estimator = KNeighborsClassifier()
estimator = GridSearchCV(estimator,param_grid={"n_neighbors":[3,4,5,6]},cv=5)
estimator.fit(x_train,y_train)
evaluation.loc['KNN', 'fpr'], evaluation.loc['KNN', 'tpr'], thresholds  =  metrics.roc_curve(y_test, estimator.predict_proba(x_test)[:,1]) 
evaluation.loc['KNN', 'auc'] = metrics.auc(evaluation.loc['KNN', 'fpr'], evaluation.loc['KNN', 'tpr']) 
evaluation.loc['KNN', 'pre'], evaluation.loc['KNN', 'rec'], thresholds  =  metrics.precision_recall_curve(y_test, estimator.predict_proba(x_test)[:,1])

#Visualization
import matplotlib.pyplot as plt

plt.figure(1)
plt.figure(figsize=(6,4))
plt.plot([0, 1], [0, 1], color='navy', lw=1, linestyle='--')
plt.xlim([0, 1])
plt.ylim([0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')

for i in range(5):
    fpr = evaluation.iloc[i]['fpr']
    tpr = evaluation.iloc[i]['tpr']
    auc = evaluation.iloc[i]['auc']
    plt.plot(fpr, tpr, lw=2, label=evaluation.index[i]+'+AUC'+' (%0.4f)'%(auc) )
plt.legend()

plt.figure(2)
plt.figure(figsize=(6,4))
plt.xlim([0.0, 1.05])
plt.ylim([0, 1.05])
plt.title('Precision/Recall Curve')
plt.xlabel('Recall')
plt.ylabel('Precision')

for i in range(5):
    pre = evaluation.iloc[i]['pre']
    rec = evaluation.iloc[i]['rec']
    plt.plot(pre, rec, lw=2, label=evaluation.index[i]) 
plt.legend()
plt.show()