roc曲线 数据集_实用见解roc曲线和不平衡数据集

roc曲线 数据集

The ROC curve is simply a graphical plot of the relationship between the False Positive Rate (FPR) and the True Positive Rate (TPR) when the discrimination threshold of a classifier is varied.

ROC曲线只是当分类器的判别阈值变化时,假阳性率(FPR)和真阳性率(TPR)之间关系的图形图。

FPR is the probability of classifying a data point as positive when it is actually negative. This is essentially the probability that your classifier gives you a false alarm, and it is defined as follows:

FPR是当数据点实际上是负数时将其分类为正数的可能性。 从本质上讲,这是分类器给您带来错误警报的概率,其定义如下:

Image for post

where N is the total number of negatives, which is equal to the sum of false positives (FP) and true negatives (TN).

其中N是阴性总数,等于假阳性(FP)和真阴性(TN)的总和。

Similarly, TPR is the probability of classifying a point correctly as positive.

同样,TPR是将点正确分类为正的概率。

Image for post

where P is the total number of positives, which is equal to the sum of true positives (TP) and false negatives (FN).

其中P是阳性总数,等于真阳性(TP)和假阴性(FN)的总和。

The area under the ROC curve (AUC) can be interpreted as the probability that the classification model correctly ranks a random positive example higher than a random negative example. So an AUC which is close to 1 is quite often considered to be a confirmation of the model being good. However that might not be the case as we will see.

ROC曲线下的面积(AUC)可以解释为分类模型正确地将随机正例排名高于随机负例的概率。 因此,接近1的AUC通常被认为是模型良好的确认。 但是,正如我们将要看到的那样,情况可能并非如此。

Looking at the curve below, which has an AUC of 0.93, one may naively conclude that this model is excellent when classifying the underlying dataset.

查看下面的曲线,其AUC为0.93,可以天真的得出结论,在对基础数据集进行分类时,该模型非常出色。

Image for post

But let’s take a look at an example of a dataset which could give rise to this excellent ROC curve, while the underlying classifier being of poor quality.

但是,让我们看一个数据集示例,该数据集可能会产生这种出色的ROC曲线,而基础分类器的质量很差。

In the image below the red dots represent the positive class and blue dots the negative class. We can assume that we are solving a binary classification problem using logistic regression.

在下面的图像中,红色圆点代表肯定类别,蓝色圆点代表否定类别。 我们可以假设我们正在使用逻辑回归来解决二进制分类问题。

Image for post

The threshold in our classifier can be set to different values resulting in, for example, the two different boundaries shown in the following figure:

可以将分类器中的阈值设置为不同的值,例如,导致下图所示的两个不同的边界:

Image for post

The classifier to the right has a TPR of ca. 0.5 but a very low FPR (<0.05). The classifier to the left already has TPR of 1.0 but still a very low FPR (~0.1). When we move the boundary-threshold from right to left we classify more data points as positive, increasing the number of false positives and true positives, giving rise to the ROC curve previously shown above.

右侧的分类器的TPR为 0.5,但FPR极低(<0.05)。 左侧的分类器已经具有1.0的TPR,但FPR仍然很低(〜0.1)。 当我们将边界阈值从右移到左时,我们将更多数据点归类为正,从而增加了假阳性和真阳性的数量,从而产生了上面显示的ROC曲线。

It’s important to note that for both of these classifiers we can let the FPR approach 0 and keep the TPR constant as we add more negative (blue) points on the left part of the figure. Thus we can let the AUC approach 1 as we like by making the dataset more imbalanced.

重要的是要注意,当我们在图的左侧添加更多的负(蓝色)点时,对于这两个分类器,我们都可以让FPR接近0并保持TPR不变。 因此,通过使数据集更加不平衡,我们可以根据需要让AUC接近1。

We see that by changing the threshold we would get great results when measured by the AUC. However this metric completely misses the fact that this classifier is very poor when it comes to precision:

我们看到,通过更改阈值,当使用AUC进行测量时,将获得出色的结果。 但是,该指标完全错过了以下事实:分类器在精度方面非常差:

Image for post

Indeed both of the thresholds in the picture above result in a precision of less than 0.4 but this shortcoming is of course not indicated by the AUC curve. Thus, one should not blindly trust the AUC metric but to investigate other statistical measures that allow for a better judgement of the outcome of the analysis. In this example the ideal way to capture the performance would be the precision recall curve, which shows the tradeoff between precision and recall (TPR) at different thresholds.

确实,上图中的两个阈值都导致精度低于0.4,但是AUC曲线当然没有指出这一缺点。 因此,不应盲目地相信AUC指标,而应该研究其他统计方法,以便更好地判断分析结果。 在此示例中,捕获性能的理想方法是精度召回曲线,该曲线显示了在不同阈值下精度和召回率(TPR)之间的权衡。

Image for post

All in all we see that it’s critical to keep in mind the meaning behind the metrics used to evaluate model performance, especially when dealing with imbalanced datasets.

总而言之,我们必须牢记用于评估模型性能的指标背后的含义至关重要,尤其是在处理不平衡的数据集时。

翻译自: https://medium.com/swlh/practical-insights-roc-curves-and-imbalanced-datasets-9f1e7cac4a46

roc曲线 数据集

  • 1
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
好的,以下是针对iris数据集ROC曲线和AUC的步骤: 1. 导入数据集 首先,我们需要导入iris数据集。可以使用sklearn库中的load_iris函数来导入数据集。 ```python from sklearn.datasets import load_iris iris = load_iris() X = iris.data y = iris.target ``` 2. 数据预处理 为了方便绘制ROC曲线,我们需要将数据集分成训练集和测试集,并将类别标签转换为二元标签。 ```python from sklearn.model_selection import train_test_split from sklearn.preprocessing import label_binarize X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) # 将类别标签转换为二元标签 y_train_binary = label_binarize(y_train, classes=[0, 1, 2]) y_test_binary = label_binarize(y_test, classes=[0, 1, 2]) ``` 3. 训练模型 在这里,我们使用Random Forest Classifier作为分类器。训练模型的代码如下: ```python from sklearn.ensemble import RandomForestClassifier # 创建分类器 clf = RandomForestClassifier(n_estimators=50, random_state=42) # 训练模型 clf.fit(X_train, y_train_binary) ``` 4. 绘制ROC曲线和计算AUC 接下来,我们可以使用sklearn库中的roc_curve和auc函数来绘制ROC曲线和计算AUC值。代码如下: ```python from sklearn.metrics import roc_curve, auc # 预测测试集的类别概率值 y_score = clf.predict_proba(X_test) # 计算每个类别的ROC曲线和AUC值 fpr = dict() tpr = dict() roc_auc = dict() for i in range(3): fpr[i], tpr[i], _ = roc_curve(y_test_binary[:, i], y_score[:, i]) roc_auc[i] = auc(fpr[i], tpr[i]) # 计算Macro-Averaging和Micro-Averaging的AUC值 fpr["micro"], tpr["micro"], _ = roc_curve(y_test_binary.ravel(), y_score.ravel()) roc_auc["micro"] = auc(fpr["micro"], tpr["micro"]) all_fpr = np.unique(np.concatenate([fpr[i] for i in range(3)])) # 绘制ROC曲线 plt.figure(figsize=(8, 6)) for i in range(3): plt.plot(fpr[i], tpr[i], lw=2, label='ROC curve of class %d (AUC = %0.2f)' % (i+1, roc_auc[i])) plt.plot(fpr["micro"], tpr["micro"], label='Micro-Avg ROC curve (AUC = {0:0.2f})'\ ''.format(roc_auc["micro"]), color='deeppink', linestyle=':', linewidth=4) plt.plot([0, 1], [0, 1], 'k--', lw=2) plt.xlim([-0.05, 1.05]) plt.ylim([-0.05, 1.05]) plt.xlabel('False Positive Rate') plt.ylabel('True Positive Rate') plt.title('ROC curve for iris dataset') plt.legend(loc="lower right") plt.show() ``` 运行上述代码,我们就可以得到iris数据集ROC曲线和AUC值。 ![image](https://user-images.githubusercontent.com/74955102/135929577-0dd1d8f9-9b7f-4f1b-88a9-7e2d6bd2c9d0.png) 从图中可以看出,三个类别的ROC曲线都很接近左上角,说明模型的分类能力很好。同时,Micro-Averaging的AUC值为0.98,说明模型的总体分类能力也很好。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值