Scikit Learn中的概率校准曲线

python收藏家

于 2024-03-27 18:43:29 发布

阅读量801

点赞数 22

分类专栏：机器学习文章标签：机器学习

本文链接：https://blog.csdn.net/qq_42034590/article/details/135254033

版权

机器学习专栏收录该内容

84 篇文章 5 订阅

订阅专栏

概率校准是一种用于将二分类的输出分数转换为概率的技术，以与目标类的实际概率相关联。在本文中，我们将讨论概率校准曲线以及如何使用Scikit-learn绘制它们。

概率校准

概率校准曲线是二分类问题的正类的预测概率和实际观察频率之间的图。它用于检查分类器的校准，即，预测概率与实际概率的匹配程度。一个完美校准的分类器将有一个遵循45度线的校准曲线，这意味着预测概率与实际概率相匹配。

它是修改二分类模型的预测概率以便与目标类别的实际概率相关联的过程。概率校准的目的是确保预测的概率被良好校准，即，比如说，预测概率为0.8对应于正类为真的实际可能性大约为0.8。换句话说，预测的概率必须是对实际可能性的准确预测。

在许多分类任务中，模型的输出是0和1之间的概率分数，指示给定样本属于特定类别的模型的置信度。然而，这些预测概率可能并不总是与类别的真实概率相匹配，特别是当模型在不平衡数据上训练或具有复杂的决策边界时。

重要术语：

二分类器（Binary Classifier）：一种经过训练的分类器，用于区分两个类或类别，通常标记为0和1。它通常用于各种应用，包括垃圾邮件过滤，欺诈检测等。
预测概率：这些是分类模型根据给定的输入特征集预测的概率。
真实概率：这些是基础类的实际概率。
校准曲线：这是预测概率和真实概率之间差异的图表。它表示分类模型的校准，并表示模型的过度自信或不足。
可靠性图表：当预测概率被划分为离散区间，并为每个区间绘制平均预测概率和阳性类别的真实频率时，它被称为可靠性图。
Brier评分：Brier评分是预测概率和真实概率之间的均方误差，范围从0（完美校准）到1（最差校准）。

Scikit-learn提供了两种绘制概率校准曲线的方法：

CalibratedClassifierCV -它是一个元估计器，用于在数据集上训练分类器，并使用交叉验证来校准预测概率。
calibration_curve -这是一个计算给定预测概率集的真阳性率和预测阳性率的函数。

现在，让我们看看如何使用Scikit-learn绘制概率校准曲线。

案例

# Import necessary libraries
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import brier_score_loss

# Load the Breast Cancer Wisconsin (Diagnostic) dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y,
													test_size=0.2,
													random_state=23)

# Train a logistic regression classifier
clf = LogisticRegression(max_iter=1000, random_state=23)
clf.fit(X_train, y_train)

# Probability estimate
prob_pos = clf.predict_proba(X_test)[:, 1]

# Brier Score
b_score = brier_score_loss(y_test, prob_pos)
print("Brier Score :",b_score)

# True and Predicted Probabilities
true_pos, pred_pos = calibration_curve(y_test, prob_pos, n_bins=10)

#Plot the Probabilities Calibrated curve
plt.plot(pred_pos,
		true_pos, 
		marker='o', 
		linewidth=1, 
		label='Logistic Regression')

#Plot the Perfectly Calibrated by Adding the 45-degree line to the plot
plt.plot([0, 1], 
		[0, 1], 
		linestyle='--', 
		label='Perfectly Calibrated')


# Set the title and axis labels for the plot
plt.title('Probability Calibration Curve')
plt.xlabel('Predicted Probability')
plt.ylabel('True Probability')

# Add a legend to the plot
plt.legend(loc='best')

# Show the plot
plt.show()