This example illustrates how sigmoid calibration changes predicted probabilities for a 3-class classification problem. Illustrated is the standard 2-simplex, where the three corners correspond to the three classes. Arrows point from the probability vectors predicted by an uncalibrated classifier to the probability vectors predicted by the same classifier after sigmoid calibration on a hold-out validation set. Colors indicate the true class of an instance (red: class 1, green: class 2, blue: class 3).
数据
Below, we generate a classification dataset with 2000 samples, 2 features and 3 target classes. We then split the data as follows:
train: 600 samples (for training the classifier)
valid: 400 samples (for calibrating predicted probabilities)
test: 1000 samples
Note that we also create
X_train_valid
andy_train_valid
, which consists of both the train and valid subsets. This is used when we only want to train the classifier but not calibrate the predicted probabilities.
# Author: Jan Hendrik Metzen <[email protected]>
# License: BSD Style.
import numpy as np
from sklearn.datasets import make_blobs
np.random.seed(0)
X, y = make_blobs(
n_samples=2000, n_features=2, centers=3, random_state=42, cluster_std=5.0
)
X_train, y_train = X[:600], y[:600]
X_valid, y_valid = X[600:1000], y[600:1000]
X_train_valid, y_train_valid = X[:1000], y[:1000]
X_test, y_test = X[1000:], y[1000:]
拟合和校准
First, we will train a RandomForestClassifier with 25 base estimators (trees) on the concatenated train and validation data (1000 samples). This is the uncalibrated classifier
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators=25)
clf.fit(X_train_valid, y_train_valid)
To train the calibrated classifier, we start with the same