目录
类的定义
model_selection.learning_curve(estimator, X, y, *, groups=None, train_sizes=array([0.1, 0.33, 0.55, 0.78, 1.]), cv=None, scoring=None, exploit_incremental_learning=False, n_jobs=None, pre_dispatch='all', verbose=0, shuffle=False, random_state=None, error_score=nan, return_times=False, fit_params=None)
主要参数描述
estimator
object type that implements the “fit” and “predict” methods
一个包含fit()和predict()方法的对象
X
array-like of shape (n_samples, n_features)
训练集数据
y
array-like of shape (n_samples,) or (n_samples, n_outputs)
训练集的标签
train_sizes
array-like of shape (n_ticks,), default=np.linspace(0.1, 1.0, 5)
相对或绝对的即将用于生成学习曲线的训练样本数量,np_linspace(.1,1.0,5)=[0.1,0.325,0.55,0.775,1.0],从序列中取出训练样本数量百分比,逐个计算在当前训练样本数量情况下训练出来的模型准确性
n_jobs
int, default=None
并行运行的任务数
shuffle
bool, default=False
是否打乱训练数据
return_times
bool, default=False
是否返回拟合过程和计算得分过程中花费的时间
cv
int, cross-validation generator or an iterable, default=None
分割策略,可选参数
参数 | 描述 |
---|---|
None | 使用默认的5-KFold |
int | 使用指定的n-KFold |
CV splitter |
当该参数为int或者None,参数estimator是一个分类器,并且y是二元标签或多元标签,将会使用StratifiedKFold 类作为分割策略;其他情况下,使用KFold类作为分割策略
返回值
train_sizes_abs
array of shape (n_unique_ticks,)
生成学习曲线时使用的训练样本数量
train_scores
array of shape (n_ticks, n_cv_folds)
训练集得分
test_scores
array of shape (n_ticks, n_cv_folds)
测试集的得分
fit_times
array of shape (n_ticks, n_cv_folds)
拟合过程花费的时间,只有当参数return_times设置为True时才会返回
score_times
array of shape (n_ticks, n_cv_folds)
计算得分过程花费的时间,只有当参数return_times设置为True时才会返回
学习曲线封装
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import learning_curve
from sklearn.model_selection import ShuffleSplit
def plot_learning_curve(estimator,title,X,y,ylim=None,cv=None,n_jobs=None):
'''
绘制出模型的学习曲线
'''
plt.title(title)
if ylim is not None:
plt.ylim(*ylim)
train_sizes, train_scores, test_scores= learning_curve(estimator,X,y,cv=cv,n_jobs=n_jobs,)
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)
plt.fill_between(train_sizes,train_scores_mean - train_scores_std,train_scores_mean + train_scores_std,alpha=0.1,color="r",) # 把模型准确性的平均值的上下方差的空间里用颜色填充
plt.fill_between(train_sizes,test_scores_mean - test_scores_std,test_scores_mean + test_scores_std,alpha=0.1,color="g",)
plt.plot(train_sizes, train_scores_mean, "o-", color="r", label="Training score")
plt.plot(train_sizes, test_scores_mean, "o-", color="g", label="Cross-validation score")
plt.legend(loc="best")
return plt
当计算模型的准确性时,是随机从数据集中分配出训练样本和交叉验证样本,会导致数据分布不均匀,同样训练样本数量的模型,由于随机分配,导致每次计算出来的准确性都不一样,为了解决这一问题,计算模型准确性时应多次计算并求得准确性的平均值和方差
学习曲线的使用实例
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.datasets import load_digits
# 加载数据集
X, y = load_digits(return_X_y=True)
cv1 = ShuffleSplit(n_splits=50, test_size=0.2, random_state=0)
cv2 = ShuffleSplit(n_splits=5, test_size=0.2, random_state=0)
estimator1 = GaussianNB()
estimator2 = SVC(gamma=0.001)
title1 = "(Naive Bayes)"
title2 = r"(SVM, RBF kernel, $\gamma=0.001$)"
plt.subplot(121)
plot_learning_curve(estimator1, title1, X, y, ylim=(0.7, 1.01), cv=cv1, n_jobs=4)
plt.subplot(122)
plot_learning_curve(estimator2, title2, X, y, ylim=(0.7, 1.01), cv=cv2, n_jobs=4)
plt.show()
输出结果