学习曲线的封装--learning_curve()函数

夺笋123

已于 2022-05-11 22:10:25 修改

阅读量860

点赞数 2

分类专栏： # sklearn机器学习库文章标签： python sklearn

于 2022-05-11 22:08:40 首次发布

本文链接：https://blog.csdn.net/m0_54510474/article/details/124599713

版权

sklearn机器学习库专栏收录该内容

20 篇文章

订阅专栏

本文介绍了`sklearn.model_selection.learning_curve`函数，用于绘制和分析学习曲线，以理解模型在不同训练集大小上的表现。主要参数包括`estimator`（如GaussianNB或SVC）、`X`（训练数据）、`y`（训练标签）和`train_sizes`（训练集大小比例）。学习曲线展示了随着训练样本增加，模型在训练集和验证集上的得分变化，有助于识别过拟合或欠拟合。通过`cv`参数可以选择不同的交叉验证策略，`shuffle`参数决定是否打乱数据。示例中，使用了GaussianNB和SVC模型，展示了如何绘制学习曲线，并比较了不同模型的表现。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

类的定义

model_selection.learning_curve(estimator, X, y, *, groups=None, train_sizes=array([0.1, 0.33, 0.55, 0.78, 1.]), cv=None, scoring=None, exploit_incremental_learning=False, n_jobs=None, pre_dispatch='all', verbose=0, shuffle=False, random_state=None, error_score=nan, return_times=False, fit_params=None)

主要参数描述

estimator

object type that implements the “fit” and “predict” methods
一个包含fit()和predict()方法的对象

X

array-like of shape (n_samples, n_features)
训练集数据

y

array-like of shape (n_samples,) or (n_samples, n_outputs)
训练集的标签

train_sizes

array-like of shape (n_ticks,), default=np.linspace(0.1, 1.0, 5)
相对或绝对的即将用于生成学习曲线的训练样本数量，np_linspace(.1,1.0,5)=[0.1,0.325,0.55,0.775,1.0]，从序列中取出训练样本数量百分比，逐个计算在当前训练样本数量情况下训练出来的模型准确性

n_jobs

int, default=None
并行运行的任务数

shuffle

bool, default=False
是否打乱训练数据

return_times

bool, default=False
是否返回拟合过程和计算得分过程中花费的时间

cv

int, cross-validation generator or an iterable, default=None
分割策略，可选参数

参数	描述
None	使用默认的5-KFold
int	使用指定的n-KFold
CV splitter

当该参数为int或者None，参数estimator是一个分类器，并且y是二元标签或多元标签，将会使用StratifiedKFold 类作为分割策略；其他情况下，使用KFold类作为分割策略

返回值

train_sizes_abs

array of shape (n_unique_ticks,)

生成学习曲线时使用的训练样本数量

train_scores

array of shape (n_ticks, n_cv_folds)
训练集得分

test_scores

array of shape (n_ticks, n_cv_folds)

测试集的得分

fit_times

array of shape (n_ticks, n_cv_folds)
拟合过程花费的时间，只有当参数return_times设置为True时才会返回

score_times

array of shape (n_ticks, n_cv_folds)
计算得分过程花费的时间，只有当参数return_times设置为True时才会返回

学习曲线封装

import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import learning_curve
from sklearn.model_selection import ShuffleSplit

def plot_learning_curve(estimator,title,X,y,ylim=None,cv=None,n_jobs=None):
	'''
	绘制出模型的学习曲线
	'''
    plt.title(title)
    if ylim is not None:
        plt.ylim(*ylim)
    train_sizes, train_scores, test_scores= learning_curve(estimator,X,y,cv=cv,n_jobs=n_jobs,)    
    
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)
    
    plt.fill_between(train_sizes,train_scores_mean - train_scores_std,train_scores_mean + train_scores_std,alpha=0.1,color="r",)	# 把模型准确性的平均值的上下方差的空间里用颜色填充
    plt.fill_between(train_sizes,test_scores_mean - test_scores_std,test_scores_mean + test_scores_std,alpha=0.1,color="g",)
    plt.plot(train_sizes, train_scores_mean, "o-", color="r", label="Training score")
    plt.plot(train_sizes, test_scores_mean, "o-", color="g", label="Cross-validation score")
    plt.legend(loc="best")
    return plt

当计算模型的准确性时，是随机从数据集中分配出训练样本和交叉验证样本，会导致数据分布不均匀，同样训练样本数量的模型，由于随机分配，导致每次计算出来的准确性都不一样，为了解决这一问题，计算模型准确性时应多次计算并求得准确性的平均值和方差

学习曲线的使用实例

from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.datasets import load_digits

# 加载数据集
X, y = load_digits(return_X_y=True)
cv1 = ShuffleSplit(n_splits=50, test_size=0.2, random_state=0)
cv2 = ShuffleSplit(n_splits=5, test_size=0.2, random_state=0)
estimator1 = GaussianNB()
estimator2 = SVC(gamma=0.001)


title1 = "(Naive Bayes)"
title2 = r"(SVM, RBF kernel, $\gamma=0.001$)"
plt.subplot(121)
plot_learning_curve(estimator1, title1, X, y, ylim=(0.7, 1.01), cv=cv1, n_jobs=4)
plt.subplot(122)
plot_learning_curve(estimator2, title2, X, y, ylim=(0.7, 1.01), cv=cv2, n_jobs=4)
plt.show()