【sklearn库】【安装】【交叉验证】

本文详细介绍了sklearn库中交叉验证的概念和应用,包括交叉验证的原理、不同类型的交叉验证迭代器,如K折、Leave-One-Out等。重点讨论了交叉验证在评估模型性能和防止过拟合中的作用,以及如何使用`cross_val_score`和`cross_validate`函数进行多指标评估。同时,文章提到了时间序列数据的交叉验证策略`TimeSeriesSplit`,并强调了在处理有序数据时正确使用洗牌的重要性。最后,文章介绍了`permutation_test_score`函数,用于通过排列测试评估分类器性能的显著性。
摘要由CSDN通过智能技术生成

        sklearn 是 scikit-learn 的简称,是一个基于 Python 的第三方模块。

        sklearn 库集成了一些常用的机器学习方法,在进行机器学习任务时,并不需要实现算法,只需要简单的调用 sklearn 库中提供的模块就能完成大多数的机器学习任务。
        sklearn 库是在 Numpy 、 Scipy 和 matplotlib 的基础上开发而成的,因此在介绍 sklearn 的安装前,需要先安装这些依赖库。

pip install scikit-learn
# from sklearn import cross_validation
from sklearn.model_selection import cross_val_score

 sklearn网址:scikit-learn: machine learning in Python — scikit-learn 1.4.1 documentation

交叉熵:3.1. Cross-validation: evaluating estimator performance — scikit-learn 1.4.1 documentation

3.1. Cross-validation: evaluating estimator performance
3.1. 交叉验证:评估估算器性能 

Learning the parameters of a prediction function and testing it on the same data is a methodological mistake: a model that would just repeat the labels of the samples that it has just seen would have a perfect score but would fail to predict anything useful on yet-unseen data. This situation is called overfitting. To avoid it, it is common practice when performing a (supervised) machine learning experiment to hold out part of the available data as a test set X_test, y_test. Note that the word “experiment” is not intended to denote academic use only, because even in commercial settings machine learning usually starts out experimentally. Here is a flowchart of typical cross validation workflow in model training. The best parameters can be determined by grid search techniques.
学习预测函数的参数并在相同的数据上对其进行测试是方法论上的错误:一个模型只是重复它刚刚看到的样本的标签,就会得到满分,但无法预测任何对尚未看到的数据有用的东西。这种情况称为过拟合。为了避免这种情况,在执行(监督)机器学习实验时,通常的做法是将部分可用数据作为测试集 X_test, y_test 。请注意,“实验”一词并不仅仅表示学术用途,因为即使在商业环境中,机器学习通常也是从实验开始的。以下是模型训练中典型交叉验证工作流程的流程图。最佳参数可以通过网格搜索技术确定。

Grid Search Workflow

In scikit-learn a random split into training and test sets can be quickly computed with the train_test_split helper function. Let’s load the iris data set to fit a linear support vector machine on it:
在 scikit-learn 中,可以使用 train_test_split helper 函数快速计算训练集和测试集的随机拆分。让我们加载虹膜数据集以在其上拟合线性支持向量机:

>>> import numpy as np
>>> from sklearn.model_selection import train_test_split
>>> from sklearn import datasets
>>> from sklearn import svm

>>> X, y = datasets.load_iris(return_X_y=True)
>>> X.shape, y.shape
((150, 4), (150,))

We can now quickly sample a training set while holding out 40% of the data for testing (evaluating) our classifier:
现在,我们可以快速对训练集进行采样,同时保留 40% 的数据来测试(评估)我们的分类器:

>>> X_train, X_test, y_train, y_test = train_test_split(
...     X, y, test_size=0.4, random_state=0)

>>> X_train.shape, y_train.shape
((90, 4), (90,))
>>> X_test.shape, y_test.shape
((60, 4), (60,))

>>> clf = svm.SVC(kernel='linear', C=1).fit(X_train, y_train)
>>> clf.score(X_test, y_test)
0.96...

When evaluating different settings (“hyperparameters”) for estimators, such as the C setting that must be manually set for an SVM, there is still a risk of overfitting on the test set because the parameters can be tweaked until the estimator performs optimally. This way, knowledge about the test set can “leak” into the model and evaluation metrics no longer report on generalization performance. To solve this problem, yet another part of the dataset can be held out as a so-called “validation set”: training proceeds on the training set, after which evaluation is done on the validation set, and when the experiment seems to be successful, final evaluation can be done on the test set.
在评估估算器的不同设置(“超参数”)时,例如必须为 SVM 手动设置的 C 设置,测试集上仍然存在过度拟合的风险,因为可以调整参数,直到估算器以最佳方式执行。这样,有关测试集的知识可能会“泄漏”到模型中,并且评估指标不再报告泛化性能。为了解决这个问题,数据集的另一部分可以作为所谓的“验证集”:训练在训练集上进行,然后在验证集上进行评估,当实验似乎成功时,可以在测试集上进行最终评估。

However, by partitioning the available data into three sets, we drastically reduce the number of samples which can be used for learning the model, and the results can depend on a particular random choice for the pair of (train, validation) sets.
然而,通过将可用数据划分为三组,我们大大减少了可用于学习模型的样本数量,并且结果可能取决于对(训练、验证)集的特定随机选择。

A solution to this problem is a procedure called cross-validation (CV for short). A test set should still be held out for final evaluation, but the validation set is no longer needed when doing CV. In the basic approach, called k-fold CV, the training set is split into k smaller sets (other approaches are described below, but generally follow the same principles). The following procedure is followed for each of the k “folds”:
这个问题的解决方案是称为交叉验证(简称 CV)的过程。测试集仍应保留以进行最终评估,但在进行 CV 时不再需要验证集。在称为 k 折叠 CV 的基本方法中,训练集被分成 k 个较小的集合(下面将介绍其他方法,但通常遵循相同的原则)。对于每个 k“折叠”,都遵循以下过程:

  • A model is trained using \(k-1\) of the folds as training data;
    使用 \(k-1\) 褶皱作为训练数据来训练模型;

  • the resulting model is validated on the remaining part of the data (i.e., it is used as a test set to compute a performance measure such as accuracy).
    生成的模型在数据的其余部分进行验证(即,它被用作测试集来计算性能度量,例如准确性)。

The performance measure reported by k-fold cross-validation is then the average of the values computed in the loop. This approach can be computationally expensive, but does not waste too much data (as is the case when fixing an arbitrary validation set), which is a major advantage in problems such as inverse inference where the number of samples is very small.
然后,k 折交叉验证报告的性能度量就是循环中计算值的平均值。这种方法在计算上可能很昂贵,但不会浪费太多数据(就像修复任意验证集时一样),这在样本数量非常少的反向推理等问题中是一个主要优势。

A depiction of a 5 fold cross validation on a training set, while holding out a test set.

3.1.1. Computing cross-validated metrics

3.1.1. 计算交叉验证的指标 

The simplest way to use cross-validation is to call the cross_val_score helper function on the estimator and the dataset.
使用交叉验证的最简单方法是在估计器和数据集上调用 cross_val_score 帮助程序函数。

The following example demonstrates how to estimate the accuracy of a linear kernel support vector machine on the iris dataset by splitting the data, fitting a model and computing the score 5 consecutive times (with different splits each time):
以下示例演示了如何通过拆分数据、拟合模型并连续 5 次计算分数(每次拆分不同)来估计 iris 数据集上线性核支持向量机的准确性:

>>> from sklearn.model_selection import cross_val_score
>>> clf = svm.SVC(kernel='linear', C=1, random_state=42)
>>> scores = cross_val_score(clf, X, y, cv=5)
>>> scores
array([0.96..., 1. , 0.96..., 0.96..., 1. ])

The mean score and the standard deviation are hence given by:
因此,平均分数和标准差由下式给出:

>>> print("%0.2f accuracy with a standard deviation of %0.2f" % (scores.mean(), scores.std()))
0.98 accuracy with a standard deviation of 0.02

By default, the score computed at each CV iteration is the score method of the estimator. It is possible to change this by using the scoring parameter:
默认情况下,每次 CV 迭代计算的分数是估计器 score 的方法。可以使用评分参数进行更改:

>>> from sklearn import metrics
>>> scores = cross_val_score(
...     clf, X, y, cv=5, scoring='f1_macro')
>>> scores
array([0.96..., 1.  ..., 0.96..., 0.96..., 1.        ])

See The scoring parameter: defining model evaluation rules for details. In the case of the Iris dataset, the samples are balanced across target classes hence the accuracy and the F1-score are almost equal.
有关详细信息,请参阅评分参数:定义模型评估规则。在 Iris 数据集的情况下,样本在目标类别之间是平衡的,因此准确性和 F1 分数几乎相等。

When the cv argument is an integer, cross_val_score uses the KFold or StratifiedKFold strategies by default, the latter being used if the estimator derives from ClassifierMixin.
当 cv 参数是整数时, cross_val_score 默认使用 KFold or 策略,如果估计器派生自 ClassifierMixin ,则使用后者 StratifiedKFold 。

It is also possible to use other cross validation strategies by passing a cross validation iterator instead, for instance:
也可以通过传递交叉验证迭代器来使用其他交叉验证策略,例如:

>>> from sklearn.model_selection import ShuffleSplit
>>> n_samples = X.shape[0]
>>> cv = ShuffleSplit(n_splits=5, test_size=0.3, random_state=0)
>>> cross_val_score(clf, X, y, cv=cv)
array([0.977..., 0.977..., 1.  ..., 0.955..., 1.        ])

Another option is to use an iterable yielding (train, test) splits as arrays of indices, for example:
另一种选择是使用可迭代的 yielding (train, test) 拆分作为索引数组,例如:

>>> def custom_cv_2folds(X):
...     n = X.shape[0]
...     i = 1
...     while i <= 2:
...         idx = np.arange(n * (i - 1) / 2, n * i / 2, dtype=int)
...         yield idx, idx
...         i += 1
...
>>> custom_cv = custom_cv_2folds(X)
>>> cross_val_score(clf, X, y, cv=custom_cv)
array([1.        , 0.973...])

Data transformation with held out data
使用保留数据进行数据转换

Just as it is important to test a predictor on data held-out from training, preprocessing (such as standardization, feature selection, etc.) and similar data transformations similarly should be learnt from a training set and applied to held-out data for prediction:
正如在训练、预处理(如标准化、特征选择等)和类似的数据转换上测试预测变量很重要一样,也应该从训练集中学习并应用于预测数据:

>>> from sklearn import preprocessing
>>> X_train, X_test, y_train, y_test = train_test_split(
...     X, y, test_size=0.4, random_state=0)
>>> scaler = preprocessing.StandardScaler().fit(X_train)
>>> X_train_transformed = scaler.transform(X_train)
>>> clf = svm.SVC(C=1).fit(X_train_transformed, y_train)
>>> X_test_transformed = scaler.transform(X_test)
>>> clf.score(X_test_transformed, y_test)
0.9333...

Pipeline makes it easier to compose estimators, providing this behavior under cross-validation:
Pipeline 使组合估计器变得更加容易,在交叉验证下提供以下行为:

>>> from sklearn.pipeline import make_pipeline
>>> clf = make_pipeline(preprocessing.StandardScaler(), svm.SVC(C=1))
>>> cross_val_score(clf, X, y, cv=cv)
array([0.977..., 0.933..., 0.955..., 0.933..., 
  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

资源存储库

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值