Python实现：Hold-Out、k折交叉验证、分层k折交叉验证、留一交叉验证

最新推荐文章于 2025-03-20 10:42:24 发布

一只干巴巴的海绵

最新推荐文章于 2025-03-20 10:42:24 发布

阅读量6k

点赞数 1

分类专栏： scikit-learn

本文链接：https://blog.csdn.net/Hanx09/article/details/104641221

版权

scikit-learn 专栏收录该内容

4 篇文章

订阅专栏

模型在统计中是极其重要的，可以通过模型来描述数据集的内在关系，了解数据的内在关系有助于对未来进行预测。一个模型可以通过设置不同的参数来描述不同的数据集，有的参数需要根据数据集估计，有的参数需要人为设定（超参数）；一个数据集也可以通过多个多个模型进行描述，不能说哪个模型是最好的，其他模型都是不可取的。

数据集可以看做变量的具体实现，描述数据集的内在关系，实则是描述变量之间关系，进而对我们感兴趣的变量进行预测。

一个合理的模型既要很好地拟合原始数据，又要以高准确率进行预测，这就要求模型既能捕获数据集大量的主要信息，又不能包含太多数据集的噪声信息。不能做到前者，得到的将是一个欠拟合（Underfitting）的模型；不能做到后者，将会得到一个过拟合（Overfitting）的模型，这样的模型不能很好的拟合新的数据集（变量的另一个实现）。欠拟合与过拟合
交叉验证（Cross-Validation） 是一种模型的验证技术，用于评估一个统计分析模型在独立数据集上的概括能力。交叉验证的目标是确定一个原数据集的子集，去限制模型在训练阶段的一些问题，比如模型的过拟合、欠拟合等，同时提供了一种判断标准去衡量模型在独立数据集上的泛化能力。

Hold-Out Method

这种方法简单的将数据集划分为两个部分：训练集和测试集。训练集用于训练模型，测试集用于评估模型。

在训练集和测试集之前没有交叉重叠的样本，或者说，两组子集必须从完整集合中均匀抽样。一般的做法是随机抽样，当样本量足够多时，便可达到均匀抽样的效果。
训练集的样本数量必须够多，一般至少大于总样本数的50%。
在得到最终的模型之前可以使用整个数据集重新训练模型来对模型的超参数进行调整。

Train/Test split

这种方式有一个很大的缺点：只进行了一次划分，当数据集的划分不是随机进行的，数据结果具有偶然性，如果在某次划分中，训练集里全是容易学习的数据，测试集里全是复杂的数据，这样就会导致欠拟合；反之，容易导致过拟合。

严格意义来说Hold-Out Method并不能算是CV，因为这种方法没有达到交叉的思想；
只有在拥有足够多的数据时，它才是一个不错的选择。

Python-sklearn

xtrain,xtest,ytrain,ytest =sklearn.model_selection.train_test_split()

Args:
data要进行划分的数据集，支持列表、数据帧、数组、矩阵 test_size 测试集所占比例，默认为0.25；

train_size训练集所占比例；
random_state随机数种子，用于生成重复随机数，保证实验可复现；
shuffle 是否在划分数据集之前打乱数据集

train_test_split()是一个函数。

>>> import numpy as np
>>> from sklearn.model_selection import train_test_split
>>> X, y = np.arange(10).reshape((5, 2)), range(5)
>>> X
array([[0, 1],
       [2, 3],
       [4, 5],
       [6, 7],
       [8, 9]])
>>> list(y)
[0, 1, 2, 3, 4]
>>> X_train, X_test, y_train, y_test = train_test_split(
...     X, y, test_size=0.33, random_state=42)
...
>>> X_train
array([[4, 5],
       [0, 1],
       [6, 7]])
>>> y_train
[2, 0, 3]
>>> X_test
array([[2, 3],
       [8, 9]])
>>> y_test
[1, 4]

$k$ 折交叉验证（Groups=k）

具体地，先将数据集打乱，然后再将打乱后的数据集均匀分成k份，轮流选择其中的 $k - 1$ 份作为训练集，剩下的一份作验证，使用某种得分函数对模型进行评分，迭代进行 $k$ 次后将 $k$ 次的得分取平均作为选择最优模型的依据。
每个数据在验证集中出现一次，并且在训练中出现k-1次，这将显著减少欠拟合，因为使用了数据集中的大多数的数据进行训练；同时也降低了过拟合的可能，因为也使用了大多数的数据进行模型的验证。
k-fold

$k - f l o d$ 适用于对一些小的数据集进行统计分析，在进行 $k$ 次交叉验证时能获得足够多的模型的质量差异和不同的最佳参数。
$k$ 一般大于等于2，实际操作时一般从3开始取，只有在原始数据集合数据量小的时候才会尝试取 2。

Python-sklearn

class sklearn.model_selection.KFold(n_splits=5, shuffle=False, random_state=None)

Args:

n_split折叠数量，默认为5，至少为2.;

shuffle是否在分割成批次之前打乱数据集，默认不打乱;
random_state随机数种子，在进行打乱数据操作时使用

Methods：

get_n_splits(X=None, y=None, groups=None)：获取参数n_splits的值
split(X, y=None, groups=None)：将数据集划分成训练集和测试集，返回索引生成器。

KFold()是一个类，使用时要先创建一个实例（例如kf），再去调用方法。

##设置shuffle=False，运行两次，发现两次结果相同
In [1]: from sklearn.model_selection import KFold
   ...: import numpy as np
   ...: X = np.arange(24).reshape(12,2)
   ...: y = np.random.choice([1,2],12,p=[0.4,0.6])
   ...: kf = KFold(n_splits=5,shuffle=False)
   ...: for train_index , test_index in kf.split(X):
   ...:     print('train_index:%s , test_index: %s ' % train_index,test_index))
   ...:
train_index:[ 3  4  5  6  7  8  9 10 11] , test_index: [0 1 2]
train_index:[ 0  1  2  6  7  8  9 10 11] , test_index: [3 4 5]
train_index:[ 0  1  2  3  4  5  8  9 10 11] , test_index: [6 7]
train_index:[ 0  1  2  3  4  5  6  7 10 11] , test_index: [8 9]
train_index:[0 1 2 3 4 5 6 7 8 9] , test_index: [10 11]

In [2]: from sklearn.model_selection import KFold
   ...: import numpy as np
   ...: X = np.arange(24).reshape(12,2)
   ...: y = np.random.choice([1,2],12,p=[0.4,0.6])
   ...: kf = KFold(n_splits=5,shuffle=False)
   ...: for train_index , test_index in kf.split(X):
   ...:     print('train_index:%s , test_index: %s ' %(train_index,test_index))
   ...:
train_index:[ 3  4  5  6  7  8  9 10 11] , test_index: [0 1 2]
train_index:[ 0  1  2  6  7  8  9 10 11] , test_index: [3 4 5]
train_index:[ 0  1  2  3  4  5  8  9 10 11] , test_index: [6 7]
train_index:[ 0  1  2  3  4  5  6  7 10 11] , test_index: [8 9]
train_index:[0 1 2 3 4 5 6 7 8 9] , test_index: [10 11]

##设置shuffle=True时，运行两次，发现两次运行的结果不同
In [3]: from sklearn.model_selection import KFold
   ...: import numpy as np
   ...: X = np.arange(24).reshape(12,2)
   ...: y = np.random.choice([1,2],12,p=[0.4,0.6])
   ...: kf = KFold(n_splits=5,shuffle=True)
   ...: for train_index , test_index in kf.split(X):
   ...:     print('train_index:%s , test_index: %s ' %(train_index,test_index))
   ...:
train_index:[ 0  1  2  4  5  6  7  8 10] , test_index: [ 3  9 11]
train_index:[ 0  1  2  3  4  5  9 10 11] , test_index: [6 7 8]
train_index:[ 2  3  4  5  6  7  8  9 10 11] , test_index: [0 1]
train_index:[ 0  1  3  4  5  6  7  8  9 11] , test_index: [ 2 10]
train_index:[ 0  1  2  3  6  7  8  9 10 11] , test_index: [4 5]
 
In [4]: from sklearn.model_selection import KFold
   ...: import numpy as np
   ...: X = np.arange(24).reshape(12,2)
   ...: y = np.random.choice([1,2],12,p=[0.4,0.6])
   ...: kf = KFold(n_splits=5,shuffle=True)
   ...: for train_index , test_index in kf.split(X):
   ...:     print('train_index:%s , test_index: %s ' %(train_index,test_index))
   ...:
train_index:[ 0  1  2  3  4  5  7  8 11] , test_index: [ 6  9 10]
train_index:[ 2  3  4  5  6  8  9 10 11] , test_index: [0 1 7]
train_index:[ 0  1  3  5  6  7  8  9 10 11] , test_index: [2 4]
train_index:[ 0  1  2  3  4  6  7  9 10 11] , test_index: [5 8]
train_index:[ 0  1  2  4  5  6  7  8  9 10] , test_index: [ 3 11]
##设置shuffle=True和random_state=整数，发现每次运行的结果都相同
In [5]: from sklearn.model_selection import KFold
   ...: import numpy as np
   ...: X = np.arange(24).reshape(12,2)
   ...: y = np.random.choice([1,2],12,p=[0.4,0.6])
   ...: kf = KFold(n_splits=5,shuffle=True,random_state=0)
   ...: for train_index , test_index in kf.split(X):
   ...:     print('train_index:%s , test_index: %s ' %(train_index,test_index))
   ...:
train_index:[ 0  1  2  3  5  7  8  9 10] , test_index: [ 4  6 11]
train_index:[ 0  1  3  4  5  6  7  9 11] , test_index: [ 2  8 10]
train_index:[ 0  2  3  4  5  6  8  9 10 11] , test_index: [1 7]
train_index:[ 0  1  2  4  5  6  7  8 10 11] , test_index: [3 9]
train_index:[ 1  2  3  4  6  7  8  9 10 11] , test_index: [0 5]
 
In [6]: from sklearn.model_selection import KFold
   ...: import numpy as np
   ...: X = np.arange(24).reshape(12,2)
   ...: y = np.random.choice([1,2],12,p=[0.4,0.6])
   ...: kf = KFold(n_splits=5,shuffle=True,random_state=0)
   ...: for train_index , test_index in kf.split(X):
   ...:     print('train_index:%s , test_index: %s ' %(train_index,test_index))
   ...:
train_index:[ 0  1  2  3  5  7  8  9 10] , test_index: [ 4  6 11]
train_index:[ 0  1  3  4  5  6  7  9 11] , test_index: [ 2  8 10]
train_index:[ 0  2  3  4  5  6  8  9 10 11] , test_index: [1 7]
train_index:[ 0  1  2  4  5  6  7  8 10 11] , test_index: [3 9]
train_index:[ 1  2  3  4  6  7  8  9 10 11] , test_index: [0 5]
##n_splits属性值获取方式
In [8]: kf.split(X)
Out[8]: <generator object _BaseKFold.split at 0x00000000047FF990> 
In [9]: kf.get_n_splits()
Out[9]: 5 
In [10]: kf.n_splits
Out[10]: 5

sklearn.model_selection.cross_val_score`(estimator, X, y=None,
 groups=None, scoring=None, cv=None, n_jobs=None, verbose=0,
 fit_params=None, pre_dispatch='2*n_jobs', error_score=nan)

Evaluate a score by cross-validation
Returns:scoresarray of float, shape=(len(list(cv)),) Array of scores of the estimator for each run of the cross validation.

>>> from sklearn import datasets, linear_model
>>> from sklearn.model_selection import cross_val_score
>>> diabetes = datasets.load_diabetes()
>>> X = diabetes.data[:150]
>>> y = diabetes.target[:150]
>>> lasso = linear_model.Lasso()
>>> print(cross_val_score(lasso, X, y, cv=3))
[0.33150734 0.08022311 0.03531764]

留一交叉验证(Groups=len(train))

留一法(Leave One Out)是一种特殊的交叉验证方式。如果样本容量为n，则k=n，进行n折交叉验证，每次留下一个样本进行验证。主要针对小样本数据。

当只有比较少的数据和训练便于重复训练的模型时，这种方法是很有用的。
每一回合中几乎所有的样本皆用于训练模型，因此最接近原始样本的分布，这样评估所得的结果比较可靠。
实验过程中没有随机因素会影响实验数据，确保实验过程是可以被复制的。
计算成本高.

Python-sklearn

sklearn.model_selection.LeaveOneOut()

分层 $k$ 折交叉验证(Stratification k-fold cross validation)

$k$ 折交叉验证每次划分时对数据进行均分，设想一种情况：数据集有5类，抽取出来的也正好是按照类别划分的5类，也就是说第一折全是0类，第二折全是1类，…。这样的数据划分就会导致，模型训练时，没有学习到测试集中数据的特点，从而导致模型得分很低。
虽然通常情况下，使用 $k$ 折交叉验证时都会对数据集进行打乱，以得到随机的数据子集，但在数据划分时，如果考虑训练集中各类别数据的占比情况，会得到更好的效果，这便是分层 $k$ 折交叉验证的原理。
分层法
分层的意思是说在每一折中都保持着原始数据中各个类别的比例关系，比如说：数据集有20个样本，分别属于三类，三个类比在总体的占比依次为4:3:3，现在需要对数据集进行划分，得到10个目标样本作为训练集或验证集，对此可以根据所占比例依次取4，3，3个样本组成训练集或测试集，这样的验证结果更加可信。

对于小数据集、不平衡数据集、多分类问题有很好的效果，因为它充分考虑了数据内部的分布情况，使得我们所得到的子集（训练集和测试集）继承了数据集的分布特性。
一般来说，对于一个平衡的大数据集，分层法（Stratification split）和对数据集进行随机打乱之后再进行划分的方式得到的结果是相同的。

Python-sklearn

sklearn.model_selection.StratifiedKFold()

Args:

n_split折叠数量，默认为3，至少为2.;
shuffle是否在分割成批次之前打乱每一类的数据集，默认不打乱;
random_state随机数种子，在进行打乱数据操作时使用.

总结

训练集：用于训练模型；
验证集：验证模型结构和超参数；
测试集：对于模型来说是未知数据，用于评估模型的泛化能力。

各方法使用情景：

如果拥有足够的数据，并且对于不同的划分方式都能得到类似的模型得分和最佳模型参数，那么Train/Testsplit是一个比较好的选择。
当对于不同的划分我们总是能得到不同的模型得分和最佳模型参数时，KFlod是一个好的选择。
当对小数据集进行处理时，留一法(LeaveOneOut)是更好的选择。
分层法能使验证集更加稳定并且对于小而不平衡的数据集是有奇效的。

交叉验证的用途：

交叉验证可以有效评估模型的质量；
交叉验证可以有效避免过拟合和欠拟合；
选择模型超参数，以此选择的超参数测试误差小。

交叉验证
 几种交叉验证方式的比较
 交叉验证Crosss-Validation
sklearn.model_selection.KFold

Python实现：Hold-Out、k折交叉验证、分层k折交叉验证、留一交叉验证

Hold-Out Method

Python-sklearn

k k k折交叉验证（Groups=k）

Python-sklearn

留一交叉验证(Groups=len(train))

Python-sklearn

分层 k k k折交叉验证(Stratification k-fold cross validation)

Python-sklearn

总结

$k$ 折交叉验证（Groups=k）

分层 $k$ 折交叉验证(Stratification k-fold cross validation)