交叉验证python_交叉验证

交叉验证python

Cross validation may be any of various model validation techniques that are used to assess how well a predictive model will generalize to an independent set of data that the model has not seen before. Hence, it is typically employed in situations where we predict something and we want to gain a rough estimate of how well our predictive model will perform in practice.

交叉验证可以是各种模型验证技术中的任何一种,可用于评估预测模型将多大程度地推广到该模型之前未见过的独立数据集。 因此,它通常用于以下情况:我们预测某些东西,并且希望对我们的预测模型在实践中的表现有一个大概的估计。

“Before Creating any Machine Learning Model, we must know what Cross Validation is and how to choose the best Cross Validation” — Abhishek Thakur, Kaggle’s first 4x Grandmaster

“在创建任何机器学习模型之前,我们必须知道什么是交叉验证以及如何选择最佳的交叉验证。” — Kaggle的第一个4x Grandmaster Abhishek Thakur

By the end of this post you will have a good understanding of the popular cross validation techniques, how we can implement them using scikit-learn, and how to select the correct CV given a specific problem.

到本文结尾,您将对流行的交叉验证技术,如何使用scikit-learn实现它们以及如何针对特定问题选择正确的CV有一个很好的了解。

流行的交叉验证技术 (Popular Cross Validation Techniques)

Essentially, selecting the correct cross validation technique boils down to the data that we have on hand, hence why one choice of cross validation may or may not work for another set of data. However, the goal of employing a cross validation technique remains constant, we want to estimate the expected level of fit for of a predictive model on unseen data, since with this information we can make the necessary adaptations to make our predictive models (if required) or decide to use a totally different one.

本质上,选择正确的交叉验证技术可以归结为我们手头的数据,因此,为什么选择交叉验证对另一组数据可能有效也可能无效。 但是,采用交叉验证技术的目标仍然是不变的,我们希望根据看不见的数据来估计预测模型的预期拟合水平,因为有了这些信息,我们就可以进行必要的调整以构建我们的预测模型(如果需要)或决定使用完全不同的产品。

基于保留的交叉验证 (Hold-Out Based Cross Validation)

I may receive abuse from some experience Data Scientist, Machine Learning and/or Deep learning practitioners for improper terminology because cross validation often allows the predictive model to train and test on various splits whereas hold-out sets do not. Regardless, a hold-out based cross validation is when we split our data into a train and test set. This is often the first validation technique you’d of implemented and the easiest to get your head around. It consist of quite literally dividing up your data into separate portions where you may train your predictive model on one set and test it on the test set.

我可能会从数据科学家,机器学习和/或深度学习从业人员那里滥用一些不当的术语,因为交叉验证通常允许预测模型对各种拆分进行训练和测试,而保留集则不允许。 无论如何,基于保留的交叉验证是将数据拆分为训练和测试集时的结果。 这通常是您要实施的第一个验证技术,并且最容易动手。 它实际上是将数据分成几个部分,您可以在其中训练一组预测模型,然后在测试集中进行测试。

Note: Some people take this further and will have a training dataset, a validation dataset and test data set. The validation dataset will be used to tune the predictive model and the test set will be used to test how well the model generalizes.

注意 :有些人会更进一步,将拥有训练数据集,验证数据集和测试数据集。 验证数据集将用于调整预测模型,测试集将用于测试模型的推广程度。

When dividing the data, something to keep in mind is that you’d have to determine what proportion of the data is for training and what is for testing. I’ve seen various splits from 60% (train)— 40% (test) to 80% (train) — 20% (test). I believe it is safe to say that a range from 60%-80% of your data should go towards training your predictive model and the remainder may go directly to the test set (or split again into validation and test sets).

在划分数据时,要记住的一点是,您必须确定要训练的数据比例是多少,用于测试的比例是多少。 我已经看到了从60%(火车)-40%(测试)到80%(火车)-20%(测试)的各种划分。 我相信可以肯定地说,数据的60%-80%应该用于训练预测模型,其余的可以直接用于测试集(或再次分为验证集和测试集)。

Image for post
Adi Bronshtein, Adi BronshteinTrain/Test Split and Cross Validation in Python) Python中的训练/测试拆分和交叉验证 )
# https://bit.ly/3fUuyOyimport numpy as npfrom sklearn.model_selection import train_test_split
X, y = np.arange(10).reshape((5, 2)), range(5)print(X)
array([[0, 1],
[2, 3],
[4, 5],
[6, 7],
[8, 9]])print(list(y))
[0, 1, 2, 3, 4]

Performing the hold-out based validation technique is most effective when we have a very large dataset. As it is not required we test on various splits, this technique uses much less computational power hence making it the go-to strategy for validation on large datasets.

当我们有非常大的数据集时,执行基于保留的验证技术最为有效。 由于不需要我们在各种分割上进行测试,因此该技术使用的计算能力要低得多,因此使其成为对大型数据集进行验证的首选策略。

K折交叉验证 (K-Fold Cross Validation)

I briefly touched on cross validation consist of above “cross validation often allows the predictive model to train and test on various splits whereas hold-out sets do not.” — In other words, cross validation is a resampling procedure. When “k” is present in machine learning discussions, it’s often used to represent a constant value, for instance k in k-means clustering refers to the number of clusters and k in k Neareast Neighbors refers to the number of neighbors to consider when performing a plurality vote (for classification). This pattern holds true for K-Fold cross validation also, where k references the number of groups that a given data sample should be split into.

我简要地提到了交叉验证,其中包括“交叉验证通常允许预测模型在各种分割上进行训练和测试,而保留集则不允许。” 换句话说,交叉验证是一个重采样过程。 当机器学习讨论中出现“ k”时,它通常用于表示一个恒定值,例如,k均值聚类中的k表示聚类数,k中的近邻中的k表示执行时要考虑的邻居数多票(用于分类)。 这种模式也适用于K折交叉验证,其中k表示应将给定数据样本拆分为的组数。

In k-fold cross-validation, the original sample is randomly partitioned into k equal sized groups. From the k groups, one group would be removed as a hold-out set and the remaining groups would be the training data. The predictive model is then fit on the training data and evaluated on the hold-out set. This procedure is k times so that all groups have served exactly once as the hold-out set.

k倍交叉验证中,原始样本被随机分为k个相等大小的组。 从k个组中,将一个组作为保留集删除,其余的组将作为训练数据。 然后将预测模型拟合到训练数据上,并在保留集上进行评估。 此过程是k次,因此所有组都仅作为保留集服务了一次。

Image for post
Scikit-Learn Documentation) Scikit-Learn文档 )
# https://bit.ly/2POmqVbimport numpy as npfrom sklearn.model_selection import KFoldX = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])
y = np.array([1, 2, 3, 4])
kf = KFold(n_splits=2)print(kf.get_n_splits(X))
2print(kf)
KFold(n_splits=2, random_state=None, shuffle=False)>>> for train_index, test_index in kf.split(X):... print("TRAIN:", train_index, "TEST:", test_index)... X_train, X_test = X[train_index], X[test_index]... y_train, y_test = y[train_index], y[test_index]
TRAIN: [2 3] TEST: [0 1]
TRAIN: [0 1] TEST: [2 3]

Due to the ease of comprehension, k-fold cross validation is quite popular. and it often results in a less biased outcome than performing hold-out based validation. This technique is often a good starting point for Regression problems, although it may be better to use stratified k-fold if the distribution of the target variable is not consistent — which would require binning the target variable. Somethings to consider is the configuration of k which must split the data so that each train/test split of the data samples are large enough to be statistically representative of the broader dataset.

由于易于理解,k折交叉验证非常受欢迎。 与执行基于保留的验证相比,它通常会减少偏差。 该技术通常是回归问题的一个很好的起点,尽管如果目标变量的分布不一致,则最好使用分层k倍,这将需要对目标变量进行分箱。 需要考虑的是k的配置,该配置必须拆分数据,以使数据样本的每个训练/测试拆分都足够大,足以在统计学上代表更广泛的数据集。

分层K折 (Stratified K-Fold)

Both of the techniques that we have covered till now are relatively effective in many scenarios, although misleading results (and potentially overall failure) can arise when the target data has imbalanced labels — I’ve been careful not to only mention this as a problem for classification task because we can adjust a regression task in some ways for us to be able to perform stratified k-fold for validation. Instead, a better solution would be to split the randomly in such a way that we maintain the same class distribution in each subset — this is what we refer to as stratification.

到目前为止,我们涵盖的两种技术在许多情况下都是相对有效的,尽管当目标数据的标签不平衡时可能会产生误导性的结果(并可能导致整体故障)-我一直小心谨慎,不要仅将此作为一个问题分类任务,因为我们可以通过某种方式调整回归任务,以便能够执行分层k折进行验证。 取而代之,更好的解决方案是采用以下方式随机分割随机数,即我们在每个子集中保持相同的类分布-这就是我们所说的分层。

Note: Other than the way we randomly split the data, the stratified k-fold cross validation is the same as simple k-fold cross validation.

注意 :除了我们随机分割数据的方式以外,分层的k折交叉验证与简单的k折交叉验证相同。

Image for post
Scikit-Learn Documentation) Scikit-Learn文档 )
# https://bit.ly/3iCHavoimport numpy as npfrom sklearn.model_selection import StratifiedKFoldX = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])
y = np.array([0, 0, 1, 1])
skf = StratifiedKFold(n_splits=2)print(skf.get_n_splits(X, y)
2print(skf)
StratifiedKFold(n_splits=2, random_state=None, shuffle=False)for train_index, test_index in skf.split(X, y): print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]TRAIN: [1 3] TEST: [0 2]
TRAIN: [0 2] TEST: [1 3]

According to the first 4X Kaggle Grandmaster, Abhishek Thakur, it is safe to say that if we have a standard classification task then applying stratified k-fold cross validation blindly is not a bad idea at all.

根据第一位4X Kaggle大师Abhishek Thakur的说法,可以肯定地说,如果我们有标准分类任务,那么盲目应用分层k倍交叉验证绝对不是一个坏主意。

留一法交叉验证 (Leave-One-Out Cross Validation)

Leave-one-out Cross validation may be thought of as a special case of k-fold cross validation where k = n and n is the number of samples within the original dataset. In other words, the data will be trained on n - 1 samples and will be used to predict the sample that was left out and this would be repeated n times so that each sample serves as the left out sample.

留一法交叉验证可以认为是k倍交叉验证的一种特殊情况,其中k = n ,n是原始数据集中的样本数。 换句话说,将在n -1个样本上训练数据,并将这些数据用于预测遗漏的样本,并将其重复n次,以便每个样本都充当遗漏的样本。

Image for post
DataCamp) DataCamp )
#https://bit.ly/2POw0HUimport numpy as npfrom sklearn.model_selection import LeaveOneOutX = np.array([[1, 2], [3, 4]])
y = np.array([1, 2])
loo = LeaveOneOut()print(loo.get_n_splits(X))
2print(loo)
LeaveOneOut()for train_index, test_index in loo.split(X):
print("TRAIN:", train_index, "TEST:", test_index) X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
print(X_train, X_test, y_train, y_test)TRAIN: [1] TEST: [0]
[[3 4]] [[1 2]] [2] [1]
TRAIN: [0] TEST: [1]
[[1 2]] [[3 4]] [1] [2]

This technique may require large computation time, and in that case k-fold cross validation may be a better solution. Alternatively, if the dataset is small then we would be losing plenty of data to the predictive model (especially if the cross validation is big) therefore making leave-one-out is a good solution in this situation.

此技术可能需要大量的计算时间,在这种情况下,k倍交叉验证可能是更好的解决方案。 另外,如果数据集很小,那么我们将丢失大量数据到预测模型中(特别是如果交叉验证很大),因此在这种情况下进行留一法是一个很好的解决方案。

组K折交叉验证 (Group K-fold Cross Validation)

GroupKFold cross validation is another variation of k-fold cross validation which ensures the same group is not represented in the train and test set. For instance, if we would like to build a predictive model that classifies malignant or benign from images of patients skin, it is likely that we would have images from the same patient. Since we do not split a single patient across the training and test set we revert to GroupKfold instead of k-fold (or stratified k-fold for that matter) — therefore, the patients would be considered as groups.

GroupKFold交叉验证是k折叠交叉验证的另一个变体,可确保训练和测试集中不代表同一组。 例如,如果我们想建立一个根据患者皮肤图像对恶性或良性进行分类的预测模型,则可能会有来自同一患者的图像。 由于我们没有在训练和测试集中分散一名患者,因此我们将其还原为GroupKfold,而不是k-fold(或分层k-fold)-因此,将这些患者视为一组。

Image for post
Scikit-Learn Documentation) Scikit-Learn文档 )
# https://scikit-learn.org/stable/modules/cross_validation.htmlfrom sklearn.model_selection import GroupKFold
X = [0.1, 0.2, 2.2, 2.4, 2.3, 4.55, 5.8, 8.8, 9, 10]
y = ["a", "b", "b", "b", "c", "c", "c", "d", "d", "d"]
groups = [1, 1, 1, 2, 2, 2, 3, 3, 3, 3]
gkf = GroupKFold(n_splits=3)for train, test in gkf.split(X, y, groups=groups):
print("%s %s" % (train, test))[0 1 2 3 4 5] [6 7 8 9]
[0 1 2 6 7 8 9] [3 4 5]
[3 4 5 6 7 8 9] [0 1 2]

Note: Another variant of this technique is Stratified K-fold Cross Validation which is performed when we want to preserve the class distribution and we don’t want the same group will not appear in two different folds. There is no inbuilt solution to this is scikit-learn, but there is a nice implementation in this Kaggle Notebook.

注意 :此技术的另一种变体是分层K折交叉验证,当我们想要保留类分布并且我们不希望同一组不会出现在两个不同的折中时执行。 scikit-learn没有内置的解决方案,但是这个Kaggle Notebook中有一个不错的实现。

结语 (Wrap Up)

Cross Validation is the first step to building Machine Learning Models and it’s extremely important that we consider the data that we have when deciding what technique to employ — In some cases, it may even be necessary to adopt new forms of cross validation depending on the data.

交叉验证是构建机器学习模型的第一步,在决定采用哪种技术时,考虑到我们拥有的数据非常重要-在某些情况下,甚至有必要根据数据采用新形式的交叉验证。

“If you have a good Cross Validation scheme in which validation data is representative of the training and real-world data, you will be able to build a good Machine Learning Model which is highly generalizable” — Abhishek Thakur

“如果您有一个很好的交叉验证方案,其中验证数据可以代表训练和现实世界的数据,那么您将能够建立一个具有良好通用性的良好机器学习模型。” — Abhishek Thakur

This story was highly inspired by the book Approaching (almost) any Machine Learning Problem (the link is not an affiliate link and I have not be asked to promote the book) by the first 4x Kaggle Grandmaster Abhishek Thakur. If you already have some experience with Machine Learning then and want more practical advice then I’d highly recommend this book.

这个故事的灵感来自第一本4倍Kaggle宗师Abhishek Thakur所著的《几乎(几乎)任何机器学习问题》 (该链接不是会员链接, 我也没有被要求推广这本书 )。 如果您已经对机器学习有所了解,并且想要更多实用的建议,那么我强烈推荐这本书。

If you’d like to get in contact with me, I am most reachable on LinkedIn:

如果您想与我联系,可以通过LinkedIn与我联系:

翻译自: https://towardsdatascience.com/cross-validation-c4fae714f1c5

交叉验证python

  • 0
    点赞
  • 5
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值