模型测试交叉验证_验证和测试模型第1部分的简单介绍

最新推荐文章于 2024-08-15 01:55:01 发布

杨_明

最新推荐文章于 2024-08-15 01:55:01 发布

阅读量1k

点赞数 2

文章标签： python 机器学习人工智能深度学习 tensorflow

原文链接：https://medium.com/analytics-vidhya/a-simple-introduction-to-validating-and-testing-a-model-part-1-2a0765deb198

版权

模型测试交叉验证

Decoding the importance of validation and test set for everyone.

解码每个人的验证和测试集的重要性。

In this article, we will be learning the importance of the validation set and the techniques used to split the original dataset into subsets (train, validation, and test). We will first understand how it works followed by the code for a better learning experience.

在本文中，我们将学习验证集的重要性以及用于将原始数据集拆分为子集(训练，验证和测试)的技术。 我们将首先了解它是如何工作的，然后是代码，以获得更好的学习体验。

It is essential to test our model on unseen data to check if it will generalize to new cases.

必须在看不见的数据上测试我们的模型，以检查它是否可以推广到新情况。

There can be two ways to check the performance of the data :

有两种方法可以检查数据的性能：

To generate the model and put it directly into the production, in this way we can see how it performs on new(unseen) data but if the model is not good the users will not be happy.
为了生成模型并将其直接投入到生产中，通过这种方式，我们可以看到它如何处理新的(看不见的)数据，但是如果模型不好，用户将不会满意。
The other smart way to do it is to split the data into two parts and then use one to train the model whilst keeping the other for testing. The error rate produced by the test set is also called a generalization error.
另一种明智的做法是将数据分为两部分，然后使用其中的一个训练模型，同时保留另一个用于测试。测试集产生的错误率也称为泛化错误。

We usually go with the second method as it is more safe and reliable.

我们通常采用第二种方法，因为它更安全可靠。

In the figure below, we can see how we keep a chunk of data as training and the other for testing. Usually, we take the test data as 20% of the original data but it can be changed as per the requirement.

在下图中，我们可以看到如何保留大量数据作为训练，而另一部分用于测试。通常，我们将测试数据作为原始数据的20％，但可以根据要求进行更改。

Model Building is an iterative process and therefore once we build our model we keep improving it.

模型构建是一个反复的过程，因此，一旦构建模型，我们就会不断对其进行改进。

The steps involved in model building are:

模型构建涉及的步骤是：

Hypothesis Generation
假设产生
Dataset Creation
数据集创建
Modeling
造型
Evaluation
评价

We have discussed it the previous article, here’s the link to it.

我们已经在上一篇文章中讨论了它，这是它的链接。

我们如何确定模型是否适合数据？ (How can we decide if the model fits the data?)

Model evaluation is based on the performance of the model on test data. Therefore for some cases, we might need some hyperparameters to apply changes to our model and make it perform better.

模型评估基于模型在测试数据上的性能。因此，在某些情况下，我们可能需要一些超参数才能将更改应用于我们的模型并使其性能更好。

Evaluation metrics such as generalizing error help us to compare different models and decide which one has a better model fit.

诸如泛化误差之类的评估指标可帮助我们比较不同的模型，并确定哪个模型更合适。

Based on the model-fit we can classify it into 3 different categories:

根据模型拟合，我们可以将其分为3个不同的类别：

Under-fit
不合身
Over-fit
过度适合
Best-fit
最适合

We will be discussing these in detail in coming articles, for now, all we need to know is that we want to achieve a good fit and both overfitting and underfitting are not preferred for the model.

我们将在接下来的文章中详细讨论这些问题，现在，我们需要知道的是我们想要实现良好的拟合，并且过拟合和欠拟合都不是该模型的首选。

与将数据集一分为二有关的问题：训练和测试 (Problems linked with splitting the dataset into two: train and test)

Suppose we did make changes to our model multiple times and we finally achieve a lower generalization error(example: 5%), we then launch our model and it ends up not performing well.

假设我们确实对模型进行了多次更改，并且最终实现了较低的泛化误差(例如5％)，然后启动了模型，结果最终效果不佳。

What do you think went wrong here?

您认为这里出了什么问题？

Well, when we made changes to our model multiple times to achieve a lower generalization error on our test data. This indicates that we are not testing our model on completely new data(unseen).

好吧，当我们多次更改模型以降低测试数据的泛化误差时。这表明我们没有在全新的数据(未知)上测试我们的模型。

This might lead to a state where the model will not generalize well.

这可能会导致模型无法很好地推广的状态。

解决方案：创建验证集 (Solution: Creating a Validation Set)

To solve this issue, we will use a Validation Set.

为了解决这个问题，我们将使用Validation Set 。

We can split the existing dataset into three parts, train, validate, and test.

我们可以将现有数据集分为三个部分，训练，验证和测试。

Now that we have three sets we will use the training set to train the model, the validation set to optimize the model, and the test set to check how the model performs on unseen data.

现在我们有了三个集合，我们将使用训练集来训练模型，使用验证集来优化模型，并使用测试集来检查模型如何处理看不见的数据。

In the figure below, we can see how we split the data into train, validation, and testing. Usually, we use this proportion but it can be changed as per the requirement.

在下图中，我们可以看到如何将数据分为训练，验证和测试。通常，我们使用此比例，但可以根据要求进行更改。

如何创建验证集？ (How to create a Validation Set?)

Techniques used to generate the validation set :

用于生成验证集的技术：

Hold-out Validation
保持验证
Stratified Hold-out Validation
分层保持验证
k-fold cross-validation
k折交叉验证
Leave one out validation
留下一个验证

In this article, we will be learning the first two techniques and the rest we will learn in future articles.

在本文中，我们将学习前两种技术，其余的将在以后的文章中学习。

保持验证 (Hold-out Validation)

Steps involved to carry out this technique are:

实施此技术涉及的步骤为：

Take the data and shuffle it (randomize the order of the rows)
提取数据并对其进行混洗(随机排列行的顺序)
Split the data into train and test
将数据分为训练和测试
Split the training data further into train and validation set
将训练数据进一步分为训练和验证集

This technique is simple as all we need to do is to take out some parts of the original dataset and use it for test and validation. The splitting of data can easily be done using various libraries in python( an interpreted, high-level general-purpose programming language) like sklearn.

这项技术很简单，因为我们要做的就是取出原始数据集的某些部分，并将其用于测试和验证。可以使用sklearn之类的python(一种解释性的高级通用编程语言)中的各种库来轻松进行数据拆分。

与保留验证相关的问题 (Issues linked with Hold-out validation)

The distribution of the variables in each set(train, validation, test) is different, therefore our model will not be able to generalize well.

每个集合(训练，验证，测试)中变量的分布是不同的，因此我们的模型将无法很好地概括。

分层保持验证 (Stratified Hold-out Validation)

The issues related to the Hold-out validation technique are solved in this technique.

与保留验证技术有关的问题已通过该技术解决。

Here we will make sure that each set has got similar distribution which will eventually help us generate a better model.

在这里，我们将确保每个集合都有相似的分布，这最终将帮助我们生成更好的模型。

Now that we know what these two techniques are, let’s have a look at the code

现在我们知道这两种技术是什么，让我们看一下代码

We will be using python 3.0

我们将使用python 3.0

Libraries used:

使用的库：

Pandas
大熊猫
Numpy
脾气暴躁的
Matplotlib
Matplotlib
Sklearn
斯克莱恩

We will be using preprocessed titanic dataset here to understand how hold-out and stratified hold-out techniques work:

我们将在这里使用预处理的钛酸数据集来了解保持和分层保持技术如何工作：

Here df will now have the dataset that we want to use.

df现在将具有我们要使用的数据集。

We can see that the data has got 5 rows and 25 columns, where Survived is our target(dependent) variable and the rest are the independent variables.

我们可以看到数据有5行25列，其中Survived是我们的目标(因变量)，其余是自变量。

Even though we are working with clean data, we will still check if there are any missing values. As we can see we have no missing values in our dataset.

即使我们正在使用干净的数据，我们仍将检查是否有任何缺失的值。如我们所见，数据集中没有缺失值。

We will be storing dependent variables as df_x and the target variable(independent) as df_y.

我们将因变量存储为df_x，将目标变量(独立)存储为df_y。

We will now import train_test_split function from sklearn library as it provides a very simple function to split our data.

现在，我们将从sklearn库中导入train_test_split函数，因为它提供了一个非常简单的函数来拆分数据。

Here, we will not use stratification for Hold-out Validation. Setting random state so that each time we run the code we can get the same output for the splits.

在这里，我们将不使用分层进行保留验证。设置随机状态，以便每次我们运行代码时，都可以为拆分获得相同的输出。

Now, we will use train_test_split again but this time we will use training data and split it into training and validation set.

现在，我们将再次使用train_test_split，但是这次我们将使用训练数据并将其分为训练和验证集。

We have a train, validation, and test set. Let’s check the distribution of our target class in all the sets.

我们有一个培训，验证和测试集。让我们检查所有集合中目标类的分布。

As we can see the distribution in each set is not similar therefore our model will not be able to generalize well.

如我们所见，每个集合的分布并不相似，因此我们的模型将无法很好地概括。

The solution to this problem is Stratified Hold-out Validation

解决此问题的方法是分层保持验证

Let’s see how it works.

让我们看看它是如何工作的。

We will be using the same line of code here as well. The only difference is that we use stratification here.

我们还将在这里使用同一行代码。唯一的区别是我们在这里使用分层。

In this case, we stratify the data with respect to our target variable as we can see df_y is our subset of the data with the target class variable.

在这种情况下，我们根据目标变量对数据进行分层，因为我们可以看到df_y是具有目标类变量的数据子集。

It can be seen that the distribution of the target class is now similar which is a good thing as our model will now be able to generalize well.

可以看出，目标类的分布现在是相似的，这是一件好事，因为我们的模型现在可以很好地概括了。

分层保留验证技术存在的问题 (Issues with Stratified Hold-out Validation Technique)

The problem with hold-out and stratified hold-out validation technique is that in order to generate a validation set we take a subset of the training set which we cannot use for training. Therefore we will have less data for training which can be a disadvantage.

保留和分层保留验证技术的问题在于，为了生成验证集，我们采用了不能用于训练的训练集的子集。因此，我们将拥有较少的训练数据，这可能是不利的。

Also, since we will be working with one validation set model might overfit the data again.

此外，由于我们将使用一个验证集模型，因此该模型可能会再次拟合数据。

解： (Solution:)

This issue can be resolved by using K-fold Cross-Validation. We will talk about K-fold Cross-Validation and Leave One-Out Validation techniques in the next article.

通过使用K折交叉验证，可以解决此问题。在下一篇文章中，我们将讨论K折交叉验证和“留下验证”技术。

参考资料 (REFERENCES)

Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems 1st Edition by Aurélien Géron, Chapter1.
使用Scikit-Learn和TensorFlow进行动手机器学习：构建智能系统的概念，工具和技术，第一版，AurélienGéron，第1章。
Applied Machine Learning Course- Analytics Vidhya
应用机器学习课程-分析Vidhya

Congratulations! You just finished learning the following topics:

恭喜你！ 您刚刚学习了以下主题：

Importance of train test split.
火车试车的重要性。
Importance of validation set
验证集的重要性
Techniques used to generate validation set and their disadvantages.
用于生成验证集的技术及其缺点。

In case if you have any questions, you can post it in the comments, I would be more than happy to address them.

如果您有任何疑问，可以将其发布在评论中，我很乐意为您解决。

You can also find me on LinkedIn.

您也可以在LinkedIn上找到我。

Any suggestions for improvement and feedback will be appreciated.

任何改进和反馈的建议将不胜感激。

If you like my work, please consider following me, I will be writing more articles on Data Science.

如果您喜欢我的工作，请考虑关注我，我将撰写有关数据科学的更多文章。

翻译自: https://medium.com/analytics-vidhya/a-simple-introduction-to-validating-and-testing-a-model-part-1-2a0765deb198

模型测试交叉验证

杨_明

关注

2
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
模型测试交叉验证_验证和测试模型第1部分的简单介绍

模型测试交叉验证Decoding the importance of validation and test set for everyone.解码每个人的验证和测试集的重要性。In this article, we will be learning the importance of the validation set and the techniques used to split th...
复制链接

扫一扫