训练集(train set) 验证集(validation set) 测试集(test set)问题综述

最新推荐文章于 2025-03-08 15:08:12 发布

chandelierds

最新推荐文章于 2025-03-08 15:08:12 发布

阅读量1.4w

点赞数 25

分类专栏：机器学习文章标签：交叉验证机器学习模型评估模型选择

机器学习专栏收录该内容

16 篇文章

订阅专栏

本人机器学习初学菜鸟一枚，在做一些小比赛时对模型评估与选择的一些问题不是很明白，搬运了以下2篇博文，跪谢大佬们整理的那么详细！！！（其实看完还是有些疑问，还是要多看几遍，加深记忆）
以下为博文1内容：
原地址：http://www.cnblogs.com/xfzhang/archive/2013/05/24/3096412.html

机器学习中模型评估与选择中的几个小问题

Part 1 Training set、Validation set 与 Testing set

有关于训练数据的过程中，validation与testing有何区别，validation的作用到底是什么。
如有100个训练样本。这100个样本既要做训练，又要做测试。因此，可以选择“留出法（hold-out）”进行模型评估。
所谓的“留出法”，即：

直接将数据集D={(x1,y1),(x2,y2),...,(x100,y100)}来评估其测试误差，作为对泛化误差的估计。

比方说，一张遥感图像，选取100个样本区。其中，80个是用做训练样本，剩下20个做检测测试的样本。然而这样，有可能得到的模型以及模型的精度是不被认可的！前些天一个同学投稿被编辑回复说“没有看Validation Set”！
然而我在李航博士的《统计学习方法》中第14页有这么一段话：

如果给定的样本充足，进行模型选择的一种简单方法是随机地将数据集切分成三部分，分为训练集（training set）、验证集（validation set）和测试集（testing set）。训练集用来训练模型，验证集用于模型的选择，而测试集用于最终对学习方法评估。在学习到的不同复杂度的模型中，选择对验证集有最小预测误差的模型。由于验证集有足够多的数据，用它对模型进行选择也是有效的。

李航老师的意思是，在给定样本充足的情况下，这时候选择将样本分为三部分比较合理。但是我同学在论文中的样本实际只有80个，这时候再分为training sets、validation sets、testing sets是不是不太合理？事实上，在最后，我同学采取了下面part 2部分所说cross validation来验证模型的精度，而没有采用将数据集分为三部分的方法。

这里先不管编辑质疑的和不合理，我想说的是，验证集究竟在训练中是怎么用的，怎么起作用的。
惭愧惭愧，其实入门机器学习时间也挺长的了，居然这个问题都不清楚。实在是说不过去，实践的还是少了。这里想弄清楚。
Training set与Validation set都是在模型的training过程中使用的，训练过程的workflow：

for each epoch
    for each training data instance
        propagate error through the network
        adjust the weights
        calculate the accuracy over training data
    for each validation data instance
        calculate the accuracy over the validation data
    if the threshold validation accuracy is met
        exit training
    else
        continue training
 
 1
2
3
4
5
6
7
8
9
10
11

从上面的workflow可以看出:
1.training sets是用作训练时调整神经网络的weights；
2.validation sets并不是用作调整weights，而是用作防止overfitting（过拟合）的。如果由training sets得到的精度随着训练的进行在增加，而这个模型经过validation sets计算后，发现精度与之前保持不变，或者精度反而下降了。这说明，已经产生overfitting了，需要停止训练。也就是让这个model在training sets与validation sets之间trade-off，更balance。

To make sure you dont overfit the network you need to input the validation dataset to the network and check if the error is within some range. Because the validation set is not being using directly to adjust the weights of the netowork, therefore a good error for the validation and also the test set indicates that the network predicts well for the train set examples, also it is expected to perform well when new example are presented to the network which was not used in the training process.

3.testing sets只用来测试模型，来看这个模型究竟有多好，就是评价这个模型的泛化能力（generalization）。这时候，这个model在testing sets上得到的accuracy就是一个很有代表性（representative）的accuracy，以后再在新的数据集上测试时，也跑不离这个精度的范围。

这里再补充一个wikipedia上关于Validation-based early stopping 的说明。

These early stopping rules work by splitting the original training set into a new training set and a validation set. The error on the validation set is used as a proxy for the generalization error in determining when overfitting has begun. These methods are most commonly employed in the training of neural networks.

Prechelt gives the following summary of a naive implementation of holdout-based early stopping as follows:
1. Split the training data into a training set and a validation set, e.g. in a 2-to-1 proportion.
2. Train only on the training set and evaluate the per-example error on the validation set once in a while, e.g. after every fifth epoch.
3. Stop training as soon as the error on the validation set is higher than it was the last time it was checked.
4. Use the weights the network had in that previous step as the result of the training run.
— Lutz Prechelt, Early Stopping – But When?

More sophisticated forms use cross-validation – multiple partitions of the data into training set and validation set – instead of a single partition into a training set and validation set. Even this simple procedure is complicated in practice by the fact that the validation error may fluctuate during training, producing multiple local minima. This complication has led to the creation of many ad-hoc rules for deciding when overfitting has truly begun.

这里还需要补充的是，validation sets的作用不仅仅是在训练中防止训练模型过拟合，平衡training accuracy与validation accuracy，而且有“compare their performances and decide which one to take”。我在wikipedia的Validation set看到这么一句话：

Validation set: A set of examples used to tune the parameters(i.e., architecture, not weights) of a classifier, for example to choose the number of hidden units in a neural network

The basic process of using a validation set for model selection (as part of training set, validation set, and test set) is:
Since our goal is to find the network having the best performance on new data, the simplest approach to the comparison of different networks is to evaluate the error function using data which is independent of that used for training. Various networks are trained by minimization of an appropriate error function defined with respect to a training data set. The performance of the networks is then compared by evaluating the error function using an independent validation set, and the network having the smallest error with respect to the validation set is selected. This approach is called the hold out method. Since this procedure can itself lead to some overfitting to the validation set, the performance of the selected network should be confirmed by measuring its performance on a third independent set of data called a test set.

关于validation set的model selection的作用，这里有一个例子，是Andrew Ng在Coursera上的课程上讲解的，这里有一个总结：

实际使用时，我们通过训练集学习到参数，再计算交叉验证集上的error，再选择一个在验证集上error最小的模型，最后再在测试集上估计模型的泛化误差。

Part 2 Cross-validation

关于cross-validation，很容易陷入一个错误的疑惑（来自上面那个同学的疑惑，他后来因为样本过少，不好分为三部分，于是选择了cross-validation。但这个疑惑一开始我也没反应过来…=_=）：

“经过cross-validation验证后，其最终的模型该如何确定？“即如10折交叉检验，每次检验就可以得到一个模型，10折就有10个模型，如何选择最终的一个模型？

其实，这样想已经陷入了一个本末倒置的错误了。因为k-fold cross validation的目的不是为了选择模型，而是先是有了一个模型，对这个模型进行精度评定。
我在stackexchange上找到一个很棒的解释：How to choose a predictive model after k-fold cross-validation（英文水平捉急，看原本更好）：

当我们说一个model的时候，通常指的是一种特定的描述输入数据与预测输出数据是怎样关联的方法。而一般不是指某一种方法的不同的实例，就可以作为不同的模型。所以，你可以说，我有一个线性回归模型。但是你不好说由两个不同的数据集训练出不同的神经网络模型，就说，这是两个不同的model。这种定义model的方式，至少在模型选择的文本环境中是成立的。

所以，当你做k-fold cross validation的时候，由一些训练集训练得到一个model，你是用未放入训练的数据集来测试这个model有多好。我们使用cross validation的目的是你在训练的时候将所有的数据用作了训练，没有留数据做测试用。
你曾经应该这样做过，将数据集的80%用作training，剩下的20%用作testing。但是，当你选作testing的20%数据里包含有一些特别容易predict的数据点，或者包含了一些特别难predict的点，这时候该怎么办？我们可能得到的model就不是理想的model。

所以，我们想要的是使用全部的数据用作训练。接着上面80/20分割数据集的例子，我们使用5-fold cross validation，训练80%的数据5次，用20%的数据测试。当然这当中要保证每次测试使用的数据集不一样。因此，我们就使用了每个数据点来检测模型的好坏。

但是cross validation的目的并不是构建出最终的model。我们并不使用这5个训练的模型实例来做实际的prediction。因为我们想要所有的数据来做训练，尽可能得到好的模型。cross validation的目的是检测，检查，checking，而不是模型的构建。

现在，我们说我们有两个模型，一个是线性回归模型，另一个是神经网络模型。怎样说哪个模型比较好呢？我们可以做 k-fold cross-validation，看哪一个模型在test sets上表现的更好。但是，一旦我们使用cross-validation来选择一个更好表现的模型，我们要用所有的数据来训练模型，对linear regression model与neural network模型都要使用所有的数据集。再强调一遍，我们并不使用在cross-validation时训练得到的具体的model instance来最为最终的predictive model。

周志华老师的《机器学习》上对cross validation的解释：

“交叉验证法”（cross validation）先将数据集 D最常用的取值是10，此时称为10折交叉验证。

将数据集D折交叉验证结果的均值，例如常见的有“10次10折交叉验证”（10次10折交叉验证，进行了100次训练／测试）。

假定数据集D个模型的计算开销可能是非常巨大的（例如数据集包含100万个样本，则需训练100万个模型），而这还是在未考虑算法调参的情况下，另外，留一法的评估结果也未必比其他评估方法准确；“没有免费的午餐”定理对实验评估同样适用。

Reference：

《机器学习》. 周志华
《统计学习方法》. 李航
How to choose a predictive model after k-fold cross-validation
whats is the difference between train, validation and test set, in neural networks?
What is the difference between test set and validation set?
Why only three partitions? (training, validation, test)
Coursera公开课笔记: 斯坦福大学机器学习第十课“应用机器学习的建议(Advice for applying machine learning)”
Wikipedia: Validation set
Wikipedia: Early stopping

以下为博文2内容：
原地址：http://www.cnblogs.com/xfzhang/archive/2013/05/24/3096412.html

[综] 训练集(train set) 验证集(validation set) 测试集(test set)

在有监督(supervise)的机器学习中，数据集常被分成2~3个，即：训练集(train set) 验证集(validation set) 测试集(test set)。

http://blog.sina.com.cn/s/blog_4d2f6cf201000cjx.html

一般需要将样本分成独立的三部分训练集(train set)，验证集(validation set)和测试集(test set)。其中训练集用来估计模型，验证集用来确定网络结构或者控制模型复杂程度的参数，而测试集则检验最终选择最优的模型的性能如何。一个典型的划分是训练集占总样本的50％，而其它各占25％，三部分都是从样本中随机抽取。
样本少的时候，上面的划分就不合适了。常用的是留少部分做测试集。然后对其余N个样本采用K折交叉验证法。就是将样本打乱，然后均匀分成K份，轮流选择其中K－1份训练，剩余的一份做验证，计算预测误差平方和，最后把K次的预测误差平方和再做平均作为选择最优模型结构的依据。特别的K取N，就是留一法（leave one out）。

http://www.cppblog.com/guijie/archive/2008/07/29/57407.html

这三个名词在机器学习领域的文章中极其常见，但很多人对他们的概念并不是特别清楚，尤其是后两个经常被人混用。Ripley, B.D（1996）在他的经典专著Pattern Recognition and Neural Networks中给出了这三个词的定义。
Training set: A set of examples used for learning, which is to fit the parameters [i.e., weights] of the classifier.
Validation set: A set of examples used to tune the parameters [i.e., architecture, not weights] of a classifier, for example to choose the number of hidden units in a neural network.
Test set: A set of examples used only to assess the performance [generalization] of a fully specified classifier.
显然，training set是用来训练模型或确定模型参数的，如ANN中权值等； validation set是用来做模型选择（model selection），即做模型的最终优化及确定的，如ANN的结构；而 test set则纯粹是为了测试已经训练好的模型的推广能力。当然，test set这并不能保证模型的正确性，他只是说相似的数据用此模型会得出相似的结果。但实际应用中，一般只将数据集分成两类，即training set 和test set，大多数文章并不涉及validation set。
Ripley还谈到了Why separate test and validation sets?
1. The error rate estimate of the final model on validation data will be biased (smaller than the true error rate) since the validation set is used to select the final model.
2. After assessing the final model with the test set, YOU MUST NOT tune the model any further.

http://stats.stackexchange.com/questions/19048/what-is-the-difference-between-test-set-and-validation-set

Step 1) Training: Each type of algorithm has its own parameter options (the number of layers in a Neural Network, the number of trees in a Random Forest, etc). For each of your algorithms, you must pick one option. That’s why you have a validation set.

Step 2) Validating: You now have a collection of algorithms. You must pick one algorithm. That’s why you have a test set. Most people pick the algorithm that performs best on the validation set (and that's ok). But, if you do not measure your top-performing algorithm’s error rate on the test set, and just go with its error rate on the validation set, then you have blindly mistaken the “best possible scenario” for the “most likely scenario.” That's a recipe for disaster.

Step 3) Testing: I suppose that if your algorithms did not have any parameters then you would not need a third step. In that case, your validation step would be your test step. Perhaps Matlab does not ask you for parameters or you have chosen not to use them and that is the source of your confusion.

My Idea is that those option in neural network toolbox is for avoiding overfitting. In this situation the weights are specified for the training data only and don't show the global trend. By having a validation set, the iterations are adaptable to where decreases in the training data error cause decreases in validation data and increases in validation data error; along with decreases in training data error, this demonstrates the overfitting phenomenon.

http://blog.sciencenet.cn/blog-397960-666113.html

http://stackoverflow.com/questions/2976452/whats-is-the-difference-between-train-validation-and-test-set-in-neural-networ

for each epoch
for each training data instance
propagate error through the network
adjust the weights
calculate the accuracy over training data
for each validation data instance
calculate the accuracy over the validation data
if the threshold validation accuracy is met
exit training
else
continue training

Once you're finished training, then you run against your testing set and verify that the accuracy is sufficient.

Training Set: this data set is used to adjust the weights on the neural network.

Validation Set: this data set is used to minimize overfitting. You're not adjusting the weights of the network with this data set, you're just verifying that any increase in accuracy over the training data set actually yields an increase in accuracy over a data set that has not been shown to the network before, or at least the network hasn't trained on it (i.e. validation data set). If the accuracy over the training data set increases, but the accuracy over then validation data set stays the same or decreases, then you're overfitting your neural network and you should stop training.

Testing Set: this data set is used only for testing the final solution in order to confirm the actual predictive power of the network.

Validating set is used in the process of training. Testing set is not. The Testing set allows

1)to see if the training set was enough and
2)whether the validation set did the job of preventing overfitting. If you use the testing set in the process of training then it will be just another validation set and it won't show what happens when new data is feeded in the network.

Training set: A set of examples used for learning, that is to fit the parameters [i.e., weights] of the classifier.

Validation set: A set of examples used to tune the parameters [i.e., architecture, not weights] of a classifier, for example to choose the number of hidden units in a neural network.

Test set: A set of examples used only to assess the performance [generalization] of a fully specified classifier.

The error surface will be different for different sets of data from your data set (batch learning). Therefore if you find a very good local minima for your test set data, that may not be a very good point, and may be a very bad point in the surface generated by some other set of data for the same problem. Therefore you need to compute such a model which not only finds a good weight configuration for the training set but also should be able to predict new data (which is not in the training set) with good error. In other words the network should be able to generalize the examples so that it learns the data and does not simply remembers or loads the training set by overfitting the training data.

The validation data set is a set of data for the function you want to learn, which you are not directly using to train the network. You are training the network with a set of data which you call the training data set. If you are using gradient based algorithm to train the network then the error surface and the gradient at some point will completely depend on the training data set thus the training data set is being directly used to adjust the weights. To make sure you don't overfit the network you need to input the validation dataset to the network and check if the error is within some range. Because the validation set is not being using directly to adjust the weights of the netowork, therefore a good error for the validation and also the test set indicates that the network predicts well for the train set examples, also it is expected to perform well when new example are presented to the network which was not used in the training process.

Early stopping is a way to stop training. There are different variations available, the main outline is, both the train and the validation set errors are monitored, the train error decreases at each iteration (backprop and brothers) and at first the validation error decreases. The training is stopped at the moment the validation error starts to rise. The weight configuration at this point indicates a model, which predicts the training data well, as well as the data which is not seen by the network . But because the validation data actually affects the weight configuration indirectly to select the weight configuration. This is where the Test set comes in. This set of data is never used in the training process. Once a model is selected based on the validation set, the test set data is applied on the network model and the error for this set is found. This error is a representative of the error which we can expect from absolutely new data for the same problem.