机器学习中模型评估与选择中的几个小问题

最新推荐文章于 2025-10-10 22:13:38 发布

原创最新推荐文章于 2025-10-10 22:13:38 发布 · 2.2w 阅读

44 ·

CC 4.0 BY-SA版权

文章标签：

#机器学习

Machine Learning 专栏收录该内容

21 篇文章

订阅专栏

本文探讨了训练集、验证集及测试集的区别与联系，解释了它们在机器学习流程中的作用，包括防止过拟合、模型选择及评估模型泛化能力等。同时介绍了交叉验证的概念及其在模型评估中的应用。

部署运行你感兴趣的模型镜像

Part 1 Training set、Validation set 与 Testing set

有关于训练数据的过程中，validation与testing有何区别，validation的作用到底是什么。
如有100个训练样本。这100个样本既要做训练，又要做测试。因此，可以选择“留出法（hold-out）”进行模型评估。
所谓的“留出法”，即：

直接将数据集 $D=\{ (x_1,y_1),(x_2,y_2),...,(x_{100},y_{100})\}$ 划分为两个互斥的集合，其中一个集合作为训练集 $S$ ，另一个作为测试集 $T$ ，即 $D=S\cup T$ ， $S\cap T = \varnothing$ 。在S上训练出模型后，用 $T$ 来评估其测试误差，作为对泛化误差的估计。

比方说，一张遥感图像，选取100个样本区。其中，80个是用做训练样本，剩下20个做检测测试的样本。然而这样，有可能得到的模型以及模型的精度是不被认可的！前些天一个同学投稿被编辑回复说“没有看Validation Set”！
然而我在李航博士的《统计学习方法》中第14页有这么一段话：

如果给定的样本充足，进行模型选择的一种简单方法是随机地将数据集切分成三部分，分为训练集（training set）、验证集（validation set）和测试集（testing set）。训练集用来训练模型，验证集用于模型的选择，而测试集用于最终对学习方法评估。在学习到的不同复杂度的模型中，选择对验证集有最小预测误差的模型。由于验证集有足够多的数据，用它对模型进行选择也是有效的。

李航老师的意思是，在给定样本充足的情况下，这时候选择将样本分为三部分比较合理。但是我同学在论文中的样本实际只有80个，这时候再分为training sets、validation sets、testing sets是不是不太合理？事实上，在最后，我同学采取了下面part 2部分所说cross validation来验证模型的精度，而没有采用将数据集分为三部分的方法。

这里先不管编辑质疑的和不合理，我想说的是，验证集究竟在训练中是怎么用的，怎么起作用的。
惭愧惭愧，其实入门机器学习时间也挺长的了，居然这个问题都不清楚。实在是说不过去，实践的还是少了。这里想弄清楚。
Training set与Validation set都是在模型的training过程中使用的，训练过程的workflow：

for each epoch
    for each training data instance
        propagate error through the network
        adjust the weights
        calculate the accuracy over training data
    for each validation data instance
        calculate the accuracy over the validation data
    if the threshold validation accuracy is met
        exit training
    else
        continue training

从上面的workflow可以看出:
1.training sets是用作训练时调整神经网络的weights；
2.validation sets并不是用作调整weights，而是用作防止overfitting（过拟合）的。如果由training sets得到的精度随着训练的进行在增加，而这个模型经过validation sets计算后，发现精度与之前保持不变，或者精度反而下降了。这说明，已经产生overfitting了，需要停止训练。也就是让这个model在training sets与validation sets之间trade-off，更balance。

To make sure you dont overfit the network you need to input the validation dataset to the network and check if the error is within some range. Because the validation set is not being using directly to adjust the weights of the netowork, therefore a good error for the validation and also the test set indicates that the network predicts well for the train set examples, also it is expected to perform well when new example are presented to the network which was not used in the training process.

3.testing sets只用来测试模型，来看这个模型究竟有多好，就是评价这个模型的泛化能力（generalization）。这时候，这个model在testing sets上得到的accuracy就是一个很有代表性（representative）的accuracy，以后再在新的数据集上测试时，也跑不离这个精度的范围。

这里再补充一个wikipedia上关于Validation-based early stopping 的说明。

These early stopping rules work by splitting the original training set into a new training set and a validation set. The error on the validation set is used as a proxy for the generalization error in determining when overfitting has begun. These methods are most commonly employed in the training of neural networks.

Prechelt gives the following summary of a naive implementation of holdout-based early stopping as follows:
1. Split the training data into a training set and a validation set, e.g. in a 2-to-1 proportion.
2. Train only on the training set and evaluate the per-example error on the validation set once in a while, e.g. after every fifth epoch.
3. Stop training as soon as the error on the validation set is higher than it was the last time it was checked.
4. Use the weights the network had in that previous step as the result of the training run.
— Lutz Prechelt, Early Stopping – But When?

More sophisticated forms use cross-validation – multiple partitions of the data into training set and validation set – instead of a single partition into a training set and validation set. Even this simple procedure is complicated in practice by the fact that the validation error may fluctuate during training, producing multiple local minima. This complication has led to the creation of many ad-hoc rules for deciding when overfitting has truly begun.

这里还需要补充的是，validation sets的作用不仅仅是在训练中防止训练模型过拟合，平衡training accuracy与validation accuracy，而且有“compare their performances and decide which one to take”。我在wikipedia的Validation set看到这么一句话：

Validation set: A set of examples used to tune the parameters(i.e., architecture, not weights) of a classifier, for example to choose the number of hidden units in a neural network

The basic process of using a validation set for model selection (as part of training set, validation set, and test set) is:
Since our goal is to find the network having the best performance on new data, the simplest approach to the comparison of different networks is to evaluate the error function using data which is independent of that used for training. Various networks are trained by minimization of an appropriate error function defined with respect to a training data set. The performance of the networks is then compared by evaluating the error function using an independent validation set, and the network having the smallest error with respect to the validation set is selected. This approach is called the hold out method. Since this procedure can itself lead to some overfitting to the validation set, the performance of the selected network should be confirmed by measuring its performance on a third independent set of data called a test set.

关于validation set的model selection的作用，这里有一个例子，是Andrew Ng在Coursera上的课程上讲解的，这里有一个总结：

实际使用时，我们通过训练集学习到参数，再计算交叉验证集上的error，再选择一个在验证集上error最小的模型，最后再在测试集上估计模型的泛化误差。

Part 2 Cross-validation

关于cross-validation，很容易陷入一个错误的疑惑（来自上面那个同学的疑惑，他后来因为样本过少，不好分为三部分，于是选择了cross-validation。但这个疑惑一开始我也没反应过来…=_=）：

“经过cross-validation验证后，其最终的模型该如何确定？“即如10折交叉检验，每次检验就可以得到一个模型，10折就有10个模型，如何选择最终的一个模型？

其实，这样想已经陷入了一个本末倒置的错误了。因为k-fold cross validation的目的不是为了选择模型，而是先是有了一个模型，对这个模型进行精度评定。
我在stackexchange上找到一个很棒的解释：How to choose a predictive model after k-fold cross-validation（英文水平捉急，看原本更好）：

当我们说一个model的时候，通常指的是一种特定的描述输入数据与预测输出数据是怎样关联的方法。而一般不是指某一种方法的不同的实例，就可以作为不同的模型。所以，你可以说，我有一个线性回归模型。但是你不好说由两个不同的数据集训练出不同的神经网络模型，就说，这是两个不同的model。这种定义model的方式，至少在模型选择的文本环境中是成立的。

所以，当你做k-fold cross validation的时候，由一些训练集训练得到一个model，你是用未放入训练的数据集来测试这个model有多好。我们使用cross validation的目的是你在训练的时候将所有的数据用作了训练，没有留数据做测试用。
你曾经应该这样做过，将数据集的80%用作training，剩下的20%用作testing。但是，当你选作testing的20%数据里包含有一些特别容易predict的数据点，或者包含了一些特别难predict的点，这时候该怎么办？我们可能得到的model就不是理想的model。

所以，我们想要的是使用全部的数据用作训练。接着上面80/20分割数据集的例子，我们使用5-fold cross validation，训练80%的数据5次，用20%的数据测试。当然这当中要保证每次测试使用的数据集不一样。因此，我们就使用了每个数据点来检测模型的好坏。

但是cross validation的目的并不是构建出最终的model。我们并不使用这5个训练的模型实例来做实际的prediction。因为我们想要所有的数据来做训练，尽可能得到好的模型。cross validation的目的是检测，检查，checking，而不是模型的构建。

现在，我们说我们有两个模型，一个是线性回归模型，另一个是神经网络模型。怎样说哪个模型比较好呢？我们可以做 k-fold cross-validation，看哪一个模型在test sets上表现的更好。但是，一旦我们使用cross-validation来选择一个更好表现的模型，我们要用所有的数据来训练模型，对linear regression model与neural network模型都要使用所有的数据集。再强调一遍，我们并不使用在cross-validation时训练得到的具体的model instance来最为最终的predictive model。

周志华老师的《机器学习》上对cross validation的解释：

“交叉验证法”（cross validation）先将数据集 $D$ 划分为k个大小相近的互斥子集，即 $D=D_1\cup D_2 \cup ... \cup D_k, D_i \cap D_j = \varnothing$ . 每个子集 $D_i$ 都尽可能保持数据分布的一致性，即从 $D$ 中通过分层采样得到。然后，每次用 $k-1$ 个子集的并集作为训练集，余下的那个子集则作为测试集；这样就可以得到 $k$ 组训练／测试集，从而可进行 $k$ 组训练和测试，最终返回的是这 $k$ 个测试结果的均值。显然，交叉验证评估结果的稳定性和保真性在很大程度上取决于 $k$ 的取值，为了强调这一点，通常把交叉验证法称为“ $k$ 折交叉验证”（k-fold cross validation）。 $k$ 最常用的取值是10，此时称为10折交叉验证。

将数据集 $D$ 划分为 $k$ 个子集存在多种划分方式。为减小因样本划分不同而引入的差别， $k$ 折交叉验证通常要随机使用不同的划分重复 $p$ 次，最终的评估结果是这 $p$ 次 $k$ 折交叉验证结果的均值，例如常见的有“10次10折交叉验证”（10次10折交叉验证，进行了100次训练／测试）。

假定数据集 $D$ 中包含 $m$ 个样本，若令 $k=m$ ，则得到了交叉验证法的一个特例：留一法（Leave-One-Out，简称LOO）。显然，留一法不受随机样本划分方式的影响，因为 $m$ 个样本只有唯一的方式划分为 $m$ 个子集——每个子集包含一个样本；留一法使用的训练集与初始数据集相比只少了一个样本，这就使得在绝大多数情况下，留一法中被实际评估的模型与期望评估的用 $D$ 训练出的模型很相似。因此，留一法的评估结果往往被认为比较准确。然而，留一法也有其缺陷：在数据集比较大时，训练 $m$ 个模型的计算开销可能是非常巨大的（例如数据集包含100万个样本，则需训练100万个模型），而这还是在未考虑算法调参的情况下，另外，留一法的评估结果也未必比其他评估方法准确；“没有免费的午餐”定理对实验评估同样适用。