5.3 CROSS-VALIDATION

最新推荐文章于 2020-09-16 22:52:03 发布

DataMining2013

最新推荐文章于 2020-09-16 22:52:03 发布

阅读量623

点赞数

分类专栏： DataMining 文章标签： machine learning

版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。

本文链接：https://blog.csdn.net/u012927785/article/details/16897591

版权

DataMining 专栏收录该内容

7 篇文章 0 订阅

订阅专栏

Question：The sample(样本) used for training (or testing) might not be representative（代表性）.

If, by bad luck, all examples with a certain class were omitted from

the training set, you could hardly expect a classifier learned from that

data to perform well on examples of that class—and the situation would

be exacerbated(恶化) by the fact that the class would necessarily be

overrepresented in the test set because none of its instances made it into

the training set!(训练集中不包含某一类别，因此训练器的效果不怎么理想，

更糟的是检验集中具有大量训练集中缺乏的的样本)

Solution1：stratification[ˌstrætɪfɪˈkeɪʃn]分层

the random sampling is done in a way that guarantees that each class is

properly represented in both training and test sets.

(为了确保分割后用于training和testing的数据都具有代表性，需要

通过随机取样的过程来确保这一点，这就是stratification。其达到的效果是

让在每一个分割中，每个类所占的比例和总体数据中每个类所占的比例相同。)

Solution2：cross-validation交叉验证

CV是用来验证分类器的性能一种统计分析方法,基本思想是把在某种意义下

将原始数据(dataset)进行分组,一部分做为训练集(train set),

另一部分做为验证集(test set),首先用训练集对分类器进行训练,

在利用验证集来测试训练得到的模型(model),以此来做为评价分类器的性能指标.

交叉验证一般要尽量满足：1）训练集的比例要足够多，一般大于一半
2）训练集和测试集要均匀抽样.

常见CV的方法如下:

(1)Double cross-validation(2-CV)

作法是将dataset分成两个相等大小的subsets，进行两回合的分类器训练。

在第一回合中，一个subset作为training set，另一个便作为test set；

在第二回合中，则将training set与test set对换后，再次训练分类器，

而其中我们比较关心的是两次test sets的辨识率。不过在实务上2-CV并不常用，

主要原因是training set样本数太少，通常不足以代表母体样本的分布，

导致test阶段辨识率容易出现明显落差.

It is better to use more than half the data for training even at the expense of test data.

(2)Threefold cross-validation

In cross-validation, you decide on a fixed number of folds,or partitions, of the data.

Suppose we use three.Then the data is split into three approximately equal partitions;

each in turn is used for testing and the remainder is used for training.

That is, use two-thirds of the data for training and one-third for testing,

and repeat the procedure three times so that in the end, every instance

has been used exactly once for testing.This is called threefold cross-validation,

and if stratification is adopted as well—which it often is—

it is stratified threefold cross-validation.

(3)Kfold cross-validation

作法是将dataset切成k个大小相等的subsets，每个subset皆分别作为一次test set，

其余样本则作为training set，因此一次k-CV的实验共需要建立k个models，

并计算k次test sets的平均辨识率。在实作上，k要够大才能使各回合中的training set

样本数够多，一般而言k=10算是相当足够了。

Why 10? Extensive tests on numerous different datasets, with different

learning techniques,have shown that 10 is about the right number of folds

to get the best estimate of error, and there is also some theoretical evidence

that backs this up.

The data is divided randomly into 10 parts in which the class is

represented in approximately the same proportions as in the full

dataset.Each part is held out in turn and the learning scheme trained

on the remaining nine-tenths; then its error rate is calculated on the

holdout set.Thus, the learning procedure is executed a total of 10 times

on different training sets(each set has a lot in common with the others).

Finally, the 10 error estimates are averagedto yield an overall error estimate.

When seeking an accurate error estimate,it is standard procedure to

repeat the cross-validation process 10 times—that is, 10 times tenfold

cross-validation—and average the results.

翻译10-fold cross-validation：将数据集分成十分，轮流将其中9份作为训练数据，

1份作为测试数据,进行试验。每次试验都会得出相应的正确率（或差错率）。10次

的结果的正确率（或差错率）的平均值作为对算法精度的估计，一般还需要进行多

次10折交叉验证（例如10次10折交叉验证），再求其均值，作为对算法准确性的

估计。之所以选择将数据集分为10份，是因为通过利用大量数据集、使用不同学习

技术进行的大量试验，表明10折是获得最好误差估计的恰当选择，而且也有一些理论

根据可以证明这一点。但这并非最终诊断，争议仍然存在。而且似乎5折或者20折与

10折所得出的结果也相差无几。

参考资料：DataMining Practical Machine Learning Tools and Techniques

http://blog.sina.com.cn/wjw84221

http://blog.csdn.net/daringpig/article/details/8053681

http://blog.csdn.net/xywlpo/article/details/6531128

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
5.3 CROSS-VALIDATION

Question：The sample(样本) used for training (or testing) might not be representative（代表性）. If, by bad luck, all examples with a certain class were omitted fromthe training set, you could hardly ex
复制链接

扫一扫

专栏目录

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。