5.3 CROSS-VALIDATION

   Question:The sample(样本) used for training (or testing) might not be representative(代表性).

   If, by bad luck, all examples with a certain class were omitted from

the training set, you could hardly expect a classifier learned from that

data to perform well on examples of that class—and the situation would

be exacerbated(恶化) by  the fact that the class would necessarily be

overrepresented in the test set because none of its instances made it into

the training set!(训练集中不包含某一类别,因此训练器的效果不怎么理想,

更糟的是检验集中具有大量训练集中缺乏的的样本)

    Solution1:stratification[ˌstrætɪfɪˈkeɪʃn]分层

    the random sampling is done in a way that guarantees that each class is

properly represented in both training and test sets.

(为了确保分割后用于trainingtesting的数据都具有代表性,需要

通过随机取样的过程来确保这一点,这就是stratification。其达到的效果是

让在每一个分割中,每个类所占的比例和总体数据中每个类所占的比例相同。)

    Solution2:cross-validation交叉验证

    CV是用来验证分类器的性能一种统计分析方法,基本思想是把在某种意义下

将原始数据(dataset)进行分组,一部分做为训练集(train set),

另一部分做为验证集(test set),首先用训练集对分类器进行训练,

在利用验证集来测试训练得到的模型(model),以此来做为评价分类器的性能指标.

交叉验证一般要尽量满足:1)训练集的比例要足够多,一般大于一半
2)训练集和测试集要均匀抽样.

    常见CV的方法如下:

    (1)Double cross-validation(2-CV)

    作法是将dataset分成两个相等大小的subsets,进行两回合的分类器训练。

在第一回合中,一个subset作为training set,另一个便作为test set;

在第二回合中,则将training set与test set对换后,再次训练分类器,

而其中我们比较关心的是两次test sets的辨识率。不过在实务上2-CV并不常用,

主要原因是training set样本数太少,通常不足以代表母体样本的分布,

导致test阶段辨识率容易出现明显落差.

It is better to use more than half the data for training even at the expense of test data.

    (2)Threefold cross-validation

    In cross-validation, you decide on a fixed number of folds,or partitions, of the data.

Suppose we use three.Then the data is split into three approximately equal partitions;

each in turn is used for testing and the remainder is used for training.

That is, use two-thirds of the data for training and one-third for testing,

and repeat the procedure three times so that in the end, every instance

has been used exactly once for testing.This is called threefold cross-validation,

and if stratification is adopted as well—which it often is—

it is stratified threefold cross-validation.

    (3)Kfold cross-validation

    作法是将dataset切成k个大小相等的subsets,每个subset皆分别作为一次test set,

其余样本则作为training set,因此一次k-CV的实验共需要建立k个models,

并计算k次test sets的平均辨识率。在实作上,k要够大才能使各回合中的training set

样本数够多,一般而言k=10算是相当足够了。

   Why 10? Extensive tests on numerous different datasets, with different

learning techniques,have shown that 10 is about the right number of folds

to get the best estimate of error, and there is also some theoretical evidence

 that backs this up. 

    The data is divided randomly into 10 parts in which the class is

represented in approximately the same proportions as in the full

dataset.Each part is held out in turn and the learning scheme trained

on the remaining nine-tenths; then its error rate is calculated on the

holdout set.Thus, the learning procedure is executed a total of 10 times

on different training sets(each set has a lot in common with the others).

Finally, the 10 error estimates are averagedto yield an overall error estimate.

     When seeking an accurate error estimate,it is standard procedure to

repeat the cross-validation process 10 times—that is, 10 times tenfold

cross-validation—and average the results.

   翻译10-fold cross-validation:将数据集分成十分,轮流将其中9份作为训练数据,

1份作为测试数据,进行试验。每次试验都会得出相应的正确率(或差错率)。10

的结果的正确率(或差错率)的平均作为对算法精度的估计,一般还需要进行多

10折交叉验证(例如1010折交叉验证),再求其均值,作为对算法准确性

估计。之所以选择将数据集分为10份,是因为通过利用大量数据集、使用不同学习

技术进行的大量试验,表明10折是获得最好误差估计的恰当选择,而且也有一些理论

根据可以证明这一点。但这并非最终诊断,争议仍然存在。而且似乎5折或者20折与

10所得出的结果也相差无几。

 

   http://bioinfo.cipf.es/wikigepas/_media/cross_validation.jpg

     

  

   参考资料:DataMining Practical Machine Learning  Tools and Techniques

                     http://blog.sina.com.cn/wjw84221

                     http://blog.csdn.net/daringpig/article/details/8053681

                     http://blog.csdn.net/xywlpo/article/details/6531128

 

 

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值