Question:The sample(样本) used for training (or testing) might not be representative(代表性).
If, by bad luck, all examples with a certain class were omitted from
the training set, you could hardly expect a classifier learned from that
data to perform well on examples of that class—and the situation would
be exacerbated(恶化) by the fact that the class would necessarily be
overrepresented in the test set because none of its instances made it into
the training set!(训练集中不包含某一类别,因此训练器的效果不怎么理想,
更糟的是检验集中具有大量训练集中缺乏的的样本)
Solution1:stratification[ˌstrætɪfɪˈkeɪʃn]分层
the random sampling is done in a way that guarantees that each class is
properly represented in both training and test sets.
(为了确保分割后用于training和testing的数据都具有代表性,需要
通过随机取样的过程来确保这一点,这就是stratification。其达到的效果是
让在每一个分割中,每个类所占的比例和总体数据中每个类所占的比例相同。)
Solution2:cross-validation交叉验证
CV是用来验证分类器的性能一种统计分析方法,基本思想是把在某种意义下
将原始数据(dataset)进行分组,一部分做为训练集(train set),
另一部分做为验证集(test set),首先用训练集对分类器进行训练,
在利用验证集来测试训练得到的模型(model),以此来做为评价分类器的性能指标.
交叉验证一般要尽量满足:1)训练集的比例要足够多,一般大于一半
2)训练集和测试集要均匀抽样.
常见CV的方法如下:
(1)Double cross-validation(2-CV)
作法是将da
在第一回合中,一个subset作为training set,另一个便作为test set;
在第二回合中,则将training set与test set对换后,再次训练分类器,
而其中我们比较关心的是两次t
主要原因是
导致tes
It is better to use more than half the data for training even at the expense of test data.
(2)Threefold cross-validation
In cross-validation, you decide on a fixed number of folds,or partitions, of the data.
Suppose we use three.Then the data is split into three approximately equal partitions;
each in turn is used for testing and the remainder is used for training.
That is, use two-thirds of the data for training and one-third for testing,
and repeat the procedure three times so that in the end, every instance
has been used exactly once for testing.This is called threefold cross-validation,
and if stratification is adopted as well—which it often is—
it is stratified threefold cross-validation.
(3)Kfold cross-validation
作法是将datas
其余样本则作为training set,因此一次k-CV的实验共需要建立k个models,
样本数够多,一般而言k=10算是相当足够了。
Why 10? Extensive tests on numerous different datasets, with different
learning techniques,have shown that 10 is about the right number of folds
to get the best estimate of error, and there is also some theoretical evidence
that backs this up.
The data is divided randomly into 10 parts in which the class is
represented in approximately the same proportions as in the full
dataset.Each part is held out in turn and the learning scheme trained
on the remaining nine-tenths; then its error rate is calculated on the
holdout set.Thus, the learning procedure is executed a total of 10 times
on different training sets(each set has a lot in common with the others).
Finally, the 10 error estimates are averagedto yield an overall error estimate.
When seeking an accurate error estimate,it is standard procedure to
repeat the cross-validation process 10 times—that is, 10 times tenfold
cross-validation—and average the results.
翻译10-fold cross-validation:将数据集分成十分,轮流将其中9份作为训练数据,
1份作为测试数据,进行试验。每次试验都会得出相应的正确率(或差错率)。10次
的结果的正确率(或差错率)的平均值作为对算法精度的估计,一般还需要进行多
次10折交叉验证(例如10次10折交叉验证),再求其均值,作为对算法准确性的
估计。之所以选择将数据集分为10份,是因为通过利用大量数据集、使用不同学习
技术进行的大量试验,表明10折是获得最好误差估计的恰当选择,而且也有一些理论
根据可以证明这一点。但这并非最终诊断,争议仍然存在。而且似乎5折或者20折与
10折所得出的结果也相差无几。
参考资料:DataMining Practical Machine Learning Tools and Techniques
http://blog.sina.com.cn/wjw84221
http://blog.csdn.net/daringpig/article/details/8053681
http://blog.csdn.net/xywlpo/article/details/6531128