How large a training set is needed?

最新推荐文章于 2024-07-19 21:36:04 发布

处女座程序员的朋友

最新推荐文章于 2024-07-19 21:36:04 发布

阅读量114

点赞数

本文链接：https://blog.csdn.net/sinat_22510827/article/details/110688964

版权

Is there a common method used to determine how many training samples are required to train a classifier (an LDA in this case) to obtain a minimum threshold generalization accuracy?

I am asking because I would like to minimize the calibration time usually required in a brain-computer interface.

The search term you are looking for is "learning curve", which gives the (average) model performance as function of the training sample size.

Learning curves depend on a lot of things, e.g.

classification method
complexity of the classifier
how well the classes are separated.

(I think for two-class LDA you may be able to derive some theoretical power calculations, but the crucial fact is always whether your data actually meets the "equal COV multivariate normal" assumption. I'd go for some simulation on for both LDA assumptions and resampling of your already existing data).

There are two aspects of the performance of a classifier trained on a finite sample size nn (as usual),

bias, i.e. on average a classifier trained on nn training samples is worse than the classifier trained on n=∞n=∞ training cases (this is usually meant by learning curve), and
variance: a given training set of nn cases may lead to quite different model performance.
Even with few cases, you may be lucky and get good results. Or you have bad luck and get a really bad classifier.
As usual, this variance decreases with incresing training sample size nn.

Another aspect that you may need to take into account is that it is usually not enough to train a good classifier, but you also need to prove that the classifier is good (or good enough). So you need to plan also the sample size needed for validation with a given precision. If you need to give these results as fraction of successes among so many test cases (e.g. producer's or consumer's accuracy / precision / sensitivity / positive predictive value), and the underlying classification task is rather easy, this can need more independent cases than training of a good model.

As a rule of thumb, for training, the sample size is usually discussed in relation to model complexity (number of cases : number of variates), whereas absolute bounds on the test sample size can be given for a required precision of the performance measurement.

Here's a paper, where we explained these things in more detail, and also discuss how to constuct learning curves:
Beleites, C. and Neugebauer, U. and Bocklitz, T. and Krafft, C. and Popp, J.: Sample size planning for classification models. Anal Chim Acta, 2013, 760, 25-33.
DOI: 10.1016/j.aca.2012.11.007
accepted manuscript on arXiv: 1211.1323

This is the "teaser", showing an easy classification problem (we actually have one easy distinction like this in our classification problem, but other classes are far more difficult to distinguish):

We did not try to extrapolate to larger training sample sizes to determine how much more training cases are needed, because the test sample sizes are our bottleneck, and larger training sample sizes would let us construct more complex models, so extrapolation is questionable. For the kind of data sets I have, I'd approach this iteratively, measuring a bunch of new cases, showing how much things improved, measure more cases, and so on.

This may be different for you, but the paper contains literature references to papers using extrapolation to higher sample sizes in order to estimate the required number of samples.

Asking about training sample size implies you are going to hold back data for model validation. This is an unstable process requiring a huge sample size. Strong internal validation with the bootstrap is often preferred. If you choose that path you need to only compute the one sample size. As @cbeleites so nicely stated this is often an "events per candidate variable" assessment, but you need a minimum of 96 observations to accurately predict the probability of a binary outcome even if there are no features to be examined [this is to achieve of 0.95 confidence margin of error of 0.1 in estimating the actual marginal probability that Y=1].

It is important to consider proper scoring rules for accuracy assessment (e.g., Brier score and log likelihood/deviance). Also make sure you really want to classify observations as opposed to estimating membership probability. The latter is almost always more useful as it allows a gray zone.