你为什么会问这个问题?
首先我们要搞清楚你为什么会问需要多大的训练数据集。
可能你现在有以下情况:
- 你有太多的数据。可以考虑通过构建学习曲线(learning curves)来预估样本数据集(representative sample)的大小或者使用大数据的框架把所有的可得数据都用上。
- 你有太少的数据。首先确定你的数据量确实比较少。那么可以考虑尝试收集更多的数据或者用数据增强的方法(data augmentation methods)来人为的增加数据样本大小
- 你还没有开始收集数据?你需要开始手机数据并且评估这些数据是否足够。如果你是在做一个研究或者数据收集太昂贵,你可以和领域内的专家或者统计学家聊一聊。
- 在我自己实际工作中,我经常应用学习曲线,在小数据集上应用重新采样的方法(resampling methods)比如k-fold 交叉验证和bootstrap,和在最终结果中增加置信区间。
针对这个问题,你究竟需要多少训练数据?
1. 不能一概而论,需要分论讨论
没有人可以在不了解你的项目的情况下告诉你,你究竟需要多少训练数据。这个一个棘手的问题,你经常需要通过经验调查来得到答案。
机器学习中你所需要的数据数量和很多因素有关,比如:
你要解决问题的复杂程度, nominally the unknown underlying function that best relates your input variables to the output variable.
学习算法的复杂程度, nominally the algorithm used to inductively learn the unknown underlying mapping function from specific examples.
2. 通过学习别人的经验进行类比
很多人在你之前做了很多机器学习相关的研究,有些人还针对他们的研究发表了paper.也许你可以参考那些和你的问题相似的文章,借鉴别人需要多大的数据量。
你还可以研究他们关于数据量大小对算法表现的影响的文章。你可以在google, Google Scholar 和Arxiv上搜索文章
3. 用你的领域的专业知识
You need a sample of data from your problem that is representative of the problem you are trying to solve.
In general, the examples must be independent and identically distributed.
Remember, in machine learning we are learning a function to map input data to output data. The mapping function learned will only be as good as the data you provide it from which to learn.
This means that there needs to be enough data to reasonably capture the relationships that may exist both between input features and between input features and output features.
Use your domain knowledge, or find a domain expert and reason about the domain and the scale of data that may be required to reasonably capture the useful complexity in the problem.
4. 应用统计式启发
There are statistical heuristic methods available that allow you to calculate a suitable sample size.
Most of the heuristics I have seen have been for classification problems as a function of the number of classes, input features or model parameters. Some heuristics seem rigorous, others seem completely ad hoc.
Here are some examples you may consider:
- Factor of the number of classes: T