machine learning yearning 第七章

How large do the dev/test sets need to be?

测试集和开发集应该多大?

开发集要大到能够区别算法之间的差异为止。例如,一个分类器A的精确度为90%,B的精确度为90.1%,那么只有100个的开发集是不能够显示出那0.1%的差距的。与其他很多作者见识过的机器学习问题相比,一个只包含100个样本的开发集远远不够。常见的开发集应该包含1000到10,000个样本左右。有了10,000个样本,发现0.1%的区别就不成问题。

The dev set should be large enough to detect differences between algorithms that you are trying out. For example, if classifier A has an accuracy of 90.0% and classifier B has an accuracy of 90.1%, then a dev set of 100 examples would not be able to detect this 0.1% difference. Compared to other machine learning problems I’ve seen, a 100 example dev set is small. Dev sets with sizes from 1,000 to 10,000 examples are common. With 10,000 examples, you will have a good chance of detecting an improvement of 0.1%.[1]

至于一些成熟的应用,例如:广告推送,网页搜索和产品推荐,我见过很多团队,都很努力地提高他们的产品。哪怕只有0.01%的提高,都会给公司带来巨大利益。这时候,为了发现那近似于0.01%的提升效果,开发集就必须远大于1W了。

For mature and important applications—for example, advertising, web search, and product recommendations—I have also seen teams that are highly motivated to eke out even a 0.01% improvement, since it has a direct impact on the company’s profits. In this case, the dev set could much larger than 10,000, in order to detect even smaller improvements. 

那么测试集呢?测试集必须达到能够覆盖你的系统的所有要测试的功能。通常我们都试探性的用我们总数据的30%用来当做测试集,这在你的样本只有100到1W时,效果良好。但是在大数据的时代,我们的数据规模超过了10亿个样本。用于开发集和测试集的数据之比:测试集总数与开发集的总数之比,即dev / test的值,正在不断下降,即使两者的数据量不断增长。所以,测试集的数据集必须大于能够测试你系统的程度。

How about the size of the test set? It should be large enough to give high confidence in the overall performance of your system. One popular heuristic had been to use 30% of your data for your test set. This works well when you have a modest number of examples—say 100 to 10,000 examples. But in the era of big data where we now have machine learning problems with sometimes more than a billion examples, the fraction of data allocated to dev/test sets has been shrinking, even as the absolute number of examples in the dev/test sets has been growing. There is no need to have excessively large dev/test beyond what is needed to evaluate the performance of your algorithms.  


人能力有限,如有错误欢迎改正,希望不吝赐教。

 

                                                                                                  ——译者:wexin_42141390 邮箱:1259975740@qq.com


[1]理论上,我们也可以用来测试:修改一下算法是否对开发集有统计意义上的提高。但事实上,没有团队会这么做(除非他们要发表论文),我也不认为这种测试对中期进展很有用。

In theory, one could also test if a change to an algorithm makes a statistically significant difference on the dev set. In practice, most teams don’t bother with this (unless they are publishing academic research papers), and I usually do not find statistical significance tests useful for measuring interim progress.

  • 2
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值