LR与batchsize的关系_batchsize和lr-CSDN博客

批量大小（batchsize）对模型训练时间和稳定性有直接影响，大batchsize能减少训练时间并使梯度计算更稳定，但在某些情况下可能导致泛化能力下降。研究指出，大的batchsize可能由于训练迭代次数不足而影响性能。适当地调整batchsize和学习率的组合对于优化模型的性能至关重要。

摘要由CSDN通过智能技术生成

batchsize越大，LR越大。

batchsize增大K倍，可将lr增加sqrt(K)倍，提供训练速度。实际中，直接将lr增大K倍效果更好。

从上面的结果可以看出，对于采用非自适应学习率变换的方法，学习率的绝对值对模型的性能有较大影响，研究者常使用step变化策略。

3 Batchsize如何影响模型性能？
模型性能对batchsize虽然没有学习率那么敏感，但是在进一步提升模型性能时，batchsize就会成为一个非常关键的参数。

3.1、大的batchsize减少训练时间，提高稳定性
这是肯定的，同样的epoch数目，大的batchsize需要的batch数目减少了，所以可以减少训练时间，目前已经有多篇公开论文在1小时内训练完ImageNet数据集。另一方面，大的batch size梯度的计算更加稳定，因为模型训练曲线会更加平滑。在微调的时候，大的batch size可能会取得更好的结果。

3.2、大的batchsize导致模型泛化能力下降
在一定范围内，增加batchsize有助于收敛的稳定性，但是随着batchsize的增加，模型的性能会下降，如下图，来自于文[5]。

Hoffer[7]等人的研究表明，大的batchsize性能下降是因为训练时间不够长，本质上并不少batchsize的问题，在同样的epochs下的参数更新变少了，因此需要更长的迭代次数。

3.3、小结

batchsize在变得很大(超过一个临界点)时，会降低模型的泛化能力。在此临界点之下，模型的性能变换随batch size通常没有学习率敏感。

参考文献
[1] Smith L N. Cyclical learning rates for training neural networks[C]//2017 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 2017: 464-472.

[2] Loshchilov I, Hutter F. Sgdr: Stochastic gradient descent with warm restarts[J]. arXiv preprint arXiv:1608.03983, 2016.

[3] Reddi S J, Kale S, Kumar S. On the convergence of adam and beyond[J]. 2018.

[4] Keskar N S, Socher R. Improving generalization performance by switching from adam to sgd[J]. arXiv preprint arXiv:1712.07628, 2017.

[5] Goyal P, Dollar P, Girshick R B, et al. Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour.[J]. arXiv: Computer Vision and Pattern Recognition, 2017.

[6] Keskar N S, Mudigere D, Nocedal J, et al. On large-batch training for deep learning: Generalization gap and sharp minima[J]. arXiv preprint arXiv:1609.04836, 2016.

[7] Hoffer E, Hubara I, Soudry D. Train longer, generalize better: closing the generalization gap in large batch training of neural networks[C]//Advances in Neural Information Processing Systems. 2017: 1731-1741.

[8] Smith S L, Kindermans P J, Ying C, et al. Don’t decay the learning rate, increase the batch size[J]. arXiv preprint arXiv:1711.00489, 2017.

[9] Smith S L, Le Q V. A bayesian perspective on generalization and stochastic gradient descent[J]. arXiv preprint arXiv:1710.06451, 2017.
————————————————
版权声明：本文为CSDN博主「Dontla」的原创文章，遵循CC 4.0 BY-SA版权协议，转载请附上原文出处链接及本声明。
原文链接：https://blog.csdn.net/Dontla/article/details/104288945