这篇文章质疑了adaptive优化算法的性能,比较了SGD、SGD with momentum与AdaGrad、RMSProp、Adam方法的泛化性能,并通过优化构建一个凸优化问题和四种深度网络验证了观点。
主要结论有:
- SGD and SGD with momentum outperform adaptive methods on the development/test set.
- adaptive methods often display faster inital progress on the training set, but their performance puickly plateaus on the development/test set.
- the same amount of tuning was required for all methods, including adaptive methods.
通过优化构建的二分类问题,发现adaptive methods tend to give undue influence to spurious features that have no effect on out-of-sample generaliztion.
论文中到优化算法的下降速度对模型的泛化有一定的影响:“sharp” minimizer generalize poorly, whereas “flat” generalize well. It is empirically showed that Adam converges to sharper minimizer when the batch size is increased, even with small batch size.
从论文中看,学习率的设置与调整对优化算法在深度学习上的表现性能有很大的影响,即使是在Adam方法上。论文中提出了两种schemes,a development-based decay scheme(dev-decay) and a fixed-frequency decay scheme(fixed-decay).
- For dev-decay, we keep track of the validation perfomance so far, and at each epoch decay the learning rate by a constant factor if the model dose not attain a new best value.
- For fixed-decay , we decay the learning rate by constant fatcor every k epochs.