论文笔记:The Marginal Value of Adaptive Gradient Methods in Machine Learning

这篇文章质疑了adaptive优化算法的性能,比较了SGD、SGD with momentum与AdaGrad、RMSProp、Adam方法的泛化性能,并通过优化构建一个凸优化问题和四种深度网络验证了观点。

主要结论有:

  1. SGD and SGD with momentum outperform adaptive methods on the development/test set.
  2. adaptive methods often display faster inital progress on the training set, but their performance puickly plateaus on the development/test set.
  3. the same amount of tuning was required for all methods, including adaptive methods.

通过优化构建的二分类问题,发现adaptive methods tend to give undue influence to spurious features that have no effect on out-of-sample generaliztion.

论文中到优化算法的下降速度对模型的泛化有一定的影响:“sharp” minimizer generalize poorly, whereas “flat” generalize well. It is empirically showed that Adam converges to sharper minimizer when the batch size is increased, even with small batch size.

从论文中看,学习率的设置与调整对优化算法在深度学习上的表现性能有很大的影响,即使是在Adam方法上。论文中提出了两种schemes,a development-based decay scheme(dev-decay) and a fixed-frequency decay scheme(fixed-decay).

  1. For dev-decay, we keep track of the validation perfomance so far, and at each epoch decay the learning rate by a constant factor if the model dose not attain a new best value.
  2. For fixed-decay , we decay the learning rate by constant fatcor every k epochs.
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值