tensorflow optimizer 总结

最新推荐文章于 2024-06-04 17:46:18 发布

跬步达千里

最新推荐文章于 2024-06-04 17:46:18 发布

阅读量9.1k

点赞数 2

分类专栏： tensorflow 深度学习（deep learning）

本文链接：https://blog.csdn.net/LIYUAN123ZHOUHUI/article/details/68946448

版权

深度学习（deep learning）同时被 2 个专栏收录

43 篇文章 3 订阅

订阅专栏

tensorflow

12 篇文章 0 订阅

订阅专栏

tensorflow上的优化函数:

下面图像来源:

https://morvanzhou.github.io/tutorials/machine-learning/ML-intro/3-06-speed-up-learning/

SGD 所谓的随机梯度下降,就是指,由于取的样本是一批一批的,因此,每批数据之间有可能会导致参数的梯度更新方向不一致,因此叫随机梯度下降

优点在初期解决了训练问题,现在一般用的较少

Momentum 更新方法,动量更新方法,SGD由于样本的随机,导致梯度的更新会有随机,这样不利于收敛,当样本为海量的时候,收敛时间会边长,而Momentum 更新方法可以考虑上次的梯度更新方向,综合出一个新的方向,使得梯度更新的方向总体是朝着最优的方向更新
优点:前后梯度一致时,能够加速学习,不一致时,能够避免走弯路
缺点:似乎对股票数据会导致发散

AdaGrad 更新方法,用于处理大的稀疏矩阵,AdaGrad可以自动变更学习速率,只是需要设定一个全局的学习速率a,但这并非是实际学习速率,实际的速率是与以往参数的模之和的开方v成反比的.
这样使得每个参数都有一个自己的学习率,对学习率这样做,一般的解释是给这个学习率加上了一个对错误方向的阻力,但是为什么是这样,还没有理解.
优点:似乎对股票数据较有用
缺点:深度过深时,可能会导致训练提前结束

Adadelta更新方法,在AdaGrad基础进行扩展,AdaGrad只使用过去固定个数的参数的模,而不是以往参数的全部,使用一阶信息,计算量较小
优点:对梯度噪声信息,不同的模型结构,不同的数据模型,与超参数的选择较鲁棒,设定全局学习率后能够自动更新学习率
缺点:对全局学习率敏感,对股票数据的效果没AdaGrad好

RMSProp 更新方法,与Adam类似,只是使用了不同的滑动均值,RMSProp在AdaGrad的基础上增加一个衰减系数b1,将AdaGrad中的v变成v = b1*v + (1-b1)*(dx^2)

Adam 更新方法,Adaptive Moment Estimation ,Adam 方法综合了Momentum和RMSProp方法,对每个参数保留一个学习率与一个根据过去梯度信息求得的指数衰减均值,
由于只保留均值,因此在内存上较Adagrad,Adadelta更高效
优点:对全局学习率不敏感,有一些观点认为,对于RNN之类的网络结构,Adam速度快,效果好,而对于CNN之类的网络结构,随机梯度加动量的更新方法要更好
另外,一般认为对稀疏矩阵和噪声数据也比较有效.对于股票数据有较有效

FtrlOptimizer:FtrlOptimizer更新方法主要用于广告点击预测,广告点击预测通常千万级别的维度,因此有巨量的稀疏权重.其主要特点是将接近0 的权重直接置0,这样
计算时可以直接跳过,从而简化计算.这个方法已经验证过在股票数据上较有效

总结:tensorflow优化函数的选择有一定的技巧,主要取决于你的数据质量和数据量,以及模型的大小与权重矩阵的内容
tensorflow的optimizer基本上都针对特定的问题,如图像识别或广告点击预测等进行了微调.
如果你有一个比较特别的问题,那么你就要进行尝试不同的方法去找到最合适的解决方案.

要想获得更好的结果,改变随机初始值与经常massaging your data(对数据进行预处理)是另一个可以尝试的方向
massaging your data:

Sometimes the whole process of moving data is referred to as "ETL" meaning "Extract, Transform, Load". Massaging the data is the "transform" step, but it implies ad-hoc fixes that you have to do to smooth out problems that you have encountered (like a massage does to your muscles) rather than transformations between well-known formats.

Thinks that you might do to "massage" data include:

Change formats from what the source system emits to what the target system expects, e.g. change date format from d/m/y to m/d/y.
replace missing values with defaults, e.g. Supply "0" when a quantity is not given.

Filter out records that not needed in the target system.

Check validity of records, and ignore or report on rows that would cause an error if you tried to insert them.
Normalise data to remove variations that should be the same, e.g. replace upper case with lower case, replace "01" with "1".

And finally there is the less savory practice of massaging the data by throwing out data (or adjusting the numbers) when they don't give you the answer you want. Unfortunatley peole doing statistical analysis often massage the data to get rid of those pesky outliers which disprove their theory. Becasue of this practice refering to data cleaning as massing the data is inappropriate. Cleaning the data to make it something that can go into your system (getting rid of meaningless dates like 02/30/2009 because someone else stored them in varchar instead of as dates, separating first and last names into separate fields, fixing all uppercase data, adding default values for fields that require data when the supplied data isn't given, etc.) is one thing - massaging the data implies a practice of adjusting the data inappropriately.

跬步达千里

关注

2
点赞
踩
6

收藏

觉得还不错? 一键收藏
0
评论
tensorflow optimizer 总结

tensorflow上的优化函数:下面图像来源:https://morvanzhou.github.io/tutorials/machine-learning/ML-intro/3-06-speed-up-learning/SGD 所谓的随机梯度下降,就是指,由于取的样本是一批一批的,因此,每批数据之间有可能会导致参数的梯度更新方向不一致,因此叫随机梯度下降优点在初期解
复制链接

扫一扫