Coursera | Andrew Ng (02-week-2-2.8)—Adam 优化算法

该系列仅在原课程基础上部分知识点添加个人学习笔记,或相关推导补充等。如有错误,还请批评指教。在学习了 Andrew Ng 课程的基础上,为了更方便的查阅复习,将其整理成文字。因本人一直在学习英语,所以该系列以英文为主,同时也建议读者以英文为主,中文辅助,以便后期进阶时,为学习相关领域的学术论文做铺垫。- ZJ

Coursera 课程 |deeplearning.ai |网易云课堂


转载请注明作者和出处:ZJ 微信公众号-「SelfImprovementLab」

知乎https://zhuanlan.zhihu.com/c_147249273

CSDNhttp://blog.csdn.net/junjun_zhao/article/details/79105995


2.8 Adam Optimization algorithms ( Adam 优化算法)

(字幕来源:网易云课堂)

这里写图片描述

During the history of deep learning,many researchers including some very well-known researchers,sometimes proposed optimization algorithms and showed that they worked well in a few problems.But those optimization algorithms subsequently were shown not to really generalize that well to the wide range of neural networks you might want to train.So over time, I think the deep learning community actually developed some amount of skepticism about new optimization algorithms.And a lot of people felt that gradient descent with Momentum really works well,was difficult to propose things that work much better.

在深度学习的历史上,包括许多知名研究者在内,提出了优化算法,并很好地解决了一些问题,但随后这些优化算法被指出,并不能一般化,并不适用于多种神经网络,时间久了 深度学习圈子里的人开始,多少有些质疑全新的优化算法,很多人都觉得, Momentum 梯度下降法很好用,很难再想出更好的优化算法,所以 RMSprop 以及 Adam 优化算法, Adam 优化算法也是本视频的内容,就是少有的经受住人们考验的两种算法,已被证明,适用于不同的深度学习结构,这个算法,我会毫不犹豫地推荐给你,因为很多人都试过,并且用它很好地解决了许多问题, Adam 优化算法基本上就是,将 Momentum 和 RMSprop 结合在一起

So, RMSprop and the Adam optimization algorithm,which we’ll talk about in this video,is one of those rare algorithms that has really stood up,and has been shown to work well across a wide range of deep learning architectures.So, this is one of the algorithms that I wouldn’t hesitate to recommend you try because many people have tried it and seen it work well on many problems.And the Adam optimization algorithm is basically taking Momentum and RMSprop and putting them together.So, let’s see how that works.To implement Adam you would initialize: VdW=0 , SdW=0 , and similarly Vdb , Sdb=0 .And then on iteration t,you would compute the derivatives:compute dW, db using current mini-batch.So usually, you do this with mini-batch gradient descent.And then you do the Momentum exponentially weighted average. So VdW=β .But now I’m going to call this β1 to distinguish it from the hyperparameter β2 we’ll use for the RMSprop proportion of this.So, this is exactly what we hadwhen we’re implementing Momentum ,except it now called hyper parameter β1 instead of β.And similarly, you have Vdb as follows: 1 minus β1 times db.And then you do the RMSprop update as well.So now, you have a different hyperparemeter β2 plus one minus β2 dW squared.Again,the squaring there is element-wise squaring of your derivatives dW.And then Sdb is equal to this plus one minus β2 times db.So this is the Momentum like update with hyperparameter β1 and this is the RMSprop like update with hyperparameter β2 .

这里写图片描述

那么来看看如何使用 Adam 算法,使用 Adam 算法 首先你要初始化, VdW=0 SdW=0 Vdb=0 Sdb=0 ,在第 t 次迭代中,你要计算微分,用当前的 mini-batch 计算 dW db,一般你会用 mini-batch 梯度下降法,接下来计算 Momentum 指数加权平均数,所以 VdW=β ,现在我要用 β1 ,这样就不会跟超参数 β2 混淆,因为后面 RMSprop 要用到 β2 使用 Momentum 时,我们肯定会用这个公式,但现在不叫它 β 而叫它 β1 ,同样 Vdb 等于 (1β1)db ,接着你用 RMSprop 进行更新,即用不同的超参数 β2 ,加上 (1β2)dW2 ,再说一次,这里是对整个微分 dW 进行平方处理, Sdb 等于这个加上 (1β2)db2 ,相当于Momentun更新了超参数 β1 , RMSprop 更新了超参数 β2

In the typical implementation of Adam ,you do implement bias correction.So you’re going to have V corrected.Corrected means after bias correction. dW=VdW divided by 1 minus ß1 to the power of t if you’ve done t iterations.And similarly, Vdb corrected equals Vdb divided by 1 minus β1 to the power of t.And then similarly, you implement this bias correction on S as well.So, that’s SdW divided by one minus β2 to the t,and Sdb corrected equals Sdb divided by 1 minus β2 to the t.Finally, you perform the update.So W gets updated as W minus alpha times.So if you’re just implementing Momentum you’d use VdW , or maybe VdW corrected.But now, we add in the RMSprop portion of this.So we’re also going to divide by square roots of SdW corrected plus epsilon.And similarly, b gets updated as a similar formula, Vdb corrected, divided by square root S, corrected db, plus epsilon.And so, this algorithm combines the effect of gradient descent with Momentum together with gradient descent with r RMSprop .And this is a commonly used learning algorithm that is proven to be very effective for many different neural networks of a very wide variety of architectures.

这里写图片描述

一般使用 Adam 算法的时候,要计算偏差修正,修正的 VdW ,修正也就是在偏差修正之后,等于 VdW/(1βt1) ,t 是迭代次数,同样修正的 Vdb 等于, Vdb/(1βt1) ,S 也使用偏差修正,也就是 SdW/(1βt1) ,修正的 Sdb 等于 Sdb/(1βt2) ,最后更新权重,所以 W更新后是 W 减去 α 乘以,如果你只用 Momentum ,你就用 VdW 或者修正后的 VdW ,但现在我们加入了 RMSprop 的部分,所以我们要,除以修正后 SdW 的平方根加上 ε,根据类似的公式更新 b 值,修正 Vdb 除以修正后 Sdb 的平方根加上 ε,所以 Adam 算法结合了, Momentum 和 RMSprop 梯度下降法,并且是一种极其常用的学习算法,被证明能有效适用于不同神经网络,适用于广泛的结构。

So, this algorithm has a number of hyperparameters.The learning rate hyper parameter alpha is still important and usually needs to be tuned.So you just have to try a range of values and see what works.A common choice really the default choice for β1 is 0.9.So this is a moving average,weighted average of dW right?this is the Momentum light term,the hyperparameter for β2 ,the authors of the Adam paper,inventors of the Adam algorithmrecommend 0.999.Again this is computing the moving weighted average of dW squared as well as db squared.And then Epsilon, the choice of epsilon doesn’t matter very much.But the authors of the Adam paper recommended it 10 to the minus 8.But this parameter you really don’t need to set itand it doesn’t affect performance much at all.But when implementing Adam ,what people usually do is just use the default value.So, β1 and β2 as well as epsilon.I don’t think anyone ever really tunes Epsilon.And then, try a range of values of Alphato see what works best.You could also tune β1 and β2 but it’s not done that often among the practitioners I know.

这里写图片描述

本算法中有很多超参数,超参数学习率 α 很重要,也经常需要调试,你可以尝试一系列值,然后看哪个有效, β1 常用的缺省值为 0.9,这是 dW 的移动平均数,也就是 dW 的加权平均数,这是 Momentum 涉及的项,至于超参数 β2 , Adam 论文的作者,也就是 Adam 算法的发明者,推荐使用 0.999,这是在计算 dW2 以及 db2 的,移动加权平均值,关于 ε 的选择其实没那么重要, Adam 论文的作者建议 ε 为 10(8) ,但你并不需要设置它,因为它并不会影响算法表现,但是在使用 Adam 的时候,人们往往用缺省值即可, β1 β2 和 ε 都是如此,我觉得没人会去调整 ε,然后尝试不同的 α 值,看看哪个效果最好,你也可以调整 β1 β2 ,但我认识的业内人士很少人这么干。

So, where does the term ’ Adam ’ come from? Adam stands for Adaptive Moment Estimation.So β1 is computing the mean of the derivatives.This is called the first moment.And β2 is used to compute exponentially weighted average of the squares,and that’s called the second moment.So that gives rise to the name adaptive moment estimation.But everyone just calls it the Adam authorization algorithm.And, by the way, one of my long term friends and collaboratorsis call Adam Coates.As far as I know,this algorithm doesn’t have anything to do with him,except for the fact that I think he uses it sometimes.But sometimes I get asked that question,so just in case you’re wondering.So, that’s it for the Adam optimization algorithm.With it, I think you really train your neural networks much more quickly.But before we wrap up for this week,let’s keep talking about hyperparameter tuning,as well as gain some more intuitions about what the optimization problem for neural networks looks like.In the next video, we’ll talk about learning rate decay.

为什么这个算法叫做 AdamAdam 代表的是Adaptive Moment Estimation β1 用于计算这个微分,叫做第一矩 β2 用来计算平方数的指数加权平均数,叫做第二矩,所以 Adam 的名字由此而来,但是大家都简称为 Adam 权威算法,顺便提一下 我有一个老朋友兼合作伙伴,叫做 Adam Coates,据我所知,他跟 Adam 算法没有任何关系,不过我觉得他偶尔会用到这个算法,不过有时有人会问我这个问题,我想你可能也有相同的疑惑,这就是有关 Adam 优化算法的全部内容,有了它 你可以更加地训练神经网络,在结束本周课程之前,我们还要讲一下超参数调整,以及更好地理解,神经网络的优化问题有哪些,下个视频中 我们将讲讲学习率衰减。


重点总结:

Adam 优化算法

Adam 优化算法的基本思想就是将 Momentum 和 RMSprop 结合起来形成的一种适用于不同深度学习结构的优化算法。

算法实现

  • 初始化: Vdw=0Sdw=0Vdb=0Sdb=0
  • 第 t 次迭代:
    • Compute dwdb on the current mini-batch
    • Vdw=β1Vdw+(1β1)dwVdb=β1Vdb+(1β1)db
    • Sdw=β2Sdw+(1β2)(dw)2Sdb=β2Sdb+(1β2)(db)2
    • Vcorrecteddw=Vdw/(1βt1)Vcorrecteddb=Vdb/(1βt1)
    • Scorrecteddw=Sdw/(1βt2)Scorrecteddb=Sdb/(1βt2)
    • w:=wαVcorrecteddwScorrecteddw+εb:=bαVcorrecteddbScorrecteddb+ε

超参数的选择

  • α :需要进行调试;
  • β1 :常用缺省值为 0.9, dw 的加权平均;
  • β2 :推荐使用 0.999, dW2 的加权平均值;
  • ε :推荐使用108

Adam 代表的是 Adaptive Moment Estimation

参考文献:

[1]. 大树先生.吴恩达Coursera深度学习课程 DeepLearning.ai 提炼笔记(2-2)– 优化算法


PS: 欢迎扫码关注公众号:「SelfImprovementLab」!专注「深度学习」,「机器学习」,「人工智能」。以及 「早起」,「阅读」,「运动」,「英语 」「其他」不定期建群 打卡互助活动。

  • 0
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值