Coursera | Andrew Ng (02-week-2-2.8)—Adam 优化算法

最新推荐文章于 2024-07-10 16:47:58 发布

ZJ_Improve

最新推荐文章于 2024-07-10 16:47:58 发布

阅读量1.8k

点赞数

分类专栏：深度学习 | 吴恩达-02.改善深层NN：超参数调试、正则化以及优化深度学习 | 吴恩达文章标签： Adam Momentum RMSprop

本文链接：https://blog.csdn.net/JUNJUN_ZHAO/article/details/79105995

版权

深度学习 | 吴恩达同时被 2 个专栏收录

129 篇文章 19 订阅

订阅专栏

深度学习 | 吴恩达-02.改善深层NN：超参数调试、正则化以及优化

34 篇文章 2 订阅

订阅专栏

该系列仅在原课程基础上部分知识点添加个人学习笔记，或相关推导补充等。如有错误，还请批评指教。在学习了 Andrew Ng 课程的基础上，为了更方便的查阅复习，将其整理成文字。因本人一直在学习英语，所以该系列以英文为主，同时也建议读者以英文为主，中文辅助，以便后期进阶时，为学习相关领域的学术论文做铺垫。- ZJ

Coursera 课程 |deeplearning.ai |网易云课堂

转载请注明作者和出处：ZJ 微信公众号-「SelfImprovementLab」

知乎：https://zhuanlan.zhihu.com/c_147249273

CSDN：http://blog.csdn.net/junjun_zhao/article/details/79105995

2.8 Adam Optimization algorithms ( Adam 优化算法)

(字幕来源：网易云课堂)

这里写图片描述

During the history of deep learning,many researchers including some very well-known researchers,sometimes proposed optimization algorithms and showed that they worked well in a few problems.But those optimization algorithms subsequently were shown not to really generalize that well to the wide range of neural networks you might want to train.So over time, I think the deep learning community actually developed some amount of skepticism about new optimization algorithms.And a lot of people felt that gradient descent with Momentum really works well,was difficult to propose things that work much better.

在深度学习的历史上，包括许多知名研究者在内，提出了优化算法，并很好地解决了一些问题，但随后这些优化算法被指出，并不能一般化，并不适用于多种神经网络，时间久了深度学习圈子里的人开始，多少有些质疑全新的优化算法，很多人都觉得， Momentum 梯度下降法很好用，很难再想出更好的优化算法，所以 RMSprop 以及 Adam 优化算法， Adam 优化算法也是本视频的内容，就是少有的经受住人们考验的两种算法，已被证明，适用于不同的深度学习结构，这个算法，我会毫不犹豫地推荐给你，因为很多人都试过，并且用它很好地解决了许多问题， Adam 优化算法基本上就是，将 Momentum 和 RMSprop 结合在一起。

So, RMSprop and the Adam optimization algorithm,which we’ll talk about in this video,is one of those rare algorithms that has really stood up,and has been shown to work well across a wide range of deep learning architectures.So, this is one of the algorithms that I wouldn’t hesitate to recommend you try because many people have tried it and seen it work well on many problems.And the Adam optimization algorithm is basically taking Momentum and RMSprop and putting them together.So, let’s see how that works.To implement Adam you would initialize: $V_{dW}=0$ , $S_{dW}=0$ , and similarly $V_{db}$ , $S_{db}=0$ .And then on iteration t,you would compute the derivatives:compute dW, db using current mini-batch.So usually, you do this with mini-batch gradient descent.And then you do the Momentum exponentially weighted average. So $V_{dW} = β$ .But now I’m going to call this $β_1$ to distinguish it from the hyperparameter $β_2$ we’ll use for the RMSprop proportion of this.So, this is exactly what we hadwhen we’re implementing Momentum ,except it now called hyper parameter $β_1$ instead of β.And similarly, you have $V_{db}$ as follows: 1 minus $β_1$ times db.And then you do the RMSprop update as well.So now, you have a different hyperparemeter $β_2$ plus one minus $β_2$ dW squared.Again,the squaring there is element-wise squaring of your derivatives dW.And then $S_{db}$ is equal to this plus one minus $β_2$ times db.So this is the Momentum like update with hyperparameter $β_1$ and this is the RMSprop like update with hyperparameter $β_2$ .

这里写图片描述

那么来看看如何使用 Adam 算法，使用 Adam 算法首先你要初始化， $V_{dW}=0$ $S_{dW}=0$ $V_{db}=0$ $S_{db}=0$ ，在第 t 次迭代中，你要计算微分，用当前的 mini-batch 计算 dW db，一般你会用 mini-batch 梯度下降法，接下来计算 Momentum 指数加权平均数，所以 $V_{dW}=β$ ，现在我要用 $β_1$ ，这样就不会跟超参数 $β_2$ 混淆，因为后面 RMSprop 要用到 $β_2$ ，使用 Momentum 时，我们肯定会用这个公式，但现在不叫它 β 而叫它 $β_1$ ，同样 $V_{db}$ 等于 $(1-β_1)*db$ ，接着你用 RMSprop 进行更新，即用不同的超参数 $β_2$ ，加上 $(1-β_2)*dW^2$ ，再说一次，这里是对整个微分 dW 进行平方处理， $S_{db}$ 等于这个加上 $(1-β_2)*db^2$ ，相当于Momentun更新了超参数 $β_1$ ， RMSprop 更新了超参数 $β_2$ 。

In the typical implementation of Adam ,you do implement bias correction.So you’re going to have V corrected.Corrected means after bias correction. $dW = V_{dW}$ divided by 1 minus ß1 to the power of t if you’ve done t iterations.And similarly, $V_{db}$ corrected equals $V_{db}$ divided by 1 minus $β_1$ to the power of t.And then similarly, you implement this bias correction on S as well.So, that’s $S_{dW}$ divided by one minus $β_2$ to the t,and $S_{db}$ corrected equals $S_{db}$ divided by 1 minus $β_2$ to the t.Finally, you perform the update.So W gets updated as W minus alpha times.So if you’re just implementing Momentum you’d use $V_{dW}$ , or maybe $V_{dW}$ corrected.But now, we add in the RMSprop portion of this.So we’re also going to divide by square roots of $S_{dW}$ corrected plus epsilon.And similarly, b gets updated as a similar formula, $V_{db}$ corrected, divided by square root S, corrected db, plus epsilon.And so, this algorithm combines the effect of gradient descent with Momentum together with gradient descent with r RMSprop .And this is a commonly used learning algorithm that is proven to be very effective for many different neural networks of a very wide variety of architectures.

这里写图片描述

一般使用 Adam 算法的时候，要计算偏差修正，修正的 $V_{dW}$ ，修正也就是在偏差修正之后，等于 $V_{dW}/(1-β_1^t)$ ，t 是迭代次数，同样修正的 $V_{db}$ 等于， $V_{db}/(1-β_1^t)$ ，S 也使用偏差修正，也就是 $S_{dW}/(1-β_1^t)$ ，修正的 $S_{db}$ 等于 $S_{db}/(1-β_2^t)$ ，最后更新权重，所以 W更新后是 W 减去 α 乘以，如果你只用 Momentum ，你就用 $V_{dW}$ 或者修正后的 $V_{dW}$ ，但现在我们加入了 RMSprop 的部分，所以我们要，除以修正后 $S_{dW}$ 的平方根加上 ε，根据类似的公式更新 b 值，修正 $V_{db}$ 除以修正后 $S_{db}$ 的平方根加上 ε，所以 Adam 算法结合了， Momentum 和 RMSprop 梯度下降法，并且是一种极其常用的学习算法，被证明能有效适用于不同神经网络，适用于广泛的结构。

So, this algorithm has a number of hyperparameters.The learning rate hyper parameter alpha is still important and usually needs to be tuned.So you just have to try a range of values and see what works.A common choice really the default choice for $β_1$ is 0.9.So this is a moving average,weighted average of dW right?this is the Momentum light term,the hyperparameter for $β_2$ ,the authors of the Adam paper,inventors of the Adam algorithmrecommend 0.999.Again this is computing the moving weighted average of dW squared as well as db squared.And then Epsilon, the choice of epsilon doesn’t matter very much.But the authors of the Adam paper recommended it 10 to the minus 8.But this parameter you really don’t need to set itand it doesn’t affect performance much at all.But when implementing Adam ,what people usually do is just use the default value.So, $β_1$ and $β_2$ as well as epsilon.I don’t think anyone ever really tunes Epsilon.And then, try a range of values of Alphato see what works best.You could also tune $β_1$ and $β_2$ but it’s not done that often among the practitioners I know.

这里写图片描述

本算法中有很多超参数，超参数学习率 α 很重要，也经常需要调试，你可以尝试一系列值，然后看哪个有效， $β_1$ 常用的缺省值为 0.9，这是 dW 的移动平均数，也就是 dW 的加权平均数，这是 Momentum 涉及的项，至于超参数 $β_2$ ， Adam 论文的作者，也就是 Adam 算法的发明者，推荐使用 0.999，这是在计算 $dW^2$ 以及 $db^2$ 的，移动加权平均值，关于 ε 的选择其实没那么重要， Adam 论文的作者建议 ε 为 $10^{(-8)}$ ，但你并不需要设置它，因为它并不会影响算法表现，但是在使用 Adam 的时候，人们往往用缺省值即可， $β_1$ $β_2$ 和 ε 都是如此，我觉得没人会去调整 ε，然后尝试不同的 α 值，看看哪个效果最好，你也可以调整 $β_1$ 和 $β_2$ ，但我认识的业内人士很少人这么干。

So, where does the term ’ Adam ’ come from? Adam stands for Adaptive Moment Estimation.So $β_1$ is computing the mean of the derivatives.This is called the first moment.And $β_2$ is used to compute exponentially weighted average of the squares,and that’s called the second moment.So that gives rise to the name adaptive moment estimation.But everyone just calls it the Adam authorization algorithm.And, by the way, one of my long term friends and collaboratorsis call Adam Coates.As far as I know,this algorithm doesn’t have anything to do with him,except for the fact that I think he uses it sometimes.But sometimes I get asked that question,so just in case you’re wondering.So, that’s it for the Adam optimization algorithm.With it, I think you really train your neural networks much more quickly.But before we wrap up for this week,let’s keep talking about hyperparameter tuning,as well as gain some more intuitions about what the optimization problem for neural networks looks like.In the next video, we’ll talk about learning rate decay.

为什么这个算法叫做 Adam ？Adam 代表的是Adaptive Moment Estimation， $β_1$ 用于计算这个微分，叫做第一矩， $β_2$ 用来计算平方数的指数加权平均数，叫做第二矩，所以 Adam 的名字由此而来，但是大家都简称为 Adam 权威算法，顺便提一下我有一个老朋友兼合作伙伴，叫做 Adam Coates，据我所知，他跟 Adam 算法没有任何关系，不过我觉得他偶尔会用到这个算法，不过有时有人会问我这个问题，我想你可能也有相同的疑惑，这就是有关 Adam 优化算法的全部内容，有了它你可以更加地训练神经网络，在结束本周课程之前，我们还要讲一下超参数调整，以及更好地理解，神经网络的优化问题有哪些，下个视频中我们将讲讲学习率衰减。

重点总结：

Adam 优化算法

Adam 优化算法的基本思想就是将 Momentum 和 RMSprop 结合起来形成的一种适用于不同深度学习结构的优化算法。

算法实现

初始化： $V_{dw} = 0，S_{dw}=0，V_{db}=0，S_{db} = 0$
第 t 次迭代：
- Compute $dw，db$ on the current mini-batch
- $V_{dw}=\beta_{1}V_{dw}+(1-\beta_{1})dw，V_{db}=\beta_{1}V_{db}+(1-\beta_{1})db$
- $S_{dw}=\beta_{2}S_{dw}+(1-\beta_{2})(dw)^{2}，S_{db}=\beta_{2}S_{db}+(1-\beta_{2})(db)^{2}$
- $V_{dw}^{corrected} = V_{dw}/(1-\beta_{1}^{t})，V_{db}^{corrected} = V_{db}/(1-\beta_{1}^{t})$
- $S_{dw}^{corrected} = S_{dw}/(1-\beta_{2}^{t})，S_{db}^{corrected} = S_{db}/(1-\beta_{2}^{t})$
- $w:=w-\alpha\dfrac{V_{dw}^{corrected}}{\sqrt{S_{dw}^{corrected}}+\varepsilon}，b:=b-\alpha\dfrac{V_{db}^{corrected}}{\sqrt{S_{db}^{corrected}}+\varepsilon}$

超参数的选择

$\alpha$ ：需要进行调试；
$\beta_{1}$ ：常用缺省值为 0.9， $dw$ 的加权平均；
$\beta_{2}$ ：推荐使用 0.999， $dW^2$ 的加权平均值；
$ε$ ：推荐使用 $10^{-8}$ 。

Adam 代表的是 Adaptive Moment Estimation。

参考文献：

[1]. 大树先生.吴恩达Coursera深度学习课程 DeepLearning.ai 提炼笔记（2-2）– 优化算法

PS: 欢迎扫码关注公众号：「SelfImprovementLab」！专注「深度学习」，「机器学习」，「人工智能」。以及「早起」，「阅读」，「运动」，「英语」「其他」不定期建群打卡互助活动。

ZJ_Improve

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
1
评论
Coursera | Andrew Ng (02-week-2-2.8)—Adam 优化算法

该系列仅在原课程基础上部分知识点添加个人学习笔记，或相关推导补充等。如有错误，还请批评指教。在学习了 Andrew Ng 课程的基础上，为了更方便的查阅复习，将其整理成文字。因本人一直在学习英语，所以该系列以英文为主，同时也建议读者以英文为主，中文辅助，以便后期进阶时，为学习相关领域的学术论文做铺垫。- ZJ Coursera 课程 |deeplearning.ai |网易云课堂
复制链接

扫一扫

专栏目录