Notes for Deep Learning Lessons of Pro. Hung-yi Lee (2)

最新推荐文章于 2024-07-26 21:19:18 发布

hello_JeremyWang

最新推荐文章于 2024-07-26 21:19:18 发布

阅读量128

点赞数 1

分类专栏：深度学习理论知识文章标签： python 深度学习优化算法

本文链接：https://blog.csdn.net/hello_jeremywang/article/details/120680564

版权

深度学习理论知识专栏收录该内容

9 篇文章 0 订阅

订阅专栏

Today, Knowledge concerning about the optimization of deep learning is written here. What is the meaning of optimaztion? The following ppt shows us the answer.

在这里插入图片描述

1. SGD with Momentum (SGDM)

Just as the name shows us, SGDM is invented by combining SGD with Momentum. For SGDM, the process of updating parameters is shown in the following ppt. What we should pay attention to, or in the other words what makes us better understand the meanings of SGDM, is $v^i$ is actually the weighted sum of all the previous gradient and the closer gradient has more influence on current momentum.
在这里插入图片描述
What is the adavantage of adding momentum in SGD program? For SGD, it is easy to lead us to local minima point rather than the global minima point. However, adding momentum takes the history information into account, which means, if we explains it in a more vivid way, SGDM offers us the ability to think whether we are standing at the local minima point.
在这里插入图片描述

2. Adagrad

It has been introduced in last blog. So, I do not want to explain it again. (I am a little bit lazy, haha).
在这里插入图片描述

3. RMSProp

RMSProp makes a little change on the formula of Adagrad. In Adagrad, $v_t$ is the sum of square of the past gradient. However, in RMSProp, $v_t = \alpha v_{t-1}+(1-\alpha)(g_{t-1})^2$ . We can changes the value of $\alpha$ to make $v_{t-1}$ have more or less influence on the current gradient. In common situation, we always set $\alpha$ as a large number with the afraid of a too large $g_{t-1}$ making $\frac{\eta}{\sqrt{v_t}}$ too close to zero.

在这里插入图片描述

4. Adam

If we ignore some little differences, Adam can be seen as the combination of SGDM and RMSProp. The little change is that we change the form of $m_t$ , which can be called as de-biasing. The reason of this change is that the value of $m_t$ is too close to zero at the beginning of updating.

在这里插入图片描述

hello_JeremyWang

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Notes for Deep Learning Lessons of Pro. Hung-yi Lee (2)

Today, Knowledge concerning about the optimization of deep learning is written here. What is the meaning of optimaztion? The following ppt shows us the answer.1. SGD with Momentum (SGDM)Just as the name shows us, SGDM is invented by combining SGD with M
复制链接

扫一扫

专栏目录