Notes for Deep Learning Lessons of Pro. Hung-yi Lee (1)

最新推荐文章于 2024-08-14 16:19:02 发布

hello_JeremyWang

最新推荐文章于 2024-08-14 16:19:02 发布

阅读量159

点赞数

分类专栏：深度学习理论知识文章标签：深度学习李宏毅

本文链接：https://blog.csdn.net/hello_JeremyWang/article/details/120633508

版权

梯度下降学习率 Adagrad 特征缩放随机梯度下降

关键词由CSDN通过智能技术生成

深度学习理论知识专栏收录该内容

9 篇文章 0 订阅

订阅专栏

I will try to use English to write down the knowledge learned in this class, just with the aim to make sure I will not forget this important language tool. (We do not have any English classes in this semester, so my worry is, emmmm, reasonable, right?) I hope I can achieve this target. Maybe I will give up one day, haha. If there exist some wrong things in my notes, no only for the knowledge but for the English gramma, please point out them with no hesitation. I will be so thankful for that.

1. Tip1 for Gradient Descent

The first tip taught by Pro. Li is tuning the learning rates of our gradient descent program. Judging from the following figures, too little learning rates will make our program run in a low speed, but too large learning rates will make our program fail to find the minimum point of the curve. When we are using gradient descent to find the minimum point, we could draw a fugure, which is similar to the right figure of the following picture, showing the relationship between the loss and the number of parameters updates. Obviously, this picture will tell us the most suitable learning rates.

在这里插入图片描述

1.1 Adaptive Learning Rates

However, it is very hard to find the most suitbale values of learing rates. One popular and simpe idea is to reduce the learning rate by some factors every few epochs. There are two targets we want to achieve.

Learing rates can not be one-size-fit-all.
Giving different parameters different learning rates

Pro. Li gives us a more detailed explanation.
在这里插入图片描述
In order to fit in with above thought, a popular method, called Adagrad, was proposed.

在这里插入图片描述

在这里插入图片描述

However, there exists a contradiction which is shown in the following ppt. One part of the formula tells us larger gradient leads to larger step, but the other part tells us the opposite conclusion.

Comparison between different parameters will show us the reasons for above formula. In our common mind, larger 1st order derivative means far from the minima. However, for the point a of $w_1$ and point c of $w_2$ (the 1st order derivate of a is obviously smaller than that of point c), point c is closer to the minima point. This phenomenon tells us we should take the second derivative into consideration, which is similar to the statistical distance we learn in the Multivariate Statistical Analysis. So the formula we use to find the point whihc is closest to the minima point is:
$\frac{First ~~Derivative}{Second~~Derivative}$
在这里插入图片描述
In practice, calculating the second derivative is always a difficult tasks. So we will use $\Sigma (g^i)^2$ to take the place of the second derivative. Why does this make sense? Looking at the following ppt, we sample sorts of points randomly from different distributions. For the distribution with larger second derivative, the value of $\sqrt{(first~~derivative)^2}$ is larger. For the distribution with smaller second derivative, the value of $\sqrt{(first~~derivative)^2}$ is smaller. In the light of this statement, we can use $\Sigma (g^i)^2$ to take the place of the second derivative.
在这里插入图片描述
Above all, we can expain the formula of Adagrad successfully.

2. Tip 2 for Gradient Descent

The second tips of gradient descent taught by Pro. Li is using Stochastic Gradient Descent, always known as SGD, to make the training faster.

在这里插入图片描述

We can also see the differents from the following ppt, which shows the diffrent process of updating the parameters for GD and SGD.
在这里插入图片描述

3. Tip 3 for Gradient Descent

The third tip we learn in the class is feature scaling. What is the meaning of feature scaling? In this class, feature scaling means we want to make different features have the same scaling, which is similar to the defination of normalization. So the next question is the reason of this behaviour. The following figure will tell us the reason. For the left part, $x_1$ and $x_2$ have different scales, which leads to the result that the change of $w_2$ will influence the loss in a more obvious way. The red line stands for the path of $w_1$ and $w_2$ in the process of gradient descent, containing two main problems: 1) the convergence direction is not straightly towards the minma point (which will make the progam waste a lot of time) 2) setting the learning rates to $w_1$ and $w_2$ is apparently not suitable. (if you do not use Adagrad)
在这里插入图片描述
The method for feature scaling is quiet easy, which can be shown as:

4. Theory Bethind the Gradient Descent

Giving a random point $\theta^0$ , we can draw a little circle with the center $\theta^0$ and the radius $\epsilon$ . Then we can find a new point with less values and remove the $\theta^0$ to this new point $\theta^1$ .
在这里插入图片描述
So how to find this new point? Before we explain the reason, we have to introduce Taylor Series, which is shown as follows. (We need to note that the radius of the circle, represented as learning rate, should be small enough)

For $L(\theta)$ , $s$ , $u$ and $v$ is a contant number or vector or matrix because their values only depend on the current point $(a, b)$ . So the value of $L(\theta)$ is determined by the inner conduction of vector $(u, v)$ and vector $(\theta_1-a,\theta_2-b)$ . Finding $\theta_1$ and $\theta_2$ in the red circle minimizing $L(\theta)$ can be concert to finding a new point $(\theta_1,\theta_2)$ which lays in the red circle and makes the inner conduction of vector $(u, v)$ and vector $(\theta_1-a,\theta_2-b)$ less. It is easy to find the final answer, right? That is the opposite direction of vector $(u, v)$ , the direction of gradient.
在这里插入图片描述

hello_JeremyWang

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Notes for Deep Learning Lessons of Pro. Hung-yi Lee (1)

I will try to use English to write down the knowledge learned in this class, just with the aim to make sure I will not forget this important language tool. I hope I can achieve this target. Maybe I will give up one day, haha.1. Tip1 for Gradient Descent
复制链接

扫一扫

专栏目录