Notes for Deep Learning Lessons of Pro. Hung-yi Lee (1)

I will try to use English to write down the knowledge learned in this class, just with the aim to make sure I will not forget this important language tool. (We do not have any English classes in this semester, so my worry is, emmmm, reasonable, right?) I hope I can achieve this target. Maybe I will give up one day, haha. If there exist some wrong things in my notes, no only for the knowledge but for the English gramma, please point out them with no hesitation. I will be so thankful for that.

1. Tip1 for Gradient Descent

The first tip taught by Pro. Li is tuning the learning rates of our gradient descent program. Judging from the following figures, too little learning rates will make our program run in a low speed, but too large learning rates will make our program fail to find the minimum point of the curve. When we are using gradient descent to find the minimum point, we could draw a fugure, which is similar to the right figure of the following picture, showing the relationship between the loss and the number of parameters updates. Obviously, this picture will tell us the most suitable learning rates.

在这里插入图片描述

1.1 Adaptive Learning Rates

However, it is very hard to find the most suitbale values of learing rates. One popular and simpe idea is to reduce the learning rate by some factors every few epochs. There are two targets we want to achieve.

  1. Learing rates can not be one-size-fit-all.
  2. Giving different parameters different learning rates

Pro. Li gives us a more detailed explanation.
在这里插入图片描述
In order to fit in with above thought, a popular method, called Adagrad, was proposed.

在这里插入图片描述

在这里插入图片描述
在这里插入图片描述
However, there exists a contradiction which is shown in the following ppt. One part of the formula tells us larger gradient leads to larger step, but the other part tells us the opposite conclusion.
在这里插入图片描述
Comparison between different parameters will show us the reasons for above formula. In our common mind, larger 1st order derivative means far from the minima. However, for the point a of w 1 w_1 w1 and point c of w 2 w_2 w2 (the 1st order derivate of a is obviously smaller than that of point c), point c is closer to the minima point. This phenomenon tells us we should take the second derivative into consideration, which is similar to the statistical distance we learn in the Multivariate Statistical Analysis. So the formula we use to find the point whihc is closest to the minima point is:
F i r s t    D e r i v a t i v e S e c o n d    D e r i v a t i v e \frac{First ~~Derivative}{Second~~Derivative} Second  DerivativeFirst  Derivative
在这里插入图片描述
In practice, calculating the second derivative is always a difficult tasks. So we will use Σ ( g i ) 2 \Sigma (g^i)^2 Σ(gi)2 to take the place of the second derivative. Why does this make sense? Looking at the following ppt, we sample sorts of points randomly from different distributions. For the distribution with larger second derivative, the value of ( f i r s t    d e r i v a t i v e ) 2 \sqrt{(first~~derivative)^2} (first  derivative)2 is larger. For the distribution with smaller second derivative, the value of ( f i r s t    d e r i v a t i v e ) 2 \sqrt{(first~~derivative)^2} (first  derivative)2 is smaller. In the light of this statement, we can use Σ ( g i ) 2 \Sigma (g^i)^2 Σ(gi)2 to take the place of the second derivative.
在这里插入图片描述
Above all, we can expain the formula of Adagrad successfully.

2. Tip 2 for Gradient Descent

The second tips of gradient descent taught by Pro. Li is using Stochastic Gradient Descent, always known as SGD, to make the training faster.

在这里插入图片描述

We can also see the differents from the following ppt, which shows the diffrent process of updating the parameters for GD and SGD.
在这里插入图片描述

3. Tip 3 for Gradient Descent

The third tip we learn in the class is feature scaling. What is the meaning of feature scaling? In this class, feature scaling means we want to make different features have the same scaling, which is similar to the defination of normalization. So the next question is the reason of this behaviour. The following figure will tell us the reason. For the left part, x 1 x_1 x1 and x 2 x_2 x2 have different scales, which leads to the result that the change of w 2 w_2 w2 will influence the loss in a more obvious way. The red line stands for the path of w 1 w_1 w1 and w 2 w_2 w2 in the process of gradient descent, containing two main problems: 1) the convergence direction is not straightly towards the minma point (which will make the progam waste a lot of time) 2) setting the learning rates to w 1 w_1 w1 and w 2 w_2 w2 is apparently not suitable. (if you do not use Adagrad)
在这里插入图片描述
The method for feature scaling is quiet easy, which can be shown as:
在这里插入图片描述

4. Theory Bethind the Gradient Descent

Giving a random point θ 0 \theta^0 θ0, we can draw a little circle with the center θ 0 \theta^0 θ0 and the radius ϵ \epsilon ϵ. Then we can find a new point with less values and remove the θ 0 \theta^0 θ0 to this new point θ 1 \theta^1 θ1.
在这里插入图片描述
So how to find this new point? Before we explain the reason, we have to introduce Taylor Series, which is shown as follows. (We need to note that the radius of the circle, represented as learning rate, should be small enough)
在这里插入图片描述
For L ( θ ) L(\theta) L(θ), s s s, u u u and v v v is a contant number or vector or matrix because their values only depend on the current point ( a , b ) (a,b) (a,b). So the value of L ( θ ) L(\theta) L(θ) is determined by the inner conduction of vector ( u , v ) (u,v) (u,v) and vector ( θ 1 − a , θ 2 − b ) (\theta_1-a,\theta_2-b) (θ1a,θ2b). Finding θ 1 \theta_1 θ1 and θ 2 \theta_2 θ2 in the red circle minimizing L ( θ ) L(\theta) L(θ)can be concert to finding a new point ( θ 1 , θ 2 ) (\theta_1,\theta_2) (θ1,θ2) which lays in the red circle and makes the inner conduction of vector ( u , v ) (u,v) (u,v) and vector ( θ 1 − a , θ 2 − b ) (\theta_1-a,\theta_2-b) (θ1a,θ2b) less. It is easy to find the final answer, right? That is the opposite direction of vector ( u , v ) (u,v) (u,v), the direction of gradient.
在这里插入图片描述

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值