I will try to use English to write down the knowledge learned in this class, just with the aim to make sure I will not forget this important language tool. (We do not have any English classes in this semester, so my worry is, emmmm, reasonable, right?) I hope I can achieve this target. Maybe I will give up one day, haha. If there exist some wrong things in my notes, no only for the knowledge but for the English gramma, please point out them with no hesitation. I will be so thankful for that.
1. Tip1 for Gradient Descent
The first tip taught by Pro. Li is tuning the learning rates of our gradient descent program. Judging from the following figures, too little learning rates will make our program run in a low speed, but too large learning rates will make our program fail to find the minimum point of the curve. When we are using gradient descent to find the minimum point, we could draw a fugure, which is similar to the right figure of the following picture, showing the relationship between the loss and the number of parameters updates. Obviously, this picture will tell us the most suitable learning rates.
1.1 Adaptive Learning Rates
However, it is very hard to find the most suitbale values of learing rates. One popular and simpe idea is to reduce the learning rate by some factors every few epochs. There are two targets we want to achieve.
- Learing rates can not be one-size-fit-all.
- Giving different parameters different learning rates
Pro. Li gives us a more detailed explanation.
In order to fit in with above thought, a popular method, called Adagrad, was proposed.
However, there exists a contradiction which is shown in the following ppt. One part of the formula tells us larger gradient leads to larger step, but the other part tells us the opposite conclusion.
Comparison between different parameters will show us the reasons for above formula. In our common mind, larger 1st order derivative means far from the minima. However, for the point a of
w
1
w_1
w1 and point c of
w
2
w_2
w2 (the 1st order derivate of a is obviously smaller than that of point c), point c is closer to the minima point. This phenomenon tells us we should take the second derivative into consideration, which is similar to the statistical distance we learn in the Multivariate Statistical Analysis. So the formula we use to find the point whihc is closest to the minima point is:
F
i
r
s
t
D
e
r
i
v
a
t
i
v
e
S
e
c
o
n
d
D
e
r
i
v
a
t
i
v
e
\frac{First ~~Derivative}{Second~~Derivative}
Second DerivativeFirst Derivative
In practice, calculating the second derivative is always a difficult tasks. So we will use
Σ
(
g
i
)
2
\Sigma (g^i)^2
Σ(gi)2 to take the place of the second derivative. Why does this make sense? Looking at the following ppt, we sample sorts of points randomly from different distributions. For the distribution with larger second derivative, the value of
(
f
i
r
s
t
d
e
r
i
v
a
t
i
v
e
)
2
\sqrt{(first~~derivative)^2}
(first derivative)2 is larger. For the distribution with smaller second derivative, the value of
(
f
i
r
s
t
d
e
r
i
v
a
t
i
v
e
)
2
\sqrt{(first~~derivative)^2}
(first derivative)2 is smaller. In the light of this statement, we can use
Σ
(
g
i
)
2
\Sigma (g^i)^2
Σ(gi)2 to take the place of the second derivative.
Above all, we can expain the formula of Adagrad successfully.
2. Tip 2 for Gradient Descent
The second tips of gradient descent taught by Pro. Li is using Stochastic Gradient Descent, always known as SGD, to make the training faster.
We can also see the differents from the following ppt, which shows the diffrent process of updating the parameters for GD and SGD.
3. Tip 3 for Gradient Descent
The third tip we learn in the class is feature scaling. What is the meaning of feature scaling? In this class, feature scaling means we want to make different features have the same scaling, which is similar to the defination of normalization. So the next question is the reason of this behaviour. The following figure will tell us the reason. For the left part,
x
1
x_1
x1 and
x
2
x_2
x2 have different scales, which leads to the result that the change of
w
2
w_2
w2 will influence the loss in a more obvious way. The red line stands for the path of
w
1
w_1
w1 and
w
2
w_2
w2 in the process of gradient descent, containing two main problems: 1) the convergence direction is not straightly towards the minma point (which will make the progam waste a lot of time) 2) setting the learning rates to
w
1
w_1
w1 and
w
2
w_2
w2 is apparently not suitable. (if you do not use Adagrad)
The method for feature scaling is quiet easy, which can be shown as:
4. Theory Bethind the Gradient Descent
Giving a random point
θ
0
\theta^0
θ0, we can draw a little circle with the center
θ
0
\theta^0
θ0 and the radius
ϵ
\epsilon
ϵ. Then we can find a new point with less values and remove the
θ
0
\theta^0
θ0 to this new point
θ
1
\theta^1
θ1.
So how to find this new point? Before we explain the reason, we have to introduce Taylor Series, which is shown as follows. (We need to note that the radius of the circle, represented as learning rate, should be small enough)
For
L
(
θ
)
L(\theta)
L(θ),
s
s
s,
u
u
u and
v
v
v is a contant number or vector or matrix because their values only depend on the current point
(
a
,
b
)
(a,b)
(a,b). So the value of
L
(
θ
)
L(\theta)
L(θ) is determined by the inner conduction of vector
(
u
,
v
)
(u,v)
(u,v) and vector
(
θ
1
−
a
,
θ
2
−
b
)
(\theta_1-a,\theta_2-b)
(θ1−a,θ2−b). Finding
θ
1
\theta_1
θ1 and
θ
2
\theta_2
θ2 in the red circle minimizing
L
(
θ
)
L(\theta)
L(θ)can be concert to finding a new point
(
θ
1
,
θ
2
)
(\theta_1,\theta_2)
(θ1,θ2) which lays in the red circle and makes the inner conduction of vector
(
u
,
v
)
(u,v)
(u,v) and vector
(
θ
1
−
a
,
θ
2
−
b
)
(\theta_1-a,\theta_2-b)
(θ1−a,θ2−b) less. It is easy to find the final answer, right? That is the opposite direction of vector
(
u
,
v
)
(u,v)
(u,v), the direction of gradient.