Today, Knowledge concerning about the optimization of deep learning is written here. What is the meaning of optimaztion? The following ppt shows us the answer.
1. SGD with Momentum (SGDM)
Just as the name shows us, SGDM is invented by combining SGD with Momentum. For SGDM, the process of updating parameters is shown in the following ppt. What we should pay attention to, or in the other words what makes us better understand the meanings of SGDM, is
v
i
v^i
vi is actually the weighted sum of all the previous gradient and the closer gradient has more influence on current momentum.
What is the adavantage of adding momentum in SGD program? For SGD, it is easy to lead us to local minima point rather than the global minima point. However, adding momentum takes the history information into account, which means, if we explains it in a more vivid way, SGDM offers us the ability to think whether we are standing at the local minima point.
2. Adagrad
It has been introduced in last blog. So, I do not want to explain it again. (I am a little bit lazy, haha).
3. RMSProp
RMSProp makes a little change on the formula of Adagrad. In Adagrad, v t v_t vt is the sum of square of the past gradient. However, in RMSProp, v t = α v t − 1 + ( 1 − α ) ( g t − 1 ) 2 v_t = \alpha v_{t-1}+(1-\alpha)(g_{t-1})^2 vt=αvt−1+(1−α)(gt−1)2. We can changes the value of α \alpha α to make v t − 1 v_{t-1} vt−1 have more or less influence on the current gradient. In common situation, we always set α \alpha α as a large number with the afraid of a too large g t − 1 g_{t-1} gt−1 making η v t \frac{\eta}{\sqrt{v_t}} vtη too close to zero.
4. Adam
If we ignore some little differences, Adam can be seen as the combination of SGDM and RMSProp. The little change is that we change the form of m t m_t mt, which can be called as de-biasing. The reason of this change is that the value of m t m_t mt is too close to zero at the beginning of updating.