# 李宏毅ML笔记3: 梯度下降

Feature Scaling的方法

Optimization

Optimization方法

SGD

RMSProp

SGDM改进

$$\theta^{0} = \begin{bmatrix} \theta_{1}^{0} \\ \theta_{2}^{0} \\ \end{bmatrix}$$

$$\theta^{1} = \theta^{0} - \eta\nabla L\left( \theta^{0} \right)$$

$$\nabla L\left( \theta \right) = \begin{bmatrix} {{\partial L\left\lbrack \theta_{1} \right\rbrack}/{\partial\theta_{1}}} \\ {{\partial L\left\lbrack \theta_{2} \right\rbrack}/{\partial\theta_{2}}} \\ \end{bmatrix}$$

$$\begin{bmatrix} \theta_{1}^{1} \\ \theta_{2}^{1} \\ \end{bmatrix} = \begin{bmatrix} \theta_{1}^{0} \\ \theta_{2}^{0} \\ \end{bmatrix} - \eta\begin{bmatrix} {{\partial L\left\lbrack \theta_{1}^{0} \right\rbrack}/{\partial\theta_{1}}} \\ {{\partial L\left\lbrack \theta_{2}^{0} \right\rbrack}/{\partial\theta_{2}}} \\ \end{bmatrix}$$

# 小心调整学习率

Tip 1: Tuning your learning rates

## 自动调整学习率方法

w是单个参数( 不是参数集 ). 基本想法是学习率η随迭代次数t变化(time dependent), g为梯度( 偏微分 )

$$\eta^{t} = \frac{\eta}{\sqrt{t + 1}}$$

$$g^{t} = \frac{\partial L\left( \theta^{t} \right)}{\partial w}$$

$$w^{t + 1}\leftarrow w^{t} - \text{η}^{t}g^{t}$$

$$w^{t + 1}\leftarrow w^{t} - \frac{\text{η}^{t}}{\sigma^{t}}g^{t}$$

$$w^{1}\leftarrow w^{0} - \frac{\text{η}^{0}}{\sigma^{0}}g^{0}$$ $$\sigma^{0} = \sqrt{\left( g^{0} \right)^{2}}$$

$$\sigma^{1} = \sqrt{\frac{1}{2}\left\lbrack {\left\lbrack g^{0} \right\rbrack^{2} + \left\lbrack g^{1} \right\rbrack^{2}} \right\rbrack}$$

$$\sigma^{t} = \sqrt{\frac{1}{t + 1}{\sum_{i = 0}^{t}\left( g^{i} \right)^{2}}}$$

$$w^{t + 1}\leftarrow w^{t} - \frac{\eta}{\sqrt{\sum_{i = 0}^{t}\left( g^{i} \right)^{2}}}g^{t}$$

若多个参数, 则c处比a处更近:

a点处一次微分( 梯度 )更小, 但同时二次微分也更小,

c点处一次微分( 梯度 )更大, 但同时二次微分也更大,

$$\sqrt{\sum_{i = 0}^{t}\left( g^{i} \right)^{2}}$$

# 随机梯度下降SGD

Loss是所有训练样本误差的总和, 然后可以做梯度下降. 若线性回归, 则Loss:

$$L = {\sum_{n}\left( {{\hat{y}}^{n} - \left( {b + ~{\sum{w_{i}x_{i}^{n}}}} \right)} \right)^{2}}$$

stochastic不同, 每次只抽取1个样本( 可按顺序可随机, 随机有帮助 ), loss只考虑一个样本, 更新参数也只考虑一个样本.

$$L^{n} = \left( {{\hat{y}}^{n} - \left( {b + ~{\sum{w_{i}x_{i}^{n}}}} \right)} \right)^{2}$$

$$\theta^{i} = \theta^{i - 1} - \eta\nabla L^{n}\left( \theta^{i - 1} \right)$$

# 特征归一化

Feature Scaling把两个属性值的分布范围变成相同.

## Feature Scaling的方法

$$x_{i}^{r}\leftarrow\frac{x_{i}^{r} - m_{i}}{\sigma_{i}}$$

# 梯度下降原理

$$h\left( x \right) = {\sum_{k = 0}^{\infty}\frac{h^{(k)}\left( x_{0} \right)}{k!}}\left( {x - x_{0}} \right)^{k} \\= h\left( x_{0} \right) + h^{'}\left( x_{0} \right)\left( {x - x_{0}} \right) + \frac{h^{''}\left( x_{0} \right)}{2!}\left( {x - x_{0}} \right)^{2} + \ldots$$

$$h(x) \approx h\left( x_{0} \right) + h^{'}\left( x_{0} \right)\left( {x - x_{0}} \right)$$

$$h({x,y}) \approx h\left( {x_{0},y_{0}} \right) + \frac{\partial h\left( {x_{0},y_{0}} \right)}{\partial x}\left( {x - x_{0}} \right) + \frac{\partial h\left( {x_{0},y_{0}} \right)}{\partial y}\left( {y - y_{0}} \right)$$

$$L\left( \theta \right) \approx L\left( {a,b} \right) + \frac{\partial L\left( {a,b} \right)}{\partial\theta_{1}}\left( {\theta_{1} - a} \right) + \frac{\partial L\left( {a,b} \right)}{\partial\theta_{2}}\left( {\theta_{2} - b} \right)$$

$$s = L\left( {a,b} \right)\\u = \frac{\partial L\left( {a,b} \right)}{\partial\theta_{1}},v = \frac{\partial L\left( {a,b} \right)}{\partial\theta_{2}}\\L\left( \theta \right) \approx s + u\left( {\theta_{1} - a} \right) + v\left( {\theta_{2} - b} \right)$$

$$\left( {\theta_{1} - a} \right)^{2} + \left( {\theta_{2} - b} \right)^{2} \leq d^{2}\\ \Delta\theta_{1} = \theta_{1} - a\\ \Delta\theta_{2} = \theta_{2} - b$$

L可以看作向量(u, v)与(Δθ_1, Δθ_2)内积, 当(Δθ_1, Δθ_2)取(u, v)的反向且长度到圆边缘时, 内积L最小. 而学习率η就是这个圆半径:

$$\begin{bmatrix} {\Delta\theta_{1}} \\ {\Delta\theta_{2}} \\ \end{bmatrix} = - \eta\begin{bmatrix} u \\ v \\ \end{bmatrix}\\ \begin{bmatrix} \theta_{1} \\ \theta_{2} \\ \end{bmatrix} = \begin{bmatrix} a \\ b \\ \end{bmatrix} - \eta\begin{bmatrix} u \\ v \\ \end{bmatrix} = \begin{bmatrix} a \\ b \\ \end{bmatrix} - \eta\begin{bmatrix} \frac{\partial L\left\lbrack {a,b} \right\rbrack}{\partial\theta_{1}} \\ \frac{\partial L\left\lbrack {a,b} \right\rbrack}{\partial\theta_{2}} \\ \end{bmatrix}$$

# Optimization

## 背景知识

μ-strong convexity, Lipschitz continuity, Bregman proximal inequality

𝜃_𝑡 : 第t次的模型参数( 就是优化的目标 )

∇𝐿(𝜃­_𝑡) or 𝑔_𝑡: 𝜃_𝑡处通过Loss的微分计算梯度, 用来计算 𝜃_(𝑡+1)

𝑚_(𝑡+1): momentum 记录了从第0次到第t次梯度信息(也有压缩), 用来计算𝜃_(𝑡+1)

input x_t, 送进参数θ_t, 得到预测y_t(可能是过softmax后的几率向量, 或一张生成的图片)

On-line：一次看一对 (𝑥_𝑡, 𝑦_𝑡)

Off-line: 每次都看所有数据

off-line更容易些, 但是可能没有这么多资源塞所有data. 剩下内容主要关注离线训练

## Optimization方法

### SGD

SGD with Momentum(SGDM)

$$m_t=\beta_1 m_{t-1}+(1-\beta_1)g_{t-1}$$

->定义向量movement v^0=0, 计算θ^0处的梯度, 更新movement( 就是把所有过去的SGD中update的方向累加到movement里面 ) v^1=λv^0-η∇𝐿(𝜃^0)

->往动量方向移动到θ^1=θ^0+v^1

$$\theta_t=\theta_{t-1} - \frac{\eta}{\sqrt{\sum_{i = 0}^{t-1}\left( g_{i} \right)^{2}}}g_{t-1}$$

### RMSProp

$$\theta_t=\theta_{t-1} - \frac{\eta}{\sqrt{v_t}}g_{t-1}\\ v_1=g_0^2\\ v_t=\alpha v_{t-1}+(1-\alpha)(g_{t-1})^2$$

$$\theta_t=\theta_{t-1}-\frac{\eta}{\sqrt{\hat v_t}+\varepsilon}\hat m_t$$

Adam：训练更快, 泛化能力下降大( large generalization gap ), 不稳定

SGDM：稳定, 泛化能力下降小, 最后收敛程度可能更好

SWATS

### SGDM改进

Cyclical LR(learning rate)/ SGDR

One-cycle LR

1.通常是定义一个曲线, 先增加, 再慢慢减少(工程上常用的做法).

2.另一种方法:

$$\theta_t=\theta_{t-1}-\frac{\eta}{\sqrt{\hat v_t}+\varepsilon}\hat m_t$$

$$\theta_t=\theta_{t-1}-\eta\hat m_t$$

$$\theta_t=\theta_{t-1}-\frac{\eta r_t}{\sqrt{\hat v_t}+\varepsilon}\hat m_t$$

 For 𝑡 = 1, 2, …(outer loop)          𝜃_(𝑡,0) = 𝜙_(𝑡−1)        For 𝑖 = 1, 2, …𝑘 (inner loop)                   𝜃_(𝑡,𝑖) = 𝜃_(𝑡,𝑖−1)+O𝑝𝑡𝑖𝑚 (𝐿𝑜𝑠𝑠, 𝑑𝑎𝑡𝑎, 𝜃_( 𝑡,𝑖−1))          𝜙_𝑡＝𝜙_(𝑡−1) + 𝛼(𝜃_(𝑡,k) − 𝜙_(𝑡−1))

SGDM

$$\theta_t=\theta_{t-1}-m_t=\theta_{t-1}-\lambda m_{t-1}-\eta \nabla L(\theta_{t-1})\\ m_t=\lambda m_{t-1}+\eta \nabla L(\theta_{t-1})$$

NAG

$$m_t=\lambda m_{t-1}+\eta \nabla L(\theta_{t-1}-\lambda m_{t-1})$$

( 原本要保存两份θ_(t-1)其中用了些数学方法替代, 防止浪费空间, 但是这个公式看得我实在想吐 )

$$m_t=\lambda m_{t-1}+\eta \nabla L(\theta_{t-1}')\\ \theta_t'=\theta_{t-1}'-\lambda m_t-\eta \nabla L(\theta_{t-1}')$$

SGDM与NAG公式区别

$$\theta_t=\theta_{t-1}-\frac{\eta}{\sqrt{\hat v_t}+\varepsilon}\hat m_t$$

$$\hat m_t=\frac{\beta_1 m_t}{1-\beta_1^{t+1}}+\frac{(1-\beta_1)g_{t-1}}{1-\beta_1^t}$$

## 有助于Optimization的事情

### 增加model随机性

Shuffling

data输入的时候, 换一下顺序.

Dropout

### 学习率

Warm-up

Curriculum learning

Fine-tuning

Normalization

Regularization

## 总结

 SGDM Adam 慢 收敛更好 稳定 泛化好 快 可能不收敛 不稳定 泛化不好

 SGDM Adam Computer vision image classification segmentation object detection NLP QA machine translation  summary Speech synthesis语音生成 GAN Reinforcement learning

04-05 135

04-07 103
06-04 25
05-18 77
06-02 62
06-03 40
04-15 113
05-22 61
09-17 62
01-26 407
03-21 105