Datawhale学习打卡LeeML-Task03（梯度下降）

最新推荐文章于 2022-03-18 20:10:54 发布

三万万万万

最新推荐文章于 2022-03-18 20:10:54 发布

阅读量182

点赞数

分类专栏： Datawhale ML

本文链接：https://blog.csdn.net/weixin_43873591/article/details/122500959

版权

Datawhale 同时被 2 个专栏收录

7 篇文章 0 订阅

订阅专栏

7 篇文章 0 订阅

订阅专栏

Datawhale学习打卡LeeML-Task03（梯度下降）

回顾梯度下降
Some Tips
相关理论基础
梯度下降的限制

参考：LeeML-Chapter6

回顾梯度下降

我们需要解决如下优化问题：
$\theta^{*}=\arg \min _{\theta} L(\theta) \quad \text { L: loss function } \theta: \text { parameters }$
假设t $\theta$ 有两个变量 $\left\{\theta_{1}, \theta_{2}\right\}$
$\begin{aligned} &\text { 随机初始化 } \theta^{0}=\left[\begin{array}{l} \theta_{1}^{0} \\ \theta_{2}^{0} \end{array}\right] \\ &{\left[\begin{array}{l} \theta_{1}^{1} \\ \theta_{2}^{1} \end{array}\right]=\left[\begin{array}{l} \theta_{1}^{0} \\ \theta_{2}^{0} \end{array}\right]-\eta\left[\begin{array}{l} \partial L\left(\theta_{1}^{0}\right) / \partial \theta_{1} \\ \partial L\left(\theta_{2}^{0}\right) / \partial \theta_{2} \end{array}\right] \rightarrow \theta^{1}=\theta^{0}-\eta \nabla L\left(\theta^{0}\right)} \\ &{\left[\begin{array}{l} \theta_{1}^{2} \\ \theta_{2}^{2} \end{array}\right]=\left[\begin{array}{l} \theta_{1}^{1} \\ \theta_{2}^{1} \end{array}\right]-\eta\left[\begin{array}{l} \partial L\left(\theta_{1}^{1}\right) / \partial \theta_{1} \\ \partial L\left(\theta_{2}^{1}\right) / \partial \theta_{2} \end{array}\right] \rightarrow \theta^{2}=\theta^{1}-\eta \nabla L\left(\theta^{1}\right)} \end{aligned}$
其中梯度可以简写为： $\nabla L(\theta)=\left[\begin{array}{l} \partial L\left(\theta_{1}\right) / \partial \theta_{1} \\ \partial L\left(\theta_{2}\right) / \partial \theta_{2} \end{array}\right]$
梯度下降法的计算过程进行可视化：
在这里插入图片描述

Some Tips

Tip1:小心设置学习率

在这里插入图片描述
虽然这样的可视化可以很直观观察，但可视化也只是能在参数是一维或者二维的时候进行，更高维的情况已经无法可视化了。

解决方法就是上图右边的方案，将参数改变对损失函数的影响进行可视化。比如学习率太小（蓝色的线），损失函数下降的非常慢；学习率太大（绿色的线），损失函数下降很快，但马上就卡住不下降了；学习率特别大（黄色的线），损失函数就飞出去了；红色的就是差不多刚好，可以得到一个好的结果。

自适应学习率

随着次数的增加，通过一些因子来减少学习率

通常刚开始，初始点会距离最低点比较远，所以使用大一点的学习率
在一些 epochs后，比较靠近最低点，减少学习率 $\eta^{t}=\eta / \sqrt{t+1}$ $t$ 是次数。随着次数的增加， $\eta^t$ 减小
对于不同参数，要设置不同学习率

Adagrad 算法

每个参数的学习率都把它除上之前微分的均方根。
普通的梯度下降为:
$\begin{gathered} w^{t+1} \leftarrow w^{t}-\eta^{t} g^{t} \\ \eta^{t}=\frac{\eta^{t}}{\sqrt{t+1}} \end{gathered}$

$\mathrm{w}$ 是一个参数
Adagrad 可以做的更好:
$\begin{aligned} \mathrm{w}^{\mathrm{t}+1} & \leftarrow \mathrm{w}^{\mathrm{t}}-\frac{\eta^{\mathrm{t}}}{\sigma^{\mathrm{t}}} \mathrm{g}^{\mathrm{t}} \\ \mathrm{g}^{\mathrm{t}} &=\frac{\partial \mathrm{L}\left(\theta^{\mathrm{t}}\right)}{\partial \mathrm{w}} \end{aligned}$
$\sigma^{t}$ :之前参数的所有微分的均方根, 对于每个参数都是不一样的。
举例：
$\begin{aligned} w^{1} \leftarrow w^{0}-\frac{\eta^{0}}{\sigma^{0}} g^{0} \quad & \sigma^{0}=\sqrt{\left(g^{0}\right)^{2}} \\ w^{2} \leftarrow w^{1}-\frac{\eta^{1}}{\sigma^{1}} g^{1} \quad & \sigma^{1}=\sqrt{\frac{1}{2}\left[\left(g^{0}\right)^{2}+\left(g^{1}\right)^{2}\right]} \\ w^{3} \leftarrow w^{2}-\frac{\eta^{2}}{\sigma^{2}} g^{2} \quad & \sigma^{2}=\sqrt{\frac{1}{3}\left[\left(g^{0}\right)^{2}+\left(g^{1}\right)^{2}+\left(g^{2}\right)^{2}\right]} \\ \vdots & \\ w^{t+1} \leftarrow w^{t}-\frac{\eta^{t}}{\sigma^{t}} g^{t} \quad & \sigma^{t}=\sqrt{\frac{1}{t+1} \sum_{i=0}^{t}\left(g^{i}\right)^{2}} \end{aligned}$
将式子化简：

问题：在 Adagrad 中，当梯度越大的时候，步伐应该越大，但下面分母又导致当梯度越大的时候，步伐会越小。

解释：详细可以参考：LeeML-Chapter6
所以最好的步伐应该是： $\text { 最好的步伐是： } \frac{\mid \text { 一次微分 } \mid}{\text { 二次微分 }}$
即不止和一次微分成正比，还和二次微分成反比。最好的step应该考虑到二次微分：

对于 $\sqrt{\sum_{i=0}^{t}\left(g^{i}\right)^{2}}$ 就是希望再尽可能不增加过多运算的情况下模拟二次微分。（如果计算二次微分，在实际情况中可能会增加很多的时间消耗）

Tip2:随机梯度下降（Stochastic Gradient Descent）

之前的梯度下降:
$\begin{gathered} \mathrm{L}=\sum_{\mathrm{n}}\left(\hat{\mathrm{y}}^{\mathrm{n}}-\left(\mathrm{b}+\sum \mathrm{w}_{\mathrm{i}} \mathrm{x}_{\mathrm{i}}^{\mathrm{n}}\right)\right)^{2} \\ \theta^{\mathrm{i}}=\theta^{\mathrm{i}-1}-\eta \nabla \mathrm{L}\left(\theta^{\mathrm{i}-1}\right) \end{gathered}$
而随机梯度下降法**更快**:
损失函数不需要处理训练集所有的数据, 选取一个例子 $\mathrm{x}^{\mathrm{n}}$
$\begin{gathered} \mathrm{L}=\left(\hat{\mathrm{y}}^{\mathrm{n}}-\left(\mathrm{b}+\sum \mathrm{w}_{\mathrm{i}} \mathrm{x}_{\mathrm{i}}^{\mathrm{n}}\right)\right)^{2} \\ \theta^{\mathrm{i}}=\theta^{\mathrm{i}-1}-\eta \nabla \mathrm{L}^{\mathrm{n}}\left(\theta^{\mathrm{i}-1}\right) \end{gathered}$
此时不需要像之前那样对所有的数据进行处理, 只需要计算某一个例子的损失函数 $\mathrm{L_{n}}$ , 就可以赶紧更新梯度。

对比：
在这里插入图片描述

Tip3:特征缩放

让不同的特征有相同的分布
在这里插入图片描述
为什么？

上图左边是 $\mathrm{x}_{1}$ 的scale比 $\mathrm{x}_{2}$ 要小很多, 所以当 $\mathrm{w}_{1}$ 和 $\mathrm{w}_{2}$ 做同样的变化时, $\mathrm{w}_{1}$ 对 $\mathrm{y}$ 的变化影响是比较小的, $\mathrm{x}_{2}$ 对 $\mathrm{y}$ 的变化影响是比较大的。

坐标系中是两个参数的error surface (现在考虑左边蓝色), 因为 $\mathrm{w}_{1}$ 对 $\mathrm{y}$ 的变化影响比较小, 所以 $\mathrm{w}_{1}$ 对损失函数的影响比较小, $\mathrm{w}_{1}$ 对损失函数有比较小的微分, 所以 $\mathrm{w}_{1}$ 方向上是比较平滑的。同理 $\mathrm{x}_{2}$ 对 $\mathrm{y}$ 的影响比较大, 所以 $\mathrm{x}_{2}$ 对损失函数的影响比较大, 所以在 $\mathrm{x}_{2}$ 方向有比较尖的峡谷。

上图右边是两个参数scaling比较接近，右边的绿色图就比较接近圆形。

对于左边的情况，上面讲过这种狭长的情形不过不用Adagrad的话是比较难处理的，两个方向上需要不同的学习率，同一组学习率会搞不定它。而右边情形更新参数就会变得比较容易。左边的梯度下降并不是向着最低点方向走的，而是顺着等高线切线法线方向走的。但绿色就可以 向着圆心(最低点) 走，这样做参数更新也是比较有效率。