【李宏毅机器学习】04：梯度下降Gradient Descent

BkbK-

已于 2022-11-07 15:43:18 修改

阅读量599

点赞数 8

分类专栏：学习笔记文章标签：人工智能深度学习梯度下降李宏毅

于 2021-01-30 22:00:09 首次发布

本文链接：https://blog.csdn.net/BlacKingZ/article/details/113407285

版权

学习笔记同时被 2 个专栏收录

89 篇文章 23 订阅

订阅专栏

李宏毅机器学习

6 篇文章 1 订阅

订阅专栏

李宏毅机器学习04：梯度下降Gradient Descent

在这里插入图片描述

ML Lecture 3-1 Gradient Descent

一、梯度下降方法

Review: 李宏毅机器学习02：回归Regression

在回归问题的第三步中，需要解决下面的最优化问题，即寻找一组参数 $\theta$ ，使损失函数Loss Function 尽可能的小。

$\theta^*=\argmin_\theta L(\theta)$

$L$ : 损失函数loss function
$\theta$ : 参数parameters

梯度下降方法如下：

假设θ有两个变量 { $\theta_1,\theta_2$ },随机选取初值 $\theta^0=\begin{bmatrix} \theta_1^0 \\ \theta_2^0 \end{bmatrix}$

损失函数loss function的梯度 $\nabla L(\theta)=\Large\begin{bmatrix} \frac{\partial L(\theta_1)}{\partial \theta_1} \\ \frac{\partial L(\theta_2)}{\partial \theta_2} \end{bmatrix}$

不断地更新参数：

$\begin{bmatrix} \theta_1^1 \\ \theta_2^1 \end{bmatrix} = \begin{bmatrix} \theta_1^0 \\ \theta_2^0 \end{bmatrix} -\eta\Large\begin{bmatrix} \frac{\partial L(\theta_1^0)}{\partial \theta_1} \\ \frac{\partial L(\theta_2^0)}{\partial \theta_2} \end{bmatrix}$ $\large\Longrightarrow$ $\theta^1=\theta^0-\eta\nabla L(\theta^0)$
$\begin{bmatrix} \theta_1^2 \\ \theta_2^2 \end{bmatrix} = \begin{bmatrix} \theta_1^1 \\ \theta_2^1 \end{bmatrix} -\eta\Large\begin{bmatrix} \frac{\partial L(\theta_1^1)}{\partial \theta_1} \\ \frac{\partial L(\theta_2^1)}{\partial \theta_2} \end{bmatrix}$ $\large\Longrightarrow$ $\theta^2=\theta^1-\eta\nabla L(\theta^1)$
… …

二、梯度下降的改进方法

Tip 1: Tuning your learning rates 调整学习率

$\theta^{i}=\theta^{i-1}-\eta\nabla L(\theta^{i-1})$
$\eta$ is called Learning Rate

1.学习率大小对梯度下降的影响

Learning Rate Ver large 步长非常大
学习率步长太大，会出现损失函数Loss Function不降反增的情况
Learning Rate Small 步长小
学习率步长小，会出现损失函数Loss Function下降速度过慢的情况
Learning Rate Just make 步长刚刚好
学习率大小适当，可以使损失函数Loss Function收敛于最小值
Learning Rate Large 步长大
学习率大可能出现损失函数Loss Function无法到达最低点的情况

通常可以做出参数更新值和损失函数Loss Function的图像来判断学习率的情况

2.Adaptive Learning Rates 自适应学习率

Popular & Simple Idea: Reduce the learning rate by some factor every few epochs
通俗、简单的思想：随着次数的增加，通过一些因子来减少学习率
- At the beginning, we are far from the destination, so we use larger learning rate
  通常刚开始，初始点会距离最低点比较远，所以使用大一点的学习率
- After several epochs, we are close to the destination, so we reduce the learning rate
  经过一段时间后，参数比较靠近最低点了，此时减少学习率
- E.g. : 1/t decay $\eta^t=\Large\frac{\eta}{\sqrt{\smash[b]{t+1}}}$
  例如分数减缓1/t decay: $\eta^t$ 表示第t次的步长，随着次数的增加，步长不断减小。
Learning rate cannot be one-size-fits-all
学习率不能是一个值通用所有特征
- Giving different parameters different learning rates
  不同的参数需要不同的学习率

3.Adagrad算法

（1）Adagrad 是什么

Divide the learning rate of each parameter by the root mean square of
its previous derivatives 将每个参数的学习速率除以其先前导数的均方根

Vanilla Gradient descent
$w^{t+1} \gets w^t - \eta^t g^t$
Adagrad
$w^{t+1} \gets w^t - \Large\frac{\eta^t}{\sigma^t} \normalsize g^t$
设 $\eta^t$ , $g^t$ 分别表示第 $t$ 次的学习率和偏微分。
$\eta^t=\Large\frac{\eta}{\sqrt{\smash[b]{t+1}}}$ ； $g^t=\Large\frac{\partial L(\theta^t)}{\partial w}$
- w is one parameters
  w是参数
- $\sigma^t$ : root mean square of the previous derivatives of parameter w,(Parameter dependent)
  $\sigma^t$ 是之前参数的所有微分的方均根，依赖于参数w

（2）Adagrad 举例

$w^1 \gets w^0 - \Large\frac{\eta^0}{\sigma^0}\normalsize g^0$ ，其中 $\sigma^0 =\sqrt{(g^0)^2}$
$w^2 \gets w^1 -\Large\frac{\eta^1}{\sigma^1}\normalsize g^1$ ，其中 $\sigma^1 =\sqrt{\frac1 2[(g^0)^2+(g^1)^2]}$
$w^3 \gets w^2 - \Large\frac{\eta^2}{\sigma^2}\normalsize g^2$ ，其中 $\sigma^2 =\sqrt{\frac1 3[(g^0)^2+(g^1)^2+(g^2)^2]}$

. … …

$w^{t+1} \gets w^t - \Large\frac{\eta^t}{\sigma^t}\normalsize g^t$ ，其中 $\sigma^t =\sqrt{\frac{1}{t+1}\displaystyle\sum_{i=0}^t(g^i)^2}$

（3）Adagrad 理解

对学习率和微分的方均根的比值进行化简
原式为： $w^{t+1} \gets w^t - \Large\frac{\eta^t}{\sigma^t}\normalsize g^t$
其中： $\eta^t=\Large\frac{\eta}{\sqrt{\smash[b]{t+1}}}$ ； $\sigma^t =\sqrt{\frac{1}{t+1}\displaystyle\sum_{i=0}^t(g^i)^2}=\frac{1}{\sqrt{t+1}}\sqrt{\displaystyle\sum_{i=0}^t(g^i)^2}$

将 $\eta^t$ ， $\sigma^t$ 代入： $\Large\frac{\eta^t}{\sigma^t}\normalsize=\Large\frac{\frac{\eta}{\sout{\sqrt{\smash[b]{t+1}}}}}{\frac{1}{\sout{\sqrt{t+1}}}\small\sqrt{\displaystyle\sum_{i=0}^t(g^i)^2}}\normalsize=\frac{\large\eta}{\small\sqrt{\displaystyle\sum_{i=0}^t(g^i)^2}}$

得到： $w^{t+1} \gets w^t - \Large\frac{\eta}{\small\sqrt{\displaystyle\sum_{i=0}^t(g^i)^2}}\normalsize g^t$
Contradiction ？出现矛盾

对比普通梯度下降公式和Adagrad算法

梯度对迭代值的大小影响相反：梯度值在分子上，梯度越大，迭代值的更新就越大；之前梯度的方均根在分母上，梯度越大，迭代值的更新越小。

Ⅰ Intuitive Reason 直观原因

当梯度值变化很大时，方均根可以造成反差的效果

Ⅱ Mathematical Reason 数学原因

以二次函数 $y=ax^2+bx+c (a>0)$ 为例：

其最小值位于 $x=-\large\frac{b}{2a}$ ,

任取一点 $x_0$ , $x_0$ 到最值点的距离为： $|x_0+\large\frac{b}{2a}\normalsize|$ ,

Best step $=|x_0+\large\frac{b}{2a}\normalsize|=\large\frac{|2ax_0+b|}{2a}$

分子是二次函数 $y=ax^2+bx+c (a>0)$ 在 $x_0$ 点的一阶导数值

由此可以得出结论：

Larger 1st order derivative means far from the minima
较大的一阶导数意味着距离极小值较远
然而当考虑多个参数时，结论失效

例如下图，参数 $w_1$ 在a点一阶导数的绝对值小于参数 $w_2$ 在c点一阶导数的绝对值，而从图上来看c点反而距离极值点更近。
同时考虑 $\large\frac{|2ax_0+b|}{2a}$ 的分母：
分母 $2 a$ 可以由二次函数 $y=ax^2+bx+c (a>0)$ 二阶导数得到

$\large\frac{\partial^2 y}{\partial x^2}=\normalsize2a$

The best step is $\large\frac{|First-derivative|}{Second-derivative}$

因此最好的步长应该时一阶导数的绝对值与二阶导数的比值
对比公式： $w^{t+1} \gets w^t - \Large\frac{\eta}{\small\sqrt{\displaystyle\sum_{i=0}^t(g^i)^2}}\normalsize g^t$

$g^t$ 是一阶导数，二阶导数较难计算，使用 $\sqrt{\displaystyle\sum_{i=0}^t(g^i)^2}$ 来估测

Tip 2: Stochastic Gradient Descent 随机梯度下降

Gradient Descent 普通梯度下降
Loss is the summation over all training examples
损失函数是所有训练样例的总和
损失函数loss function：
$L(w,b)=\sum\big(\hat{y^n}-(b+\sum w_1\cdot x_{cp}^i)\big)^2$
损失函数loss function梯度下降公式：
$\theta^{i}=\theta^{i-1}-\eta\nabla L(\theta^{i-1})$
Stochastic Gradient Descent 随机梯度下降
Loss for only one example
每次使用一个样例
损失函数loss function：
$L(w,b)=\big(\hat{y^n}-(b+\sum w_1\cdot x_{cp}^i)\big)^2$
损失函数loss function梯度下降公式(与普通梯度下降相同)：
$\theta^{i}=\theta^{i-1}-\eta\nabla L(\theta^{i-1})$
随机梯度下降Stochastic Gradient Descent优点:

计算速度更快——“天下武功，唯快不破”

Tip 3: Feature Scaling 特征缩放

1.特征缩放`Feature Scaling`是什么(what)？

设函数： $y=b+w_1x_1+w_2x_2$
通过特征缩放Feature Scaling后如下图所示
在这里插入图片描述

2.特征缩放`Feature Scaling`为什么(why)？

Make different features have the same scaling
使不同的特征值有相同的比例

经过特征缩放后，便于进行梯度下降：
右边是两个参数scaling比较接近，右边的绿色图就比较接近圆形。
对于左边的情况，这种狭长的情形不用Adagrad算法比较难处理，两个方向上需要不同的学习率。而右边情形更新参数就会变得比较容易。左边的梯度下降并不是向着最低点方向走的，而是顺着等高线切线法线方向走的。但绿色就可以向着圆心（最低点）走，这样做参数更新也是比较有效率。

3.特征缩放`Feature Scaling`怎么做(how)？

其中一种做法类似于概率统计学中正态分布的标准化：
具体方法如下：

一共有R组数据： ${x^1,x^2,x^3...x^r...x^R\}$
对于第 $r$ 个数据，其第 $i$ 维特征值 $x_i^r$ 可以通过：
$x_i^r \gets \large\frac{x_i^r-m_i}{\sigma_i}$

其中 $m_i$ 是第 $i$ 维数值的均值， $\sigma_i$ 是第 $i$ 维数据的标准差

通过特征缩放Feature Scaling，所有维度数据的均值为0，标准差为1

三、Gradient Descent Theory梯度下降的数学理论

梯度下降Gradient Descent可以看作在圆圈内不断寻找最小值点，并更新圆心的过程

基本原理1：泰勒级数

一阶泰勒展开：
$h(x)=\displaystyle\sum_{k=0}^∞\frac{h^{(k)}(x_0)}{k!}(x-x_0)^k$
$=h(x_0)+h'(x_0)(x-x_0)+\frac{h''(x_0)}{2!}(x-x_0)^2+...$
当 $\to x_0$ 时，有 $h(x)\approx h(x_0)+h'(x_0)(x-x_0)$
二阶泰勒展开：
$h(x,y)=h(x_0,y_0)+\frac{\partial h(x_0,y_0)}{\partial x}(x-x_0)+\frac{\partial h(x_0,y_0)}{\partial y}(y-y_0)+...$
由于‘…’中的项在 $\to x_0,y \to y_0$ 时可以忽略，因此：
$h(x,y)\approx h(x_0,y_0)+\frac{\partial h(x_0,y_0)}{\partial x}(x-x_0)+\frac{\partial h(x_0,y_0)}{\partial y}(y-y_0)$

利用泰勒级数，对于有两个参数 $\{\theta_1,\theta_2\}$ 的损失函数loss function在点 $(a, b)$ 处可以表示为如下形式：
$L(\theta)\approx L(a,b)+\frac{\partial L(a,b)}{\partial \theta_1}(\theta_1-a)+\frac{\partial L(a,b)}{\partial \theta_2}(\theta_2-b)$

基本原理2：梯度矢量点乘

$L(\theta)\approx L(a,b)+\frac{\partial L(a,b)}{\partial \theta_1}(\theta_1-a)+\frac{\partial L(a,b)}{\partial \theta_2}(\theta_2-b)$

令 $s = L (a, b)$
$u=\frac{\partial L(a,b)}{\partial \theta_1}$ , $v=\frac{\partial L(a,b)}{\partial \theta_2}$

则 $L(\theta)\approx s+u(\theta_1-a)+v(\theta_2-b)$
在这里插入图片描述
再令 $(\theta_1-a)\to \Delta\theta_1 , (\theta_2-b)\to \Delta\theta_2$
有 $(\Delta\theta_1)^2+(\Delta\theta_2)^2\leqslant d^2$
为了使 $L(\theta)$ 最小，考虑 $\Delta\theta_1,\Delta\theta_2$ 组成的 $(\Delta\theta_1,\Delta\theta_2)$ 向量
$L(\theta)=s+(\Delta\theta_1,\Delta\theta_2)\cdot(u,v)$
显然，当 $(\Delta\theta_1,\Delta\theta_2)$ 与 $(u, v)$ 反向时， $L(\theta)$ 最小。

因此，令 $\begin{bmatrix} \Delta\theta_1 \\ \Delta\theta_2 \end{bmatrix}=\eta\begin{bmatrix} u \\ v \end{bmatrix}$ ,

即 $\begin{bmatrix} \theta_1 \\ \theta_2 \end{bmatrix}=\begin{bmatrix} a \\ b \end{bmatrix}-\eta\begin{bmatrix} u \\ v \end{bmatrix}=\begin{bmatrix} a \\ b \end{bmatrix}-\eta\begin{bmatrix} \frac{\partial L(a,b)}{\partial \theta_1} \\ \frac{\partial L(a,b)}{\partial \theta_2} \end{bmatrix}$

在这里插入图片描述

四、梯度下降的限制

(1)Very slow at the plateau 在稳定的‘高原’时下降缓慢
(2)Stuck at saddle point 停在马鞍点
(3)Stuck at local minima 停在局部最小值点

在这里插入图片描述

ML Lecture 3-2 Gradient Descent (Demo by AOE)

ML Lecture 3-3 Gradient Descent -Demo by Minecraft-

【知识索引】【李宏毅机器学习】

BkbK-

关注

8
点赞
踩
8

收藏

觉得还不错? 一键收藏
打赏
0
评论
【李宏毅机器学习】04：梯度下降Gradient Descent

梯度下降Gradient Descent 主要内容：一、梯度下降方法；二、梯度下降的改进方法；三、梯度下降的数学理论；四、梯度下降的限制。其中重点讲述了改进方法中的Adagrad算法。
复制链接

扫一扫