Optimization Algorithm

最新推荐文章于 2025-03-06 11:55:20 发布

sam-X

最新推荐文章于 2025-03-06 11:55:20 发布

阅读量1.2k

点赞数

分类专栏：深度学习文章标签：优化算法 adam momentum RMSprop

本文链接：https://blog.csdn.net/u010945683/article/details/77894226

版权

深度学习专栏收录该内容

14 篇文章

订阅专栏

本文详细介绍了梯度下降算法的不同变种，包括批量梯度下降（Batch Gradient Descent）、随机梯度下降（Stochastic Gradient Descent）及小批量梯度下降（Mini-Batch Gradient Descent）。对比了它们之间的优缺点，并探讨了指数加权平均法、动量法、RMSprop以及Adam等优化算法。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Optimization Algorithm

Some Source From: deeplearning.ai

GD、SGD和Mini-Batch Gradient Descent

三者关系，

S G D \in M i n i B a t c h G r a d i e n t D e s c e n t \in G D

$SGD\in MiniBatch\ Gradient\ Descent\in GD$
GD是针对所有样本计算梯度，SGD是随机抽取一个样本计算梯度，Mini-Batch Gradient Descent则是折中地抽取部分样本计算梯度。同时由于SGD是最小化单个样本的损失函数，并不是朝着全局最优方向，因此波动较大，Batch-GD虽然精度高，但是在样本量过大的情况下，计算量会过大。

Mini-Batch Gradient Descent

从训练集抽取 $m$ 个大小的Batch样本 $\{x^{(1)},...,x^{(m)}\}$
梯度估计
$g^\leftarrow 1 m \nabla θ \sum i L (f (x (i)), y (i))$ $\hat g\leftarrow{1\over m}\nabla_\theta\sum_iL(f(x^{(i)}),y^{(i)})$
更新
$θ \leftarrow θ - ϵ g^$ $\theta\leftarrow \theta-\epsilon\hat g$
在Mini-Batch Gradient Descent中由于随机采样引入的噪源1，其梯度估计并不会在极小值处小时，而使用全部样本时梯度下降到极小值时，整个代价函数的真实梯度也会变得很小甚至为0，因此Batch-GD下降可以使用固定的学习速率。实践中，一般会线性衰减学习速率到第 $\tau$ 次迭代：
$ϵ k = (1 - α) ϵ 0 + α ϵ τ 其中 α = k τ$ $\epsilon_k=(1-\alpha)\epsilon_0+\alpha\epsilon_\tau\\ 其中\alpha={k\over \tau}$
其中参数选择为 $\epsilon_0, \epsilon_\tau, \tau$ ，通常 $\tau$ 被设为反复遍历训练样本所需的迭代次数， $\epsilon_\tau$ 设为 $1\%$ 的 $\epsilon_0$ 。主要问题是如何设置 $\epsilon_0$ ，若 $\epsilon_0$ 太大，学习曲线将会剧烈振荡，代价函数会明显增加，如果学习速率太慢，那么学习进程会缓慢。如果初始学习速率太低，那么学习可能会卡在一个相当高的损失值。通常，就总训练时间和最终损失值而言，最优初始学习速率会高于大约迭代100步后输出最好效果的学习速率。因此，通常最好是检测最早的几轮迭代，使用一个高于此时效果最佳学习速率的学习速率，但又不能太高以致严重的不稳定性。在文献2 3中对随机梯度下降有更详细的分析.

mini-batch size

Small data set( size < 2000 ) : use batch gradient descent
Typical mini-batch size : 64, 128, 256, … ( $2^n$ better )
Make sure mini-batch size fits in CPU/GPU memory
Search for the much better number about the hyper parameter

Some algorithms faster than gradient descent

Exponentially weighted averages

Original data : ${\theta_t}$

V t = α V t + 1 + (1 - α) θ t

$V_{t}=\alpha V_{t+1}+(1-\alpha)\theta_{t}$
Means

V t \approx 1 1 - α \sum i = t - 1 1 - α t θ i

$V_{t}\approx{1\over{1-\alpha}}\sum_{i=t-{1\over{1-\alpha}}}^t\theta_i$
Much better Computarion and memory efficiency, but the effect of the ordinary average is better.

Bias Correction

To correct the preceding value on the warmed up,

V t = V t 1 - α t

$V_t={{V_t}\over{1-\alpha^t}}$

Gradient descent with momentum

On iteration t :
Compute $dw,db$ on batch(minibatch)

V d w = α V d w + (1 - α) d w w = w - α V d w

$V_{dw}=\alpha V_{dw}+(1-\alpha)dw \\ w=w-\alpha V_{dw}$
Decrease the wave, speed up toward the optimal( could use larger learning rate )

RMSprop

RMSprop(root mean square )

On iteration t :
Compute $dw,db$ on batch(mini-batch)

S d w = α S d w + (1 - α) (d w) 2 w = w - α S d w S d w - - - \sqrt

$S_{dw}=\alpha S_{dw}+(1-\alpha)(dw)^2 \\ w=w-\alpha {S_{dw}\over \sqrt {S_{dw}}}$
the effect same as momentum
Avoid dividing by 0

Adam

Adam means adaptive moment estimation

Combine momentum and RMSprop together

On iteration t :
Compute $dw,db$ on batch(mini-batch)

V d w = α 1 V d w + (1 - α 1) d w S d w = α 2 S d w + (1 - α 2) (d w) 2 V d w = V d w 1 - α t 1 S d w = S d w 1 - α t 2 w = w - β V d w S d w + ϵ - - - - - - \sqrt

$V_{dw}=\alpha_1 V_{dw}+(1-\alpha_1)dw \\ S_{dw}=\alpha_2 S_{dw}+(1-\alpha_2)(dw)^2 \\ V_{dw}={{V_{dw}}\over{1-\alpha_1^t}}\\ S_{dw}={{S_{dw}}\over{1-\alpha_2^t}}\\ w=w-\beta {V_{dw}\over \sqrt {S_{dw}+\epsilon}}$

Hyperparameters choice