Adagrad求sqrt SGD Momentum Adagrad Adam AdamW RMSProp LAMB Lion 推导

taoqick

已于 2024-02-04 11:56:55 修改

阅读量465

点赞数 1

分类专栏： python3 算法文章标签：算法人工智能

于 2020-06-27 16:47:53 首次发布

本文链接：https://blog.csdn.net/taoqick/article/details/106981213

版权

算法同时被 2 个专栏收录

474 篇文章 6 订阅

订阅专栏

python3

120 篇文章 1 订阅

订阅专栏

随机梯度下降（Stochastic Gradient Descent）SGD

经典的梯度下降法每次对模型参数更新时，需要遍历所有的训练数据。随机梯度下降法用单个训练样本的损失来近似平均损失。
$\theta_{t+1} = \theta_{t}-\eta g_t (公式1)$

小批量梯度下降法（Mini-Batch Gradient Descent）

把SGD中用单个训练样本改成用m个，实际中有这么几个问题：

如何选取m：m一般是2的幂，方便指数运算
如何挑选m个训练数据：为了避免数据特定顺序给算法收敛性带来的影响，一般每次训练前都需要随机排序，然后按照随机后的顺序依次取m个
如何选取学习速率（learning rate lr） $\eta$ ：开始时用较大的学习速率，进入平台期后，用较小的学习速率

Momentum 动量方法保持原来一定的方向

简单来说就是把当前的负梯度和前一步的负梯度加一个权重，保持一定之前的速度，在前面公式1的基础上，

$v_t = \gamma v_{t-1} + \eta g_t \\ \theta_{t+1} = \theta_{t}-v_t (公式2)$

Adagrad 历史梯度平方和来降learning rate

适合做广告、推荐精排中dense部分的优化器
$\theta_{t+1} = \theta_{t}-\frac{\eta}{\sqrt{这个方向上历史梯度平方和}} g_t (公式3)$
展开写准确点就是
$\theta_{t+1} = \theta_{t}- \eta \frac{1}{\sqrt{v_t+\epsilon}}g_t \\ v_t = v_{t-1}+g_t^2 \\ g_t = \nabla_\theta J(\theta_t)+\alpha \theta_t$
其中 $\alpha \theta_t$ 代表着L2正则项的梯度

RMSProp (Root Mean Square) 可以看成是Adam的简化版，Adagrad对二阶矩过大进行修正

RMSProp is an unpublished adaptive learning rate optimizer proposed by Geoff Hinton.

$\theta_{t+1} = \theta_{t}-\eta \frac{g_t}{\sqrt{E(g_t^2)}} (公式5)$
展开写准确点就是
$\theta_{t+1} = \theta_{t}- \eta \frac{1}{\sqrt{v_t+\epsilon}}g_t \\ v_t = \beta v_{t-1}+(1-\beta)g_t^2 \\ g_t = \nabla_\theta J(\theta_t)+\alpha \theta_t$
其中 $\alpha \theta_t$ 代表着L2正则项的梯度

RMSPropV2在RMSProp基础上扔掉 $(1-\beta)$

适合做广告、推荐精排中dense部分的优化器
$\theta_{t+1} = \theta_{t}- \eta \frac{1}{\sqrt{v_t+\epsilon}}g_t \\ v_t = \beta v_{t-1}+g_t^2 \\ g_t = \nabla_\theta J(\theta_t)+\alpha \theta_t$

Adam 用类似Momentum的方式保持方向用类似Adagrad方式降learning rate

Momentum里的方向调整方式用一阶矩 $m_t=E(g_t)$ 来表示，Adagrad里的Learning Rate调整方式用二阶矩 $v_t = E(g_t^2)$ 来表示。如何理解一阶矩和二阶矩呢？一阶矩相当于估计：由于当下梯度 $g_t$ 是随机采样得到的估计结果，因此更关注它在统计意义上的期望；二阶矩相当于估计，这点与AdaGrad方法不同，不是 $g_t^2$ 从开始到现在的加和，而是它的期望。它们的物理意义是，当||mt||大且vt大时，梯度大且稳定，这表明遇到一个明显的大坡，前进方向明确；当||mt||趋于零且vt大时，梯度不稳定，表明可能遇到一个峡谷，容易引起反弹震荡；当||mt||大且vt趋于零时，这种情况不可能出现；当||mt||趋于零且vt趋于零时，梯度趋于零，可能到达局部最低点，也可能走到一片坡度极缓的平地，此时要避免陷入平原（plateau）另外，Adam方法还考虑了mt，vt在零初始值情况下的偏置矫正。**Adam可以看作是一种信噪比！**具体来说，Adam的更新公式为
$\theta_{t+1} = \theta_{t}-\eta \frac{E(g_t)}{\sqrt{E(g_t^2)}} (公式4)$
关于Adam在混合进度下显存占用估计请参考：https://zhuanlan.zhihu.com/p/624740065
在这里插入图片描述
注意：小batch场景下混合精度并不能带来速度提升，甚至会更慢。因为小batch下的计算已经很快了，速度瓶颈在IO（在GPU和GPU间传送数据）。而混合精度需要进行FP16与FP32的转换，会消耗更多时间。
Pytorch可以使用英伟达的开源框架APEX，支持混合进度和分布式训练：

        if enable_autocast:
            return torch.cuda.amp.autocast(dtype=dtype)
        else:
            return contextlib.nullcontext()

Tensorflow就更简单了，已经有官方支持，只需要训练前加一句：
export TF_ENABLE_AUTO_MIXED_PRECISION=1这个环境变量

AdamW

AdamW 即 Adam + weight decay，当然SGD也能加weight decay。Decoupled Weight Decay Regulation中说L2正则化和Weight Decay对于SGD是等价的，但是对于Adam来说Weight Decay防过拟合能力优于对应的L2正则。截图来自该paper：
在这里插入图片描述
写准确点就是
$\theta_{t+1} = \theta_{t}- \eta*( \frac{\alpha*m_t*(1-\beta_2^t)}{(\sqrt{v_t}+\epsilon)*(1-\beta_1^t)}+\lambda\theta_t) \\ m_t = \beta_1 m_{t-1}+(1-\beta_1)g_t \\ v_t = \beta_2 v_{t-1}+(1-\beta_2)g_t^2 \\ g_t = \nabla_\theta J(\theta_t)+\lambda \theta_t$

Amos-Adam的改进思路

Adam的改进思路可以参考Amos paper （https://kexue.fm/archives/9344#mjx-eqn-eq%3Aalpha-rho-3）， 降低优化器显存占用的主要就两个思路，一是去掉动量，二是对二阶矩做低秩分解，Amos本质上也是沿用了这两个思路。

LAMB（layerwise adaptive mixed batch）每一层的参数的更新幅度应该由θt的模长来调控

这个公式摘录自科学空间，但是不够准确：
$g_t = \nabla_{\theta} L\\ h_t = f(g_{<=t}) \\ \theta_{t+1} = \theta_{t}- \eta * g_t * \frac{||\theta_t||_2}{||h_t||_2} (公式6)$
Large Batch Optimization for Deep Learning: Training BERT in 76 minutes这paper里说LAMB在batch size较大（成千上万）的时候比Adam效果要好。
准确的写法摘录自原文：
在这里插入图片描述
这个图里再做一些解释，LARS是The most prominent algorithm in this
line of research is LARS, which by employing layerwise adaptive learning rates
trains RESNET on ImageNet in a few minutes. However, LARS performs poorly for
attention models like BERT.
关于LAMB的公式，其中 $x_t$ 是模型参数，也就是 $\theta_t$ ， $S_t$ 是sample到的一个batch，这么看LAMB中的 $\frac{r_t^{(i)}+\lambda x_t^{(i)}}{||r_t^{(i)}+\lambda x_t^{(i)}||}$ 实际上就是把Adam中的那一大堆做了个Normalize。另外LAMB中多出来的 $\phi(||x_t^{(i)}||)$ 在原文中解释如下图，实际上就是根据 $r_t$ 的最大最小做了clip
在这里插入图片描述
关于什么是mixed batch摘原文的一段话：To obtain further improvements, we use the Mixed-Batch Training procedure with LAMB. Recall that BERT training involves two stages: the first 9/10 of the total epochs use a sequence length of 128, while the last 1/10 of the total epochs use a sequence length of 512. For the second stage training, which involves a longer sequence length, due to memory limits, a maximum batch size of only 32768 can be used on a TPUv3 Pod. However, we can potentially use a larger batch size for the first stage because of a shorter sequence length. In particular, the batch size can be increased to 131072 for the first stage. However, we did not observe any speedup by increasing the batch size from
65536 to 131072 for the first stage, thus, we restrict the batch size to 65536 for this stage. By using this strategy, we are able to make full utilization of the hardware resources throughout the training.

Lion

Lion（momentum的思路还保留着）相对于Adam的改进在于两点：

丢掉了二阶矩
引入sign函数（大于0是1，小于0是-1）
Lion缺点是它在小batch_size（小于64）的时候效果不如AdamW
以下转载自https://kexue.fm/archives/9473：

代码题目： Adagrad求sqrt(a)

拿到这道题目的时候一定觉得疯了，Adagrad的公式里本身用了sqrt，还要求sqrt，为啥不用牛顿法之类的方法？？？但是这个问题的本意实际上是考察对Adagrad的理解，就是迭代而已嘛，所以这个问题分为两步：

写对Loss function： Loss function的要求是目标函数取得极值的时候，参数刚好是我们想求的。所以，这个问题的Loss function是 $L(x) = (x^2-a)^2$ ，而不是 $L(x) = (x^2-a)$ ，因为后者在取得极值时，x是0而不是 $\sqrt(a)$
导数 $g(x)=4x^3-4ax$ ，迭代时注意L(x)本身不是凸函数，有两个极值点，选>0那个即可。同时learning rate不要调太大

下面上codes：

import math
class Solution:
    def sqrt_adagrad(self, a, eps):
        def L(x0,a):
            return (x0**2-a)**2
        def g(x0,a):
            return (x0**3-a*x0)*4

        x,l,lr = 10,L(0,a),0.001
        glist = [1]
        while (True):
            g_cur = g(x,a)*lr
            nx = x-1/math.sqrt(sum(it**2 for it in glist))*g_cur
            glist.append(g_cur)
            nl = L(nx,a)
            #print("g_cur={0} nx={1} nl={2} glist={3}".format(g_cur,nx,nl,glist))
            if (abs(nl-l) <= eps):
                return x
            else:
                x,l = nx,nl
    def sqrt_newton(self, a, eps):
        x = a
        while (True):
            nx = (x+a/x)/2
            if (abs(nx-x) < eps):
                return x
            else:
                x = nx
s = Solution()
print(s.sqrt_adagrad(9, 1e-10))
print(s.sqrt_newton(9,1e-10))

这个题目如果用pytorch实现是这样的：

import torch
import torch.nn as nn
# f(x) = (x^2-a)^2的最小值
'''
x = torch.Tensor([0.1], requires_grad=True) # x需要被求导
a = torch.Tensor([3.0])
optimizer = torch.optim.Adagrad(params=[x],lr = 1e-5)
def f(x):
    result = torch.pow((torch.pow(x,2)-a),2)
    return result
for i in range(30000):
    optimizer.zero_grad() #如果没有合格x.grad明显是累加效果
    y = f(x)
    y.backward()
    print('x={} y={} dydx={}'.format(x.data, f(x).data, x.grad))
    optimizer.step()
print('x={} y={} dydx={}'.format(x.data, f(x).data, x.grad))
'''


# 使用nn.Module来完成上述功能
class SqrtLoss(nn.Module):
    def __init__(self):
        super(SqrtLoss, self).__init__()
        self.x = nn.Parameter(torch.Tensor([5]), requires_grad=True)
    def forward(self, a):
        return torch.pow((torch.pow(self.x,2)-a), 2)
loss = SqrtLoss()
optimizer = torch.optim.Adagrad(params=loss.parameters(), lr = 1e-2)
for i in range(30000):
    optimizer.zero_grad() #如果没有合格x.grad明显是累加效果
    y = loss(2)
    y.backward()
    print('y={} x={}'.format(y, loss.x))
    optimizer.step()

'''
# 使用nn.Module来实现Linear Regression
class LinearModel(nn.Module):
    def __init__(self):
        super(LinearModel, self).__init__()
        self.fc = nn.Linear(in_features=3, out_features=1, bias=True)
    def forward(self, input):
        return self.fc(input)
linearModel = LinearModel()
loss = nn.MSELoss()
optimizer = torch.optim.Adagrad(params=linearModel.parameters(), lr = 1e-2)
for i in range(30000):
    optimizer.zero_grad() #如果没有合格x.grad明显是累加效果
    y = loss(input=linearModel(torch.Tensor([1,2,3])),target=torch.Tensor([12]))
    y.backward()
    optimizer.step()
print(linearModel(torch.Tensor([1,2,3])))
'''