【优化器】(五) Adam原理 & pytorch代码解析

Lcm_Tech

已于 2024-01-15 18:40:31 修改

阅读量8.1k

点赞数 5

分类专栏：优化器深度学习文章标签： pytorch 人工智能 python

于 2023-07-18 14:47:39 首次发布

本文链接：https://blog.csdn.net/Lizhi_Tech/article/details/131783056

版权

深度学习同时被 2 个专栏收录

22 篇文章 22 订阅

订阅专栏

优化器

7 篇文章 8 订阅

订阅专栏

1.简介

在之前的文章了，我们学习了SGD，以及在其基础上加了一阶动量的SGD Momentum，还有在其基础上加了二阶动量的AdaGrad、AdaDelta、RMSProp。那么自然而然就会想到把一阶动量和二阶动量结合起来，这样就形成了我们常用的优化器Adam：Adaptive+Momentum。

2.Adam

类似于Momentum，Adam使用了一阶动量的窗口衰减累加，公式如下：

$m_{t}=\beta_{1}\cdot m_{t-1}+(1-\beta _{1}))\cdot g_{t}$

其中， $m_{t}$ 为当前step的一阶动量， $m_{t-1}$ 为上一个step的一阶动量， $\beta _{1}$ 为历史一阶动量保留率

类似于RMSProp，Adam使用了二阶动量的窗口衰减累加，公式如下：

$V_{t}=\beta_{2}\cdot V_{t-1}+(1-\beta _{2}))\cdot g_{t}^{2}$

其中， $V_{t}$ 为当前step的二阶动量， $V_{t-1}$ 为上一个step的二阶动量， $\beta _{2}$ 为历史二阶动量的衰减率。

最终的更新公式为：

$\Delta w _{t}=\alpha \cdot \frac{m_{t}}{\sqrt{V_{t}+\epsilon }}$

其中， $\epsilon$ 为增加分母稳定性的系数， $\alpha$ 为学习率。

3.思考

其实我们回过头看，Adam其实存在可能不收敛的问题。因为对SGD而言，学习率是固定的，但是一般会使用scheduler使得学习率不断衰减，所以最终是会收敛的。而对AdaGrad而言，二阶动量的不断累加会导致学习率不断趋近于0最终收敛（虽然不一定能到最优解）。但是对Adam来说，由于二阶动量的突变性，可能在后期时引起学习率的震荡。

4.pytorch代码

Adam的伪代码流程如下，可以看到两个输入的参数 $\beta _{1}$ 和 $\beta _{2}$ 就是我们常设置的两个超参数，一班设为0.9和0.999。同时我们可以看到伪代码里分别对 $m_{t}$ 和 $V_{t}$ 除以 $1-\beta$ ，这一步其实是在做偏差纠正，因为当 $t=0$ 的时候，并没有历史向量，所以 $m_{t}$ 和 $V_{t}$ 的系数直接为 $1-\beta$ ，这就意味着当 $\beta$ 较大的时候，更新速度较慢，所以除以一个系数来进行纠正。

以下代码为pytorch官方Adam的代码。

def _single_tensor_adam(params: List[Tensor],
                        grads: List[Tensor],
                        exp_avgs: List[Tensor],
                        exp_avg_sqs: List[Tensor],
                        max_exp_avg_sqs: List[Tensor],
                        state_steps: List[Tensor],
                        grad_scale: Optional[Tensor],
                        found_inf: Optional[Tensor],
                        *,
                        amsgrad: bool,
                        beta1: float,
                        beta2: float,
                        lr: float,
                        weight_decay: float,
                        eps: float,
                        maximize: bool,
                        capturable: bool,
                        differentiable: bool):

    assert grad_scale is None and found_inf is None

    for i, param in enumerate(params):

        grad = grads[i] if not maximize else -grads[i]
        exp_avg = exp_avgs[i]
        exp_avg_sq = exp_avg_sqs[i]
        step_t = state_steps[i]

        if capturable:
            assert param.is_cuda and step_t.is_cuda, "If capturable=True, params and state_steps must be CUDA tensors."

        # update step
        step_t += 1

        if weight_decay != 0:
            grad = grad.add(param, alpha=weight_decay)

        if torch.is_complex(param):
            grad = torch.view_as_real(grad)
            exp_avg = torch.view_as_real(exp_avg)
            exp_avg_sq = torch.view_as_real(exp_avg_sq)
            param = torch.view_as_real(param)

        # Decay the first and second moment running average coefficient
        exp_avg.mul_(beta1).add_(grad, alpha=1 - beta1)
        exp_avg_sq.mul_(beta2).addcmul_(grad, grad.conj(), value=1 - beta2)

        if capturable or differentiable:
            step = step_t

            # 1 - beta1 ** step can't be captured in a CUDA graph, even if step is a CUDA tensor
            # (incurs "RuntimeError: CUDA error: operation not permitted when stream is capturing")
            bias_correction1 = 1 - torch.pow(beta1, step)
            bias_correction2 = 1 - torch.pow(beta2, step)

            step_size = lr / bias_correction1
            step_size_neg = step_size.neg()

            bias_correction2_sqrt = bias_correction2.sqrt()

            if amsgrad:
                # Maintains the maximum of all 2nd moment running avg. till now
                if differentiable:
                    max_exp_avg_sqs_i = max_exp_avg_sqs[i].clone()
                else:
                    max_exp_avg_sqs_i = max_exp_avg_sqs[i]
                max_exp_avg_sqs[i].copy_(torch.maximum(max_exp_avg_sqs_i, exp_avg_sq))
                # Uses the max. for normalizing running avg. of gradient
                # Folds in (admittedly ugly) 1-elem step_size math here to avoid extra param-set-sized read+write
                # (can't fold it into addcdiv_ below because addcdiv_ requires value is a Number, not a Tensor)
                denom = (max_exp_avg_sqs[i].sqrt() / (bias_correction2_sqrt * step_size_neg)).add_(eps / step_size_neg)
            else:
                denom = (exp_avg_sq.sqrt() / (bias_correction2_sqrt * step_size_neg)).add_(eps / step_size_neg)

            param.addcdiv_(exp_avg, denom)
        else:
            step = _get_value(step_t)

            bias_correction1 = 1 - beta1 ** step
            bias_correction2 = 1 - beta2 ** step

            step_size = lr / bias_correction1

            bias_correction2_sqrt = _dispatch_sqrt(bias_correction2)

            if amsgrad:
                # Maintains the maximum of all 2nd moment running avg. till now
                torch.maximum(max_exp_avg_sqs[i], exp_avg_sq, out=max_exp_avg_sqs[i])
                # Use the max. for normalizing running avg. of gradient
                denom = (max_exp_avg_sqs[i].sqrt() / bias_correction2_sqrt).add_(eps)
            else:
                denom = (exp_avg_sq.sqrt() / bias_correction2_sqrt).add_(eps)

            param.addcdiv_(exp_avg, denom, value=-step_size)

业务合作/学习交流+v：lizhiTechnology

如果想要了解更多优化器相关知识，可以参考我的专栏和其他相关文章：

优化器_Lcm_Tech的博客-CSDN博客

【优化器】(一) SGD原理 & pytorch代码解析_sgd优化器-CSDN博客

【优化器】(二) AdaGrad原理 & pytorch代码解析_adagrad优化器-CSDN博客

【优化器】(三) RMSProp原理 & pytorch代码解析_rmsprop优化器-CSDN博客

【优化器】(四) AdaDelta原理 & pytorch代码解析_adadelta里rho越大越敏感-CSDN博客

【优化器】(五) Adam原理 & pytorch代码解析_adam优化器-CSDN博客

【优化器】(六) AdamW原理 & pytorch代码解析-CSDN博客

【优化器】(七) 优化器统一框架 & 总结分析_mosec优化器优点-CSDN博客

如果想要了解更多深度学习相关知识，可以参考我的其他文章：

【损失函数】(一) L1Loss原理 & pytorch代码解析_l1 loss-CSDN博客

【图像生成】(一) DNN 原理 & pytorch代码实例_pytorch dnn代码-CSDN博客

Lcm_Tech

关注

5
点赞
踩
37

收藏

觉得还不错? 一键收藏
4
评论
【优化器】(五) Adam原理 & pytorch代码解析

在之前的文章了，我们学习了SGD，以及在其基础上加了一阶动量的SGD Momentum，还有在其基础上加了二阶动量的AdaGrad、AdaDelta、RMSProp。那么自然而然就会想到把一阶动量和二阶动量结合起来，这样就形成了我们常用的优化器Adam
复制链接

扫一扫