【深度学习】优化器(optimizer)

优化算法 类型

优化算法 类型 包括 一阶优化法 和 二阶优化法:

一阶优化法二阶优化法
具体算法随机梯度下降法、基于动量的随机梯度下降法、Nesterov型动量随机下降法、Adagrad法、Adadelta法、RMSProp法、Adam法牛顿法
计算难度较易
运用程度主流少有人用

一阶优化法 对比

随机梯度下降法基于动量的随机梯度下降法Nesterov型动量随机下降法Adagrad法Adadelta法RMSProp法Adam法
运用程度最广
训练速度
模型结果可靠可靠

随机梯度下降法、基于动量的随机梯度下降法 和 Nesterov型动量随机下降法 彼此性能相近
Adagrad法、Adadelta法、RMSProp法 和 Adam法 彼此性能相近

一阶优化法

ω \omega ω :待学习参数;
η \eta η :学习率;
g g g :一阶梯度值;
t t t :第 t t t轮训练。

随机梯度下降法

随机梯度下降算法,Stochastic Gradient Descent,简称 SGD

ω t ← ω t − 1 − η ⋅ g \omega_{t} \leftarrow \omega_{t-1} - \eta \cdot g ωtωt1ηg

基于动量的随机梯度下降法

由于SGD更新时可能出现 振荡 ,遂通过 累积前几轮的动量 (momentum) 信息辅助参数更新

v t ← μ ⋅ v t − 1 − η ⋅ g v_{t} \leftarrow \mu \cdot v_{t-1} - \eta \cdot g vtμvt1ηg

ω t ← ω t − 1 + v t \omega_{t} \leftarrow \omega_{t-1} + v_{t} ωtωt1+vt

μ \mu μ :动量因子,控制动量信息对整体梯度更新的影响程度。设置方法分为 静态 (始终为 0.9) 和 动态 (初始为 0.5,逐渐增长为 0.9 或 0.99) 。

Nesterov型动量随机下降法

较罕见,遂略过。

Adagrad法

根据训练轮数的不同,对学习率进行动态调整:

η t ← η g l o b a l ∑ t ′ = 1 t g t ′ 2 + ϵ ⋅ g t \eta_{t} \leftarrow \frac{\eta_{global}}{\sqrt{\sum_{t′=1}^t g_{t'}^2 + \epsilon}} \cdot g_{t} ηtt=1tgt2+ϵ ηglobalgt

η g l o b a l \eta_{global} ηglobal :全局学习率 (必须提前指定) ;
ϵ \epsilon ϵ :防止分母为0。

初始时, η t \eta_{t} ηt 接近于 η g l o b a l \eta_{global} ηglobal ,随着 ∑ t ′ = 1 t g t ′ 2 \sum_{t′=1}^t g_{t'}^2 t=1tgt2 的不断增大, η t \eta_{t} ηt 渐渐趋近于 0 。

Adadelta法

Adadelta法 在 Adagrad法 的 基础上,通过引入衰减因子 ρ \rho ρ ,使得 g g g 也和 η g l o b a l \eta_{global} ηglobal 一起来对 η t \eta_{t} ηt 施加影响,防止 η g l o b a l \eta_{global} ηglobal 垄断:

r t ← ρ ⋅ r t − 1 + ( 1 − ρ ) ⋅ g 2 r_{t} \leftarrow \rho \cdot r_{t-1} + (1-\rho) \cdot g^2 rtρrt1+(1ρ)g2

η t ← s t − 1 + ϵ r t + ϵ \eta_{t} \leftarrow \frac{\sqrt{s_{t-1} + \epsilon}}{\sqrt{r_t + \epsilon}} ηtrt+ϵ st1+ϵ

s t ← ρ ⋅ s t − 1 + ( 1 − ρ ) ⋅ ( η t ⋅ g ) 2 s_{t} \leftarrow \rho \cdot s_{t-1} + (1-\rho) \cdot (\eta_{t} \cdot g)^2 stρst1+(1ρ)(ηtg)2

ρ \rho ρ :衰减因子,取值范围 [0, 1] ,值越大越促进网络更新,推荐为 0.95
ϵ \epsilon ϵ :防止为 0,推荐为 1 0 − 6 10^{-6} 106

RMSProp法

较罕见,遂略过。

Adam法

在 RMSProp法 基础上 加上了 动量项

利用梯度的一阶矩估计和二阶矩估计动态调整每个参数的学习率。

Adam和Nadam是前述方法的集大成者:SGD-M在SGD基础上增加了一阶动量,AdaGrad和AdaDelta在SGD基础上增加了二阶动量。把 一阶动量 β 1 \beta_{1} β1二阶动量 β 2 \beta_{2} β2 都用起来,就是 Adam(Adaptive + Momentum)了。

m t ← β 1 ⋅ m t − 1 + ( 1 − β 1 ) ⋅ g m_{t} \leftarrow \beta_{1} \cdot m_{t-1} + (1 - \beta_{1}) \cdot g mtβ1mt1+(1β1)g

v t ← β 2 ⋅ v t − 1 + ( 1 − β 2 ) ⋅ g 2 v_{t} \leftarrow \beta_{2} \cdot v_{t-1} + (1 - \beta_{2}) \cdot g^2 vtβ2vt1+(1β2)g2

ω t ← ω t − 1 − η ⋅ m t v t \omega_{t} \leftarrow \omega_{t-1} - \eta \cdot \frac{m_{t}}{\sqrt{v_{t}}} ωtωt1ηvt mt

优点:
经过偏置校正后,每一次迭代学习率都有一个确定范围,这样可以使得参数更新比较平稳。


[1] 卷积神经网络中的优化算法比较

  • 0
    点赞
  • 7
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
以下是使用 PyTorch 实现的深度学习优化器 Ranger 的代码: ```python import math from torch.optim.optimizer import Optimizer import torch.optim as optim class Ranger(Optimizer): def __init__(self, params, lr=1e-3, alpha=0.5, k=6, N_sma_threshhold=5, betas=(0.95, 0.999), eps=1e-5, weight_decay=0): defaults = dict(lr=lr, alpha=alpha, k=k, N_sma_threshhold=N_sma_threshhold, betas=betas, eps=eps, weight_decay=weight_decay) super().__init__(params, defaults) def __setstate__(self, state): super().__setstate__(state) def step(self, closure=None): loss = None if closure is not None: loss = closure() # Gradient centralization for group in self.param_groups: for p in group['params']: if p.grad is None: continue grad = p.grad.data if grad.is_sparse: raise RuntimeError('Ranger optimizer does not support sparse gradients') grad_data = grad.data if len(grad_data.shape) > 1: mean = torch.mean(grad_data, dim=tuple(range(1, len(grad_data.shape))), keepdim=True) var = torch.var(grad_data, dim=tuple(range(1, len(grad_data.shape))), keepdim=True) grad_data = (grad_data - mean) / (torch.sqrt(var) + group['eps']) p.grad.data = grad_data # Perform optimization step beta1, beta2 = group['betas'] N_sma_threshhold = group['N_sma_threshhold'] grad_ema_beta = 1 - beta1 sqr_ema_beta = 1 - beta2 step_size = group['lr'] eps = group['eps'] k = group['k'] alpha = group['alpha'] weight_decay = group['weight_decay'] for group in self.param_groups: for p in group['params']: if p.grad is None: continue grad = p.grad.data if grad.is_sparse: raise RuntimeError('Ranger optimizer does not support sparse gradients') state = self.state[p] # State initialization if len(state) == 0: state['step'] = 0 state['exp_avg'] = torch.zeros_like(p.data) state['exp_avg_sq'] = torch.zeros_like(p.data) state['SMA'] = 0 exp_avg, exp_avg_sq = state['exp_avg'], state['exp_avg_sq'] SMA = state['SMA'] state['step'] += 1 # Gradient centralization grad_data = grad.data if len(grad_data.shape) > 1: mean = torch.mean(grad_data, dim=tuple(range(1, len(grad_data.shape))), keepdim=True) var = torch.var(grad_data, dim=tuple(range(1, len(grad_data.shape))), keepdim=True) grad_data = (grad_data - mean) / (torch.sqrt(var) + eps) grad = grad_data bias_correction1 = 1 - beta1 ** state['step'] bias_correction2 = 1 - beta2 ** state['step'] step_size = step_size * math.sqrt(bias_correction2) / bias_correction1 # Compute exponential moving average of gradient and squared gradient exp_avg = beta1 * exp_avg + grad_ema_beta * grad exp_avg_sq = beta2 * exp_avg_sq + sqr_ema_beta * grad * grad # Compute SMA SMA_prev = SMA SMA = alpha * SMA + (1 - alpha) * exp_avg_sq.mean() # Update parameters if state['step'] <= k: # Warmup p.data.add_(-step_size * exp_avg / (torch.sqrt(exp_avg_sq) + eps)) else: if SMA > SMA_prev or state['step'] <= N_sma_threshhold: # If SMA is increasing, skip lookahead and perform RAdam step denom = torch.sqrt(exp_avg_sq) + eps p.data.add_(-step_size * exp_avg / denom) else: # Lookahead slow_state = state['slow_buffer'] if len(slow_state) == 0: slow_state['step'] = 0 slow_state['exp_avg'] = torch.zeros_like(p.data) slow_state['exp_avg_sq'] = torch.zeros_like(p.data) slow_state['SMA'] = 0 for key in state.keys(): if key != 'slow_buffer': slow_state[key] = state[key].clone() slow_exp_avg, slow_exp_avg_sq = slow_state['exp_avg'], slow_state['exp_avg_sq'] slow_SMA = slow_state['SMA'] slow_state['step'] += 1 # Gradient centralization grad_data = grad.data if len(grad_data.shape) > 1: mean = torch.mean(grad_data, dim=tuple(range(1, len(grad_data.shape))), keepdim=True) var = torch.var(grad_data, dim=tuple(range(1, len(grad_data.shape))), keepdim=True) grad_data = (grad_data - mean) / (torch.sqrt(var) + eps) grad = grad_data # Compute exponential moving average of gradient and squared gradient slow_exp_avg = beta1 * slow_exp_avg + grad_ema_beta * grad slow_exp_avg_sq = beta2 * slow_exp_avg_sq + sqr_ema_beta * grad * grad # Compute SMA slow_SMA_prev = slow_SMA slow_SMA = alpha * slow_SMA + (1 - alpha) * slow_exp_avg_sq.mean() # Update parameters if slow_state['step'] <= k: # Warmup pass else: if slow_SMA > slow_SMA_prev or slow_state['step'] <= N_sma_threshhold: # If SMA is increasing, skip lookahead and perform RAdam step denom = torch.sqrt(slow_exp_avg_sq) + eps p.data.add_(-step_size * slow_exp_avg / denom) else: # Lookahead p.data.add_(-step_size * (exp_avg + slow_exp_avg) / (2 * torch.sqrt((beta2 * exp_avg_sq + sqr_ema_beta * slow_exp_avg_sq) / (1 - bias_correction2 ** state['step'])) + eps)) # Weight decay if weight_decay != 0: p.data.add_(-step_size * weight_decay * p.data) return loss ``` 以上的代码实现了 Ranger 优化器,其中包括了 RAdam 和 LookAhead 的结合,以及动态学习率和权重衰减等技巧。可以将其应用于 PyTorch 中的深度学习模型训练中。
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值