在深度学习的训练过程中,需要配置一些超参数,但是在配置的过程中,往往需要根据经验来设置,这对缺乏经验的小白十分不友好,因此就有了动态调整学习率的算法,统称为 LRScheduler 学习率调度器。在此推荐一种十分受欢迎的 调度器 即:CosineLRScheduler。
- 本文源自论文 SGDR: Stochastic Gradient Descent with Warm Restarts ,代码可参考 cosine.py。
⚠️注意:在论文中,这个调度器被称为SGDR,但在实际使用中,它常常被称为cosine调度器。两者大题一致,实现差异很小。
论文中提到的 SGDR 训练图如下:
文章目录
1、CosineLRScheduler
from timm.scheduler.cosine_lr import CosineLRScheduler
CosineLRScheduler( optimizer:Optimizer, t_initial:int, t_mul:float=1.0, lr_min:float,
decay_rate:float=1.0, warmup_t, warmup_lr_init,warmup_prefix=False,
cycle_limit, t_in_epochs=True, noise_range_t=None, noise_pct,
noise_std=1.0, noise_seed=42, initialize=True) ::Scheduler
CosineLRScheduler 接受 optimizer 和一些超参数。我们将首先看看如何首先使用timm训练文档来使用cosineLR调度器训练模型,然后看看如何将此调度器用作自定义训练脚本的独立调度器。
将cosine调度器与timm训练脚本一起使用
要使用余cosine调度器训练模型,我们只需更新通过传递–sched cosine参数和必要的超参数传递的训练脚本args。在本节中,我们还将了解每个超参数如何更新余cosine调度器。
t_initial
The initial number of epochs。例如,50、100等
t_mul
Defaults to 1.0. Updates the SGDR schedule annealing.
lr_min 最小学习率
默认为1e-5。训练期间要使用的最低学习率。学习率永远不会低于这个值。
decay_rate:衰减比例
When decay_rate > 0 and <1., at every restart the learning rate is decayed by new learning rate which equals l r ∗ d e c a y _ r a t e lr * decay\_rate lr∗decay_rate . So if decay_rate=0.5, then in that case, the new learning rate becomes half the initial lr.
warmup_t 定义热身时代的数量
warmup_lr_init 热身期间的初始学习率
warmup_prefix
默认为False。如果设置为True,那么每个新纪元数都等于epoch = epoch - warmup_t。
cycle_limit
SGDR 中的最大重启次数
The number of maximum restarts in SGDR.
t_in_epochs
If set to False, the learning rates returned for epoch t are None.
initialize 初始化
If set to True, then, the an attributes initial_lr is set to each param group. Defaults to True.
2、CosineAnnealingLR
⚠️Note:
that this only implements the cosine annealing part of SGDR, and not the restarts.
The full version : CosineAnnealingWarmRestarts
Set the learning rate of each parameter group using a cosine annealing schedule, where η m a x η_{max} ηmax is set to the initial lr and T c u r T_{cur} Tcur is the number of epochs since the last restart in SGDR:
η t = η m i n + 1 2 ( η m a x − η m i n ) ( 1 + c o s ( T c u r T m a x π ) ) , T c u r ≠ ( 2 k + 1 ) T m a x ; η t + 1 = η t + 1 2 ( η m a x − η m i n ) ( 1 − c o s ( 1 T m a x π ) ) , T c u r = ( 2 k + 1 ) T m a x ; \eta_t = \eta_{min} + \frac{1}{2}(\eta_{max}-\eta_{min})\Big( 1+cos\big(\frac{T_{cur}}{T_{max}}\pi\big) \Big) , \qquad T_{cur}\neq(2k+1)T_{max}; \\ \quad \\ \eta_{t+1} = \eta_{t} + \frac{1}{2}(\eta_{max}-\eta_{min})\Big( 1-cos\big(\frac{1}{T_{max}}\pi\big) \Big) ,\qquad T_{cur}=(2k+1)T_{max}; ηt=ηmin+21(ηmax−ηmin)(1+cos(TmaxTcurπ)),Tcur=(2k+1)Tmax;ηt+1=ηt+21(ηmax−ηmin)(1−cos(Tmax1π)),Tcur=(2k+1)Tmax;
When last_epoch=-1, sets initial lr as lr. Notice that because the schedule is defined recursively, the learning rate can be simultaneously modified outside this scheduler by other operators. If the learning rate is set solely by this scheduler, the learning rate at each step becomes:
η t = η m i n + 1 2 ( η m a x − η m i n ) ( 1 + c o s ( T c u r T m a x π ) ) \eta_t=\eta_{min}+\frac{1}{2}(\eta_{max}-\eta_{min}) \Big( 1+cos\big( \frac{T_{cur}}{T_{max}}\pi \big) \Big) ηt=ηmin+21(ηmax−ηmin)(1+cos(TmaxTcurπ))
from torch.optim import lr_scheduler
lr_scheduler.CosineAnnealingLR(optimizer, T_max, eta_min=0, last_epoch=- 1, verbose=False)
Parameters:
optimizer (Optimizer) – Wrapped optimizer.
T_max (int) – Maximum number of iterations.
eta_min (float) – Minimum learning rate. Default: 0.
last_epoch (int) – The index of last epoch. Default: -1.
verbose (bool) – If True, prints a message to stdout for each update. Default: False.
- get_last_lr()
Return last computed learning rate by current scheduler. - load_state_dict(state_dict)
Loads the schedulers state.
Parameters:
state_dict (dict) – scheduler state. Should be an object returned from a call to state_dict(). - print_lr(is_verbose, group, lr, epoch=None)
Display the current learning rate. - state_dict()
Returns the state of the scheduler as a dict.
It contains an entry for every variable in self.__dict__ which is not the optimizer.
3、CosineAnnealingWarmRestarts
from torch.optim import lr_scheduler
lr_scheduler.CosineAnnealingWarmRestarts(optimizer, T_0, T_mult=1, eta_min=0, last_epoch=- 1, verbose=False)
Parameters
optimizer (Optimizer) – Wrapped optimizer.
T_0 (int) – Number of iterations for the first restart.
T_mult (int, optional) – A factor increases Ti after a restart. Default: 1.
eta_min (float, optional) – Minimum learning rate. Default: 0.
last_epoch (int, optional) – The index of last epoch. Default: -1.
verbose (bool) – If True, prints a message to stdout for each update. Default: False.
Set the learning rate of each parameter group using a cosine annealing schedule, where η m a x η_{max} ηmax is set to the initial lr, T c u r T_{cur} Tcur is the number of epochs since the last restart and T i T_i Ti is the number of epochs between two warm restarts in SGDR:
η t = η m i n + 1 2 ( η m a x − η m i n ) ( 1 + c o s ( T c u r T i π ) ) \eta_t=\eta_{min}+\frac{1}{2}(\eta_{max}-\eta_{min}) \Big( 1+cos\big( \frac{T_{cur}}{T_{i}}\pi \big) \Big) ηt=ηmin+21(ηmax−ηmin)(1+cos(TiTcurπ))
When T c u r = T i T_{cur}=T_i Tcur=Ti, set η t = η m i n \eta_t=\eta_{min} ηt=ηmin. When T c u r = 0 T_{cur}=0 Tcur=0 after restart, set η t = η m a x \eta_t=\eta_{max} ηt=ηmax.
- get_last_lr()
Return last computed learning rate by current scheduler. - load_state_dict(state_dict)
Loads the schedulers state.
Parameters:
state_dict (dict) – scheduler state. Should be an object returned from a call to state_dict(). - print_lr(is_verbose, group, lr, epoch=None)
Display the current learning rate. - state_dict()
Returns the state of the scheduler as a dict.
It contains an entry for every variable in self.__dict__ which is not the optimizer. - step(epoch=None)
Step could be called after every batch update
Step Example
-----------------------------------------------------------------
""" called after every batch update """
scheduler = CosineAnnealingWarmRestarts(optimizer, T_0, T_mult)
iters = len(dataloader)
for epoch in range(20):
for i, sample in enumerate(dataloader):
inputs, labels = sample['inputs'], sample['labels']
optimizer.zero_grad()
outputs = net(inputs)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
scheduler.step(epoch + i / iters)
-----------------------------------------------------------------
""" called in an interleaved way. """
scheduler = CosineAnnealingWarmRestarts(optimizer, T_0, T_mult)
for epoch in range(20):
scheduler.step()
scheduler.step(26)
scheduler.step()