1. 如何在 PyTorch 中设定学习率衰减(learning rate decay)
很多时候我们要对学习率(learning rate)进行衰减,下面的代码示范了如何每30个epoch按10%的速率衰减:
def adjust_learning_rate(optimizer, epoch):
"""Sets the learning rate to the initial LR decayed by 10 every 30 epochs"""
lr = args.lr * (0.1 ** (epoch // 30))
for param_group in optimizer.param_groups:
param_group['lr'] = lr
什么是param_groups?
optimizer通过param_group来管理参数组,param_group中保存了参数组及其对应的学习率,动量等等,所以我们可以通过更改param_group[‘lr’]的值来更改对应参数组的学习率。
# 有两个`param_group`即,len(optim.param_groups)==2
optim.SGD([
{'params': model.base.parameters()},
{'params': model.classifier.parameters(), 'lr': 1e-3}
], lr=1e-2, momentum=0.9)
#一个参数组
optim.SGD(model.parameters(), lr=1e-2, momentum=.9)
加个我代码里的例子:
no_decay = ['bias', 'LayerNorm.bias', 'LayerNorm.weight']
grouped_parameters = [
{'params': [p for n, p in model.encoder.named_parameters() if not any(nd in n for nd in no_decay)], 'weight_decay': args.weight_decay, 'lr': args.encoder_lr},
{'params': [p for n, p in model.encoder.named_parameters() if any(nd in n for nd in no_decay)], 'weight_decay': 0.0, 'lr': args.encoder_lr},
{'params': [p for n, p in model.decoder.named_parameters() if not any(nd in n for nd in no_decay)], 'weight_decay': args.weight_decay, 'lr': args.decoder_lr},
{'params': [p for n, p in model.decoder.named_parameters() if any(nd in n for nd in no_decay)], 'weight_decay': 0.0, 'lr': args.decoder_lr},
]
optimizer = OPTIMIZER_CLASSES[args.optim](grouped_parameters)
2. warmup 策略
warmup_lr 的初始值是跟训练语料的大小成反比的,也就是说训练语料越大,那么warmup_lr 初值越小,随后增长到我们预设的超参 initial_learning_rate相同的量级,再接下来又通过 decay_rates 逐步下降。
这样做有什么好处?
1)这样可以使得学习率可以适应不同的训练集合size实验的时候经常需要先使用小的数据集训练验证模型,然后换大的数据集做生成环境模型训练。
2)即使不幸学习率设置得很大,那么也能通过warmup机制看到合适的学习率区间(即训练误差先降后升的关键位置附近),以便后续验证。
# import的内容
try:
from transformers import (ConstantLRSchedule, WarmupLinearSchedule, WarmupConstantSchedule)
except:
from transformers import get_constant_schedule, get_constant_schedule_with_warmup, get_linear_schedule_with_warmup
......
......
# main里写的
if args.lr_schedule == 'fixed':
try:
scheduler = ConstantLRSchedule(optimizer)
except:
scheduler = get_constant_schedule(optimizer)
elif args.lr_schedule == 'warmup_constant':
try:
scheduler = WarmupConstantSchedule(optimizer, warmup_steps=args.warmup_steps)
except:
scheduler = get_constant_schedule_with_warmup(optimizer, num_warmup_steps=args.warmup_steps)
elif args.lr_schedule == 'warmup_linear':
max_steps = int(args.n_epochs * (dataset.train_size() / args.batch_size))
try:
scheduler = WarmupLinearSchedule(optimizer, warmup_steps=args.warmup_steps, t_total=max_steps)
except:
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=args.warmup_steps, num_training_steps=max_steps)
持续更新中…
参考:
- https://www.pytorchtutorial.com/pytorch-learning-rate-decay/
- https://www.zhihu.com/question/338066667/answer/973639422