断点续训实现很简单!
当我们的数据集很大,或模型很复杂时,模型训练往往会耗费几天甚至数月时间;这时如果因为一些难以预料的原因(电脑关机、电脑崩溃、爆显存等等)导致训练中断,真是会让人崩溃!
因此,我们需要一些机制来避免上述问题!
代码实现,保存啥加载啥就完事了,epoch,model,optimizer
start_epoch = -1
resume = True
path_checkpoint = "/home/wanglu/GridMask/GridMixup/runs/ResNet18_cifar10_gridmix/checkpoint.pth.tar"
if resume:
checkpoint = torch.load(path_checkpoint)
start_epoch = checkpoint['epoch']
model.load_state_dict(checkpoint['state_dict'])
optimizer.load_state_dict(checkpoint['optimizer'])
for epoch in range(start_epoch + 1, args.epochs):
adjust_learning_rate(optimizer, epoch)
# train for one epoch
train_loss = train(train_loader, model, criterion, optimizer, epoch)
# evaluate on validation set
criterionv2 = nn.CrossEntropyLoss()
err1, err5, val_loss = validate(val_loader, model, criterionv2, epoch)
# remember best prec@1 and save checkpoint
is_best = err1 <= best_err1
best_err1 = min(err1, best_err1)
if is_best:
best_err5 = err5
logger.info('Current best accuracy (top-1 and 5 error):{} and {}'.format(best_err1, best_err5) )
save_checkpoint({
'epoch': epoch,
'arch': args.net_type,
'state_dict': model.state_dict(),
'best_err1': best_err1,
'best_err5': best_err5,
'optimizer': optimizer.state_dict(),
}, is_best)
1. 设置checkpoint
checkpoint的机制是:在模型训练的过程中,不断地保存训练结果(包括但不限于EPOCH、模型权重、优化器状态、调度器状态)。即便模型训练中断,我们也可以基于checkpoint接续训练。
for epoch in range(num_epochs):
train(model,train_loader)
test(model,test_loader)
#设置checkpoint
checkpoint = {
'epoch':epoch,
'net':net.state_dict,
'optimizer':optimizer.state_dict()}
if not os.path.isdir(r'tf-logs/'+save_model):
os.mkdir(r'tf-logs/'+save_model)
torch.save(checkpoint,r'tf-logs/'+save_model+'/best_%s.pth'%(str(epoch+1))
上述代码将保存模型每一个EPOCH训练后的checkpoint。
保存模型每一个EPOCH训练后的checkpoint是不明智的,我们应该保存到目前为止模型性能最佳的checkpoint;可以通过判断语句实现这个功能。
for epoch in range(num_epochs):
min_loss_val = 1
train(model,train_loader)
test_loss = test(model,test_loader)
#设置checkpoint
if epoch > int(num_epochs/2) and test_loss <= min_loss_val:
min_loss_val = test_loss
checkpoint = {
'epoch':epoch,
'net':net.state_dict(),
'optimizer':optimizer.state_dict(),
'lr_schedule':scheduler.state_dict()}
if not os.path.isdir(r'tf-logs/'+save_model):
os.mkdir(r'tf-logs/'+save_model)
torch.save(checkpoint,r'tf-logs/'+save_model+'/ckpt_best_%s.pth'%(str(epoch+1)))
上述代码实现了在训练到一定程度再进行checkpoint保存,且只保存到目前为止最佳的模型。
2. 接续训练
当我们需要从训练中断的位置接续训练,只需要加载checkpoint,并用checkpoint信息初始化训练状态即可。
start_epoch = -1
#如果接续训练,则加载checkpoint,并初始化训练状态
if RESUME:
path_checkpoint = checkpoint_path
checkpoint = torch.load(path_checkpoint)
start_epoch = checkpoint['epoch']
net.load_state_dict(checkpoint['net'])
optimizer.load_state_dict(checkpoint['optimizer'])
scheduler.load_state_dict(checkpoint['lr_schedule'])
for epoch in range(start_epoch+1,num_epochs):
min_loss_val = 1
train(net,train_loader)
test_loss = test(net,test_loader)
if epoch > int(num_epochs/2) and test_loss <= min_loss_val:
min_loss_val = test_loss
checkpoint = {
'loss':test_loss,
'epoch':epoch,
'net':net.state_dict(),
'optimizer':optimizer.state_dict(),
'lr_schedule':scheduler.state_dict()}
if not os.path.isdir(r'tf-logs/'+save_model):
os.mkdir(r'tf-logs/'+save_model)
torch.save(checkpoint,r'tf-logs/'+save_model+'/ckpt_best_%s.pth'%(str(epoch+1)))