整体的思路如下:
1、保存每个epoch模型的参数;
2、如果在训练时意外终止,,则自定义再次加载意外终止处保存的模型;
3、然后一切再开始,,比如接着训练测试什么的,,但是问题来了,,出现错误:RuntimeError: expected device cpu but got device cuda:0。。期望CPU设备而得到的是CUDA(GPU)。。
原因:optimizer加载参数时,tensor默认在CPU上,故需将所有的tensor都放到GPU上,否则:在optimizer.step()处报错:RuntimeError: expected device cpu but got device cuda:0。
解决的详细代码如下:
if opt.resume_path:
print('loading checkpoint {}'.format(opt.resume_path))
checkpoint = torch.load(opt.resume_path)
assert opt.arch == checkpoint['arch']
opt.begin_epoch = checkpoint['epoch'] # 从上次中断的epoch次重新开始,直至n_epochs
model.load_state_dict(checkpoint['state_dict'])
if not opt.no_train:
optimizer.load_state_dict(checkpoint['optimizer'])
# 因为optimizer加载参数时,tensor默认在CPU上
# 故需将所有的tensor都放到cuda,
# 否则: 在optimizer.step()处报错:
# RuntimeError: expected device cpu but got device cuda:0
for state in optimizer.state.values():
for k, v in state.items():
if torch.is_tensor(v):
state[k] = v.cuda()