2、 官方教程
# Initialization
opt_level = 'O1'
model, optimizer = amp.initialize(model, optimizer, opt_level=opt_level)
# Train your model
...
with amp.scale_loss(loss, optimizer) as scaled_loss:
scaled_loss.backward()
...
# Save checkpoint
checkpoint = {
'model': model.state_dict(),
'optimizer': optimizer.state_dict(),
'amp': amp.state_dict()
}
torch.save(checkpoint, 'amp_checkpoint.pt')
...
# Restore
model = ...
optimizer = ...
checkpoint = torch.load('amp_checkpoint.pt')
model, optimizer = amp.initialize(model, optimizer, opt_level=opt_level)
model.load_state_dict(checkpoint['model'])
optimizer.load_state_dict(checkpoint['optimizer'])
amp.load_state_dict(checkpoint['amp'])
# Continue training
...
Note that we recommend restoring the model using the same opt_level. Also note that we recommend calling the load_state_dict methods after amp.initialize.
3、add apex 代码
try:
import sys
sys.path.insert(0, "/home/apex") #下载的github工程
from apex.parallel import DistributedDataParallel as DDP
from apex.fp16_utils import *
from apex import amp, optimizers
from apex.multi_tensor_apply import multi_tensor_applier
except ImportError:
print("======================================================>")
print("Not use apex fp16 to train ,first confirm your gpu support fp16, for example 2080Ti、Titan、Tesla...")
raise ImportError("Please install apex from https://www.github.com/nvidia/apex to run this example.")
if cfg.SOLVER.FP16: #add 混合精度训练 opt_level='O1'
print("使用混合精度apex 训练")
model, optimizer = amp.initialize(model, optimizer, opt_level='O1')
else: #author code DDP
# For training, wrap with DDP. But don't need this for inference.
if comm.get_world_size() > 1:
# ref to https://github.com/pytorch/pytorch/issues/22049 to set `find_unused_parameters=True`
# for part of the parameters is not updated.
model = DistributedDataParallel(
model, device_ids=[comm.get_local_rank()], broadcast_buffers=False
)
if self.FP16: #add 混合精度代码
with amp.scale_loss(losses, self.optimizer) as scaled_loss:
scaled_loss.backward()
else: #author
losses.backward()
最后torch.save ,还是原来保存模型的接口,没有用apex
加入apex 使用接口之后,梯度溢出,
不知道如何解决,但是训练还能回来
ttps://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
“https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate”, UserWarning)
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 16384.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8192.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4096.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2048.0
[08/17 11:20:26 fastreid.utils.events]: eta: 7:17:17 iter: 199 total_loss: 8.289 loss_cls: 6.490 loss_triplet: 1.803 time: 0.8417 data_time: 0.0017 lr: 3.02e-05 max_mem: 0M
[08/17 11:23:17 fastreid.utils.events]: eta: 7:14:16 iter: 399 total_loss: 7.069 loss_cls: 6.336 loss_triplet: 0.724 time: 0.8477 data_time: 0.0018 lr: 5.71e-05 max_mem: 0M
[08/17 11:26:08 fastreid.utils.events]: eta: 7:11:49 iter: 599 total_loss: 6.362 loss_cls: 5.828 loss_triplet: 0.541 time: 0.8501 data_time: 0.0018 lr: 8.39e-05 max_mem: 0M