背景
当使用AMP混合精度训练时,可以提升训练速度,并降低对显存的占用。下面提供一个使用AMP训练的代码demo。
AMP混合精度训练
use_amp = True
net = make_model(in_size, out_size, num_layers)
opt = torch.optim.SGD(net.parameters(), lr=0.001)
scaler = torch.amp.GradScaler('cuda', enabled=use_amp)
start_timer()
for epoch in range(epochs):
for input, target in zip(data, targets):
with torch.amp.autocast('cuda', dtype=torch.float16, enabled=use_amp):
output = net(input)
loss = loss_fn(output, target)
opt.zero_grad() # set_to_none=True here can modestly improve performance
scaler.scale(loss).backward()
# 如果有gradient clipping
if clip_grad_l2norm > 0.0:
if use_amp:
scaler.unscale_(opt)
torch.nn.utils.clip_grad_norm_(model.parameters(), clip_grad_l2norm)
scaler.step(opt)
scaler.update()
scheduler.step()
end_timer_and_print("Mixed precision:")
FP16 通信压缩
给 DDP 模型注册通信 hook,将梯度通信压缩为 FP16,以减少通信量、加速分布式训练。
# model是DistributedDataParallel
from torch.distributed.algorithms.ddp_comm_hooks import default as comm_hooks
use_fp16_compress = getattr(cfg.solver, "fp16_compress", False)
if use_fp16_compress:
logger.info("Using FP16 compression ...")
model.register_comm_hook(state=None, hook=comm_hooks.fp16_compress_hook)