参考:
- https://zhuanlan.zhihu.com/p/165152789
- https://zhuanlan.zhihu.com/p/176998729
- https://pytorch.org/docs/stable/amp.html
- https://pytorch.org/docs/stable/notes/amp_examples.html#amp-examples
pytorch 版本有点旧,更新一下就好了,我直接更新到1.7
import torch
print(torch.__version__)
print(torch.version.cuda)
print(torch.cuda.amp)
print(torch.cuda.amp.autocast)
AMP:Automatic mixed precision,自动混合精度。
torch.float32
(float
)和torch.float16
(half
)。 linear layers and convolutions中使用torch.float16
(half
)会快很多。reductions就需要float32。Mixed precision会自动的为不同的操作配置合适的数据类型。torch.cuda.amp.autocast和torch.cuda.amp.GradScalar一般同时使用。
torch.cuda.amp.autocast 使用混合精度,在调用autocast的上下文中
model(s) or inputs就不要调用.half()。 反向传播就不要使用了,只包含在前向传播和损失函数计算就好了。反向传播和前向传播的数据类型是对应的。
# Creates model and optimizer in default precision
model = Net().cuda()#模型
optimizer = optim.SGD(model.parameters(), ...)#优化器
for input, target in data:
optimizer.zero_grad()#梯度置零
# Enables autocasting for the forward pass (model + loss)
with autocast():#gradient penalty, multiple models/losses, custom autograd functions
output = model(input)#前向传播的模型使用混合精度
loss = loss_fn(output, target)#前向传播的损失函数使用混合精度
# Exits the context manager before backward()
loss.backward()//反向传播者不推荐使用
optimizer.step()
还可以在前线传播中直接使用装饰器
class AutocastModel(nn.Module):
...
@autocast()
def forward(self, input):
...
# Creates some tensors in default dtype (here assumed to be float32)
a_float32 = torch.rand((8, 8), device="cuda")
b_float32 = torch.rand((8, 8), device="cuda")
c_float32 = torch.rand((8, 8), device="cuda")
d_float32 = torch.rand((8, 8), device="cuda")
with autocast():#创建的tensor是float16的与外面float32类型不匹配,会自动转换
# torch.mm is on autocast's list of ops that should run in float16.
# Inputs are float32, but the op runs in float16 and produces float16 output.
# No manual casts are required.
e_float16 = torch.mm(a_float32, b_float32)
# Also handles mixed input types
f_float16 = torch.mm(d_float32, e_float16)
# After exiting autocast, calls f_float16.float() to use with d_float32,可以转换到float32
g_float32 = torch.mm(d_float32, f_float16.float())
# Creates some tensors in default dtype (here assumed to be float32)
a_float32 = torch.rand((8, 8), device="cuda")
b_float32 = torch.rand((8, 8), device="cuda")
c_float32 = torch.rand((8, 8), device="cuda")
d_float32 = torch.rand((8, 8), device="cuda")
with autocast():
e_float16 = torch.mm(a_float32, b_float32)
with autocast(enabled=False):#在autocast上下文中局部禁用autocase,数据类型就转换为float32
# Calls e_float16.float() to ensure float32 execution
# (necessary because e_float16 was created in an autocasted region)
f_float32 = torch.mm(c_float32, e_float16.float())
# No manual casts are required when re-entering the autocast-enabled region.
# torch.mm again runs in float16 and produces float16 output, regardless of input types.
g_float16 = torch.mm(d_float32, f_float32)
torch.cuda.amp.GradScalar梯度放缩,如果前向传播时float16,那反向传播也是float16,假设传播的梯度值非常小float16不足以表示,这时候梯度就会下溢到0 underflow,这样就没办法更新对应的参数了。
“gradient scaling”将网络的损失 network’s loss(es)乘以一个缩放因子scale factor,并调用对scaled loss(es)的反向传播。然后,通过反向传播流动的梯度按同样的因子缩放。也就是梯度增大了,不会变成0了。
每个参数的梯度(.grad
)在优化器更新参数之前,应该取消缩放,这样缩放因子就不会干扰学习速率。
这个配方recipe以默认精度度量一个简单网络的性能,然后通过添加autocast和GradScaler来以混合精度运行同一个网络,从而提高性能。混合精度主要有利于张量核支持架构(Volta, Turing, Ampere)。这个配方在这些架构上应该显示出显著的(2-3)加速。
import torch, time, gc
# Timing utilities
start_time = None
def start_timer():
global start_time
gc.collect()#启动完全的垃圾回收
torch.cuda.empty_cache()#释放显存
torch.cuda.reset_max_memory_allocated()#重置显存分配峰值的起点。
torch.cuda.synchronize()#等待当前设备上所有流中的所有核心完成。
start_time = time.time()
def end_timer_and_print(local_msg):
torch.cuda.synchronize()#等待当前设备上所有流中的所有核心完成。
end_time = time.time()
print("\n" + local_msg)
print("Total execution time = {:.3f} sec".format(end_time - start_time))
print("Max memory used by tensors = {} bytes".format(torch.cuda.max_memory_allocated()))
A simple network
def make_model(in_size, out_size, num_layers):
layers = []
for _ in range(num_layers - 1):
layers.append(torch.nn.Linear(in_size, in_size))
layers.append(torch.nn.ReLU())
layers.append(torch.nn.Linear(in_size, out_size))
return torch.nn.Sequential(*tuple(layers)).cuda()
batch_size, in_size, out_size和num_layers被选择为足够大,以使GPU工作饱和。改变参数的大小,并查看混合精度加速如何变化。
batch_size = 512 # Try, for example, 128, 256, 513.
in_size = 4096
out_size = 4096
num_layers = 3
num_batches = 50
epochs = 3
# Creates data in default precision.
# The same data is used for both default and mixed precision trials below.
# You don't need to manually change inputs' dtype when enabling mixed precision.
data = [torch.randn(batch_size, in_size, device="cuda") for _ in range(num_batches)]
targets = [torch.randn(batch_size, out_size, device="cuda") for _ in range(num_batches)]
loss_fn = torch.nn.MSELoss().cuda()
Default Precision
不使用autocast
net = make_model(in_size, out_size, num_layers)
opt = torch.optim.SGD(net.parameters(), lr=0.001)
start_timer()
for epoch in range(epochs):
for input, target in zip(data, targets):
output = net(input)
loss = loss_fn(output, target)
loss.backward()
opt.step()
opt.zero_grad() # set_to_none=True here can modestly improve performance
end_timer_and_print("Default precision:")
使用autocast
for epoch in range(0): # 0 epochs, this section is for illustration only
for input, target in zip(data, targets):
# Runs the forward pass under autocast.
with torch.cuda.amp.autocast():
output = net(input)
# output is float16 because linear layers autocast to float16.
assert output.dtype is torch.float16
loss = loss_fn(output, target)
# loss is float32 because mse_loss layers autocast to float32.
assert loss.dtype is torch.float32
# Exits autocast before backward().
# Backward passes under autocast are not recommended.
# Backward ops run in the same dtype autocast chose for corresponding forward ops.
loss.backward()
opt.step()
opt.zero_grad() # set_to_none=True here can modestly improve performance
Adding GradScaler
# Constructs scaler once, at the beginning of the convergence run, using default args.
# If your network fails to converge with default GradScaler args, please file an issue.
# The same GradScaler instance should be used for the entire convergence run.
# If you perform multiple convergence runs in the same script, each run should use
# a dedicated fresh GradScaler instance. GradScaler instances are lightweight.
scaler = torch.cuda.amp.GradScaler()
for epoch in range(0): # 0 epochs, this section is for illustration only
for input, target in zip(data, targets):
with torch.cuda.amp.autocast():
output = net(input)
loss = loss_fn(output, target)
# Scales loss. Calls backward() on scaled loss to create scaled gradients.
scaler.scale(loss).backward()
# scaler.step() first unscales the gradients of the optimizer's assigned params.
# If these gradients do not contain infs or NaNs, optimizer.step() is then called,
# otherwise, optimizer.step() is skipped.
scaler.step(opt)
# Updates the scale for next iteration.
scaler.update()
opt.zero_grad() # set_to_none=True here can modestly improve performance
All together: “Automatic Mixed Precision”
use_amp = True
net = make_model(in_size, out_size, num_layers)
opt = torch.optim.SGD(net.parameters(), lr=0.001)
scaler = torch.cuda.amp.GradScaler(enabled=use_amp)
start_timer()
for epoch in range(epochs):
for input, target in zip(data, targets):
with torch.cuda.amp.autocast(enabled=use_amp):
output = net(input)
loss = loss_fn(output, target)
scaler.scale(loss).backward()
scaler.step(opt)
scaler.update()
opt.zero_grad() # set_to_none=True here can modestly improve performance
end_timer_and_print("Mixed precision:")
Inspecting/modifying gradients (e.g., clipping)
for epoch in range(0): # 0 epochs, this section is for illustration only
for input, target in zip(data, targets):
with torch.cuda.amp.autocast():
output = net(input)
loss = loss_fn(output, target)
scaler.scale(loss).backward()
# Unscales the gradients of optimizer's assigned params in-place
scaler.unscale_(opt)
# Since the gradients of optimizer's assigned params are now unscaled, clips as usual.
# You may use the same value for max_norm here as you would without gradient scaling.
torch.nn.utils.clip_grad_norm_(net.parameters(), max_norm=0.1)
scaler.step(opt)
scaler.update()
opt.zero_grad() # set_to_none=True here can modestly improve performance
Saving/Resuming
保存时,将scaler状态字典与通常的模型和优化器状态字典一起保存。在任何前向传递之前的迭代开始时,或在scaler.update()之后的迭代结束时执行此操作。
checkpoint = {"model": net.state_dict(),
"optimizer": opt.state_dict(),
"scaler": scaler.state_dict()}
# Write checkpoint as desired, e.g.,
# torch.save(checkpoint, "filename")
在恢复时,在加载模型和优化器状态字典的同时加载scaler状态字典。
# Read checkpoint as desired, e.g.,
# dev = torch.cuda.current_device()
# checkpoint = torch.load("filename",
# map_location = lambda storage, loc: storage.cuda(dev))
net.load_state_dict(checkpoint["model"])
opt.load_state_dict(checkpoint["optimizer"])
scaler.load_state_dict(checkpoint["scaler"])
AUTOMATIC MIXED PRECISION EXAMPLES
# Creates model and optimizer in default precision
model = Net().cuda()
optimizer = optim.SGD(model.parameters(), ...)
# Creates a GradScaler once at the beginning of training.
scaler = GradScaler()
for epoch in epochs:
for input, target in data:
optimizer.zero_grad()
# Runs the forward pass with autocasting.
with autocast():
output = model(input)
loss = loss_fn(output, target)
# Scales loss. Calls backward() on scaled loss to create scaled gradients.
# Backward passes under autocast are not recommended.
# Backward ops run in the same dtype autocast chose for corresponding forward ops.
scaler.scale(loss).backward()
# scaler.step() first unscales the gradients of the optimizer's assigned params.
# If these gradients do not contain infs or NaNs, optimizer.step() is then called,
# otherwise, optimizer.step() is skipped.
scaler.step(optimizer)
# Updates the scale for next iteration.
scaler.update()
Working with Unscaled Gradients
所有由scaler.scale(loss).backward()产生的梯度都会被缩放。如果您想修改或检查backward()和scale.step(optimizer)之间的参数的.grad属性,您应该先取消它们的缩放。例如一组梯度的梯度裁剪操作,使它们的global norm参见torch.nn.utils.clip_grad_norm_()
)或最大值(参见torch.nn.utils.clip_grad_value_()
)为<=某个用户设定的阈值。如果您试图在不取消缩放的情况下进行剪切,那么渐变的norm/maximum大小也将被缩放,所以您所请求的阈值(即未缩放渐变的阈值)将是无效的。
Gradient clipping
在裁剪之前调用scaler.unscale_(optimizer)可以让您像往常一样裁剪未缩放的梯度:
scaler = GradScaler()
for epoch in epochs:
for input, target in data:
optimizer.zero_grad()
with autocast():
output = model(input)
loss = loss_fn(output, target)
scaler.scale(loss).backward()
# Unscales the gradients of optimizer's assigned params in-place
scaler.unscale_(optimizer)
# Since the gradients of optimizer's assigned params are unscaled, clips as usual:
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm)
# optimizer's gradients are already unscaled, so scaler.step does not unscale them,
# although it still skips optimizer.step() if the gradients contain infs or NaNs.
scaler.step(optimizer)
# Updates the scale for next iteration.
scaler.update()
scaler记录了在这个迭代中已经为这个优化器调用了scaler.unscale_(optimizer),所以scaler.step(optimizer)知道在(内部)调用optimizer.step()之前不要多余地使用unscale渐变。
Working with Scaled Gradients
Gradient accumulation
梯度累加在batch_per_iter * iters_to_accumulate(如果是分布式的,则* num_procs)大小的有效批上累加梯度。尺度应该针对有效batch进行校准,这意味着inf/NaN检查,如果发现inf/NaN梯度则跳过步骤,并且尺度更新应该在有效批的粒度上进行。此外,梯度应该保持可伸缩,并且比例因子应该保持不变,而给定有效批次的梯度是累积的。如果在累积完成之前梯度是未缩放的(或缩放因子发生了变化),那么下一次反向传递将把缩放的梯度添加到未缩放的梯度(或用不同的因子缩放的梯度),之后就不可能恢复累积的未缩放的梯度step。
因此,如果你想unscale_梯度(例如,允许剪切未缩放的梯度),在step之前调用unscale_,毕竟下一个step的所有(缩放的)梯度已经累积。同样,只有在你调用了完整有效批处理的step的迭代结束时才调用update:
scaler = GradScaler()
for epoch in epochs:
for i, (input, target) in enumerate(data):
with autocast():
output = model(input)
loss = loss_fn(output, target)
loss = loss / iters_to_accumulate
# Accumulates scaled gradients.
scaler.scale(loss).backward()
if (i + 1) % iters_to_accumulate == 0:
# may unscale_ here if desired (e.g., to allow clipping unscaled gradients)
scaler.step(optimizer)
scaler.update()
optimizer.zero_grad()
Gradient penalty
梯度惩罚通常使用torch.autograd.grad()创建梯度实现,将它们组合起来创建惩罚值,并将惩罚值添加到损失中。
下面是一个没有梯度缩放或autocasting的L2惩罚的普通例子:
for epoch in epochs:
for input, target in data:
optimizer.zero_grad()
output = model(input)
loss = loss_fn(output, target)
# Creates gradients
grad_params = torch.autograd.grad(outputs=loss,
inputs=model.parameters(),
create_graph=True)
# Computes the penalty term and adds it to the loss
grad_norm = 0
for grad in grad_params:
grad_norm += grad.pow(2).sum()
grad_norm = grad_norm.sqrt()
loss = loss + grad_norm
loss.backward()
# clip gradients here, if desired
optimizer.step()
为了实现梯度缩放的梯度惩罚,传递给torch.autograd.grad()的输出张量应该被缩放。因此,产生的梯度将被缩放,并且在组合为创建惩罚值之前应该取消缩放。另外,惩罚项计算是前向传递的一部分,因此应该位于自动转换上下文中。
scaler = GradScaler()
for epoch in epochs:
for input, target in data:
optimizer.zero_grad()
with autocast():
output = model(input)
loss = loss_fn(output, target)
# Scales the loss for autograd.grad's backward pass, producing scaled_grad_params
scaled_grad_params = torch.autograd.grad(outputs=scaler.scale(loss),
inputs=model.parameters(),
create_graph=True)
# Creates unscaled grad_params before computing the penalty. scaled_grad_params are
# not owned by any optimizer, so ordinary division is used instead of scaler.unscale_:
inv_scale = 1./scaler.get_scale()
grad_params = [p * inv_scale for p in scaled_grad_params]
# Computes the penalty term and adds it to the loss
with autocast():
grad_norm = 0
for grad in grad_params:
grad_norm += grad.pow(2).sum()
grad_norm = grad_norm.sqrt()
loss = loss + grad_norm
# Applies scaling to the backward call as usual.
# Accumulates leaf gradients that are correctly scaled.
scaler.scale(loss).backward()
# may unscale_ here if desired (e.g., to allow clipping unscaled gradients)
# step() and update() proceed as usual.
scaler.step(optimizer)
scaler.update()
Working with Multiple Models, Losses, and Optimizers
如果您的网络有多个损失,您必须每一个都单独调用scaler.scale
。如果你的网络有多个优化器,你可以在每一个单独优化器的调用scaler.unscale_
,然后必须调用 scaler.step
。
然而,scaler.update
应该只被调用一次,在所有使用这个迭代的优化器都被步骤执行之后:
scaler = torch.cuda.amp.GradScaler()
for epoch in epochs:
for input, target in data:
optimizer0.zero_grad()
optimizer1.zero_grad()
with autocast():
output0 = model0(input)
output1 = model1(input)
loss0 = loss_fn(2 * output0 + 3 * output1, target)
loss1 = loss_fn(3 * output0 - 5 * output1, target)
# (retain_graph here is unrelated to amp, it's present because in this
# example, both backward() calls share some sections of graph.)
scaler.scale(loss0).backward(retain_graph=True)
scaler.scale(loss1).backward()
# You can choose which optimizers receive explicit unscaling, if you
# want to inspect or modify the gradients of the params they own.
scaler.unscale_(optimizer0)
scaler.step(optimizer0)
scaler.step(optimizer1)
scaler.update()
Working with Multiple GPUs
DataParallel in a single process
torch.nn.DataParallel
在每个设备上生成线程来运行向前传递。autocast的状态是线程本地的,所以下面的将不会工作:
model = MyModel()
dp_model = nn.DataParallel(model)
# Sets autocast in the main thread
with autocast():
# dp_model's internal threads won't autocast. The main thread's autocast state has no effect.
output = dp_model(input)
# loss_fn still autocasts, but it's too late...
loss = loss_fn(output)
解决办法很简单。在MyModel.forward中启用autocast:
MyModel(nn.Module):
...
@autocast()
def forward(self, input):
...
# Alternatively
MyModel(nn.Module):
...
def forward(self, input):
with autocast():
...
现在在dp_model的线程(向前执行)和主线程(执行loss_fn)中自动转换如下:
model = MyModel()
dp_model = nn.DataParallel(model)
with autocast():
output = dp_model(input)
loss = loss_fn(output)
DistributedDataParallel, one GPU per process
torch.nn.parallel.DistributedDataParallel
的文档建议每个进程使用一个GPU以获得最佳性能。在这种情况下,DistributedDataParallel不会在内部生成线程,因此autocast和GradScaler的使用不会受到影响。
DistributedDataParallel, multiple GPUs per process
这里,torch.nn.parallel.DistributedDataParallel
可能会衍生一个侧线程来在每个设备上运行向前传递,比如torch.nn.DataParallel
。修复方法The fix is the same:是一样的:应用autocast作为模型的forward方法的一部分,以确保在侧线程中启用它。
Autocast and Custom Autograd Functions
如果你的网络使用 custom autograd functions ( torch.autograd.Function
的子类),如果有函数需要修改自动转换的兼容性:
- 取多个浮点张量输入,
- 包装任何可自动转换的op(参见 Autocast Op Reference),
- 需要一个特定的dtype(例如,如果它包装了仅为dtype编译的 CUDA extensions )。
在所有情况下,如果你正在导入函数而不能改变它的定义,一个安全的补救措施是禁用autocast,并在任何使用错误发生的地方强制执行float32(或dtype):
with autocast():
...
with autocast(enabled=False):
output = imported_function(input1.float(), input2.float())
如果您是函数的作者(或者可以改变它的定义),一个更好的解决方案是使用torch.cuda.amp.custom_fwd()
和torch.cuda.amp.custom_bwd()
装饰器,如下面的相关案例所示。
Functions with multiple inputs or autocastable ops
分别对forward和backward应用custom_fwd
和custom_bwd
(不带参数)。这确保了forward执行与当前的autocast状态相同,backward与forward相同的autocast(这可以防止类型不匹配的错误):
class MyMM(torch.autograd.Function):
@staticmethod
@custom_fwd
def forward(ctx, a, b):
ctx.save_for_backward(a, b)
return a.mm(b)
@staticmethod
@custom_bwd
def backward(ctx, grad):
a, b = ctx.saved_tensors
return grad.mm(b.t()), a.t().mm(grad)
现在MyMM可以在任何地方被调用,而无需禁用自动转换或手动转换输入:
mymm = MyMM.apply
with autocast():
output = mymm(input1, input2)
Functions that need a particular dtype
考虑一个需要torch.float32自定义函数。将custom_fwd(cast_inputs=torch.float32)
应用到forward,将custom_bwd
(不带参数)应用到backward。如果forward在启用了自动强制转换的区域运行,decorator将浮点CUDA张量输入强制转换为float32,并在forward和backward时在本地禁用自动强制转换:
class MyFloat32Func(torch.autograd.Function):
@staticmethod
@custom_fwd(cast_inputs=torch.float32)
def forward(ctx, input):
ctx.save_for_backward(input)
...
return fwd_output
@staticmethod
@custom_bwd
def backward(ctx, grad):
...
现在,可以在任何地方调用MyFloat32Func,而无需手动禁用自动广播或强制转换输入:
func = MyFloat32Func.apply
with autocast():
# func will run in float32, regardless of the surrounding autocast state
output = func(input)