AttributeError: module ‘torch.cuda.amp‘ has no attribute ‘autocast‘

最新推荐文章于 2024-11-25 12:20:31 发布

shchojj

最新推荐文章于 2024-11-25 12:20:31 发布

阅读量1.4w

点赞数

分类专栏： pytorch

原文链接：https://pytorch.org/docs/stable/amp.html

版权

pytorch 专栏收录该内容

11 篇文章

订阅专栏

参考：

pytorch 版本有点旧，更新一下就好了，我直接更新到1.7

import torch
print(torch.__version__)
print(torch.version.cuda)
print(torch.cuda.amp)
print(torch.cuda.amp.autocast)

AMP：Automatic mixed precision，自动混合精度。

torch.float32 (float)和torch.float16 (half)。 linear layers and convolutions中使用torch.float16 (half)会快很多。reductions就需要float32。Mixed precision会自动的为不同的操作配置合适的数据类型。torch.cuda.amp.autocast和torch.cuda.amp.GradScalar一般同时使用。

torch.cuda.amp.autocast 使用混合精度，在调用autocast的上下文中model(s) or inputs就不要调用.half()。 反向传播就不要使用了，只包含在前向传播和损失函数计算就好了。反向传播和前向传播的数据类型是对应的。

# Creates model and optimizer in default precision
model = Net().cuda()#模型
optimizer = optim.SGD(model.parameters(), ...)#优化器

for input, target in data:
    optimizer.zero_grad()#梯度置零

    # Enables autocasting for the forward pass (model + loss)
    with autocast():#gradient penalty, multiple models/losses, custom autograd functions
        output = model(input)#前向传播的模型使用混合精度
        loss = loss_fn(output, target)#前向传播的损失函数使用混合精度

    # Exits the context manager before backward()
    loss.backward()//反向传播者不推荐使用
    optimizer.step()

还可以在前线传播中直接使用装饰器

class AutocastModel(nn.Module):
    ...
    @autocast()
    def forward(self, input):
        ...

# Creates some tensors in default dtype (here assumed to be float32)
a_float32 = torch.rand((8, 8), device="cuda")
b_float32 = torch.rand((8, 8), device="cuda")
c_float32 = torch.rand((8, 8), device="cuda")
d_float32 = torch.rand((8, 8), device="cuda")

with autocast():#创建的tensor是float16的与外面float32类型不匹配，会自动转换
    # torch.mm is on autocast's list of ops that should run in float16.
    # Inputs are float32, but the op runs in float16 and produces float16 output.
    # No manual casts are required.
    e_float16 = torch.mm(a_float32, b_float32)
    # Also handles mixed input types
    f_float16 = torch.mm(d_float32, e_float16)

# After exiting autocast, calls f_float16.float() to use with d_float32，可以转换到float32
g_float32 = torch.mm(d_float32, f_float16.float())

# Creates some tensors in default dtype (here assumed to be float32)
a_float32 = torch.rand((8, 8), device="cuda")
b_float32 = torch.rand((8, 8), device="cuda")
c_float32 = torch.rand((8, 8), device="cuda")
d_float32 = torch.rand((8, 8), device="cuda")

with autocast():
    e_float16 = torch.mm(a_float32, b_float32)

    with autocast(enabled=False):#在autocast上下文中局部禁用autocase，数据类型就转换为float32
        # Calls e_float16.float() to ensure float32 execution
        # (necessary because e_float16 was created in an autocasted region)
        f_float32 = torch.mm(c_float32, e_float16.float())

    # No manual casts are required when re-entering the autocast-enabled region.
    # torch.mm again runs in float16 and produces float16 output, regardless of input types.
    g_float16 = torch.mm(d_float32, f_float32)

torch.cuda.amp.GradScalar梯度放缩，如果前向传播时float16，那反向传播也是float16，假设传播的梯度值非常小float16不足以表示，这时候梯度就会下溢到0 underflow，这样就没办法更新对应的参数了。“gradient scaling”将网络的损失 network’s loss(es)乘以一个缩放因子scale factor，并调用对scaled loss(es)的反向传播。然后，通过反向传播流动的梯度按同样的因子缩放。也就是梯度增大了，不会变成0了。

每个参数的梯度(.grad ）在优化器更新参数之前，应该取消缩放，这样缩放因子就不会干扰学习速率。

这个配方recipe以默认精度度量一个简单网络的性能，然后通过添加autocast和GradScaler来以混合精度运行同一个网络，从而提高性能。混合精度主要有利于张量核支持架构(Volta, Turing, Ampere)。这个配方在这些架构上应该显示出显著的(2-3)加速。

import torch, time, gc

# Timing utilities
start_time = None

def start_timer():
    global start_time
    gc.collect()#启动完全的垃圾回收
    torch.cuda.empty_cache()#释放显存
    torch.cuda.reset_max_memory_allocated()#重置显存分配峰值的起点。
    torch.cuda.synchronize()#等待当前设备上所有流中的所有核心完成。
    start_time = time.time()

def end_timer_and_print(local_msg):
    torch.cuda.synchronize()#等待当前设备上所有流中的所有核心完成。
    end_time = time.time()
    print("\n" + local_msg)
    print("Total execution time = {:.3f} sec".format(end_time - start_time))
    print("Max memory used by tensors = {} bytes".format(torch.cuda.max_memory_allocated()))

A simple network

def make_model(in_size, out_size, num_layers):
    layers = []
    for _ in range(num_layers - 1):
        layers.append(torch.nn.Linear(in_size, in_size))
        layers.append(torch.nn.ReLU())
    layers.append(torch.nn.Linear(in_size, out_size))
    return torch.nn.Sequential(*tuple(layers)).cuda()

batch_size, in_size, out_size和num_layers被选择为足够大，以使GPU工作饱和。改变参数的大小，并查看混合精度加速如何变化。

batch_size = 512 # Try, for example, 128, 256, 513.
in_size = 4096
out_size = 4096
num_layers = 3
num_batches = 50
epochs = 3

# Creates data in default precision.
# The same data is used for both default and mixed precision trials below.
# You don't need to manually change inputs' dtype when enabling mixed precision.
data = [torch.randn(batch_size, in_size, device="cuda") for _ in range(num_batches)]
targets = [torch.randn(batch_size, out_size, device="cuda") for _ in range(num_batches)]

loss_fn = torch.nn.MSELoss().cuda()

Default Precision

不使用autocast

net = make_model(in_size, out_size, num_layers)
opt = torch.optim.SGD(net.parameters(), lr=0.001)

start_timer()
for epoch in range(epochs):
    for input, target in zip(data, targets):
        output = net(input)
        loss = loss_fn(output, target)
        loss.backward()
        opt.step()
        opt.zero_grad() # set_to_none=True here can modestly improve performance
end_timer_and_print("Default precision:")

使用autocast

for epoch in range(0): # 0 epochs, this section is for illustration only
    for input, target in zip(data, targets):
        # Runs the forward pass under autocast.
        with torch.cuda.amp.autocast():
            output = net(input)
            # output is float16 because linear layers autocast to float16.
            assert output.dtype is torch.float16

            loss = loss_fn(output, target)
            # loss is float32 because mse_loss layers autocast to float32.
            assert loss.dtype is torch.float32

        # Exits autocast before backward().
        # Backward passes under autocast are not recommended.
        # Backward ops run in the same dtype autocast chose for corresponding forward ops.
        loss.backward()
        opt.step()
        opt.zero_grad() # set_to_none=True here can modestly improve performance

Adding GradScaler

# Constructs scaler once, at the beginning of the convergence run, using default args.
# If your network fails to converge with default GradScaler args, please file an issue.
# The same GradScaler instance should be used for the entire convergence run.
# If you perform multiple convergence runs in the same script, each run should use
# a dedicated fresh GradScaler instance.  GradScaler instances are lightweight.
scaler = torch.cuda.amp.GradScaler()

for epoch in range(0): # 0 epochs, this section is for illustration only
    for input, target in zip(data, targets):
        with torch.cuda.amp.autocast():
            output = net(input)
            loss = loss_fn(output, target)

        # Scales loss.  Calls backward() on scaled loss to create scaled gradients.
        scaler.scale(loss).backward()

        # scaler.step() first unscales the gradients of the optimizer's assigned params.
        # If these gradients do not contain infs or NaNs, optimizer.step() is then called,
        # otherwise, optimizer.step() is skipped.
        scaler.step(opt)

        # Updates the scale for next iteration.
        scaler.update()

        opt.zero_grad() # set_to_none=True here can modestly improve performance

All together: “Automatic Mixed Precision”

use_amp = True

net = make_model(in_size, out_size, num_layers)
opt = torch.optim.SGD(net.parameters(), lr=0.001)
scaler = torch.cuda.amp.GradScaler(enabled=use_amp)

start_timer()
for epoch in range(epochs):
    for input, target in zip(data, targets):
        with torch.cuda.amp.autocast(enabled=use_amp):
            output = net(input)
            loss = loss_fn(output, target)
        scaler.scale(loss).backward()
        scaler.step(opt)
        scaler.update()
        opt.zero_grad() # set_to_none=True here can modestly improve performance
end_timer_and_print("Mixed precision:")

Inspecting/modifying gradients (e.g., clipping)

for epoch in range(0): # 0 epochs, this section is for illustration only
    for input, target in zip(data, targets):
        with torch.cuda.amp.autocast():
            output = net(input)
            loss = loss_fn(output, target)
        scaler.scale(loss).backward()

        # Unscales the gradients of optimizer's assigned params in-place
        scaler.unscale_(opt)

        # Since the gradients of optimizer's assigned params are now unscaled, clips as usual.
        # You may use the same value for max_norm here as you would without gradient scaling.
        torch.nn.utils.clip_grad_norm_(net.parameters(), max_norm=0.1)

        scaler.step(opt)
        scaler.update()
        opt.zero_grad() # set_to_none=True here can modestly improve performance

Saving/Resuming

保存时，将scaler状态字典与通常的模型和优化器状态字典一起保存。在任何前向传递之前的迭代开始时，或在scaler.update()之后的迭代结束时执行此操作。

checkpoint = {"model": net.state_dict(),
              "optimizer": opt.state_dict(),
              "scaler": scaler.state_dict()}
# Write checkpoint as desired, e.g.,
# torch.save(checkpoint, "filename")

在恢复时，在加载模型和优化器状态字典的同时加载scaler状态字典。

# Read checkpoint as desired, e.g.,
# dev = torch.cuda.current_device()
# checkpoint = torch.load("filename",
#                         map_location = lambda storage, loc: storage.cuda(dev))
net.load_state_dict(checkpoint["model"])
opt.load_state_dict(checkpoint["optimizer"])
scaler.load_state_dict(checkpoint["scaler"])

AUTOMATIC MIXED PRECISION EXAMPLES

# Creates model and optimizer in default precision
model = Net().cuda()
optimizer = optim.SGD(model.parameters(), ...)

# Creates a GradScaler once at the beginning of training.
scaler = GradScaler()

for epoch in epochs:
    for input, target in data:
        optimizer.zero_grad()

        # Runs the forward pass with autocasting.
        with autocast():
            output = model(input)
            loss = loss_fn(output, target)

        # Scales loss.  Calls backward() on scaled loss to create scaled gradients.
        # Backward passes under autocast are not recommended.
        # Backward ops run in the same dtype autocast chose for corresponding forward ops.
        scaler.scale(loss).backward()

        # scaler.step() first unscales the gradients of the optimizer's assigned params.
        # If these gradients do not contain infs or NaNs, optimizer.step() is then called,
        # otherwise, optimizer.step() is skipped.
        scaler.step(optimizer)

        # Updates the scale for next iteration.
        scaler.update()

Working with Unscaled Gradients

所有由scaler.scale(loss).backward()产生的梯度都会被缩放。如果您想修改或检查backward()和scale.step(optimizer)之间的参数的.grad属性，您应该先取消它们的缩放。例如一组梯度的梯度裁剪操作，使它们的global norm参见torch.nn.utils.clip_grad_norm_())或最大值(参见torch.nn.utils.clip_grad_value_())为<=某个用户设定的阈值。如果您试图在不取消缩放的情况下进行剪切，那么渐变的norm/maximum大小也将被缩放，所以您所请求的阈值(即未缩放渐变的阈值)将是无效的。

Gradient clipping

在裁剪之前调用scaler.unscale_(optimizer)可以让您像往常一样裁剪未缩放的梯度:

scaler = GradScaler()

for epoch in epochs:
    for input, target in data:
        optimizer.zero_grad()
        with autocast():
            output = model(input)
            loss = loss_fn(output, target)
        scaler.scale(loss).backward()

        # Unscales the gradients of optimizer's assigned params in-place
        scaler.unscale_(optimizer)

        # Since the gradients of optimizer's assigned params are unscaled, clips as usual:
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm)

        # optimizer's gradients are already unscaled, so scaler.step does not unscale them,
        # although it still skips optimizer.step() if the gradients contain infs or NaNs.
        scaler.step(optimizer)

        # Updates the scale for next iteration.
        scaler.update()

scaler记录了在这个迭代中已经为这个优化器调用了scaler.unscale_(optimizer)，所以scaler.step(optimizer)知道在(内部)调用optimizer.step()之前不要多余地使用unscale渐变。

Working with Scaled Gradients

Gradient accumulation

梯度累加在batch_per_iter * iters_to_accumulate(如果是分布式的，则* num_procs)大小的有效批上累加梯度。尺度应该针对有效batch进行校准，这意味着inf/NaN检查，如果发现inf/NaN梯度则跳过步骤，并且尺度更新应该在有效批的粒度上进行。此外，梯度应该保持可伸缩，并且比例因子应该保持不变，而给定有效批次的梯度是累积的。如果在累积完成之前梯度是未缩放的(或缩放因子发生了变化)，那么下一次反向传递将把缩放的梯度添加到未缩放的梯度(或用不同的因子缩放的梯度)，之后就不可能恢复累积的未缩放的梯度step。

因此，如果你想unscale_梯度(例如，允许剪切未缩放的梯度)，在step之前调用unscale_，毕竟下一个step的所有(缩放的)梯度已经累积。同样，只有在你调用了完整有效批处理的step的迭代结束时才调用update:

scaler = GradScaler()

for epoch in epochs:
    for i, (input, target) in enumerate(data):
        with autocast():
            output = model(input)
            loss = loss_fn(output, target)
            loss = loss / iters_to_accumulate

        # Accumulates scaled gradients.
        scaler.scale(loss).backward()

        if (i + 1) % iters_to_accumulate == 0:
            # may unscale_ here if desired (e.g., to allow clipping unscaled gradients)

            scaler.step(optimizer)
            scaler.update()
            optimizer.zero_grad()

Gradient penalty

梯度惩罚通常使用torch.autograd.grad()创建梯度实现，将它们组合起来创建惩罚值，并将惩罚值添加到损失中。

下面是一个没有梯度缩放或autocasting的L2惩罚的普通例子:

for epoch in epochs:
    for input, target in data:
        optimizer.zero_grad()
        output = model(input)
        loss = loss_fn(output, target)

        # Creates gradients
        grad_params = torch.autograd.grad(outputs=loss,
                                          inputs=model.parameters(),
                                          create_graph=True)

        # Computes the penalty term and adds it to the loss
        grad_norm = 0
        for grad in grad_params:
            grad_norm += grad.pow(2).sum()
        grad_norm = grad_norm.sqrt()
        loss = loss + grad_norm

        loss.backward()

        # clip gradients here, if desired

        optimizer.step()

为了实现梯度缩放的梯度惩罚，传递给torch.autograd.grad()的输出张量应该被缩放。因此，产生的梯度将被缩放，并且在组合为创建惩罚值之前应该取消缩放。另外，惩罚项计算是前向传递的一部分，因此应该位于自动转换上下文中。

scaler = GradScaler()

for epoch in epochs:
    for input, target in data:
        optimizer.zero_grad()
        with autocast():
            output = model(input)
            loss = loss_fn(output, target)

        # Scales the loss for autograd.grad's backward pass, producing scaled_grad_params
        scaled_grad_params = torch.autograd.grad(outputs=scaler.scale(loss),
                                                 inputs=model.parameters(),
                                                 create_graph=True)

        # Creates unscaled grad_params before computing the penalty. scaled_grad_params are
        # not owned by any optimizer, so ordinary division is used instead of scaler.unscale_:
        inv_scale = 1./scaler.get_scale()
        grad_params = [p * inv_scale for p in scaled_grad_params]

        # Computes the penalty term and adds it to the loss
        with autocast():
            grad_norm = 0
            for grad in grad_params:
                grad_norm += grad.pow(2).sum()
            grad_norm = grad_norm.sqrt()
            loss = loss + grad_norm

        # Applies scaling to the backward call as usual.
        # Accumulates leaf gradients that are correctly scaled.
        scaler.scale(loss).backward()

        # may unscale_ here if desired (e.g., to allow clipping unscaled gradients)

        # step() and update() proceed as usual.
        scaler.step(optimizer)
        scaler.update()

Working with Multiple Models, Losses, and Optimizers

如果您的网络有多个损失，您必须每一个都单独调用scaler.scale。如果你的网络有多个优化器，你可以在每一个单独优化器的调用scaler.unscale_ ，然后必须调用 scaler.step 。

然而,scaler.update 应该只被调用一次，在所有使用这个迭代的优化器都被步骤执行之后:

scaler = torch.cuda.amp.GradScaler()

for epoch in epochs:
    for input, target in data:
        optimizer0.zero_grad()
        optimizer1.zero_grad()
        with autocast():
            output0 = model0(input)
            output1 = model1(input)
            loss0 = loss_fn(2 * output0 + 3 * output1, target)
            loss1 = loss_fn(3 * output0 - 5 * output1, target)

        # (retain_graph here is unrelated to amp, it's present because in this
        # example, both backward() calls share some sections of graph.)
        scaler.scale(loss0).backward(retain_graph=True)
        scaler.scale(loss1).backward()

        # You can choose which optimizers receive explicit unscaling, if you
        # want to inspect or modify the gradients of the params they own.
        scaler.unscale_(optimizer0)

        scaler.step(optimizer0)
        scaler.step(optimizer1)

        scaler.update()

Working with Multiple GPUs

DataParallel in a single process

torch.nn.DataParallel 在每个设备上生成线程来运行向前传递。autocast的状态是线程本地的，所以下面的将不会工作:

model = MyModel()
dp_model = nn.DataParallel(model)

# Sets autocast in the main thread
with autocast():
    # dp_model's internal threads won't autocast.  The main thread's autocast state has no effect.
    output = dp_model(input)
    # loss_fn still autocasts, but it's too late...
    loss = loss_fn(output)

解决办法很简单。在MyModel.forward中启用autocast:

MyModel(nn.Module):
    ...
    @autocast()
    def forward(self, input):
       ...

# Alternatively
MyModel(nn.Module):
    ...
    def forward(self, input):
        with autocast():
            ...

现在在dp_model的线程(向前执行)和主线程(执行loss_fn)中自动转换如下:

model = MyModel()
dp_model = nn.DataParallel(model)

with autocast():
    output = dp_model(input)
    loss = loss_fn(output)

DistributedDataParallel, one GPU per process

torch.nn.parallel.DistributedDataParallel的文档建议每个进程使用一个GPU以获得最佳性能。在这种情况下，DistributedDataParallel不会在内部生成线程，因此autocast和GradScaler的使用不会受到影响。

DistributedDataParallel, multiple GPUs per process

这里，torch.nn.parallel.DistributedDataParallel可能会衍生一个侧线程来在每个设备上运行向前传递，比如torch.nn.DataParallel。修复方法The fix is the same:是一样的:应用autocast作为模型的forward方法的一部分，以确保在侧线程中启用它。

Autocast and Custom Autograd Functions

如果你的网络使用 custom autograd functions ( torch.autograd.Function的子类)，如果有函数需要修改自动转换的兼容性：

取多个浮点张量输入，
包装任何可自动转换的op(参见 Autocast Op Reference)，
需要一个特定的dtype(例如，如果它包装了仅为dtype编译的 CUDA extensions )。

在所有情况下，如果你正在导入函数而不能改变它的定义，一个安全的补救措施是禁用autocast，并在任何使用错误发生的地方强制执行float32(或dtype):

with autocast():
    ...
    with autocast(enabled=False):
        output = imported_function(input1.float(), input2.float())

如果您是函数的作者(或者可以改变它的定义)，一个更好的解决方案是使用torch.cuda.amp.custom_fwd()和torch.cuda.amp.custom_bwd() 装饰器，如下面的相关案例所示。

Functions with multiple inputs or autocastable ops

分别对forward和backward应用custom_fwd 和custom_bwd(不带参数)。这确保了forward执行与当前的autocast状态相同，backward与forward相同的autocast(这可以防止类型不匹配的错误):

class MyMM(torch.autograd.Function):
    @staticmethod
    @custom_fwd
    def forward(ctx, a, b):
        ctx.save_for_backward(a, b)
        return a.mm(b)
    @staticmethod
    @custom_bwd
    def backward(ctx, grad):
        a, b = ctx.saved_tensors
        return grad.mm(b.t()), a.t().mm(grad)

现在MyMM可以在任何地方被调用，而无需禁用自动转换或手动转换输入:

mymm = MyMM.apply

with autocast():
    output = mymm(input1, input2)

Functions that need a particular `dtype`

考虑一个需要torch.float32自定义函数。将custom_fwd(cast_inputs=torch.float32)应用到forward，将custom_bwd (不带参数)应用到backward。如果forward在启用了自动强制转换的区域运行，decorator将浮点CUDA张量输入强制转换为float32，并在forward和backward时在本地禁用自动强制转换:

class MyFloat32Func(torch.autograd.Function):
    @staticmethod
    @custom_fwd(cast_inputs=torch.float32)
    def forward(ctx, input):
        ctx.save_for_backward(input)
        ...
        return fwd_output
    @staticmethod
    @custom_bwd
    def backward(ctx, grad):
        ...

现在，可以在任何地方调用MyFloat32Func，而无需手动禁用自动广播或强制转换输入：

func = MyFloat32Func.apply

with autocast():
    # func will run in float32, regardless of the surrounding autocast state
    output = func(input)