pytorch 高精度编程:自动混合精度（AMP）+Pytorch有什么节省显存

FakeOccupational

已于 2022-07-01 10:00:07 修改

阅读量830

点赞数

分类专栏：深度学习文章标签： pytorch 深度学习人工智能

于 2022-06-01 10:22:43 首次发布

本文链接：https://blog.csdn.net/ResumeProject/article/details/125024806

版权

深度学习专栏收录该内容

162 篇文章 19 订阅

订阅专栏

本文介绍了PyTorch中使用自动混合精度（AMP）训练模型以提升速度和降低内存消耗的方法。通过autocast上下文管理器启用半精度浮点运算，配合梯度缩放防止下溢并动态调整比例因子，以及梯度裁剪来确保数值稳定性。实验结果显示，使用AMP能显著减少训练时间。

摘要由CSDN通过智能技术生成

注：有关amp更详细解释，参见原文档

自动投射类型(自动改变tensor在传播时的具体类型比如Float16到 Float32)

from torch.cuda.amp import autocast as autocast # 注意！！！

# Creates model and optimizer in default precision
model = Net().cuda()
optimizer = optim.SGD(model.parameters(), ...)

for input, target in data:
    optimizer.zero_grad()

    # Enables autocasting for the forward pass (model + loss)
    with autocast():# 注意！！！
        output = model(input)
        loss = loss_fn(output, target)

    # Exits the context manager before backward()
    loss.backward()
    optimizer.step()

或

from torch.cuda.amp import autocast as autocast # 注意！！！
class AutocastModel(nn.Module):
    ...
    @autocast()# 注意！！！
    def forward(self, input):
        ...

自动梯度加权：inf或NaN的梯度出现

underflow（“下溢”）

对于很小的数值用float16表示即为0。为了防止下溢，“梯度缩放”将网络的损失乘以比例因子，并在缩放的损失上调用向后传递。然后，通过网络向后流动的梯度按相同的因子缩放。换句话说，梯度值具有更大的量级，因此它们不会下溢到零。

scaler的大小在每次迭代中动态的估计，为了尽可能的减少梯度underflow，scaler应该更大；但是如果太大的话，半精度浮点型的tensor又容易overflow（变成inf或者NaN）。在每次scaler.step(optimizer)中，都会检查inf或NaN的梯度

内部调用（除非在迭代的前面显式调用unscale_（））。作为unscale_（）的一部分，将检查 infs/NaN 的渐变。unscale_(optimizer)optimizer
如果未找到 inf/NaN 梯度，则使用未缩放的梯度进行调用。否则，将跳过，以避免损坏参数。optimizer.step()optimizer.step()

...
scaler.scale(loss).backward()
scaler.unscale_(optimizer) # Simple example, using unscale_() to enable clipping of unscaled gradients:
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm)
scaler.step(optimizer)
scaler.update()

梯度裁剪：clip_grad_norm

torch.nn.utils.clip_grad_norm(model.parameters(),1.)

参考与更多

如何提高PyTorch“炼丹”速度？这位小哥总结了17种方法

需要pytorch1.5或以上版本才有amp:

在这里插入图片描述

在这里插入图片描述
使用DTR和混合精度训练大模型

我用这个代码测试了一下

在这里插入图片描述

又将网络改的更复杂了一点，并运行了多次：


autocast 17.183414459228516
autocast 15.994871854782104
autocast 15.820693016052246
autocast 16.13741683959961
autocast 15.887990713119507
autocast 16.030110120773315
autocast 16.310083389282227
autocast 15.994830131530762
autocast 15.934360265731812
autocast 16.200788259506226
Process finished with exit code 0
tensor(0.1449, grad_fn=<SumBackward0>)
noautocast 16.728676080703735
tensor(0.1794, grad_fn=<SumBackward0>)
noautocast 15.766069889068604
tensor(0.0185, grad_fn=<SumBackward0>)
noautocast 16.220239877700806
tensor(0.1052, grad_fn=<SumBackward0>)
noautocast 15.826192855834961
tensor(0.0923, grad_fn=<SumBackward0>)
noautocast 15.72545838356018
tensor(0.0265, grad_fn=<SumBackward0>)
noautocast 15.755510807037354
tensor(0.0863, grad_fn=<SumBackward0>)
noautocast 15.407546281814575
tensor(0.2301, grad_fn=<SumBackward0>)
noautocast 16.111358404159546
tensor(0.1834, grad_fn=<SumBackward0>)
noautocast 15.880879163742065
tensor(0.1598, grad_fn=<SumBackward0>)
noautocast 15.930904150009155
Process finished with exit code 0



noautocast 14.903399467468262
autocast 14.896795988082886
noautocast 17.021508932113647
autocast 15.884583711624146
noautocast 16.640204429626465
autocast 15.859556198120117

autocast 16.10871434211731
noautocast 14.936076402664185
autocast 16.06855869293213
noautocast 15.59671926498413
autocast 15.723806619644165
noautocast 14.815815687179565

# 实现与采用大batch size相近的效果(学习率也要适当放大：因为使用的样本增多，梯度更加稳定了)
# 先将梯度进行累加，当梯度累加达到固定次数之后更新网络参数，梯度置零
for i, (images, target) in enumerate(train_loader):
    images = images.cuda(non_blocking=True)
    target = torch.from_numpy(np.array(target)).float().cuda(non_blocking=True)
    outputs = model(images)
    loss = criterion(outputs, target)
    loss = loss / accumulation_steps   

    loss.backward()

    if (i + 1) % accumulation_steps == 0:
        optimizer.step()       
        optimizer.zero_grad()

Pytorch有什么节省显存的小技巧？