自动投射类型(自动改变tensor在传播时的具体类型比如Float16到 Float32)
from torch.cuda.amp import autocast as autocast # 注意!!!
# Creates model and optimizer in default precision
model = Net().cuda()
optimizer = optim.SGD(model.parameters(), ...)
for input, target in data:
optimizer.zero_grad()
# Enables autocasting for the forward pass (model + loss)
with autocast():# 注意!!!
output = model(input)
loss = loss_fn(output, target)
# Exits the context manager before backward()
loss.backward()
optimizer.step()
或
from torch.cuda.amp import autocast as autocast # 注意!!!
class AutocastModel(nn.Module):
...
@autocast()# 注意!!!
def forward(self, input):
...
自动梯度加权:inf或NaN的梯度出现
underflow(“下溢”)
对于很小的数值用float16表示即为0。为了防止下溢,“梯度缩放”将网络的损失乘以比例因子,并在缩放的损失上调用向后传递。然后,通过网络向后流动的梯度按相同的因子缩放。换句话说,梯度值具有更大的量级,因此它们不会下溢到零。
scaler的大小在每次迭代中动态的估计,为了尽可能的减少梯度underflow,scaler应该更大;但是如果太大的话,半精度浮点型的tensor又容易overflow(变成inf或者NaN)。在每次scaler.step(optimizer)中,都会检查inf或NaN的梯度
-
内部调用(除非在迭代的前面显式调用unscale_())。作为unscale_() 的一部分,将检查 infs/NaN 的渐变。unscale_(optimizer)optimizer
-
如果未找到 inf/NaN 梯度,则使用未缩放的梯度进行调用。否则,将跳过,以避免损坏参数。optimizer.step()optimizer.step()
...
scaler.scale(loss).backward()
scaler.unscale_(optimizer) # Simple example, using unscale_() to enable clipping of unscaled gradients:
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm)
scaler.step(optimizer)
scaler.update()
梯度裁剪:clip_grad_norm
torch.nn.utils.clip_grad_norm(model.parameters(),1.)
参考与更多
如何提高PyTorch“炼丹”速度?这位小哥总结了17种方法
需要pytorch1.5或以上版本才有amp:
使用DTR和混合精度训练大模型
我用这个代码测试了一下
又将网络改的更复杂了一点,并运行了多次:
autocast 17.183414459228516
autocast 15.994871854782104
autocast 15.820693016052246
autocast 16.13741683959961
autocast 15.887990713119507
autocast 16.030110120773315
autocast 16.310083389282227
autocast 15.994830131530762
autocast 15.934360265731812
autocast 16.200788259506226
Process finished with exit code 0
tensor(0.1449, grad_fn=<SumBackward0>)
noautocast 16.728676080703735
tensor(0.1794, grad_fn=<SumBackward0>)
noautocast 15.766069889068604
tensor(0.0185, grad_fn=<SumBackward0>)
noautocast 16.220239877700806
tensor(0.1052, grad_fn=<SumBackward0>)
noautocast 15.826192855834961
tensor(0.0923, grad_fn=<SumBackward0>)
noautocast 15.72545838356018
tensor(0.0265, grad_fn=<SumBackward0>)
noautocast 15.755510807037354
tensor(0.0863, grad_fn=<SumBackward0>)
noautocast 15.407546281814575
tensor(0.2301, grad_fn=<SumBackward0>)
noautocast 16.111358404159546
tensor(0.1834, grad_fn=<SumBackward0>)
noautocast 15.880879163742065
tensor(0.1598, grad_fn=<SumBackward0>)
noautocast 15.930904150009155
Process finished with exit code 0
noautocast 14.903399467468262
autocast 14.896795988082886
noautocast 17.021508932113647
autocast 15.884583711624146
noautocast 16.640204429626465
autocast 15.859556198120117
autocast 16.10871434211731
noautocast 14.936076402664185
autocast 16.06855869293213
noautocast 15.59671926498413
autocast 15.723806619644165
noautocast 14.815815687179565
# 实现与采用大batch size相近的效果(学习率也要适当放大:因为使用的样本增多,梯度更加稳定了)
# 先将梯度进行累加,当梯度累加达到固定次数之后更新网络参数,梯度置零
for i, (images, target) in enumerate(train_loader):
images = images.cuda(non_blocking=True)
target = torch.from_numpy(np.array(target)).float().cuda(non_blocking=True)
outputs = model(images)
loss = criterion(outputs, target)
loss = loss / accumulation_steps
loss.backward()
if (i + 1) % accumulation_steps == 0:
optimizer.step()
optimizer.zero_grad()