PyTorch指南:17个技巧让你的深度学习模型训练变得飞快!

                                                                            Fly-AI竞赛服务平台 flyai.com

在开始学习之前推荐大家可以多在  FlyAI竞赛服务平台多参加训练和竞赛,以此来提升自己的能力。FlyAI是为AI开发者提供数据竞赛并支持GPU离线训练的一站式服务平台。每周免费提供项目开源算法样例,支持算法能力变现以及快速的迭代算法模型。

 关注“FlyAI社区”公众号【后台回复“Pytorch”,即可获取原文链接】

英文原文:Faster Deep Learning Training with PyTorch – a 2021 Guide

标签:深度学习

Say, you're training a deep learning model in PyTorch. What can you do to make your training finish faster?

假设你正在使用pytorch训练深度学习模型,那么如何能够加快模型训练速度呢?

In this post, I'll provide an overview of some of the lowest-effort, highest-impact ways of accelerating the training of deep learning models in PyTorch. For each of the methods, I'll briefly summarize the idea, try to estimate the expected speed-up and discuss some limitations. I will focus on conveying the most important parts and point to further resources for each of them. Mostly, I'll focus on changes that can be made directly within PyTorch without introducing additional libraries and I'll assume that you are training your model on GPU(s).

在本文中,我会介绍一些高效的pytorch深度学习模型加速方法。对于某种方法,我会对其思路进行简要介绍,然后给出其提升速度并讨论其限制。我会把我认为重要的部分着重介绍,并在每个部分展示一些实例。接下来我假设你使用GPU训练模型,这些方法基本不需要导入其他库,只需要pytorch内进行更改即可。

The suggestions – roughly sorted from largest to smallest expected speed-up – are:

以下是我根据大概的加速效果对不同策略的排序:

Consider using a different learning rate schedule.

调整不同的学习率;

Use multiple workers and pinned memory in DataLoader.

在DataLoader中使用多个工作者和钉住内存。

Max out the batch size.

最大化批处理大小;

Use Automatic Mixed Precision (AMP).

使用自动混合精度AMP;

Consider using a different optimizer.

考虑不同的优化器;

Turn on cudNN benchmarking.

打开cuDNN基准;

Beware of frequently transferring data between CPUs and GPUs.

当心CPU与GPU频繁的数据转换;

Use gradient/activation checkpointing.

使用梯度/激活检查点;

Use gradient accumulation.

梯度累积;

Use DistributedDataParallel for multi-GPU training.

通过DistributedDataParallel进行多GPU训练;

Set gradients to None rather than 0.

将梯度设置为None而不是0;

Use .as_tensor rather than .tensor()

使用.as_tensor而不是.tensor;

Turn off debugging APIs if not needed.

如无必要,关闭调试API;

Use gradient clipping.

梯度裁剪;

Turn off bias before BatchNorm.

BatchNorm之前关闭偏差;

Turn off gradient computation during validation.

验证过程中关闭梯度计算;

Use input and batch normalization.

规范化输入和批处理。

1. Consider using another learning rate schedule

The learning rate (schedule) you choose has a large impact on the speed of convergence as well as the generalization performance of your model.

在训练中使用的学习速率调整计划会极大影响收敛速率以及模型泛化能力.

Cyclical Learning Rates and the 1Cycle learning rate schedule are both methods introduced by Leslie N. Smith (here and here), and then popularised by fast.ai's Jeremy Howard and Sylvain Gugger (here and here). Essentially, the 1Cycle learning rate schedule looks something like this:

Leslie N. Smith 提出了周期学习速率和1-周期学习速率计划,然后被fast.ai 的 Jeremy Howard 和 Sylvain Gugger 推广了.总的来说, 1-周期学习速率计划中,学习速率的变化如下图所示:

图片

Sylvain writes:

[1cycle consists of] two steps of equal lengths, one going from a lower learning rate to a higher one than go back to the minimum. The maximum should be the value picked with the Learning Rate Finder, and the lower one can be ten times lower. Then, the length of this cycle should be slightly less than the total number of epochs, and, in the last part of training, we should allow the learning rate to decrease more than the minimum, by several orders of magnitude.

[1个周期由]两个长度相等的步骤组成,一个是从较低的学习率到较高的学习率,然后再回到最小值。最大值应该是用学习率查找器选取的值,较低的可以低十倍。然后,这个循环的长度应该略小于总的纪元数,而且,在训练的最后一部分,我们应该让学习率比最小值降低几个数量级。

In the best case this schedule achieves a massive speed-up – what Smith calls Superconvergence – as compared to conventional learning rate schedules. Using the 1Cycle policy he needs ~10x fewer training iterations of a ResNet-56 on ImageNet to match the performance of the original paper, for instance). The schedule seems to perform robustly well across common architectures and optimizers.

在最好的情况下,与传统的学习率策略相比,上述方法有望实现了巨大加速(Smith称之为超收敛)。例如,ResNet-56通过使用1周期策略,在ImageNet上将减少约10倍的训练迭代,与原论文性能相当。在常见的结构和优化器中,上述时间策略表现更好。

PyTorch implements both of these methods torch.optim.lr_scheduler.CyclicLR and torch.optim.lr_scheduler.OneCycleLR, see the documentation.

PyTorch提供了torch.optim.lr_scheduler.CyclicLR和torch.optim.lr_scheduler.OneCycleLR两种方法实现该操作,请参阅相关文档。

One drawback of these schedulers is that they introduce a number of additional hyperparameters. This post and this repo, offer a nice overview and implementation of how good hyper-parameters can be found including the Learning Rate Finder mentioned above.

这些调度器的一个缺点是引入了许多额外的超参数。这篇文章和仓库对如何查找好的超参数(包括上文提及的学习率)提供了详细概述和实现。

Why does this work? It doesn't seem entirely clear but one possible explanation might be that regularly increasing the learning rate helps to traverse saddle points in the loss landscape more quickly.

至于为什么要这样做?现今并不完全清楚,但一个可能的解释是:定期提高学习率有助于更快越过损失鞍点。

2. Use multiple workers and pinned memory in DataLoader

When using torch.utils.data.DataLoader, set num_workers > 0, rather than the default value of 0, and pin_memory=True, rather than the default value of False. Details of this are explained here.

在使用 torch.utils.data.DataLoader时,令 num_workers > 0,而不是默认值 0,同时设置 pin_memory=True,而不是默认值 False。至于为什么这么做,这篇文章会给你答案。

Szymon Micacz achieves a 2x speed-up for a single training epoch by using four workers and pinned memory.

根据上述方法,Szymon Micacz 在四个 worker 和页锁定内存的情况下,在单个epoch中实现了 2 倍加速。

A rule of thumb that people are using to choose the number of workers is to set it to four times the number of available GPUs with both a larger and smaller number of workers leading to a slow down.Note that increasing num_workers will increase your CPU memory consumption.

根据经验,一般将进程数量设置为可用 GPU 数量的四倍,大于或小于这个值都会降低训练速度。但是要注意,增加num_workers会增加 CPU 内存消耗。

3. Max out the batch size

This is a somewhat contentious point. Generally, however, it seems like using the largest batch size your GPU memory permits will accelerate your training (see NVIDIA's Szymon Migacz, for instance). Note that you will also have to adjust other hyperparameters, such as the learning rate, if you modify the batch size. A rule of thumb here is to double the learning rate as you double the batch size.

一直以来,人们对于调大batch没有定论。一般来说,在GPU内存允许的情况下增大batch将会增快训练速度,但同时还需要调整学习率等其他超参数。根据经验,batch大小加倍时,学习率也相应加倍。

OpenAI has a nice empirical paper on the number of convergence steps needed for different batch sizes. Daniel Huynh runs some experiments with different batch sizes (also using the 1Cycle policy discussed above) where he achieves a 4x speed-up by going from batch size 64 to 512.

OpenAI 的论文表明不同的batch大小收敛周期不同。Daniel Huynh用不同的batch大小进行了一些实验(使用上述1Cycle 策略),实验中他将 batch大小由64增加到512,实现了4倍加速。

One of the downsides of using large batch sizes, however, is that they might lead to solutions that generalize worse than those trained with smaller batches.

然而也要注意,较大的batch会降低模型泛化能力,反之亦然。

4. Use Automatic Mixed Precision (AMP)

The release of PyTorch 1.6 included a native implementation of Automatic Mixed Precision training to PyTorch. The main idea here is that certain operations can be run faster and without a loss of accuracy at semi-precision (FP16) rather than in the single-precision (FP32) used elsewhere. AMP, then, automatically decide which operation should be executed in which format. This allows both for faster training and a smaller memory footprint.

PyTorch1.6支持本地自动混合精度训练。与单精度 (FP32) 相比,一些运算在不损失准确率的情况下,使用半精度 (FP16)速度更快。AMP能够自动决定应该以哪种精度执行哪种运算,这样既可以加快训练速度,又减少了内存占用。

In the best case, the usage of AMP would look something like this:

AMP的使用如下所示:

import torch# Creates once at the beginning of trainingscaler = torch.cuda.amp.GradScaler()for data, label in data_iter:
   optimizer.zero_grad()
   # Casts operations to mixed precision
   with torch.cuda.amp.autocast():
      loss = model(data)

   # Scales the loss, and calls backward()
   # to create scaled gradients
   scaler.scale(loss).backward()

   # Unscales gradients and calls
   # or skips optimizer.step()
   scaler.step(optimizer)

   # Updates the scale for next iteration
   scaler.update()

Benchmarking a number of common language and vision models on NVIDIA V100 GPUs, Huang and colleagues find that using AMP over regular FP32 training yields roughly 2x – but upto 5.5x – training speed-ups.

Huang及其同事在NVIDIA V100 GPU上对一些常用语言和视觉模型进行了基准测试,发现在FP32训练中使用AMP提高约2倍的训练速度,最高甚至达到5.5倍。

Currently, only CUDA ops can be autocast in this way. See the documentation here for more details on this and other limitations.

目前,只有CUDA支持上述方式,查看本文档了解更多信息。

5. Consider using another optimizer

AdamW is Adam with weight decay (rather than L2-regularization) which was popularized by fast.ai and is now available natively in PyTorch as torch.optim.AdamW. AdamW seems to consistently outperform Adam in terms of both the error achieved and the training time. See this excellent blog post on why using weight decay instead of L2-regularization makes a difference for Adam.

AdamW是由fast.ai提出的具有权重衰减(而非 L2 正则化)的Adam, PyTorch中通过torch.optim.AdamW实现。在误差和训练时间上,AdamW都优于Adam。查看此文章了解为什么权重衰减使得Adam产生更好效果。

Both Adam and AdamW work well with the 1Cycle policy described above.

Adam和AdamW都很适合前文提到的1Cycle策略。

There are also a few not-yet-native optimizers that have received a lot of attention recently, most notably LARS (pip installable implementation) and LAMB.

此外,LARS和LAMB等其他优化器也受到广泛关注。

NVIDA's APEX implements fused versions of a number of common optimizers such as Adam. This implementation avoid a number of passes to and from GPU memory as compared to the PyTorch implementation of Adam, yielding speed-ups in the range of 5%.

NVIDA的APEX对Adam等常见优化器进行优化融合,相比PyTorch中的原始Adam,由于避免了GPU内存之间的多次传递,训练速度提升约 5%。

6. Turn on cudNN benchmarking

If your model architecture remains fixed and your input size stays constant, setting torch.backends.cudnn.benchmark = True might be beneficial (docs). This enables the cudNN autotuner which will benchmark a number of different ways of computing convolutions in cudNN and then use the fastest method from then on.

如果你的模型架构是固定的,同时输入大小保持不变,那么设置torch.backends.cudnn.benchmark = True可能会提升模型速度(帮助文档)。通过启用cudNN自动调节器,可以在cudNN中对多种计算卷积的方法进行基准测试,然后选择最快的方法。

For a rough reference on the type of speed-up you can expect from this, Szymon Migacz achieves a speed-up of 70% on a forward pass for a convolution and a 27% speed-up for a forward + backward pass of the same convolution.

至于提速效果,Szymon Migacz在前向卷积时提速70%,在同时向前和后向卷积时提升了27%。

One caveat here is that this autotuning might become very slow if you max out the batch size as mentioned above.

注意,如果你想要根据上述方法最大化批大小,该自动调整可能会非常耗时。

7. Beware of frequently transferring data between CPUs and GPUs

Beware of frequently transferring tensors from a GPU to a CPU usingtensor.cpu() and vice versa using tensor.cuda() as these are relatively expensive. The same applies for .item() and .numpy() – use .detach() instead.

通过tensor.cpu()可以将张量从GPU传输到CPU,反之使用tensor.cuda(),但这样的数据转化代价较高。.item()和.numpy()的使用也是如此,建议使用.detach()。

If you are creating a new tensor, you can also directly assign it to your GPU using the keyword argument device=torch.device('cuda:0').

如果要创建新的张量,使用关键字参数device=torch.device('cuda:0')将其直接分配给GPU。

If you do need to transfer data, using .to(non_blocking=True), might be useful as long as you don't have any synchronization points after the transfer.

最好使用.to(non_blocking=True)传输数据,确保传输后没有任何同步点即可。

If you really have to, you might want to give Santosh Gupta's SpeedTorch a try, although it doesn't seem entirely clear when this actually does/doesn't provide speed-ups.

如果你认为有必要,你可能会想试试Santosh Gupta的SpeedTorch,尽管它似乎并不完全清楚这实际上什么时候能/不提供加速。

8. Use gradient/activation checkpointing

Quoting directly from the documentation:

直接引用文件中的话:

Checkpointing works by trading compute for memory. Rather than storing all intermediate activations of the entire computation graph for computing backward, the checkpointed part does not save intermediate activations, and instead recomputes them in backward pass. It can be applied on any part of a model.

检查点通过将计算保存到内存来工作。检查点在反向传播算法过程中并不保存计算图的中间激活,而是在反向传播时重新计算,其可用于模型的任何部分。

Specifically, in the forward pass, function will run in torch.no_grad() manner, i.e., not storing the intermediate activations. Instead, the forward pass saves the inputs tuple and the function parameter. In the backwards pass, the saved inputs and function is retrieved, and the forward pass is computed on function again, now tracking the intermediate activations, and then the gradients are calculated using these activation values.

具体来说,在前向传播中,function以torch.no_grad()方式运行,不存储任何中间激活。相反,前向传递将保存输入元组和function参数。在反向传播时,检索保存的输入和function,并再次对function进行正向传播,记录中间激活,并使用这些激活值计算梯度。

So while this will might slightly increase your run time for a given batch size, you'll significantly reduce your memory footprint. This in turn will allow you to further increase the batch size you're using allowing for better GPU utilization.

因此,对于特定的批处理大小,这可能会稍微增加运行时间,但会显着减少内存消耗。反过来,你可以进一步增加批处理大小,从而更好地利用GPU。

While checkpointing is implemented natively as torch.utils.checkpoint (docs), it does seem to take some thought and effort to implement properly. Priya Goyal has a good tutorial demonstrating some of the key aspects of checkpointing.

虽然检查点可以通过torch.utils.checkpoint方便实现,但仍需要里哦阿姐其思想与本质。Priya Goyal的教程很清晰的演示了检查点的一些关键思想,推荐阅读。

9. Use gradient accumulation

Another approach to increasing the batch size is to accumulate gradients across multiple .backward() passes before calling optimizer.step().

增加批处理大小的另一种方法是在调用Optimizer.step()之对多个.backward()传递梯度进行累积。

Following a post by Hugging Face's Thomas Wolf, gradient accumulation can be implemented as follows:

根据Hugging Face的Thomas Wolf发表的文章,可以按以下方式实现梯度累积:

model.zero_grad()                                   # Reset gradients tensors    
for i, (inputs, labels) in enumerate(training_set):    
predictions = model(inputs)                     # Forward pass    
loss = loss_function(predictions, labels)       # Compute loss function    
loss = loss / accumulation_steps                # Normalize our loss (if averaged)    
loss.backward()                                 # Backward pass    
if (i+1) % accumulation_steps == 0:             # Wait for several backward steps    
optimizer.step()                            # Now we can do an optimizer step    
model.zero_grad()                           # Reset gradients tensors    
if (i+1) % evaluation_steps == 0:           # Evaluate the model when we...    
evaluate_model()                        # ...have no gradients accumulated

This method was developed mainly to circumvent GPU memory limitations and I'm not entirely clear on the trade-off between having additional .backward() loops. This discussion on the fastai forum seems to suggest that it can in fact accelerate training, so it's probably worth a try.view rawgradient_accumulation.py hosted with ❤ by GitHub

该方法主要是为了规避GPU内存的限制,但对其他.backward()循环之间的取舍我并不清楚。fastai论坛上的讨论似乎表明它实际上是可以加速训练的,因此值得一试。详情查看GitHub托管的rawgradient_accumulation.py。

10. Use Distributed Data Parallel for multi-GPU training

Methods to accelerate distributed training probably warrant their own post but one simple one is to use torch.nn.DistributedDataParallel rather than torch.nn.DataParallel. By doing so, each GPU will be driven by a dedicated CPU core avoiding the GIL issues of DataParallel.

通过分布式训练加快模型速度的一种简单的方法是使用torch.nn.DistributedDataParallel而不是torch.nn.DataParallel。这样,每个GPU将由专用的CPU内核驱动,从而避免了DataParallel的GIL问题。

11. Set gradients to None rather than 0

Use .zero_grad(set_to_none=True) rather than .zero_grad().

设置.zero_grad(set_to_none=True)而不是.zero_grad()。

Doing so will let the memory allocator handle the gradients rather than actively setting them to 0. This will lead to yield a modest speed-up as they say in the documentation, so don't expect any miracles.

这样内存分配器处理梯度而不是主动将其设置为0,这会产生该文档所示的适度加速,但不要抱有过大期望。

Watch out, doing this is not side-effect free! Check the docs for the details on this.

注意,这样做不会有任何副作用!阅读文档查看更多信息。

12. Use .as_tensor() rather than .tensor()

torch.tensor() always copies data. If you have a numpy array that you want to convert, use torch.as_tensor() or torch.from_numpy() to avoid copying the data.

torch.tensor()本质是复制数据,因此,如果要转换numpy数组,使用torch.as_tensor()或torch.from_numpy()可以避免复制数据。

13. Turn on debugging tools only when actually needed

Pytorch offers a number of useful debugging tools like the autograd.profiler, autograd.grad_check, and autograd.anomaly_detection. Make sure to use them to better understand when needed but to also turn them off when you don't need them as they will slow down your training.

Pytorch提供了许多调试工具,例如autograd.profiler, autograd.grad_check和autograd.anomaly_detection。使用时一定要谨慎,这些调试工具显然会影响训练速度,因此在不需要时将其关闭。

14. Use gradient clipping

Originally used to avoid exploding gradients in RNNs, there is both some empirical evidence as well as some theoretical support that clipping gradients (roughly speaking: gradient = min(gradient, threshold) ) accelerates convergence.

为了避免RNN中的梯度爆炸,使用梯度裁剪gradient = min(gradient, threshold)可以起到加速收敛作用,这一方法已得到理论和实验的支持。

Hugging Face's Transformer implementation is a really clean example of how to use gradient clipping as well as some of the other methods such as AMP mentioned in this post.

Hugging Face的Transformer提供了将梯度裁剪和AMP等其他方法有效结合的清晰示例。

In PyTorch this can be done using torch.nn.utils.clip_grad_norm_ (documentation).

在PyTorch中,也可使用torch.nn.utils.clip_grad_norm_(文档查阅)完成此操作。

It's not entirely clear to me which models benefit how much from gradient clipping but it seems to be robustly useful for RNNs, Transformer-based and ResNets architectures and a range of different optimizers.

虽然我尚不完全清楚哪种模型可以从梯度裁剪中受益,但毫无疑问的是,对于RNN、基于Transformer和ResNets结构的一系列优化器来说,该方法显然是起到一定作用的。

15. Turn off bias before BatchNorm

This is a very simple one: turn off the bias of layers before BatchNormalization layers. For a 2-D convolutional layer, this can be done by setting the bias keyword to False: torch.nn.Conv2d(..., bias=False, ...). (Here's a reminder why this makes sense.)

在BatchNormalization层之前关闭之前层的偏差是一种简单有效的方法。对于二维卷积层,可以通过将bias关键字设置为False实现,即torch.nn.Conv2d(..., bias=False, ...)。阅读该文档了解其原理。

You will save some parameters, I would however expect the speed-up of this to be relatively small as compared to some of the other methods mentioned here.

与其他方法相比,该方法的速度提升是有的。

16. Turn off gradient computation during validation

This one is straightforward: set torch.no_grad() during validation.

在模型验证时令torch.no_grad()

17. Use input and batch normalization

You're probably already doing this but you might want to double-check:

也许你已经在这样做了,但还是要仔细检查,反复确认:

Are you normalizing your input?

是否规范化输入?

Are you using batch-normalization?

是否规范化批处理?

Bonus tip from the comments: Use JIT to fuse point-wise operations.

其他技巧:使用JIT实现逐点融合

If you have adjacent point-wise operations you can use PyTorch JIT to combine them into one FusionGroup which can then be launched on a single kernel rather than multiple kernels as would have been done per default. You'll also save some memory reads and writes.

如果要执行相邻逐点操作,可以使用PyTorch JIT将它们组合成一个FusionGroup,然后在单内核上启动,而不是像默认情况那样在多个内核上启动,同时还可以保存一些内存进行读写。

Szymon Migacz shows how you can use the @torch.jit.script decorator to fuse the operations in a GELU, for instance:

Szymon Migacz展示了如何使用@torch.jit.script装饰器融合GELU操作融合,如下:

@torch.jit.script
def fused_gelu(x):
    return x * 0.5 * (1.0 + torch.erf(x / 1.41421))

In this case, fusing the operations leads to a 5x speed-up for the execution of fused_gelu as compared to the unfused version.

相比于未融合版本,融合这些操作可以使fused_gelu的执行速度提高5倍。

See also this post for an example of how Torchscript can be used to accelerate an RNN.

查阅此文章获取更多使用Torchscript加速RNN的示例。

Hat tip to u/Patient_Atmosphere45 on Reddit for the suggestion.

当然,你还可以在Reddit上与u/Patient_Atmosphere45交流讨论。

Sources and additional resources

Many of the tips listed above come from Szymon Migacz' talk and post in the PyTorch docs.

本文许多技巧参考自Szymon Migacz的演讲及PyTorch文档。

PyTorch Lightning's William Falcon has two interesting posts with tips to speed-up training. PyTorch Lightning does already take care of some of the points above per-default.

PyTorch Lightning的作者William Falcon在这两篇文章中介绍了关于加快训练的内容。同时,PyTorch Lightning已集成以上一些技巧与方法。

Thomas Wolf at Hugging Face has a number of interesting articles on accelerating deep learning – with a particular focus on language models.

Hugging Face的作者Thomas Wolf也写了一系列文章介绍深度学习的加速-尤其是语言模型。

The same goes for Sylvain Gugger and Jeremy Howard: they have many interesting posts in particular on learning rates and AdamW.

Sylvain Gugger和Jeremy Howard写了很多关于学习率和AdamW的文章。

Thanks to Ben Hahn, Kevin Klein and Robin Vaaler for their feedback on a draft of this post!

感谢Ben Hahn,Kevin Klein和Robin Vaaler对本文撰写提供的帮助!

 

更多深度学习竞赛项目,大家可移步官网进行查看和参赛!

更多精彩内容请访问FlyAI-AI竞赛服务平台;为AI开发者提供数据竞赛并支持GPU离线训练的一站式服务平台;每周免费提供项目开源算法样例,支持算法能力变现以及快速的迭代算法模型。

挑战者,都在FlyAI!!!

 

  • 1
    点赞
  • 7
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值