在使用pytorch lightning框架训练的时候,遇到了如下的warning:
UserWarning: Dataloader(num_workers>0) and ddp_spawn do not mix well!
Your performance might suffer dramatically.
Please consider setting distributed_backend=ddp to use num_workers > 0 (this is a bottleneck of Python .spawn() and PyTorch
貌似与多GPU训练有关,所以学习了一下如何多GPU训练,有两处可以设置并行训练:dataloader和trainer。
查看GPU个数
首先查看一下自己有几块GPU,根据自己情况分配:
Linux查看显卡信息:
lspci | grep -i vga
一般用的都是nvidia GPU,这样查看:
lspci | grep -i nvidia
查看GPU使用情况:
nvidia-smi
Dataloader 并行
在使用dataloader时可以用如下代码设置worker的个数,以加快数据处理。
Dataloader(dataset, num_workers=8, pin_memory=True)
注意pin_memory=True
只适用于有GPU 的情况,这个参数是为了加速数据的传输。
参考:https://developer.nvidia.com/blog/how-optimize-data-transfers-cuda-cc/
Trainer 并行
trainer = pl.Trainer(
max_epochs=args.epochs,
logger=logger,
checkpoint_callback=checkpoint_callback,
callbacks=[DecayLearningRate()],
gpus=args.gpus, # your GPU number
)
Pytorch lightning中用的是Spawn,但是这个工具有个问题就是当你只指定GPU个数的时候,这个方法会跟dataloader里面设置多个worker冲突,就会出现文章最开始的warning。
解决方法是在后面加一句:
trainer = pl.Trainer(
max_epochs=args.epochs,
logger=logger,
checkpoint_callback=checkpoint_callback,
callbacks=[DecayLearningRate()],
gpus=args.gpus,
distributed_backend="ddp"
)
使用了ddp模式之后就允许你在dataloader中指定 num_workers > 0
最后附上Pytorch Lightning 中的数据分布模式:
Lightning supports two backends. DataParallel and DistributedDataParallel. Both can be used for single-node multi-GPU training. For multi-node training you must use DistributedDataParallel.
-
DATAPARALLEL (DP)
Splits a batch across multiple GPUs on the same node. Cannot be used for multi-node training. -
DISTRIBUTEDDATAPARALLEL (DDP)
Trains a copy of the model on each GPU and only syncs gradients. If used with DistributedSampler, each GPU trains on a subset of the full dataset. -
DISTRIBUTEDDATAPARALLEL-2 (DDP2)
Works like DDP, except each node trains a single copy of the model using ALL GPUs on that node. Very useful when dealing with negative samples, etc…
You can toggle between each mode by setting this flag.
# DEFAULT (when using single GPU or no GPUs)
trainer = Trainer(distributed_backend=None)
# Change to DataParallel (gpus > 1)
trainer = Trainer(distributed_backend='dp')
# change to distributed data parallel (gpus > 1)
trainer = Trainer(distributed_backend='ddp')
# change to distributed data parallel (gpus > 1)
trainer = Trainer(distributed_backend='ddp2')
参考:https://pytorch-lightning.readthedocs.io/en/0.5.3.2/Trainer/Distributed%20training/