这个是较为简单的,详细请移步另一篇文章:https://blog.csdn.net/qq_36276587/article/details/123913384
简单总结使用pytorch进行单机多卡的分布式训练,主要是一些关键API的使用,以及分布式训练流程,pytorch版本1.2.0可用
初始化GPU通信方式(NCCL)
import torch.distributed as dist
torch.cuda.set_device(FLAGS.local_rank)
dist.init_process_group(backend='nccl')
device = torch.device("cuda", FLAGS.local_rank) #自己设置
分布式的数据加载
train_sampler = torch.utils.data.distributed.DistributedSampler(traindataset)
train_loader = torch.utils.data.DataLoader(
traindataset, batch_size=batchSize,
sampler=train_sampler,
num_workers=4, pin_memory=True,#drop_last=False,
collate_fn=alignCollate(imgH=imgH, imgW=imgW, keep_ratio=FLAGS.keep_ratio))
#pytorch的DataLoader格式处理训练标签
分布式训练模型
#初始化后的模型使用分布式训练
model = torch.nn.SyncBatchNorm.convert_sync_batchnorm(model) ## 同步bn
model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[FLAGS.local_rank],
output_device=FLAGS.local_rank)
启动训练
CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node=4 train_distributed.py