目录
一、 DataParallel 和 DistributedDataParallel
- DataParallel是单进程多线程的,仅仅能工作在单机中。而DistributedDataParallel是多进程的,可以工作在单机或多机器中。
- DataParallel通常会慢于DistributedDataParallel。所以目前主流的方法是DistributedDataParallel。
二、DP(DataParallel)——单机多卡
在带有多GPU的机器上运行代码,只要在原始单GPU代码中模型定义完成后面,添加以下代码即可,这样会默认在所有的GPU上进行训练,特别注意:若banchsize=30,则在单GPU上是30个样本一组进行训练,在2个GPU上并行训练,则是每个GPU上15个样本,以此类推。所以可以考虑增加banchsize的值,即设置为30*GPU个数。
- 导入包
from torch.nn import DataParallel
- 模型用DataParallel包装一下:
device = [torch.device('cuda:0'), torch.device('cuda:1')] # 使用 GPU 0 和 GPU 1
MODEL = UNet(1, 2).to(device[0])
MODEL = torch.nn.DataParallel(MODEL, device_ids=device) # 指定要用到的设备
- 损失函数也指定设备:
CRITERION = nn.CrossEntropyLoss().to(device[0])
- 数据也指定设备,具体看自己的代码修改:
trainer = trainer(MODEL, CRITERION, dataloaders, device=device[0])
或
X_train, y_train = X_train.to(device[0]), y_train.to(device[0])
这里只需要用device[0]定义一个样式就好,不需要逐卡指定设备,但没有这一步会报错。
- 当然,有些自定义的sampler,dataloader可能也要相应改之,具体问题具体处理。
参考: pytorch多卡训练(含demo) - 因为数据会被均分到不同的GPU上,所以要求batch_size大于GPU的数量。
三、DDP(DistributedDataParallel)——单机多卡、多机多卡
3.1 单卡模式
3.1.1 BatchNorm2d layer转换为SyncBatchNorm layer
- PyTorch — SyncBatchNorm
- PyTorch 将模型中的所有BatchNorm2d layer转换为SyncBatchNorm layer (单机多卡):
import torch.distributed as dist
dist.init_process_group(backend='gloo', init_method='file:///tmp/somefile', rank=0, world_size=1)
model = torch.nn.SyncBatchNorm.convert_sync_batchnorm(model)
- SyncBN及其pyTorch实现
- Pytorch | torch.nn.SyncBatchNorm
- Pytorch多GPU和Sync BatchNorm代码
- Pytorch多GPU的计算和Sync BatchNorm
3.1.2 其他设置
- You管:PyTorch,系列视频【Part 3: Multi-GPU training with DDP (code walkthrough)】,亲测有效。一步步按视频修改。
- 参考:Pytorch实现多GPU并行训练(DDP))
- DDP(DistributedDataParallel) 分布式训练1——入门上手
vimdiff
可以比较两个文档的差别:
vimdiff single_gpu.py mp_train.py
3.1.2.1 Trainer
- 模型初始化
self.model = DDP(model, device_ids=[gpu_id]) # 1.修改
- checkpoints
# 修改前
if (self.gpu_id == 0 and self.iter % self.ckpt_frequency) == 0: # 3.修改
checkpoint_name = os.path.join(self.checkpoint_dir, self.comments + 'iter_' + str(self.iter) + '.pth')
torch.save(self.model.module.state_dict(), checkpoint_name) # 2.修改
# 修改后
if (self.iter % self.ckpt_frequency) == 0:
checkpoint_name = os.path.join(self.checkpoint_dir, self.comments + 'iter_' + str(self.iter) + '.pth')
torch.save(self.model.state_dict(), checkpoint_name)
3.1.2.2 Dataset
- Dataloader
# 修改前,pin_memory=True 可加可不加,如果服务器性能好,内存大,可以加
loader_tr = torch.utils.data.DataLoader(dataset=dataset_tr, batch_size=batch, shuffle=True, num_workers=n_workers,pin_memory=True)
loader_ts = torch.utils.data.DataLoader(dataset=dataset_ts, batch_size=batch, shuffle=True, num_workers=n_workers,pin_memory=True)
# 修改后
loader_tr = torch.utils.data.DataLoader(dataset=dataset_tr, batch_size=batch, shuffle=False, sampler=DistributedSampler(dataset),num_workers=n_workers) # 5.修改
loader_ts = torch.utils.data.DataLoader(dataset=dataset_ts, batch_size=batch, shuffle=False, sampler=DistributedSampler(dataset),num_workers=n_workers) # 6.修改
# 也可以这么修改
train_sampler = DistributedSampler(dataset_tr)
val_sampler = DistributedSampler(dataset_ts)
loader_tr = DataLoader(dataset_tr, batch_size=batch, shuffle=False,num_workers=n_workers, sampler=train_sampler) # 5.修改
loader_ts = DataLoader(dataset_ts, batch_size=batch, shuffle=False,num_workers=n_workers, sampler=val_sampler) # 6.修改
3.1.2.3 main
- 定义ddp_setup
def ddp_setup(rank, world_size):
"""
Args:
rank: Unique identifier of each process
world_size: Total number of processes
"""
os.environ["MASTER_ADDR"] = "localhost"
os.environ["MASTER_PORT"] = "12355"
init_process_group(backend="nccl", rank=rank, world_size=world_size)
torch.cuda.set_device(rank)
- main主函数
ddp_setup(rank, world_size) # 7.修改
- if name == ‘main’:
if __name__ == '__main__':
import argparse
parser = argparse.ArgumentParser(description='simple distributed training job')
args = parser.parse_args()
world_size = torch.cuda.device_count() # 9.修改
mp.spawn(main, args=(world_size,), nprocs=world_size) # 8.会自动分配rank
- 模型修改
# 修改前
model = HRANet(1,2).to(device)
# 修改后
model = HRANet(1,2) # 10. 修改
MODEL = nn.SyncBatchNorm.convert_sync_batchnorm(model) # 11. 修改
- loss
# 修改前
CRITERION = nn.CrossEntropyLoss().to(device)
# 修改后
CRITERION = nn.CrossEntropyLoss() # 12. 修改
- trainer
# 修改前
trainer = trainer(MODEL, OPTIMIZER, LR_SCHEDULER, CRITERION, dataloaders, comment, verbose_train, verbose_val, checkpoint_frequency, MAX_EPOCHS, checkpoint_dir=checkpoints_dir, device=device)
# 修改后
trainer = Trainer(MODEL, OPTIMIZER, LR_SCHEDULER, CRITERION, dataloaders, comment, verbose_train, verbose_val, rank, checkpoint_frequency, MAX_EPOCHS, checkpoint_dir=checkpoints_dir) # 13.修改rank的位置
3.2 多卡模式
3.3 在DDP训练时出现错误梯度计算错误
- 问题:
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [256]] is at version 4; expected version 3 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).
3.4 使用多GPU进行训练,出现了日志文件冲突的问题
- 问题:
# 出现文件已存在的问题
FileExistsError: [Errno 17] File exists: 'runs/May22_10-57-41_autodl-container-927b4c95cc-8f4abb00liver_segmentation_HRADEA_ce_on_LITS_dataset_'
- 解决:
# 修改前
self.writer = SummaryWriter(comment=self.comments)
# 修改后
self.writer = SummaryWriter(comment=f'{self.comments}_rank{rank}')
参考文献
[1] 官网教程Optional: Data Parallelism
[2] pytorch单GPU代码改成多GPU并行训练
[3] pytorch多卡训练(含demo)
[4] 官网教程Getting Started with Distributed Data Parallel
[5] Pytorch实现多GPU并行训练(DDP))
[6] 深度学习-GPU多卡并行训练总结
[7] 超详细逐步骤演示Pytorch深度学习多GPU并行训练全过程
[8] pytorch多GPU并行训练教程_哔哩哔哩