当我们使用distributedDataParallel(DDP)进行分布式训练的时候,假设单卡训练时,一张卡一个batch能装4张图片,并且占得比较满。而多卡训练时,由于cuda0除了要进行前向传播等还得负责通信,cuda0的空间就不够大了。
这时,我们可以选择减少cuda0上的batch_size大小,比如改为1.
假如我们有8张卡,设置的总的batch_size = 32,原始设置的每张卡batch_size=4。
以github上这个文件为例
截取需要改的一段:
parser.add_argument('-b', '--batch-size', default=32, type=int,
metavar='N',
help='mini-batch size (default: 256), this is the total '
'batch size of all GPUs on the current node when '
'using Data Parallel or Distributed Data Parallel')
····
elif args.distributed:
# For multiprocessing distributed, DistributedDataParallel constructor
# should always set the single device scope, otherwise,
# DistributedDataParallel will use all available devices.
if args.gpu is not None:
torch.cuda.set_device(args.gpu)
model.cuda(args.gpu)
# When using a single GPU per process and per
# DistributedDataParallel, we need to divide the batch size
# ourselves based on the total number of GPUs we have
args.batch_size = int(args.batch_size / ngpus_per_node)
args.workers = int((args.workers + ngpus_per_node - 1) / ngpus_per_node)
model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.gpu])
这样cuda0会out of memory,改为:
parser.add_argument('-b', '--batch-size', default=32, type=int,
metavar='N',
help='mini-batch size (default: 256), this is the total '
'batch size of all GPUs on the current node when '
'using Data Parallel or Distributed Data Parallel')
·····
elif args.distributed:
# For multiprocessing distributed, DistributedDataParallel constructor
# should always set the single device scope, otherwise,
# DistributedDataParallel will use all available devices.
if args.gpu is not None:
torch.cuda.set_device(args.gpu)
model.cuda(args.gpu)
# When using a single GPU per process and per
# DistributedDataParallel, we need to divide the batch size
# ourselves based on the total number of GPUs we have
if args.rank = 0:
args.batch_size = 1
else:
args.batch_size = int(args.batch_size / ngpus_per_node)
args.workers = int((args.workers + ngpus_per_node - 1) / ngpus_per_node)
model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.gpu])
注意事项:
- 我们没改总的batch_size,因为其他的卡还需要这个来计算batch_size数。如果改成32-7=25,那么其他卡的batch_size数为int(25/8)=3了。所以我们这种改法的实际batch_size数为25。
- 以前输出和保存都是按epoch的频率输出的话,这时候不要仍在cuda0上进行输出了,因为cuda0上的batch_size=1,运行一个batch是原来的4倍时间。其他cuda已经训练完100个epoch了,cuda0还在第25个。由于同一计算梯度的,相当于是间隔epoch保存。