pytorch多GPU分布式训练（DDP），cuda0 out of memory，cuda0减少batch_size的注意事项。

最新推荐文章于 2024-08-17 22:11:19 发布

yijun009

最新推荐文章于 2024-08-17 22:11:19 发布

阅读量4.4k

点赞数 3

分类专栏：程序并行文章标签： pytorch 分布式 out of memory

本文链接：https://blog.csdn.net/m0_37833297/article/details/121875395

版权

程序并行专栏收录该内容

2 篇文章 0 订阅

订阅专栏

当我们使用distributedDataParallel（DDP）进行分布式训练的时候，假设单卡训练时，一张卡一个batch能装4张图片，并且占得比较满。而多卡训练时，由于cuda0除了要进行前向传播等还得负责通信，cuda0的空间就不够大了。

这时，我们可以选择减少cuda0上的batch_size大小，比如改为1.

假如我们有8张卡，设置的总的batch_size = 32，原始设置的每张卡batch_size=4。

以github上这个文件为例
截取需要改的一段：

parser.add_argument('-b', '--batch-size', default=32, type=int,
                    metavar='N',
                    help='mini-batch size (default: 256), this is the total '
                         'batch size of all GPUs on the current node when '
                         'using Data Parallel or Distributed Data Parallel')
····
    elif args.distributed:
        # For multiprocessing distributed, DistributedDataParallel constructor
        # should always set the single device scope, otherwise,
        # DistributedDataParallel will use all available devices.
        if args.gpu is not None:
            torch.cuda.set_device(args.gpu)
            model.cuda(args.gpu)
            # When using a single GPU per process and per
            # DistributedDataParallel, we need to divide the batch size
            # ourselves based on the total number of GPUs we have
            args.batch_size = int(args.batch_size / ngpus_per_node)
            args.workers = int((args.workers + ngpus_per_node - 1) / ngpus_per_node)
            model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.gpu])

这样cuda0会out of memory，改为：

parser.add_argument('-b', '--batch-size', default=32, type=int,
                    metavar='N',
                    help='mini-batch size (default: 256), this is the total '
                         'batch size of all GPUs on the current node when '
                         'using Data Parallel or Distributed Data Parallel')
   ·····
    elif args.distributed:
        # For multiprocessing distributed, DistributedDataParallel constructor
        # should always set the single device scope, otherwise,
        # DistributedDataParallel will use all available devices.
        if args.gpu is not None:
            torch.cuda.set_device(args.gpu)
            model.cuda(args.gpu)
            # When using a single GPU per process and per
            # DistributedDataParallel, we need to divide the batch size
            # ourselves based on the total number of GPUs we have
            if args.rank = 0:
            	args.batch_size = 1
            else:
            	args.batch_size = int(args.batch_size / ngpus_per_node)
            args.workers = int((args.workers + ngpus_per_node - 1) / ngpus_per_node)
            model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.gpu])

注意事项：

我们没改总的batch_size，因为其他的卡还需要这个来计算batch_size数。如果改成32-7=25，那么其他卡的batch_size数为int（25/8）=3了。所以我们这种改法的实际batch_size数为25。
以前输出和保存都是按epoch的频率输出的话，这时候不要仍在cuda0上进行输出了，因为cuda0上的batch_size=1,运行一个batch是原来的4倍时间。其他cuda已经训练完100个epoch了，cuda0还在第25个。由于同一计算梯度的，相当于是间隔epoch保存。