超详细逐步骤演示Pytorch深度学习多GPU并行训练全过程

撑住我爱代码

已于 2023-10-04 12:46:34 修改

阅读量4k

点赞数 7

文章标签：深度学习 pytorch 人工智能

于 2023-10-02 23:56:49 首次发布

本文链接：https://blog.csdn.net/qq_52868077/article/details/133500862

版权

写在前面

最近在复现深度学习方向的论文时，遇到了采用多GPU并行训练的情况。为将自己的复现结果与论文中达到的精度进行对比，于是学习了Pytorch框架下深度学习多GPU并行训练的全过程。在自己动手操作的过程中，我发现网上的资料较为零散，于是在本篇博文中，我将整理我的最近所学，逐步骤详解多GPU并行训练全过程。

一、基础知识

1. 现有的一些可供参考的资料

关于多GPU并行训练的基础知识，网上已有很详细的教程，因此我便不再赘述。对多GPU并行训练基础知识尚不了解的小伙伴们可以参考以下视频和博文：

pytorch多GPU并行训练教程_哔哩哔哩_bilibili

深度学习-GPU多卡并行训练总结_多gpu并行训练_MrRoose的博客-CSDN博客

2. 总结

总体来说，相较单GPU，多GPU的代码部分主要做了以下改动：

（1）新增初始化各进程环境的函数init_distributed_mode函数，并在主函数的初始阶段调用：

def init_distributed_mode(args):
    # 多机多卡时，WORLD_SIZE代表所有机器中使用的进程数量（一个进程对应一个GPU）;RANK为所有进程中的第几个进程（理解为从0开始的编号）
    # LOCAL_RANK为当前GPU的第几块设备（第几个进程）
    # 对于单机多卡，WORLD_SIZE代表一共有几块GPU, RANK = LOCAL_RANK
    # 在启用多GPU时，传入“use_env”参数，因此会在环境变量os.environ中存入RANK、WORLD_SIZE、LOCAL_RANK一系列参数
    if 'RANK' in os.environ and 'WORLD_SIZE' in os.environ:
        args.rank = int(os.environ["RANK"])   # 将字符型转为整型
        args.world_size = int(os.environ['WORLD_SIZE'])
        args.gpu = int(os.environ['LOCAL_RANK'])
    elif 'SLURM_PROCID' in os.environ:
        args.rank = int(os.environ['SLURM_PROCID'])
        args.gpu = args.rank % torch.cuda.device_count()
    else:
        print('Not using distributed mode')
        args.distributed = False
        return

    args.distributed = True

    torch.cuda.set_device(args.gpu)   # 对当前进程，指定所使用的GPU;使用多卡：针对每一个GPU起一个进程
    args.dist_backend = 'nccl'  # pytorch中有很多通信后端方式，nvidia GPU推荐使用NCCL
    print('| distributed init (rank {}): {}'.format(
        args.rank, args.dist_url), flush=True)

    # 创建进程组
    dist.init_process_group(backend=args.dist_backend, init_method=args.dist_url,
                            world_size=args.world_size, rank=args.rank)
    # 对于不同进程，world_size是一样的，rank不同
    dist.barrier()   # 等待每块GPU均运行至此，再往后走

    # 初始化各进程环境
    init_distributed_mode(args=args)

（2）分别使用DistributedSampler和BatchSampler对训练、验证数据进行处理：

    # 给每个rank对应的进程分配训练的样本索引 shuffle->补充数据->分配数据
    train_sampler = torch.utils.data.distributed.DistributedSampler(train_data_set)
    val_sampler = torch.utils.data.distributed.DistributedSampler(val_data_set)

    # 将样本索引每batch_size个元素组成一个list(针对train sampler做进一步处理)
    train_batch_sampler = torch.utils.data.BatchSampler(train_sampler, batch_size, drop_last=True)

（3）加载预训练权重或在不同设备载入主进程权重：

    # 如果存在预训练权重则载入
    if os.path.exists(weights_path):
        weights_dict = torch.load(weights_path, map_location=device)
        load_weights_dict = {k: v for k, v in weights_dict.items()
                             if model.state_dict()[k].numel() == v.numel()}
        model.load_state_dict(load_weights_dict, strict=False)
    else:
        # 如果没有初始化权重
        # 使用多GPU时，必须保证每个设备的初始化权重一模一样
        # 如果初始化权重不一样，那么训练过程中所求取的梯度就不是针对同一组参数而言的
        checkpoint_path = os.path.join(tempfile.gettempdir(), "initial_weights.pt")
        # 如果不存在预训练权重，需要将第一个进程中的权重保存，然后其他进程载入，保持初始化权重一致
        if rank == 0:
            torch.save(model.state_dict(), checkpoint_path)   # 在主进程中保存模型初始化权重
        dist.barrier()   # 不同进程之间数据的同步
        # 这里注意，一定要指定map_location参数，否则会导致第一块GPU占用更多资源
        model.load_state_dict(torch.load(checkpoint_path, map_location=device))   # 在不同设备中载入主进程权重

（4）对使用Batch Normalization且没有冻结权重的网络结构可以考虑使用SyncBatchNorm：

    # 是否冻结权重
    if args.freeze_layers:
        for name, para in model.named_parameters():
            # 除最后的全连接层外，其他权重全部冻结
            if "fc" not in name:
                para.requires_grad_(False)
    else:
        # 只有训练带有BN结构的网络时使用SyncBatchNorm采用意义
        if args.syncBN:
            # 使用SyncBatchNorm后训练会更耗时(具有同步功能的BN)
            model = torch.nn.SyncBatchNorm.convert_sync_batchnorm(model).to(device)

（5）包装模型并将模型转为DDP模型：

    # 转为DDP模型
    model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.gpu])   # 包装模型，指认device_id
    # 至此，可以在各个设备之间进行通信

（6）训练时使用set_epoch使每一设备获取更丰富的数据：

    train_sampler.set_epoch(epoch)   # 通过epoch影响生成器随机种子，使每一轮训练每一个设备获取的数据都不同

（7）在训练过程中计算不同设备之间的平均loss与验证过程中计算不同设备间验证正确的数量：

    # 训练过程
    loss = reduce_value(loss, average=True)   # 求得不同设备之间loss均值(针对多GPU场景使用，对单GPU不起作用)
    mean_loss = (mean_loss * step + loss.detach()) / (step + 1)  # update mean losses

    # 在进程0中打印平均loss
    if is_main_process():
        data_loader.desc = "[epoch {}] mean loss {}".format(epoch, round(mean_loss.item(), 3))
        # 传入description参数，显示平均损失值

    if not torch.isfinite(loss):   # 如果损失无穷大，报错，终止训练
        print('WARNING: non-finite loss, ending training ', loss)
        sys.exit(1)

    # 验证过程
    sum_num = reduce_value(sum_num, average=False)

（8）训练、验证过程中，还需要同步各设备之间的进度：

    # 等待所有进程计算完毕
    if device != torch.device("cpu"):   # 如果使用多gpu,同步各设备之间的进度
        torch.cuda.synchronize(device)

3. 多GPU启动指令

python -m torch.distributed.launch --nproc_per_node=3 --use_env train_multi_gpu.py

其中，参数nproc_per_node指定GPU并行数量，这里我使用了3块GPU设备；train_multi_gpu.py为我自己使用的多GPU并行训练脚本名称，需要根据实际进行修改。

二、我的操作环境说明

（1）操作系统：Win10

（2）Anacanda版本：Anacanda3

（3）torch版本：1.12.0

（4）cuda版本：11.3

（5）GPU型号：NVIDIA GeForce RTX 3090

三、命令行操作

1. 打开Anacanda Powershell Prompt或Anaconda Prompt：

2. 激活自己的虚拟环境并打开对应的项目文件夹：

(base) C:\Users\user5>conda activate torch_zfy

(torch_zfy) C:\Users\user5>D:

(torch_zfy) D:\>cd ZFY\Postgraduate_0\Semantic segmentation\code\DWin-HRT

其中，torch_zfy为我自己创建的虚拟环境，DWin-HRT为我自己的项目名称。

3. 输入多GPU启动指令

多GPU启动指令如本博文一、3.所示，需要注意的是Windows操作环境下指定某几块GPU进行并行训练的方法。在最前面的视频中，up主提到可以在指令前添加“CUDA_VISBLE_DEVICES=4,5,6”，即指定在编号为4、5、6块GPU设备上进行并行训练。而我在实践中发现，这种方式会发生报错，经查询可能是操作系统的原因（可参考解决报错：‘CUDA_VISIBLE_DEVICES‘ 不是内部或外部命令，也不是可运行的程序或批处理文件。_cuda_visible_devices' 不是内部或外部命令,也不是可运行的程序-CSDN博客），对此，我的解决方式是在脚本最前面进行设备的指认：

import os
os.environ['CUDA_VISIBLE_DEVICES'] = '4, 5, 6'

这样，即可直接使用上述的多GPU启动指令。实现的效果如下图所示，下图中，可以看到我的代码是同步运行在编号为4、5、6的三块GPU上的。

四、一些报错及处理

1. RuntimeError: Distributed package doesn‘t have NCCL built in

这里我找到的参考资料是： Windows RuntimeError: Distributed package doesn‘t have NCCL built in问题_nccl的替代_StarCap的博客-CSDN博客

根据博文修改init_distributed_mode函数中的通信后端方式：

def init_distributed_mode(args):
    if 'RANK' in os.environ and 'WORLD_SIZE' in os.environ:
        args.rank = int(os.environ["RANK"])
        args.world_size = int(os.environ['WORLD_SIZE'])
        args.gpu = int(os.environ['LOCAL_RANK'])
    elif 'SLURM_PROCID' in os.environ:
        args.rank = int(os.environ['SLURM_PROCID'])
        args.gpu = args.rank % torch.cuda.device_count()
    else:
        print('Not using distributed mode')
        args.distributed = False
        return

    args.distributed = True

    torch.cuda.set_device(args.gpu)
    args.dist_backend = 'gloo'
    print('| distributed init (rank {}): {}'.format(
        args.rank, args.dist_url), flush=True)

    dist.init_process_group(backend=args.dist_backend, init_method=args.dist_url,
                            world_size=args.world_size, rank=args.rank)
    dist.barrier()

2. ValueError: batch_sampler option is mutually exclusive with batch_size, shuffle, sampler, and drop_last

这个报错的原因是batch_sampler 选项与 batch_size、shuffle、sampler 和 drop_last 互斥，参考Pytorch：Pytorch ValueError: sampler选项与shuffle选项互斥|极客教程 (geek-docs.com) ，sampler选项已经提供了一种特定的采样方式，而shuffle选项则会打乱数据的顺序，两者的功能有一定的重叠。为了避免冲突，Pytorch设计成了这两个选项不能同时使用。

这里我的解决方式是去除DataLoader中的shuffle参数：

    # 给每个rank对应的进程分配训练的样本索引
    train_sampler = torch.utils.data.distributed.DistributedSampler(train_dataset)
    val_sampler = torch.utils.data.distributed.DistributedSampler(val_dataset)

    # 将样本索引每batch_size个元素组成一个list
    train_batch_sampler = torch.utils.data.BatchSampler(train_sampler, train_batch_size, drop_last=True)

    train_loader = torch.utils.data.DataLoader(train_dataset, batch_sampler=train_batch_sampler, num_workers=2, pin_memory=True)
    val_loader = torch.utils.data.DataLoader(val_dataset, batch_size=val_batch_size, sampler=val_sampler, num_workers=2, pin_memory=True)

3. 开始训练后一个特别长的报错

报错内容是：RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by (1) passing the keyword argument find_unused_parameters=True to torch.nn.parallel.DistributedDataParallel; (2) making sure all forward function outputs participate in calculating loss. If you already have done the above two steps, then the distributed data parallel module wasn’t able to locate the output tensors in the return value of your module’s forward function. Please include the loss function and the structure of the return value of forward of your module when reporting this issue (e.g. list, dict, iterable).

参考解决pytorch报错——RuntimeError: Expected to have finished reduction in the prior iteration..._Polaris_T的博客-CSDN博客，翻译成中文即：
运行时错误：期望在开始新迭代之前已完成前一次迭代的reduction（但没做到这一点）。此错误表明您的模块具有未参与产生loss的参数。您可以通过 (1) 将关键字参数 find_unused_parameters=True 传递给 torch.nn.parallel.DistributedDataParallel 来启用未使用的参数检测； (2) 确保所有 forward 函数的所有输出都参与计算损失。如果您已经完成了上述两个步骤，那么分布式数据并行模块无法在模块的 forward 函数的返回值中定位输出张量。报告此问题时，请包括损失函数和模块 forward 返回值的结构（例如 list、dict、iterable）。

经过检查，我的代码并没有某部分loss没有参与梯度反向传播的情况，根据提示，在torch.nn.parallel.DistributedDataParallel中添加参数“find_unused_parameters=True”，经过调整后，代码得以成功运行。

    model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[rank], find_unused_parameters=True)