pytorch中DistributedDataParallel的基本使用框架

最新推荐文章于 2025-04-16 10:01:54 发布

水蓝城城主

最新推荐文章于 2025-04-16 10:01:54 发布

阅读量926

点赞数

文章标签： pytorch 深度学习 python

本文链接：https://blog.csdn.net/mrliuzhao/article/details/128807119

版权

文章目录

简述

简述

相比于torch.nn.DataParallel，torch.nn.parallel.DistributedDataParallel使用多进程实现并行，因此没有Python GIL锁的限制，但使用起来略显复杂。模型在DDP创建时就复制到各个进程中，也在一定程度上起到加速效果。

一些原理：

DDP首先需要通过init_process_group创建进程组来保证通信。
创建DDP时，会把模型的state_dict从进程组中rank 0的进程上复制到其他的进程中，保证各进程模型起始状态的一致。之后会创建Reducer来管理梯度的同步。由于各进程中模型的起始状态一致，并且梯度经过Reducer的同步会使得不同进程中的模型参数得到同样的更新，故训练过程中模型的参数不需要在进程之间进行复制，进而节省时间。但正因为有这样的假设，训练过程中一定不能对模型的结构和参数做任何动态修改！
也就是基于上述假设，模型参数可以保证在每一轮loss.backward()之后在各进程之间同步，那么保存checkpoint文件也就只需要在一个子进程中进行即可。

首先看一个最简单的示例：

import torch
import torch.distributed as dist
import torch.multiprocessing as mp
import torch.nn as nn
import torch.optim as optim
from torch.nn.parallel import DistributedDataParallel as DDP

def example(rank, world_size):
    # create default process group
    dist.init_process_group("gloo", rank=rank, world_size=world_size)
    # create local model
    model = nn.Linear(10, 10).to(rank)
    # construct DDP model
    ddp_model = DDP(model, device_ids=[rank])
    # define loss function and optimizer
    loss_fn = nn.MSELoss()
    optimizer = optim.SGD(ddp_model.parameters(), lr=0.001)

    # forward pass
    outputs = ddp_model(torch.randn(20, 10).to(rank))
    labels = torch.randn(20, 10).to(rank)
    # backward pass, model parameteres will be synchronized after this step
    loss_fn(outputs, labels).backward()
    # update parameters
    optimizer.step()

def main():
    world_size = 2
    mp.spawn(example, args=(world_size,), nprocs=world_size, join=True)

if __name__=="__main__":
    # Environment variables which need to be
    # set when using c10d's default "env"
    # initialization mode.
    os.environ["MASTER_ADDR"] = "localhost"
    os.environ["MASTER_PORT"] = "29500"
    main()

下面撸一遍函数中的具体API。

torch.multiprocessing.spawn

首先类似python多进程，通过torch.multiprocessing.spawn来发放并行任务到一个指定的函数。函数的第一个参数就是进程的下标；args指定函数的其他参数；nprocs指定多进程数量，往往就是可用GPU的数量；join=True指明torch.multiprocessing.spawn会在所有进程结束后才完成并返回，返回值就是None。

多进程的启动也可以通过torch.distributed.launch方式进行。该方式为命令行启动训练脚本来实现多进程并行，如：

# 单节点多GPU
python -m torch.distributed.launch --nproc_per_node=NUM_GPUS_YOU_HAVE 
            YOUR_TRAINING_SCRIPT.py (--arg1 --arg2 --arg3 
            and all other arguments of your training script)

# 多节点多GPU
# master - 192.168.1.1:1234
python -m torch.distributed.launch --nproc_per_node=NUM_GPUS_YOU_HAVE
           --nnodes=2 --node_rank=0 --master_addr="192.168.1.1"
           --master_port=1234 YOUR_TRAINING_SCRIPT.py (--arg1 --arg2 --arg3
           and all other arguments of your training script)

# slave
python -m torch.distributed.launch --nproc_per_node=NUM_GPUS_YOU_HAVE
           --nnodes=2 --node_rank=1 --master_addr="192.168.1.1"
           --master_port=1234 YOUR_TRAINING_SCRIPT.py (--arg1 --arg2 --arg3
           and all other arguments of your training script)

类似地还有torchrun等工具。这些往往都应用于之后多机器多GPU的训练中，同时还需要很多其他的工具，比如slurm等，来集群调度多机器的运行。这些都留待后续学习~

torch.distributed.init_process_group

需要通过torch.distributed.init_process_group建立进程组来保证进程之间的通信。torch.distributed.init_process_group(backend, init_method=None, timeout=datetime.timedelta(seconds=1800), world_size=-1, rank=-1, store=None, group_name='', pg_options=None)，其中参数分别为：

backend：指定分布式后端，字符串格式，可用的值有gloo、nccl、mpi。Linux上随Pytorch安装的分布式后端就有gloo和nccl，其中gloo用于CPU分布式训练，nccl用于GPU分布式训练。mpi则需要从源码安装Pytorch才可使用，暂不考虑。可以从分布式后端的对比表格查看详情。
world_size：用于指定进程组中进程数量，整数格式。
rank：用于指定当前进程的下标，整数格式。
init_method：URL字符串格式，用于指定创建进程组的方式，与store参数互斥，当二者均不定义时使用默认值env://。
store：以键值对的方式保存进程间共享的连接信息，与init_method参数互斥，torch.distributed.Store格式，有TCPStore/FileStore/HashStore三类。
timeout：整个进程组可等待的时间。对于nccl分布式后端，则在环境变量NCCL_BLOCKING_WAIT=1时，进程组中若有错误，则会等待timeout时间长度后抛出异常，用户可以接收到异常信息；在NCCL_ASYNC_ERROR_HANDLING=1时，进程组中若有错误，则会等待timeout时间长度之后直接崩溃。二者仅可以设置一个，均不设置则该参数无效。用户应该注意负载均衡，否则就把该参数调大来防止超时等待而退出。
group_name：进程组名称，字符串格式。
pg_options：额外的进程组选项，目前暂不关心。

torch.nn.parallel.DistributedDataParallel

torch.nn.parallel.DistributedDataParallel(module, device_ids=None, output_device=None, dim=0, broadcast_buffers=True, process_group=None, bucket_cap_mb=25, find_unused_parameters=False, gradient_as_bucket_view=False, static_graph=False)

module：要实现并行的模型。
device_ids：CUDA设备列表，一般而言都是仅包含一个数字的列表，表示该子进程所使用的设备。多个GPU或CPU训练时都要设置为None，比如多GPU模型（一个模型不同层在不同的GPU上训练）。
output_device：输出保存的device，默认为device_ids[0]，多GPU或CPU时也要设置为None。
bucket_cap_mb：参数梯度会被分为bucket来管理以增强通信效率，该参数则用于控制bucket的大小，单位为Mb。
find_unused_parameters：设置find_unused_parameters=True时，可以使得子进程中将所有unused参数标记为ready，Reducer仅等待并更新用到的参数。但该过程需要遍历计算图，也会有额外的开销，往往还是使用默认值False。

一些经验

DDP包裹后的模型，state_dict的keys都多出module.前缀。比如原始模型state_dict的key为net1.weight，DDP包裹后对应的keys就变成了module.net1.weight。测试发现二者的数据内容并没有差别，因此可以仅checkpoint包裹前模型的state_dict，进而使得后续的predict不需要在DDP + 进程组框架下进行。或类似DataParallel，同样通过调用ddp_model.module.state_dict()来获取模型参数并保存。
每次训练使用的机器和设备可能都不一样，因此在调用torch.load从checkpoint文件中恢复时，一定还是使用map_location参数指明当前子进程所使用的设备ID，以将数据加载到对应的设备上