PyTorch DDP模式的使用

最新推荐文章于 2025-03-28 12:17:15 发布

laizi_laizi

最新推荐文章于 2025-03-28 12:17:15 发布

阅读量1w

点赞数 24

分类专栏： PyTorch

本文链接：https://blog.csdn.net/laizi_laizi/article/details/115299263

版权

PyTorch 专栏收录该内容

13 篇文章

订阅专栏

本文介绍了PyTorch中分布式训练的两种方式：DataParallel(DP)和DistributedDataParallel(DDP)。DP适用于单机多GPU，修改代码量小，但性能不如DDP且受限于GIL。DDP则提供更好的性能，适用于多机多GPU，但需要更多的代码改动。文中提供了DDP的使用模板，并强调了初始化、数据加载、进程间通信等关键点。总结了DDP使用时的注意事项，包括world_size、rank、local_rank的理解，以及DistributedSampler的使用等。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

PyTorch支持分布式训练，所以当你有多块卡时，肯定是想让你的代码能够使用multi-gpu进行training。

然后你会发现有两种选择摆在你面前：
DP(torch.nn.DataParallel) 官方Tutorial

优点：修改的代码量最少，只要像这样model = nn.DataParallel(model)包裹一下你的模型就行了，想用的话可以看一下上面的官方Tutorial，非常简单
缺点：只适用single-machine multi-GPU(不过也已经适合大部分人使用了)，不适用multi-machine multi-GPU；性能不如DDP; DP使用单进程，由于Python GIL只能利用一个CPU核心

DDP(torch.nn.parallel.DistributedDataParallel) 官方Tutorial

优点：使用多进程，没有GIL contention；性能更优；模型广播只在初始化的时候，不在每次前向传播时，故训练加速
缺点：代码改动较多，坑较多，需要试错攒经验

下面就DDP的使用描述一下(只涉及single-device GPU module，即只涉及数据并行，不涉及模型并行)，等碰到更多坑的时候再来更新

DDP使用模板

下面的DDP使用模板摘抄自这里：

#!/usr/bin/env python
# -*- coding: utf-8 -*-
from argparse import ArgumentParser

import torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.utils.data import DataLoader, Dataset
from torch.utils.data.distributed import DistributedSampler
from transformers import BertForMaskedLM

SEED = 42
BATCH_SIZE = 8
NUM_EPOCHS = 3

class YourDataset(Dataset):

    def __init__(self):
        pass


def main():
    parser = ArgumentParser('DDP usage example')
    parser.add_argument('--local_rank', type=int, default=-1, metavar='N', help='Local process rank.')  # you need this argument in your scripts for DDP to work
    args = parser.parse_args()

    # keep track of whether the current process is the `master` process (totally optional, but I find it useful for data laoding, logging, etc.)
    args.is_master = args.local_rank == 0

    # set the device
    args.device = torch.cuda.device(args.local_rank)

    # initialize PyTorch distributed using environment variables (you could also do this more explicitly by specifying `rank` and `world_size`, but I find using environment variables makes it so that you can easily use the same script on different machines)
    dist.init_process_group(backend='nccl', init_method='env://')
    torch.cuda.set_device(args.local_rank)

    # set the seed for all GPUs (also make sure to set the seed for random, numpy, etc.)
    torch.cuda.manual_seed_all(SEED)

    # initialize your model (BERT in this example)
    model = BertForMaskedLM.from_pretrained('bert-base-uncased')

    # send your model to GPU
    model = model.to(device)

    # initialize distributed data parallel (DDP)
    model = DDP(
        model,
        device_ids=[args.local_rank],
        output_device=args.local_rank
    )

    # initialize your dataset
    dataset = YourDataset()

    # initialize the DistributedSampler
    sampler = DistributedSampler(dataset)

    # initialize the dataloader
    dataloader = DataLoader(
        dataset=dataset,
        sampler=sampler,
        batch_size=BATCH_SIZE
    )

    # start your training!
    for epoch in range(NUM_EPOCHS):
        # put model in train mode
        model.train()

        # let all processes sync up before starting with a new epoch of training
        dist.barrier()

        for step, batch in enumerate(dataloader):
            # send batch to device
            batch = tuple(t.to(args.device) for t in batch)
            
            # forward pass
            outputs = model(*batch)
            
            # compute loss
            loss = outputs[0]

            # etc.


if __name__ == '__main__':
    main()

#!/bin/bash

# this example uses a single node (`NUM_NODES=1`) w/ 4 GPUs (`NUM_GPUS_PER_NODE=4`)
export NUM_NODES=1
export NUM_GPUS_PER_NODE=4
export NODE_RANK=0
export WORLD_SIZE=$(($NUM_NODES * $NUM_GPUS_PER_NODE))

# launch your script w/ `torch.distributed.launch`
python -m torch.distributed.launch \
    --nproc_per_node=$NUM_GPUS_PER_NODE \
    --nnodes=$NUM_NODES \
    --node_rank $NODE_RANK \
    ddp_example.py \
    # include any arguments to your script, e.g:
    #    --seed 42
    #    etc.

总结注意事项

几个术语：world size:进程总数，一般为使用的GPU数；rank:进程序号，用于进程间通讯;local_rank本地进程序号，需要有这个命令行参数，因为DDP在终端使用torch.distributed.launch时会自动给args.local_rank赋值，比如有2个GPU，就会有0和1的local_rank。在我的理解里，多进程就是每个进程都来执行一遍train.py，而这时候分辨他们之间就是用local_rank号
在用 DistributedDataParallel包裹模型之前需要先用模型送到device上，也就是要先送到GPU上，否则会报错：AssertionError: DistributedDataParallel device_ids and output_device arguments only work with single-device GPU modules, but got device_ids [1], output_device 1, and module parameters {device(type='cpu')}.
在开始使用DDP之前，需要用dist.init_process_group(backend='nccl', init_method='env://')初始化进程组，一般建议用nccl的backend，init_method不写默认就是’env://’。这个要写在最前面，不能运行两次，否则会报错：RuntimeError: trying to initialize the default process group twice!，在初始化进程组之后，经常会跟这样两句话：
```
torch.cuda.set_device(args.local_rank)
device = torch.device('cuda', args.local_rank)
```
与单GPU模式下一个区别就是你需要用DistributedSampler包裹你的dataset得到sampler输入到dataloader里面，这时候在dataloader就不能指定shuffle这个参数了
batch size的区别：对于DP而言，输入到dataloader里面的batch_size参数指的是总的batch_size，例如batch_size=30，你有两块GPU，则每块GPU会吃15个sample；对于DDP而言，里面的batch_size参数指的却是每个GPU的batch_size，例如batch_size=30，你有两块GPU，则每块GPU会吃30个sample，一个batch总共就吃60个sample.
load/save model的时候要注意一下，在multi-gpu下保存模型应该要用net.module.state_dict()，否则你在load的时候会有Missing key(s) in state_dict: "conv1.weight" ... Unexpected key(s) in state_dict: "module.conv1.weight"，因为直接用net.state_dict()保存的模型会带有module的前缀，除非你自己又循环一遍参数，除去前缀，然后加载这个新的state_dict
打印信息，保存log，保存模型这些只要在用local_rank为0的进程进行就可以，因为模型会进行同步的，对于log信息，否则终端里就会显示很多快慢不一的信息，不美观，这也就是为什么很多代码里面都有args.local_rank==0的判断