PyTorch DDP模式的使用

PyTorch支持分布式训练,所以当你有多块卡时,肯定是想让你的代码能够使用multi-gpu进行training。

然后你会发现有两种选择摆在你面前:
DP(torch.nn.DataParallel) 官方Tutorial

  • 优点:修改的代码量最少,只要像这样model = nn.DataParallel(model)包裹一下你的模型就行了,想用的话可以看一下上面的官方Tutorial,非常简单
  • 缺点:只适用single-machine multi-GPU(不过也已经适合大部分人使用了),不适用multi-machine multi-GPU;性能不如DDP; DP使用单进程,由于Python GIL只能利用一个CPU核心

DDP(torch.nn.parallel.DistributedDataParallel) 官方Tutorial

  • 优点:使用多进程,没有GIL contention;性能更优;模型广播只在初始化的时候,不在每次前向传播时,故训练加速
  • 缺点:代码改动较多,坑较多,需要试错攒经验

下面就DDP的使用描述一下(只涉及single-device GPU module,即只涉及数据并行,不涉及模型并行),等碰到更多坑的时候再来更新

DDP使用模板

下面的DDP使用模板摘抄自这里

#!/usr/bin/env python
# -*- coding: utf-8 -*-
from argparse import ArgumentParser

import torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.utils.data import DataLoader, Dataset
from torch.utils.data.distributed import DistributedSampler
from transformers import BertForMaskedLM

SEED = 42
BATCH_SIZE = 8
NUM_EPOCHS = 3

class YourDataset(Dataset):

    def __init__(self):
        pass


def main():
    parser = ArgumentParser('DDP usage example')
    parser.add_argument('--local_rank', type=int, default=-1, metavar='N', help='Local process rank.')  # you need this argument in your scripts for DDP to work
    args = parser.parse_args()

    # keep track of whether the current process is the `master` process (totally optional, but I find it useful for data laoding, logging, etc.)
    args.is_master = args.local_rank == 0

    # set the device
    args.device = torch.cuda.device(args.local_rank)

    # initialize PyTorch distributed using environment variables (you could also do this more explicitly by specifying `rank` and `world_size`, but I find using environment variables makes it so that you can easily use the same script on different machines)
    dist.init_process_group(backend='nccl', init_method='env://')
    torch.cuda.set_device(args.local_rank)

    # set the seed for all GPUs (also make sure to set the seed for random, numpy, etc.)
    torch.cuda.manual_seed_all(SEED)

    # initialize your model (BERT in this example)
    model = BertForMaskedLM.from_pretrained('bert-base-uncased')

    # send your model to GPU
    model = model.to(device)

    # initialize distributed data parallel (DDP)
    model = DDP(
        model,
        device_ids=[args.local_rank],
        output_device=args.local_rank
    )

    # initialize your dataset
    dataset = YourDataset()

    # initialize the DistributedSampler
    sampler = DistributedSampler(dataset)

    # initialize the dataloader
    dataloader = DataLoader(
        dataset=dataset,
        sampler=sampler,
        batch_size=BATCH_SIZE
    )

    # start your training!
    for epoch in range(NUM_EPOCHS):
        # put model in train mode
        model.train()

        # let all processes sync up before starting with a new epoch of training
        dist.barrier()

        for step, batch in enumerate(dataloader):
            # send batch to device
            batch = tuple(t.to(args.device) for t in batch)
            
            # forward pass
            outputs = model(*batch)
            
            # compute loss
            loss = outputs[0]

            # etc.


if __name__ == '__main__':
    main()
#!/bin/bash

# this example uses a single node (`NUM_NODES=1`) w/ 4 GPUs (`NUM_GPUS_PER_NODE=4`)
export NUM_NODES=1
export NUM_GPUS_PER_NODE=4
export NODE_RANK=0
export WORLD_SIZE=$(($NUM_NODES * $NUM_GPUS_PER_NODE))

# launch your script w/ `torch.distributed.launch`
python -m torch.distributed.launch \
    --nproc_per_node=$NUM_GPUS_PER_NODE \
    --nnodes=$NUM_NODES \
    --node_rank $NODE_RANK \
    ddp_example.py \
    # include any arguments to your script, e.g:
    #    --seed 42
    #    etc.

总结注意事项

  • 几个术语:world size:进程总数,一般为使用的GPU数;rank:进程序号,用于进程间通讯;local_rank本地进程序号,需要有这个命令行参数,因为DDP在终端使用torch.distributed.launch时会自动给args.local_rank赋值,比如有2个GPU,就会有0和1的local_rank。在我的理解里,多进程就是每个进程都来执行一遍train.py,而这时候分辨他们之间就是用local_rank号

  • 在用 DistributedDataParallel包裹模型之前需要先用模型送到device上,也就是要先送到GPU上,否则会报错:AssertionError: DistributedDataParallel device_ids and output_device arguments only work with single-device GPU modules, but got device_ids [1], output_device 1, and module parameters {device(type='cpu')}.

  • 在开始使用DDP之前,需要用dist.init_process_group(backend='nccl', init_method='env://')初始化进程组,一般建议用nccl的backend,init_method不写默认就是’env://’。这个要写在最前面,不能运行两次,否则会报错:RuntimeError: trying to initialize the default process group twice!,在初始化进程组之后,经常会跟这样两句话:

    torch.cuda.set_device(args.local_rank)
    device = torch.device('cuda', args.local_rank)
    
  • 与单GPU模式下一个区别就是你需要用DistributedSampler包裹你的dataset得到sampler输入到dataloader里面,这时候在dataloader就不能指定shuffle这个参数了

  • batch size的区别:对于DP而言,输入到dataloader里面的batch_size参数指的是总的batch_size,例如batch_size=30,你有两块GPU,则每块GPU会吃15个sample;对于DDP而言,里面的batch_size参数指的却是每个GPU的batch_size,例如batch_size=30,你有两块GPU,则每块GPU会吃30个sample,一个batch总共就吃60个sample.

  • load/save model的时候要注意一下,在multi-gpu下保存模型应该要用net.module.state_dict(),否则你在load的时候会有Missing key(s) in state_dict: "conv1.weight" ... Unexpected key(s) in state_dict: "module.conv1.weight",因为直接用net.state_dict()保存的模型会带有module的前缀,除非你自己又循环一遍参数,除去前缀,然后加载这个新的state_dict

  • 打印信息,保存log,保存模型这些只要在用local_rank为0的进程进行就可以,因为模型会进行同步的,对于log信息,否则终端里就会显示很多快慢不一的信息,不美观,这也就是为什么很多代码里面都有args.local_rank==0的判断

参考好文

  1. [原创][深度][PyTorch] DDP系列第一篇:入门教程
  2. Pytorch 分布式训练
  3. The Launching and configuring distributed data parallel applications document shows how to use the DDP launching script.
  4. The Shard Optimizer States With ZeroRedundancyOptimizer recipe demonstrates how ZeroRedundancyOptimizer helps to reduce optimizer memory footprint for distributed data-parallel training.
评论 12
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

laizi_laizi

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值