PyTorch 多GPU训练实践 (5) - DDP-torch.distributed.launch 代码修改

前言

在教程(3)和(4)中讲解了 DistributedDataParallel 有关的底层逻辑,相信大家已经对分布式数据并行有了一定了了解了。PyTorch 为我们提供了一个方便的接口torch.DistributedDataParallel ,让我们比较容易地将代码修改为分布式数据并行模式。在本教程中,我将一步步修改代码为以 torch.distributed.launch 启动的 DDP 版本。

前置知识

为了更好的理解本教程,我们需要关心的是 torch.distributed.launch 做了什么。我们先看一下 torch.distributed.launch 输入的参数,使用 python -m torch.distributed.launch --help 获得。关于源码可以在 torch.ditributed.launch 中找到。

> python -m torch.distributed.launch --help
usage: launch.py [-h] [--nnodes NNODES] [--node_rank NODE_RANK]
                 [--nproc_per_node NPROC_PER_NODE] [--master_addr MASTER_ADDR]
                 [--master_port MASTER_PORT] [--use_env] [-m] [--no_python]
                 training_script ...

在这里,我详细描述了 torch.distributed.launch 的参数:

  • nnodes:节点数量,一个节点对应一个主机;
  • node_rank:节点的序号,从 0 开始;
  • nproc_per_node:一个节点中的进程数量,一般一个进程使用一个显卡,故也通常表述为一个节中显卡的数量;
  • master_addr:master 节点的 IP 地址,也就是 rank=0 对应的主机地址。设置该参数目的是为了让其他节点知道 0 号节点的位置,这样就可以将自己训练的参数传递过去处理;
  • master_port:master 节点的端口号,用于通信;
  • use_env:使用 used_env 后,pytorch 会把当前进程所使用的 local_rank 放到环境变量中,而不会放在args.local_rank 中。目前,官方现在已经建议废弃使用 torch.distributed.launch,而是建议使用 torchrun。在 torchrun 中,--use_env 这个参数被废弃了并作为默认设置在 torchrun 中,从而强制要求用户从环境变量的 LOACL_RANK 里获取当前进程在本机上的 rank

在使用 torch.distributed.launch 运行代码后,每个进程都将设置五个参数(MASTER_ADDR、MASTER_PORT、RANK、LOCAL_RANK和WORLD_RANK)到环境变量中。RANK、LOCAL_RANK和WORLD_RANK 的详情如下:

  • RANK:使用 os.environ["RANK"] 获取进程的序号,一般是1个 gpu 对应一个进程。它是一个全局的序号,从 0 开始,最大值为所有 GPU 的数量减 1;
  • LOCAL_RANK:使用 os.environ["LOCAL_RANK"] 获取每个进程在所在主机中的序号。从 0 开始,最大值为当前进程所在主机的 GPU 的数量减 1;
  • WORLD_SIZE:使用 os.environ["WORLD_SIZE"] 获取当前启动的所有的进程的数量(所有机器的进程总和)。

为了便于理解,我们举个例子来说明:假设我们使用了 2 台机器,每台机器 4 块 GPU。那么,RANK 取值为 [0, 7];每台机器上的 LOCAL_RANK 的取值为 [0, 3];WORLD_SIZE 的值为 8。

接下来,我在我们的服务器(2 台服务器,每台 4 块 GPU)上实际测试一下,来打印出设置的这五个参数。使用代码如下:

import os
import time
import torch.distributed as dist

print("before running dist.init_process_group()")
MASTER_ADDR = os.environ["MASTER_ADDR"]
MASTER_PORT = os.environ["MASTER_PORT"]
LOCAL_RANK = os.environ["LOCAL_RANK"]
RANK = os.environ["RANK"]
WORLD_SIZE = os.environ["WORLD_SIZE"]

print("MASTER_ADDR: {}\tMASTER_PORT: {}".format(MASTER_ADDR, MASTER_PORT))
print("LOCAL_RANK: {}\tRANK: {}\tWORLD_SIZE: {}".format(LOCAL_RANK, RANK, WORLD_SIZE))

dist.init_process_group('nccl')
print("after running dist.init_process_group()")
time.sleep(60)  # Sleep for a while to avoid exceptions that occur when some processes end too quickly.
dist.destroy_process_group()

单机器多 GPU

我们首先测试单机器多 GPU 的情况。语法为:

> python -m torch.distributed.launch --nproc_per_node=NUM_GPUS_YOU_HAVE
           YOUR_TRAINING_SCRIPT.py (--arg1 --arg2 --arg3 and all other
           arguments of your training script)

接下来,我们执行该代码。可以看到 torch.distributed.launch 自动在环境变量中添加了 MASTER_ADDRMASTER_PORTRANKLOCAL_RANKWORLD_RANK

> CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch --nproc_per_node=2 train.py
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
before running dist.init_process_group()
MASTER_ADDR: 127.0.0.1	MASTER_PORT: 29500
LOCAL_RANK: 0	RANK: 0	WORLD_SIZE: 2
before running dist.init_process_group()
MASTER_ADDR: 127.0.0.1	MASTER_PORT: 29500
LOCAL_RANK: 1	RANK: 1	WORLD_SIZE: 2
after running dist.init_process_group()
after running dist.init_process_group()

> CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node=4 train.py
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
before running dist.init_process_group()
MASTER_ADDR: 127.0.0.1	MASTER_PORT: 29500
LOCAL_RANK: 0	RANK: 0	WORLD_SIZE: 4
before running dist.init_process_group()
MASTER_ADDR: 127.0.0.1	MASTER_PORT: 29500
LOCAL_RANK: 2	RANK: 2	WORLD_SIZE: 4
before running dist.init_process_group()
MASTER_ADDR: 127.0.0.1	MASTER_PORT: 29500
LOCAL_RANK: 3	RANK: 3	WORLD_SIZE: 4
before running dist.init_process_group()
MASTER_ADDR: 127.0.0.1	MASTER_PORT: 29500
LOCAL_RANK: 1	RANK: 1	WORLD_SIZE: 4
after running dist.init_process_group()
after running dist.init_process_group()
after running dist.init_process_group()

多机器多 GPU

使用 2 个机器举例,master 节点的 IP 地址为 192.168.1.1: 1234。

在机器 1 上的语法如下:

> python -m torch.distributed.launch --nproc_per_node=NUM_GPUS_YOU_HAVE
         --nnodes=2 --node_rank=0 --master_addr="192.168.1.1"
         --master_port=1234 YOUR_TRAINING_SCRIPT.py (--arg1 --arg2 --arg3
         and all other arguments of your training script)

在机器 2 上的语法如下:

> python -m torch.distributed.launch --nproc_per_node=NUM_GPUS_YOU_HAVE
         --nnodes=2 --node_rank=1 --master_addr="192.168.1.1"
         --master_port=1234 YOUR_TRAINING_SCRIPT.py (--arg1 --arg2 --arg3
         and all other arguments of your training script)

在我们服务器上运行的结果如下:

机器 1(master,IP:168.192.1.105):

> python -m torch.distributed.launch --nproc_per_node 4 --nnodes 2 --node_rank 0 --master_addr='192.168.1.105' --master_port='12345' train.py
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
before running dist.init_process_group()
MASTER_ADDR: 192.168.1.105	MASTER_PORT: 12345
LOCAL_RANK: 0	RANK: 0	WORLD_SIZE: 8
before running dist.init_process_group()
MASTER_ADDR: 192.168.1.105	MASTER_PORT: 12345
LOCAL_RANK: 1	RANK: 1	WORLD_SIZE: 8
before running dist.init_process_group()
MASTER_ADDR: 192.168.1.105	MASTER_PORT: 12345
LOCAL_RANK: 3	RANK: 3	WORLD_SIZE: 8
before running dist.init_process_group()
MASTER_ADDR: 192.168.1.105	MASTER_PORT: 12345
LOCAL_RANK: 2	RANK: 2	WORLD_SIZE: 8
after running dist.init_process_group()
after running dist.init_process_group()
after running dist.init_process_group()
after running dist.init_process_group()

机器 2(IP:168.192.1.106):

> python -m torch.distributed.launch --nproc_per_node 4 --nnodes 2 --node_rank 1 --master_addr='192.168.1.105' --master_port='12345' train.py
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
before running dist.init_process_group()
MASTER_ADDR: 192.168.1.105	MASTER_PORT: 12345
LOCAL_RANK: 0	RANK: 4	WORLD_SIZE: 8
before running dist.init_process_group()
MASTER_ADDR: 192.168.1.105	MASTER_PORT: 12345
LOCAL_RANK: 1	RANK: 5	WORLD_SIZE: 8
before running dist.init_process_group()
MASTER_ADDR: 192.168.1.105	MASTER_PORT: 12345
LOCAL_RANK: 3	RANK: 7	WORLD_SIZE: 8
before running dist.init_process_group()
MASTER_ADDR: 192.168.1.105	MASTER_PORT: 12345
LOCAL_RANK: 2	RANK: 6	WORLD_SIZE: 8
after running dist.init_process_group()
after running dist.init_process_group()
after running dist.init_process_group()
after running dist.init_process_group()

更多的详情可以直接访问我的代码:

https://github.com/HongxinXiang/pytorch-multi-GPU-training-tutorial/tree/master/test/torch_distributed_launch​github.com/HongxinXiang/pytorch-multi-GPU-training-tutorial/tree/master/test/torch_distributed_launch

修改代码为 DDP 版本

现在,我们正式开始修改基础版的训练代码为 DDP 训练代码,主要修改的地方为 4 处。

修改1:初始化分布式进程组和分布式设备

在代码最开始的地方,初始化 DDP 所需要的环境。

def setup_DDP(backend="nccl", verbose=False):
    """
    We don't set ADDR and PORT in here, like:
        # os.environ['MASTER_ADDR'] = 'localhost'
        # os.environ['MASTER_PORT'] = '12355'
    Because program's ADDR and PORT can be given automatically at startup.
    E.g. You can set ADDR and PORT by using:
        python -m torch.distributed.launch --master_addr="192.168.1.201" --master_port=23456 ...

    You don't set rank and world_size in dist.init_process_group() explicitly.

    :param backend:
    :param verbose:
    :return:
    """
    rank = int(os.environ["RANK"])
    local_rank = int(os.environ["LOCAL_RANK"])
    world_size = int(os.environ["WORLD_SIZE"])
    # If the OS is Windows or macOS, use gloo instead of nccl
    dist.init_process_group(backend=backend)
    # set distributed device
    device = torch.device("cuda:{}".format(local_rank))
    if verbose:
        print(f"local rank: {local_rank}, global rank: {rank}, world size: {world_size}")
    return rank, local_rank, world_size, device

rank, local_rank, world_size, device = setup_DDP(verbose=True)

修改2:使用 DistributedSampler 初始化 DataLoader

  1. 修改 batch_size我将原本的 batch_size=64 除以了 world_size,因此每个 GPU 将分别处理一部分数据。在传入 batch_size 参数时,随着 GPU 数量的增多,batch_size 应适当增大。
  2. 初始化 DistributedSampler
  3. 初始化DataLoader初始化 DataLoader 时, 应传入 sampler 参数
batch_size = 64 // world_size  # [*] // world_size
train_sampler = DistributedSampler(training_data, shuffle=True)  # [*]
test_sampler = DistributedSampler(test_data, shuffle=False)  # [*]
train_dataloader = DataLoader(training_data, batch_size=batch_size, sampler=train_sampler)  # [*] sampler=...
test_dataloader = DataLoader(test_data, batch_size=batch_size, sampler=test_sampler)  # [*] sampler=...

修改3:使用 DistributedDataParallel 初始化模型

使用 torch.nn.parallel.DistributedDataParallel 包裹定义的 model,并显示地指定模型使用的设备(device_ids)以及输出数据存在的设备(output_device)。

from torch.nn.parallel import DistributedDataParallel as DDP

# initialize model
model = NeuralNetwork().to(device)  # copy model from cpu to gpu
# [*] using DistributedDataParallel
model = DDP(model, device_ids=[local_rank], output_device=local_rank)  # [*] DDP(...)

修改4:保存模型

模型的保存和单机单 GPU 时保存一样。为了避免重复保存模型,我们仅在 master 主机上保存模型。

# [*] save model on rank 0
if dist.get_rank() == 0:
    model_state_dict = model.state_dict()
    torch.save(model_state_dict, "model.pth")
    print("Saved PyTorch Model State to model.pth")

除此之外,还有两处非必要的修改:

  1. 设置 sampler 的 epoch 参数,方便采样器知道当前是训练到第几个 epoch 了。
# [*] set sampler
train_dataloader.sampler.set_epoch(t)
test_dataloader.sampler.set_epoch(t)

2. 仅在 rank=0 的主机上打印训练和测试日志。还有一些 print() 也可以修改为 print_only_rank0()

def print_only_rank0(log):
    if dist.get_rank() == 0:
        print(log)

def train(...):
    ...
    # [*] only print log on rank 0
    if dist.get_rank() == 0 and batch % 100 == 0:
        loss, current = loss.item(), batch * len(X)
        print(f"loss: {loss:>7f}  [{current:>5d}/{size:>5d}]")
    ...

def test(...):
    ...
    # [*] only print log on rank 0
    print_only_rank0(f"Test Error: \n Accuracy: {(100*correct):>0.1f}%, Avg loss: {test_loss:>8f} \n")
    ...

最终,完整的代码可以在下面链接中访问:

https://github.com/HongxinXiang/pytorch-multi-GPU-training-tutorial/blob/master/single-machine-and-multi-GPU-DistributedDataParallel-launch.py​github.com/HongxinXiang/pytorch-multi-GPU-training-tutorial/blob/master/single-machine-and-multi-GPU-DistributedDataParallel-launch.py

开始运行代码

我们使用 2 台服务器来运行,IP 分别是 192.168.1.105 (master) 和 192.168.1.106。每台机器有 4 块 GPU。

在多机多卡训练时,我们在运行之前需要注意以下两点:

  1. 不同服务器之间需要通信,因此我们需要确定每台机器之间能否 ping 通;
  2. GPU 的 NCCL 之间也需要通信。在运行时,最容易出现的 NCCL 错误是:RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:825, unhandled system error, NCCL version 2.7.8。这可能是 NCCL 没有正常安装、也可能是 NCCL 没有建立通信、也可能是防火墙的问题等等。我们可以使用环境变量将运行切换到 DEBUG 模型,能够为我们提供更多的信息来定位错误。如下命令所示(更多信息见:torch.distributed)。
export NCCL_DEBUG=INFO
export NCCL_DEBUG_SUBSYS=ALL

接下来,让我们开始运行我们的程序。

机器 1 (master, IP: 192.168.1.105):

> CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node 4 --nnodes 2 --node_rank 0 --master_addr='192.168.1.105' --master_port='12345' single-machine-and-multi-GPU-DistributedDataParallel-launch.py
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
Using device: cuda:3
local rank: 3, global rank: 3, world size: 8
Using device: cuda:2
Using device: cuda:1
local rank: 1, global rank: 1, world size: 8
local rank: 2, global rank: 2, world size: 8
Using device: cuda:0
local rank: 0, global rank: 0, world size: 8
tesla-105:1475:1475 [0] NCCL INFO Bootstrap : Using [0]eno2:192.168.1.105<0>
tesla-105:1475:1475 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

tesla-105:1475:1475 [0] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
tesla-105:1475:1475 [0] NCCL INFO NET/Socket : Using [0]eno2:192.168.1.105<0>
tesla-105:1475:1475 [0] NCCL INFO Using network Socket
NCCL version 2.7.8+cuda10.1
tesla-105:1477:1477 [1] NCCL INFO Bootstrap : Using [0]eno2:192.168.1.105<0>
tesla-105:1477:1477 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

tesla-105:1477:1477 [1] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
tesla-105:1477:1477 [1] NCCL INFO NET/Socket : Using [0]eno2:192.168.1.105<0>
tesla-105:1477:1477 [1] NCCL INFO Using network Socket
tesla-105:1481:1481 [3] NCCL INFO Bootstrap : Using [0]eno2:192.168.1.105<0>
tesla-105:1481:1481 [3] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

tesla-105:1481:1481 [3] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
tesla-105:1481:1481 [3] NCCL INFO NET/Socket : Using [0]eno2:192.168.1.105<0>
tesla-105:1481:1481 [3] NCCL INFO Using network Socket
tesla-105:1480:1480 [2] NCCL INFO Bootstrap : Using [0]eno2:192.168.1.105<0>
tesla-105:1480:1480 [2] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

tesla-105:1480:1480 [2] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
tesla-105:1480:1480 [2] NCCL INFO NET/Socket : Using [0]eno2:192.168.1.105<0>
tesla-105:1480:1480 [2] NCCL INFO Using network Socket
tesla-105:1481:2165 [3] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/64
tesla-105:1481:2165 [3] NCCL INFO Trees [0] -1/-1/-1->3->2|2->3->-1/-1/-1 [1] -1/-1/-1->3->2|2->3->-1/-1/-1
tesla-105:1481:2165 [3] NCCL INFO Setting affinity for GPU 3 to ff,c00ffc00
tesla-105:1475:2146 [0] NCCL INFO Channel 00/02 :    0   1   2   3   4   5   6   7
tesla-105:1477:2150 [1] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/64
tesla-105:1477:2150 [1] NCCL INFO Trees [0] 2/4/-1->1->0|0->1->2/4/-1 [1] 2/-1/-1->1->0|0->1->2/-1/-1
tesla-105:1477:2150 [1] NCCL INFO Setting affinity for GPU 1 to 3ff003ff
tesla-105:1475:2146 [0] NCCL INFO Channel 01/02 :    0   1   2   3   4   5   6   7
tesla-105:1475:2146 [0] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/64
tesla-105:1480:2169 [2] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/64
tesla-105:1480:2169 [2] NCCL INFO Trees [0] 3/-1/-1->2->1|1->2->3/-1/-1 [1] 3/-1/-1->2->1|1->2->3/-1/-1
tesla-105:1480:2169 [2] NCCL INFO Setting affinity for GPU 2 to ff,c00ffc00
tesla-105:1475:2146 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1|-1->0->1/-1/-1 [1] 1/-1/-1->0->5|5->0->1/-1/-1
tesla-105:1475:2146 [0] NCCL INFO Setting affinity for GPU 0 to 3ff003ff
tesla-105:1481:2165 [3] NCCL INFO Channel 00 : 3[b1000] -> 4[18000] [send] via NET/Socket/0
tesla-105:1480:2169 [2] NCCL INFO Channel 00 : 2[af000] -> 3[b1000] via P2P/IPC
tesla-105:1477:2150 [1] NCCL INFO Channel 00 : 1[1a000] -> 2[af000] via direct shared memory
tesla-105:1475:2146 [0] NCCL INFO Channel 00 : 7[b1000] -> 0[18000] [receive] via NET/Socket/0
tesla-105:1480:2169 [2] NCCL INFO Channel 00 : 2[af000] -> 1[1a000] via direct shared memory
tesla-105:1475:2146 [0] NCCL INFO Channel 00 : 0[18000] -> 1[1a000] via P2P/IPC
tesla-105:1477:2150 [1] NCCL INFO Channel 00 : 4[18000] -> 1[1a000] [receive] via NET/Socket/0
tesla-105:1477:2150 [1] NCCL INFO Channel 00 : 1[1a000] -> 0[18000] via P2P/IPC
tesla-105:1475:2146 [0] NCCL INFO Channel 01 : 7[b1000] -> 0[18000] [receive] via NET/Socket/0
tesla-105:1475:2146 [0] NCCL INFO Channel 01 : 0[18000] -> 1[1a000] via P2P/IPC
tesla-105:1481:2165 [3] NCCL INFO Channel 00 : 3[b1000] -> 2[af000] via P2P/IPC
tesla-105:1477:2150 [1] NCCL INFO Channel 00 : 1[1a000] -> 4[18000] [send] via NET/Socket/0
tesla-105:1481:2165 [3] NCCL INFO Channel 01 : 3[b1000] -> 4[18000] [send] via NET/Socket/0
tesla-105:1480:2169 [2] NCCL INFO Channel 01 : 2[af000] -> 3[b1000] via P2P/IPC
tesla-105:1477:2150 [1] NCCL INFO Channel 01 : 1[1a000] -> 2[af000] via direct shared memory
tesla-105:1481:2165 [3] NCCL INFO Channel 01 : 3[b1000] -> 2[af000] via P2P/IPC
tesla-105:1480:2169 [2] NCCL INFO Channel 01 : 2[af000] -> 1[1a000] via direct shared memory
tesla-105:1481:2165 [3] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
tesla-105:1481:2165 [3] NCCL INFO comm 0x7fd634001060 rank 3 nranks 8 cudaDev 3 busId b1000 - Init COMPLETE
tesla-105:1475:2146 [0] NCCL INFO Channel 01 : 0[18000] -> 5[1a000] [send] via NET/Socket/0
tesla-105:1477:2150 [1] NCCL INFO Channel 01 : 1[1a000] -> 0[18000] via P2P/IPC
tesla-105:1477:2150 [1] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
tesla-105:1477:2150 [1] NCCL INFO comm 0x7f30b4001060 rank 1 nranks 8 cudaDev 1 busId 1a000 - Init COMPLETE
tesla-105:1480:2169 [2] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
tesla-105:1480:2169 [2] NCCL INFO comm 0x7f37d4001060 rank 2 nranks 8 cudaDev 2 busId af000 - Init COMPLETE
tesla-105:1475:2146 [0] NCCL INFO Channel 01 : 5[1a000] -> 0[18000] [receive] via NET/Socket/0
tesla-105:1475:2146 [0] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
tesla-105:1475:2146 [0] NCCL INFO comm 0x7f9e54001060 rank 0 nranks 8 cudaDev 0 busId 18000 - Init COMPLETE
tesla-105:1475:1475 [0] NCCL INFO Launch mode Parallel
DistributedDataParallel(
  (module): NeuralNetwork(
    (flatten): Flatten(start_dim=1, end_dim=-1)
    (linear_relu_stack): Sequential(
      (0): Linear(in_features=784, out_features=512, bias=True)
      (1): ReLU()
      (2): Linear(in_features=512, out_features=512, bias=True)
      (3): ReLU()
      (4): Linear(in_features=512, out_features=10, bias=True)
    )
  )
)
Epoch 1
-------------------------------
loss: 2.294374  [    0/60000]
loss: 2.301075  [  800/60000]
loss: 2.315739  [ 1600/60000]
loss: 2.299692  [ 2400/60000]
loss: 2.258646  [ 3200/60000]
loss: 2.252302  [ 4000/60000]
loss: 2.218223  [ 4800/60000]
loss: 2.126724  [ 5600/60000]
loss: 2.174220  [ 6400/60000]
loss: 2.177455  [ 7200/60000]
Test Error: 
 Accuracy: 4.1%, Avg loss: 2.166388 

Epoch 2
-------------------------------
loss: 2.136480  [    0/60000]
loss: 2.127040  [  800/60000]
loss: 2.118551  [ 1600/60000]
loss: 2.051364  [ 2400/60000]
loss: 2.076279  [ 3200/60000]
loss: 2.002108  [ 4000/60000]
loss: 2.075573  [ 4800/60000]
loss: 1.959522  [ 5600/60000]
loss: 1.861534  [ 6400/60000]
loss: 1.872814  [ 7200/60000]
Test Error: 
 Accuracy: 7.2%, Avg loss: 1.908959 

Epoch 3
-------------------------------
loss: 2.081742  [    0/60000]
loss: 1.841850  [  800/60000]
loss: 1.939971  [ 1600/60000]
loss: 1.684577  [ 2400/60000]
loss: 1.648371  [ 3200/60000]
loss: 1.774270  [ 4000/60000]
loss: 1.552769  [ 4800/60000]
loss: 1.508346  [ 5600/60000]
loss: 1.516589  [ 6400/60000]
loss: 1.481997  [ 7200/60000]
Test Error: 
 Accuracy: 7.8%, Avg loss: 1.533547 

Epoch 4
-------------------------------
loss: 1.625404  [    0/60000]
loss: 1.543570  [  800/60000]
loss: 1.428792  [ 1600/60000]
loss: 1.446484  [ 2400/60000]
loss: 1.841029  [ 3200/60000]
loss: 1.320562  [ 4000/60000]
loss: 1.511142  [ 4800/60000]
loss: 1.444456  [ 5600/60000]
loss: 1.570060  [ 6400/60000]
loss: 1.482602  [ 7200/60000]
Test Error: 
 Accuracy: 8.0%, Avg loss: 1.256674 

Epoch 5
-------------------------------
loss: 1.064455  [    0/60000]
loss: 1.233810  [  800/60000]
loss: 1.168940  [ 1600/60000]
loss: 1.227281  [ 2400/60000]
loss: 1.437644  [ 3200/60000]
loss: 1.195065  [ 4000/60000]
loss: 1.305991  [ 4800/60000]
loss: 1.258441  [ 5600/60000]
loss: 0.970569  [ 6400/60000]
loss: 1.698888  [ 7200/60000]
Test Error: 
 Accuracy: 8.2%, Avg loss: 1.083617 

Done!
Saved PyTorch Model State to model.pth

机器 2 (IP: 192.168.1.106):

> CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node 4 --nnodes 2 --node_rank 1 --master_addr='192.168.1.105' --master_port='12345' single-machine-and-multi-GPU-DistributedDataParallel-launch.py
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
Using device: cuda:0
Using device: cuda:1

local rank: 1, global rank: 5, world size: 8
local rank: 0, global rank: 4, world size: 8
Using device: cuda:2
local rank: 2, global rank: 6, world size: 8
Using device: cuda:3
local rank: 3, global rank: 7, world size: 8
tesla-106:1942:1942 [1] NCCL INFO Bootstrap : Using [0]eno2:192.168.1.106<0>
tesla-106:1942:1942 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

tesla-106:1942:1942 [1] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
tesla-106:1942:1942 [1] NCCL INFO NET/Socket : Using [0]eno2:192.168.1.106<0>
tesla-106:1942:1942 [1] NCCL INFO Using network Socket
tesla-106:1988:1988 [3] NCCL INFO Bootstrap : Using [0]eno2:192.168.1.106<0>
tesla-106:1988:1988 [3] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

tesla-106:1988:1988 [3] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
tesla-106:1988:1988 [3] NCCL INFO NET/Socket : Using [0]eno2:192.168.1.106<0>
tesla-106:1988:1988 [3] NCCL INFO Using network Socket
tesla-106:1943:1943 [2] NCCL INFO Bootstrap : Using [0]eno2:192.168.1.106<0>
tesla-106:1943:1943 [2] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

tesla-106:1943:1943 [2] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
tesla-106:1943:1943 [2] NCCL INFO NET/Socket : Using [0]eno2:192.168.1.106<0>
tesla-106:1943:1943 [2] NCCL INFO Using network Socket
tesla-106:1940:1940 [0] NCCL INFO Bootstrap : Using [0]eno2:192.168.1.106<0>
tesla-106:1940:1940 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

tesla-106:1940:1940 [0] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
tesla-106:1940:1940 [0] NCCL INFO NET/Socket : Using [0]eno2:192.168.1.106<0>
tesla-106:1940:1940 [0] NCCL INFO Using network Socket
tesla-106:1988:2787 [3] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/64
tesla-106:1988:2787 [3] NCCL INFO Trees [0] -1/-1/-1->7->6|6->7->-1/-1/-1 [1] -1/-1/-1->7->6|6->7->-1/-1/-1
tesla-106:1988:2787 [3] NCCL INFO Setting affinity for GPU 3 to ff,c00ffc00
tesla-106:1943:2821 [2] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/64
tesla-106:1943:2821 [2] NCCL INFO Trees [0] 7/-1/-1->6->5|5->6->7/-1/-1 [1] 7/-1/-1->6->5|5->6->7/-1/-1
tesla-106:1943:2821 [2] NCCL INFO Setting affinity for GPU 2 to ff,c00ffc00
tesla-106:1942:2786 [1] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/64
tesla-106:1942:2786 [1] NCCL INFO Trees [0] 6/-1/-1->5->4|4->5->6/-1/-1 [1] 6/0/-1->5->4|4->5->6/0/-1
tesla-106:1942:2786 [1] NCCL INFO Setting affinity for GPU 1 to 3ff003ff
tesla-106:1940:2831 [0] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/64
tesla-106:1940:2831 [0] NCCL INFO Trees [0] 5/-1/-1->4->1|1->4->5/-1/-1 [1] 5/-1/-1->4->-1|-1->4->5/-1/-1
tesla-106:1942:2786 [1] NCCL INFO Channel 00 : 5[1a000] -> 6[af000] via direct shared memory
tesla-106:1940:2831 [0] NCCL INFO Setting affinity for GPU 0 to 3ff003ff
tesla-106:1943:2821 [2] NCCL INFO Channel 00 : 6[af000] -> 7[b1000] via P2P/IPC
tesla-106:1988:2787 [3] NCCL INFO Channel 00 : 7[b1000] -> 0[18000] [send] via NET/Socket/0
tesla-106:1940:2831 [0] NCCL INFO Channel 00 : 3[b1000] -> 4[18000] [receive] via NET/Socket/0
tesla-106:1940:2831 [0] NCCL INFO Channel 00 : 4[18000] -> 5[1a000] via P2P/IPC
tesla-106:1988:2787 [3] NCCL INFO Channel 00 : 7[b1000] -> 6[af000] via P2P/IPC
tesla-106:1942:2786 [1] NCCL INFO Channel 00 : 5[1a000] -> 4[18000] via P2P/IPC
tesla-106:1940:2831 [0] NCCL INFO Channel 00 : 4[18000] -> 1[1a000] [send] via NET/Socket/0
tesla-106:1940:2831 [0] NCCL INFO Channel 00 : 1[1a000] -> 4[18000] [receive] via NET/Socket/0
tesla-106:1988:2787 [3] NCCL INFO Channel 01 : 7[b1000] -> 0[18000] [send] via NET/Socket/0
tesla-106:1943:2821 [2] NCCL INFO Channel 00 : 6[af000] -> 5[1a000] via direct shared memory
tesla-106:1942:2786 [1] NCCL INFO Channel 01 : 5[1a000] -> 6[af000] via direct shared memory
tesla-106:1943:2821 [2] NCCL INFO Channel 01 : 6[af000] -> 7[b1000] via P2P/IPC
tesla-106:1988:2787 [3] NCCL INFO Channel 01 : 7[b1000] -> 6[af000] via P2P/IPC
tesla-106:1988:2787 [3] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
tesla-106:1988:2787 [3] NCCL INFO comm 0x7fbb14001060 rank 7 nranks 8 cudaDev 3 busId b1000 - Init COMPLETE
tesla-106:1943:2821 [2] NCCL INFO Channel 01 : 6[af000] -> 5[1a000] via direct shared memory
tesla-106:1940:2831 [0] NCCL INFO Channel 01 : 3[b1000] -> 4[18000] [receive] via NET/Socket/0
tesla-106:1940:2831 [0] NCCL INFO Channel 01 : 4[18000] -> 5[1a000] via P2P/IPC
tesla-106:1942:2786 [1] NCCL INFO Channel 01 : 0[18000] -> 5[1a000] [receive] via NET/Socket/0
tesla-106:1943:2821 [2] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
tesla-106:1943:2821 [2] NCCL INFO comm 0x7f6fec001060 rank 6 nranks 8 cudaDev 2 busId af000 - Init COMPLETE
tesla-106:1942:2786 [1] NCCL INFO Channel 01 : 5[1a000] -> 4[18000] via P2P/IPC
tesla-106:1940:2831 [0] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
tesla-106:1940:2831 [0] NCCL INFO comm 0x7f5550001060 rank 4 nranks 8 cudaDev 0 busId 18000 - Init COMPLETE
tesla-106:1942:2786 [1] NCCL INFO Channel 01 : 5[1a000] -> 0[18000] [send] via NET/Socket/0
tesla-106:1942:2786 [1] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
tesla-106:1942:2786 [1] NCCL INFO comm 0x7f75d4001060 rank 5 nranks 8 cudaDev 1 busId 1a000 - Init COMPLETE

错误排查

运行代码时,我们发现了一个错误:

RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:825, unhandled system error, NCCL version 2.7.8
ncclSystemError: System call (socket, malloc, munmap, etc) failed.
tesla-105:29334:30213 [1] NCCL INFO Channel 00 : 6[af000] -> 1[1a000] [receive] via NET/Socket/1
tesla-105:29334:30213 [1] NCCL INFO Channel 00 : 1[1a000] -> 0[18000] via P2P/IPC
tesla-105:29331:30215 [0] NCCL INFO Channel 01 : 4[18000] -> 0[18000] [receive] via NET/Socket/1
tesla-105:29331:30215 [0] NCCL INFO Channel 01 : 0[18000] -> 1[1a000] via P2P/IPC
tesla-105:29336:30216 [2] NCCL INFO Call to connect returned Connection refused, retrying
tesla-105:29336:30216 [2] NCCL INFO Call to connect returned Connection refused, retrying
tesla-105:29336:30216 [2] NCCL INFO Call to connect returned Connection refused, retrying
tesla-105:29336:30216 [2] NCCL INFO Call to connect returned Connection refused, retrying

这个问题与网络通信有关。为了解决这个问题,可以进行以下尝试:

  1. 端口号可能已经被占用,通过修改 --master_port 来解决;
  2. 修改网络通信的 IP 接口(linux系统可通过 ipconfig 来查看可用的网络接口),例如:
> export NCCL_SOCKET_IFNAME=eth0
  • 2
    点赞
  • 34
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
### 回答1: torch.distributed.launchPyTorch的一个工具,可以用来启动分布式训练任务。具体使用方法如下: 首先,在你的代码中使用torch.distributed模块来定义分布式训练的参数,如下所示: ``` import torch.distributed as dist dist.init_process_group(backend="nccl", init_method="env://") ``` 这个代码片段定义了使用NCCL作为分布式后端,以及使用环境变量作为初始化方法。 接下来,在命令行中使用torch.distributed.launch来启动分布式训练任务,如下所示: ``` python -m torch.distributed.launch --nproc_per_node=NUM_GPUS YOUR_TRAINING_SCRIPT.py (--arg1 --arg2 --arg3 and so on) ``` 其中,NUM_GPUS是每个节点上使用的GPU数量,YOUR_TRAINING_SCRIPT.py是你的训练脚本,(--arg1 --arg2 --arg3 and so on)是传递给训练脚本的参数。 torch.distributed.launch会自动为每个节点启动一个进程,并传递适当的环境变量和命令行参数。在训练过程中,你可以使用torch.distributed模块来进行分布式的操作,如在每个节点之间同步参数、收集梯度等。 希望这个回答对你有所帮助! ### 回答2: torch.distributed.launchPyTorch中用于多节点分布式训练的一个工具。它能够帮助我们简化在多个节点上启动分布式训练的过程,使得代码编写更加简单方便。 使用torch.distributed.launch,首先需要确保环境中已经安装了PyTorch库。然后,在命令行中执行以下命令: python -m torch.distributed.launch --nproc_per_node=<num_gpus> <your_script.py> (--arg1 --arg2 ...) 其中,"<num_gpus>"是每个节点上的GPU数量,"<your_script.py>"是要运行的脚本路径。"--arg1 --arg2 ..."是你的脚本所需的各种参数,与普通的命令行参数传递方式相同。 执行上述命令后,torch.distributed.launch将会自动在每个节点上启动训练进程,并负责进程间的通信和同步。每个进程将会自动获得一个本地的rank编号,从0开始递增,并且可以通过torch.distributed.get_rank()函数获得。 在你的训练脚本中,可以通过torch.distributed.get_world_size()获得总的节点数量,通过torch.distributed.get_rank()获得当前节点的rank编号。你可以根据这些信息来区分不同的节点,进行相应的分布式操作。 除了以上基本用法外,torch.distributed.launch还提供了其他的一些选项,如--use_env、--master_addr、--master_port等,可以根据需要进行使用。可以通过在命令行中执行python -m torch.distributed.launch --help来查看更多详细的帮助信息。 总之,使用torch.distributed.launch可以方便地实现多节点分布式训练,简化了代码编写和启动的过程,提高了训练效率和灵活性。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值