PyTorch 多GPU训练实践 (5) - DDP-torch.distributed.launch 代码修改

最新推荐文章于 2024-08-14 17:44:14 发布

javastart

最新推荐文章于 2024-08-14 17:44:14 发布

阅读量6.5k

点赞数 2

分类专栏：深度学习文章标签： python 开发语言

原文链接：https://zhuanlan.zhihu.com/p/543198995

版权

深度学习专栏收录该内容

141 篇文章 24 订阅

订阅专栏

前言

在教程（3）和（4）中讲解了 DistributedDataParallel 有关的底层逻辑，相信大家已经对分布式数据并行有了一定了了解了。PyTorch 为我们提供了一个方便的接口torch.DistributedDataParallel ，让我们比较容易地将代码修改为分布式数据并行模式。在本教程中，我将一步步修改代码为以 torch.distributed.launch 启动的 DDP 版本。

前置知识

为了更好的理解本教程，我们需要关心的是 torch.distributed.launch 做了什么。我们先看一下 torch.distributed.launch 输入的参数，使用 python -m torch.distributed.launch --help 获得。关于源码可以在 torch.ditributed.launch 中找到。

> python -m torch.distributed.launch --help
usage: launch.py [-h] [--nnodes NNODES] [--node_rank NODE_RANK]
                 [--nproc_per_node NPROC_PER_NODE] [--master_addr MASTER_ADDR]
                 [--master_port MASTER_PORT] [--use_env] [-m] [--no_python]
                 training_script ...

在这里，我详细描述了 torch.distributed.launch 的参数：

nnodes：节点数量，一个节点对应一个主机；
node_rank：节点的序号，从 0 开始；
nproc_per_node：一个节点中的进程数量，一般一个进程使用一个显卡，故也通常表述为一个节中显卡的数量；
master_addr：master 节点的 IP 地址，也就是 rank=0 对应的主机地址。设置该参数目的是为了让其他节点知道 0 号节点的位置，这样就可以将自己训练的参数传递过去处理；
master_port：master 节点的端口号，用于通信；
use_env：使用 used_env 后，pytorch 会把当前进程所使用的 local_rank 放到环境变量中，而不会放在args.local_rank 中。目前，官方现在已经建议废弃使用 torch.distributed.launch，而是建议使用 torchrun。在 torchrun 中，--use_env 这个参数被废弃了并作为默认设置在 torchrun 中，从而强制要求用户从环境变量的 LOACL_RANK 里获取当前进程在本机上的 rank。

在使用 torch.distributed.launch 运行代码后，每个进程都将设置五个参数（MASTER_ADDR、MASTER_PORT、RANK、LOCAL_RANK和WORLD_RANK）到环境变量中。RANK、LOCAL_RANK和WORLD_RANK 的详情如下：

RANK：使用 os.environ["RANK"] 获取进程的序号，一般是1个 gpu 对应一个进程。它是一个全局的序号，从 0 开始，最大值为所有 GPU 的数量减 1；
LOCAL_RANK：使用 os.environ["LOCAL_RANK"] 获取每个进程在所在主机中的序号。从 0 开始，最大值为当前进程所在主机的 GPU 的数量减 1；
WORLD_SIZE：使用 os.environ["WORLD_SIZE"] 获取当前启动的所有的进程的数量（所有机器的进程总和）。

为了便于理解，我们举个例子来说明：假设我们使用了 2 台机器，每台机器 4 块 GPU。那么，RANK 取值为 [0, 7]；每台机器上的 LOCAL_RANK 的取值为 [0, 3]；WORLD_SIZE 的值为 8。

接下来，我在我们的服务器（2 台服务器，每台 4 块 GPU）上实际测试一下，来打印出设置的这五个参数。使用代码如下：

import os
import time
import torch.distributed as dist

print("before running dist.init_process_group()")
MASTER_ADDR = os.environ["MASTER_ADDR"]
MASTER_PORT = os.environ["MASTER_PORT"]
LOCAL_RANK = os.environ["LOCAL_RANK"]
RANK = os.environ["RANK"]
WORLD_SIZE = os.environ["WORLD_SIZE"]

print("MASTER_ADDR: {}\tMASTER_PORT: {}".format(MASTER_ADDR, MASTER_PORT))
print("LOCAL_RANK: {}\tRANK: {}\tWORLD_SIZE: {}".format(LOCAL_RANK, RANK, WORLD_SIZE))

dist.init_process_group('nccl')
print("after running dist.init_process_group()")
time.sleep(60)  # Sleep for a while to avoid exceptions that occur when some processes end too quickly.
dist.destroy_process_group()

单机器多 GPU

我们首先测试单机器多 GPU 的情况。语法为：

> python -m torch.distributed.launch --nproc_per_node=NUM_GPUS_YOU_HAVE
           YOUR_TRAINING_SCRIPT.py (--arg1 --arg2 --arg3 and all other
           arguments of your training script)

接下来，我们执行该代码。可以看到 torch.distributed.launch 自动在环境变量中添加了 MASTER_ADDR、MASTER_PORT、RANK、LOCAL_RANK和WORLD_RANK。

> CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch --nproc_per_node=2 train.py
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
before running dist.init_process_group()
MASTER_ADDR: 127.0.0.1	MASTER_PORT: 29500
LOCAL_RANK: 0	RANK: 0	WORLD_SIZE: 2
before running dist.init_process_group()
MASTER_ADDR: 127.0.0.1	MASTER_PORT: 29500
LOCAL_RANK: 1	RANK: 1	WORLD_SIZE: 2
after running dist.init_process_group()
after running dist.init_process_group()

> CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node=4 train.py
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
before running dist.init_process_group()
MASTER_ADDR: 127.0.0.1	MASTER_PORT: 29500
LOCAL_RANK: 0	RANK: 0	WORLD_SIZE: 4
before running dist.init_process_group()
MASTER_ADDR: 127.0.0.1	MASTER_PORT: 29500
LOCAL_RANK: 2	RANK: 2	WORLD_SIZE: 4
before running dist.init_process_group()
MASTER_ADDR: 127.0.0.1	MASTER_PORT: 29500
LOCAL_RANK: 3	RANK: 3	WORLD_SIZE: 4
before running dist.init_process_group()
MASTER_ADDR: 127.0.0.1	MASTER_PORT: 29500
LOCAL_RANK: 1	RANK: 1	WORLD_SIZE: 4
after running dist.init_process_group()
after running dist.init_process_group()
after running dist.init_process_group()

多机器多 GPU

使用 2 个机器举例，master 节点的 IP 地址为 192.168.1.1: 1234。

在机器 1 上的语法如下：

> python -m torch.distributed.launch --nproc_per_node=NUM_GPUS_YOU_HAVE
         --nnodes=2 --node_rank=0 --master_addr="192.168.1.1"
         --master_port=1234 YOUR_TRAINING_SCRIPT.py (--arg1 --arg2 --arg3
         and all other arguments of your training script)

在机器 2 上的语法如下：

> python -m torch.distributed.launch --nproc_per_node=NUM_GPUS_YOU_HAVE
         --nnodes=2 --node_rank=1 --master_addr="192.168.1.1"
         --master_port=1234 YOUR_TRAINING_SCRIPT.py (--arg1 --arg2 --arg3
         and all other arguments of your training script)

在我们服务器上运行的结果如下：

机器 1（master，IP：168.192.1.105）：

> python -m torch.distributed.launch --nproc_per_node 4 --nnodes 2 --node_rank 0 --master_addr='192.168.1.105' --master_port='12345' train.py
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
before running dist.init_process_group()
MASTER_ADDR: 192.168.1.105	MASTER_PORT: 12345
LOCAL_RANK: 0	RANK: 0	WORLD_SIZE: 8
before running dist.init_process_group()
MASTER_ADDR: 192.168.1.105	MASTER_PORT: 12345
LOCAL_RANK: 1	RANK: 1	WORLD_SIZE: 8
before running dist.init_process_group()
MASTER_ADDR: 192.168.1.105	MASTER_PORT: 12345
LOCAL_RANK: 3	RANK: 3	WORLD_SIZE: 8
before running dist.init_process_group()
MASTER_ADDR: 192.168.1.105	MASTER_PORT: 12345
LOCAL_RANK: 2	RANK: 2	WORLD_SIZE: 8
after running dist.init_process_group()
after running dist.init_process_group()
after running dist.init_process_group()
after running dist.init_process_group()

机器 2（IP：168.192.1.106）：

> python -m torch.distributed.launch --nproc_per_node 4 --nnodes 2 --node_rank 1 --master_addr='192.168.1.105' --master_port='12345' train.py
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
before running dist.init_process_group()
MASTER_ADDR: 192.168.1.105	MASTER_PORT: 12345
LOCAL_RANK: 0	RANK: 4	WORLD_SIZE: 8
before running dist.init_process_group()
MASTER_ADDR: 192.168.1.105	MASTER_PORT: 12345
LOCAL_RANK: 1	RANK: 5	WORLD_SIZE: 8
before running dist.init_process_group()
MASTER_ADDR: 192.168.1.105	MASTER_PORT: 12345
LOCAL_RANK: 3	RANK: 7	WORLD_SIZE: 8
before running dist.init_process_group()
MASTER_ADDR: 192.168.1.105	MASTER_PORT: 12345
LOCAL_RANK: 2	RANK: 6	WORLD_SIZE: 8
after running dist.init_process_group()
after running dist.init_process_group()
after running dist.init_process_group()
after running dist.init_process_group()

修改代码为 DDP 版本

现在，我们正式开始修改基础版的训练代码为 DDP 训练代码，主要修改的地方为 4 处。

修改1：初始化分布式进程组和分布式设备

在代码最开始的地方，初始化 DDP 所需要的环境。

def setup_DDP(backend="nccl", verbose=False):
    """
    We don't set ADDR and PORT in here, like:
        # os.environ['MASTER_ADDR'] = 'localhost'
        # os.environ['MASTER_PORT'] = '12355'
    Because program's ADDR and PORT can be given automatically at startup.
    E.g. You can set ADDR and PORT by using:
        python -m torch.distributed.launch --master_addr="192.168.1.201" --master_port=23456 ...

    You don't set rank and world_size in dist.init_process_group() explicitly.

    :param backend:
    :param verbose:
    :return:
    """
    rank = int(os.environ["RANK"])
    local_rank = int(os.environ["LOCAL_RANK"])
    world_size = int(os.environ["WORLD_SIZE"])
    # If the OS is Windows or macOS, use gloo instead of nccl
    dist.init_process_group(backend=backend)
    # set distributed device
    device = torch.device("cuda:{}".format(local_rank))
    if verbose:
        print(f"local rank: {local_rank}, global rank: {rank}, world size: {world_size}")
    return rank, local_rank, world_size, device

rank, local_rank, world_size, device = setup_DDP(verbose=True)

修改2：使用 DistributedSampler 初始化 DataLoader

修改 batch_size：我将原本的 batch_size=64 除以了 world_size，因此每个 GPU 将分别处理一部分数据。在传入 batch_size 参数时，随着 GPU 数量的增多，batch_size 应适当增大。
初始化 DistributedSampler
初始化DataLoader：初始化 DataLoader 时，应传入 sampler 参数

batch_size = 64 // world_size  # [*] // world_size
train_sampler = DistributedSampler(training_data, shuffle=True)  # [*]
test_sampler = DistributedSampler(test_data, shuffle=False)  # [*]
train_dataloader = DataLoader(training_data, batch_size=batch_size, sampler=train_sampler)  # [*] sampler=...
test_dataloader = DataLoader(test_data, batch_size=batch_size, sampler=test_sampler)  # [*] sampler=...

修改3：使用 DistributedDataParallel 初始化模型

使用 torch.nn.parallel.DistributedDataParallel 包裹定义的 model，并显示地指定模型使用的设备（device_ids）以及输出数据存在的设备（output_device）。

from torch.nn.parallel import DistributedDataParallel as DDP

# initialize model
model = NeuralNetwork().to(device)  # copy model from cpu to gpu
# [*] using DistributedDataParallel
model = DDP(model, device_ids=[local_rank], output_device=local_rank)  # [*] DDP(...)

修改4：保存模型

模型的保存和单机单 GPU 时保存一样。为了避免重复保存模型，我们仅在 master 主机上保存模型。

# [*] save model on rank 0
if dist.get_rank() == 0:
    model_state_dict = model.state_dict()
    torch.save(model_state_dict, "model.pth")
    print("Saved PyTorch Model State to model.pth")

除此之外，还有两处非必要的修改：

设置 sampler 的 epoch 参数，方便采样器知道当前是训练到第几个 epoch 了。

# [*] set sampler
train_dataloader.sampler.set_epoch(t)
test_dataloader.sampler.set_epoch(t)

2. 仅在 rank=0 的主机上打印训练和测试日志。还有一些 print() 也可以修改为 print_only_rank0()。

def print_only_rank0(log):
    if dist.get_rank() == 0:
        print(log)

def train(...):
    ...
    # [*] only print log on rank 0
    if dist.get_rank() == 0 and batch % 100 == 0:
        loss, current = loss.item(), batch * len(X)
        print(f"loss: {loss:>7f}  [{current:>5d}/{size:>5d}]")
    ...

def test(...):
    ...
    # [*] only print log on rank 0
    print_only_rank0(f"Test Error: \n Accuracy: {(100*correct):>0.1f}%, Avg loss: {test_loss:>8f} \n")
    ...

最终，完整的代码可以在下面链接中访问：

https://github.com/HongxinXiang/pytorch-multi-GPU-training-tutorial/blob/master/single-machine-and-multi-GPU-DistributedDataParallel-launch.pygithub.com/HongxinXiang/pytorch-multi-GPU-training-tutorial/blob/master/single-machine-and-multi-GPU-DistributedDataParallel-launch.py

开始运行代码

我们使用 2 台服务器来运行，IP 分别是 192.168.1.105 (master) 和 192.168.1.106。每台机器有 4 块 GPU。

在多机多卡训练时，我们在运行之前需要注意以下两点：

不同服务器之间需要通信，因此我们需要确定每台机器之间能否 ping 通；
GPU 的 NCCL 之间也需要通信。在运行时，最容易出现的 NCCL 错误是：RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:825, unhandled system error, NCCL version 2.7.8。这可能是 NCCL 没有正常安装、也可能是 NCCL 没有建立通信、也可能是防火墙的问题等等。我们可以使用环境变量将运行切换到 DEBUG 模型，能够为我们提供更多的信息来定位错误。如下命令所示（更多信息见：torch.distributed）。

export NCCL_DEBUG=INFO
export NCCL_DEBUG_SUBSYS=ALL

接下来，让我们开始运行我们的程序。

机器 1 (master, IP: 192.168.1.105)：

> CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node 4 --nnodes 2 --node_rank 0 --master_addr='192.168.1.105' --master_port='12345' single-machine-and-multi-GPU-DistributedDataParallel-launch.py
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
Using device: cuda:3
local rank: 3, global rank: 3, world size: 8
Using device: cuda:2
Using device: cuda:1
local rank: 1, global rank: 1, world size: 8
local rank: 2, global rank: 2, world size: 8
Using device: cuda:0
local rank: 0, global rank: 0, world size: 8
tesla-105:1475:1475 [0] NCCL INFO Bootstrap : Using [0]eno2:192.168.1.105<0>
tesla-105:1475:1475 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

tesla-105:1475:1475 [0] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
tesla-105:1475:1475 [0] NCCL INFO NET/Socket : Using [0]eno2:192.168.1.105<0>
tesla-105:1475:1475 [0] NCCL INFO Using network Socket
NCCL version 2.7.8+cuda10.1
tesla-105:1477:1477 [1] NCCL INFO Bootstrap : Using [0]eno2:192.168.1.105<0>
tesla-105:1477:1477 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

tesla-105:1477:1477 [1] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
tesla-105:1477:1477 [1] NCCL INFO NET/Socket : Using [0]eno2:192.168.1.105<0>
tesla-105:1477:1477 [1] NCCL INFO Using network Socket
tesla-105:1481:1481 [3] NCCL INFO Bootstrap : Using [0]eno2:192.168.1.105<0>
tesla-105:1481:1481 [3] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

tesla-105:1481:1481 [3] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
tesla-105:1481:1481 [3] NCCL INFO NET/Socket : Using [0]eno2:192.168.1.105<0>
tesla-105:1481:1481 [3] NCCL INFO Using network Socket
tesla-105:1480:1480 [2] NCCL INFO Bootstrap : Using [0]eno2:192.168.1.105<0>
tesla-105:1480:1480 [2] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

tesla-105:1480:1480 [2] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
tesla-105:1480:1480 [2] NCCL INFO NET/Socket : Using [0]eno2:192.168.1.105<0>
tesla-105:1480:1480 [2] NCCL INFO Using network Socket
tesla-105:1481:2165 [3] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/64
tesla-105:1481:2165 [3] NCCL INFO Trees [0] -1/-1/-1->3->2|2->3->-1/-1/-1 [1] -1/-1/-1->3->2|2->3->-1/-1/-1
tesla-105:1481:2165 [3] NCCL INFO Setting affinity for GPU 3 to ff,c00ffc00
tesla-105:1475:2146 [0] NCCL INFO Channel 00/02 :    0   1   2   3   4   5   6   7
tesla-105:1477:2150 [1] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/64
tesla-105:1477:2150 [1] NCCL INFO Trees [0] 2/4/-1->1->0|0->1->2/4/-1 [1] 2/-1/-1->1->0|0->1->2/-1/-1
tesla-105:1477:2150 [1] NCCL INFO Setting affinity for GPU 1 to 3ff003ff
tesla-105:1475:2146 [0] NCCL INFO Channel 01/02 :    0   1   2   3   4   5   6   7
tesla-105:1475:2146 [0] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/64
tesla-105:1480:2169 [2] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/64
tesla-105:1480:2169 [2] NCCL INFO Trees [0] 3/-1/-1->2->1|1->2->3/-1/-1 [1] 3/-1/-1->2->1|1->2->3/-1/-1
tesla-105:1480:2169 [2] NCCL INFO Setting affinity for GPU 2 to ff,c00ffc00
tesla-105:1475:2146 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1|-1->0->1/-1/-1 [1] 1/-1/-1->0->5|5->0->1/-1/-1
tesla-105:1475:2146 [0] NCCL INFO Setting affinity for GPU 0 to 3ff003ff
tesla-105:1481:2165 [3] NCCL INFO Channel 00 : 3[b1000] -> 4[18000] [send] via NET/Socket/0
tesla-105:1480:2169 [2] NCCL INFO Channel 00 : 2[af000] -> 3[b1000] via P2P/IPC
tesla-105:1477:2150 [1] NCCL INFO Channel 00 : 1[1a000] -> 2[af000] via direct shared memory
tesla-105:1475:2146 [0] NCCL INFO Channel 00 : 7[b1000] -> 0[18000] [receive] via NET/Socket/0
tesla-105:1480:2169 [2] NCCL INFO Channel 00 : 2[af000] -> 1[1a000] via direct shared memory
tesla-105:1475:2146 [0] NCCL INFO Channel 00 : 0[18000] -> 1[1a000] via P2P/IPC
tesla-105:1477:2150 [1] NCCL INFO Channel 00 : 4[18000] -> 1[1a000] [receive] via NET/Socket/0
tesla-105:1477:2150 [1] NCCL INFO Channel 00 : 1[1a000] -> 0[18000] via P2P/IPC
tesla-105:1475:2146 [0] NCCL INFO Channel 01 : 7[b1000] -> 0[18000] [receive] via NET/Socket/0
tesla-105:1475:2146 [0] NCCL INFO Channel 01 : 0[18000] -> 1[1a000] via P2P/IPC
tesla-105:1481:2165 [3] NCCL INFO Channel 00 : 3[b1000] -> 2[af000] via P2P/IPC
tesla-105:1477:2150 [1] NCCL INFO Channel 00 : 1[1a000] -> 4[18000] [send] via NET/Socket/0
tesla-105:1481:2165 [3] NCCL INFO Channel 01 : 3[b1000] -> 4[18000] [send] via NET/Socket/0
tesla-105:1480:2169 [2] NCCL INFO Channel 01 : 2[af000] -> 3[b1000] via P2P/IPC
tesla-105:1477:2150 [1] NCCL INFO Channel 01 : 1[1a000] -> 2[af000] via direct shared memory
tesla-105:1481:2165 [3] NCCL INFO Channel 01 : 3[b1000] -> 2[af000] via P2P/IPC
tesla-105:1480:2169 [2] NCCL INFO Channel 01 : 2[af000] -> 1[1a000] via direct shared memory
tesla-105:1481:2165 [3] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
tesla-105:1481:2165 [3] NCCL INFO comm 0x7fd634001060 rank 3 nranks 8 cudaDev 3 busId b1000 - Init COMPLETE
tesla-105:1475:2146 [0] NCCL INFO Channel 01 : 0[18000] -> 5[1a000] [send] via NET/Socket/0
tesla-105:1477:2150 [1] NCCL INFO Channel 01 : 1[1a000] -> 0[18000] via P2P/IPC
tesla-105:1477:2150 [1] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
tesla-105:1477:2150 [1] NCCL INFO comm 0x7f30b4001060 rank 1 nranks 8 cudaDev 1 busId 1a000 - Init COMPLETE
tesla-105:1480:2169 [2] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
tesla-105:1480:2169 [2] NCCL INFO comm 0x7f37d4001060 rank 2 nranks 8 cudaDev 2 busId af000 - Init COMPLETE
tesla-105:1475:2146 [0] NCCL INFO Channel 01 : 5[1a000] -> 0[18000] [receive] via NET/Socket/0
tesla-105:1475:2146 [0] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
tesla-105:1475:2146 [0] NCCL INFO comm 0x7f9e54001060 rank 0 nranks 8 cudaDev 0 busId 18000 - Init COMPLETE
tesla-105:1475:1475 [0] NCCL INFO Launch mode Parallel
DistributedDataParallel(
  (module): NeuralNetwork(
    (flatten): Flatten(start_dim=1, end_dim=-1)
    (linear_relu_stack): Sequential(
      (0): Linear(in_features=784, out_features=512, bias=True)
      (1): ReLU()
      (2): Linear(in_features=512, out_features=512, bias=True)
      (3): ReLU()
      (4): Linear(in_features=512, out_features=10, bias=True)
    )
  )
)
Epoch 1
-------------------------------
loss: 2.294374  [    0/60000]
loss: 2.301075  [  800/60000]
loss: 2.315739  [ 1600/60000]
loss: 2.299692  [ 2400/60000]
loss: 2.258646  [ 3200/60000]
loss: 2.252302  [ 4000/60000]
loss: 2.218223  [ 4800/60000]
loss: 2.126724  [ 5600/60000]
loss: 2.174220  [ 6400/60000]
loss: 2.177455  [ 7200/60000]
Test Error: 
 Accuracy: 4.1%, Avg loss: 2.166388 

Epoch 2
-------------------------------
loss: 2.136480  [    0/60000]
loss: 2.127040  [  800/60000]
loss: 2.118551  [ 1600/60000]
loss: 2.051364  [ 2400/60000]
loss: 2.076279  [ 3200/60000]
loss: 2.002108  [ 4000/60000]
loss: 2.075573  [ 4800/60000]
loss: 1.959522  [ 5600/60000]
loss: 1.861534  [ 6400/60000]
loss: 1.872814  [ 7200/60000]
Test Error: 
 Accuracy: 7.2%, Avg loss: 1.908959 

Epoch 3
-------------------------------
loss: 2.081742  [    0/60000]
loss: 1.841850  [  800/60000]
loss: 1.939971  [ 1600/60000]
loss: 1.684577  [ 2400/60000]
loss: 1.648371  [ 3200/60000]
loss: 1.774270  [ 4000/60000]
loss: 1.552769  [ 4800/60000]
loss: 1.508346  [ 5600/60000]
loss: 1.516589  [ 6400/60000]
loss: 1.481997  [ 7200/60000]
Test Error: 
 Accuracy: 7.8%, Avg loss: 1.533547 

Epoch 4
-------------------------------
loss: 1.625404  [    0/60000]
loss: 1.543570  [  800/60000]
loss: 1.428792  [ 1600/60000]
loss: 1.446484  [ 2400/60000]
loss: 1.841029  [ 3200/60000]
loss: 1.320562  [ 4000/60000]
loss: 1.511142  [ 4800/60000]
loss: 1.444456  [ 5600/60000]
loss: 1.570060  [ 6400/60000]
loss: 1.482602  [ 7200/60000]
Test Error: 
 Accuracy: 8.0%, Avg loss: 1.256674 

Epoch 5
-------------------------------
loss: 1.064455  [    0/60000]
loss: 1.233810  [  800/60000]
loss: 1.168940  [ 1600/60000]
loss: 1.227281  [ 2400/60000]
loss: 1.437644  [ 3200/60000]
loss: 1.195065  [ 4000/60000]
loss: 1.305991  [ 4800/60000]
loss: 1.258441  [ 5600/60000]
loss: 0.970569  [ 6400/60000]
loss: 1.698888  [ 7200/60000]
Test Error: 
 Accuracy: 8.2%, Avg loss: 1.083617 

Done!
Saved PyTorch Model State to model.pth

机器 2 (IP: 192.168.1.106)：

> CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node 4 --nnodes 2 --node_rank 1 --master_addr='192.168.1.105' --master_port='12345' single-machine-and-multi-GPU-DistributedDataParallel-launch.py
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
Using device: cuda:0
Using device: cuda:1

local rank: 1, global rank: 5, world size: 8
local rank: 0, global rank: 4, world size: 8
Using device: cuda:2
local rank: 2, global rank: 6, world size: 8
Using device: cuda:3
local rank: 3, global rank: 7, world size: 8
tesla-106:1942:1942 [1] NCCL INFO Bootstrap : Using [0]eno2:192.168.1.106<0>
tesla-106:1942:1942 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

tesla-106:1942:1942 [1] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
tesla-106:1942:1942 [1] NCCL INFO NET/Socket : Using [0]eno2:192.168.1.106<0>
tesla-106:1942:1942 [1] NCCL INFO Using network Socket
tesla-106:1988:1988 [3] NCCL INFO Bootstrap : Using [0]eno2:192.168.1.106<0>
tesla-106:1988:1988 [3] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

tesla-106:1988:1988 [3] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
tesla-106:1988:1988 [3] NCCL INFO NET/Socket : Using [0]eno2:192.168.1.106<0>
tesla-106:1988:1988 [3] NCCL INFO Using network Socket
tesla-106:1943:1943 [2] NCCL INFO Bootstrap : Using [0]eno2:192.168.1.106<0>
tesla-106:1943:1943 [2] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

tesla-106:1943:1943 [2] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
tesla-106:1943:1943 [2] NCCL INFO NET/Socket : Using [0]eno2:192.168.1.106<0>
tesla-106:1943:1943 [2] NCCL INFO Using network Socket
tesla-106:1940:1940 [0] NCCL INFO Bootstrap : Using [0]eno2:192.168.1.106<0>
tesla-106:1940:1940 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

tesla-106:1940:1940 [0] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
tesla-106:1940:1940 [0] NCCL INFO NET/Socket : Using [0]eno2:192.168.1.106<0>
tesla-106:1940:1940 [0] NCCL INFO Using network Socket
tesla-106:1988:2787 [3] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/64
tesla-106:1988:2787 [3] NCCL INFO Trees [0] -1/-1/-1->7->6|6->7->-1/-1/-1 [1] -1/-1/-1->7->6|6->7->-1/-1/-1
tesla-106:1988:2787 [3] NCCL INFO Setting affinity for GPU 3 to ff,c00ffc00
tesla-106:1943:2821 [2] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/64
tesla-106:1943:2821 [2] NCCL INFO Trees [0] 7/-1/-1->6->5|5->6->7/-1/-1 [1] 7/-1/-1->6->5|5->6->7/-1/-1
tesla-106:1943:2821 [2] NCCL INFO Setting affinity for GPU 2 to ff,c00ffc00
tesla-106:1942:2786 [1] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/64
tesla-106:1942:2786 [1] NCCL INFO Trees [0] 6/-1/-1->5->4|4->5->6/-1/-1 [1] 6/0/-1->5->4|4->5->6/0/-1
tesla-106:1942:2786 [1] NCCL INFO Setting affinity for GPU 1 to 3ff003ff
tesla-106:1940:2831 [0] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/64
tesla-106:1940:2831 [0] NCCL INFO Trees [0] 5/-1/-1->4->1|1->4->5/-1/-1 [1] 5/-1/-1->4->-1|-1->4->5/-1/-1
tesla-106:1942:2786 [1] NCCL INFO Channel 00 : 5[1a000] -> 6[af000] via direct shared memory
tesla-106:1940:2831 [0] NCCL INFO Setting affinity for GPU 0 to 3ff003ff
tesla-106:1943:2821 [2] NCCL INFO Channel 00 : 6[af000] -> 7[b1000] via P2P/IPC
tesla-106:1988:2787 [3] NCCL INFO Channel 00 : 7[b1000] -> 0[18000] [send] via NET/Socket/0
tesla-106:1940:2831 [0] NCCL INFO Channel 00 : 3[b1000] -> 4[18000] [receive] via NET/Socket/0
tesla-106:1940:2831 [0] NCCL INFO Channel 00 : 4[18000] -> 5[1a000] via P2P/IPC
tesla-106:1988:2787 [3] NCCL INFO Channel 00 : 7[b1000] -> 6[af000] via P2P/IPC
tesla-106:1942:2786 [1] NCCL INFO Channel 00 : 5[1a000] -> 4[18000] via P2P/IPC
tesla-106:1940:2831 [0] NCCL INFO Channel 00 : 4[18000] -> 1[1a000] [send] via NET/Socket/0
tesla-106:1940:2831 [0] NCCL INFO Channel 00 : 1[1a000] -> 4[18000] [receive] via NET/Socket/0
tesla-106:1988:2787 [3] NCCL INFO Channel 01 : 7[b1000] -> 0[18000] [send] via NET/Socket/0
tesla-106:1943:2821 [2] NCCL INFO Channel 00 : 6[af000] -> 5[1a000] via direct shared memory
tesla-106:1942:2786 [1] NCCL INFO Channel 01 : 5[1a000] -> 6[af000] via direct shared memory
tesla-106:1943:2821 [2] NCCL INFO Channel 01 : 6[af000] -> 7[b1000] via P2P/IPC
tesla-106:1988:2787 [3] NCCL INFO Channel 01 : 7[b1000] -> 6[af000] via P2P/IPC
tesla-106:1988:2787 [3] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
tesla-106:1988:2787 [3] NCCL INFO comm 0x7fbb14001060 rank 7 nranks 8 cudaDev 3 busId b1000 - Init COMPLETE
tesla-106:1943:2821 [2] NCCL INFO Channel 01 : 6[af000] -> 5[1a000] via direct shared memory
tesla-106:1940:2831 [0] NCCL INFO Channel 01 : 3[b1000] -> 4[18000] [receive] via NET/Socket/0
tesla-106:1940:2831 [0] NCCL INFO Channel 01 : 4[18000] -> 5[1a000] via P2P/IPC
tesla-106:1942:2786 [1] NCCL INFO Channel 01 : 0[18000] -> 5[1a000] [receive] via NET/Socket/0
tesla-106:1943:2821 [2] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
tesla-106:1943:2821 [2] NCCL INFO comm 0x7f6fec001060 rank 6 nranks 8 cudaDev 2 busId af000 - Init COMPLETE
tesla-106:1942:2786 [1] NCCL INFO Channel 01 : 5[1a000] -> 4[18000] via P2P/IPC
tesla-106:1940:2831 [0] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
tesla-106:1940:2831 [0] NCCL INFO comm 0x7f5550001060 rank 4 nranks 8 cudaDev 0 busId 18000 - Init COMPLETE
tesla-106:1942:2786 [1] NCCL INFO Channel 01 : 5[1a000] -> 0[18000] [send] via NET/Socket/0
tesla-106:1942:2786 [1] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
tesla-106:1942:2786 [1] NCCL INFO comm 0x7f75d4001060 rank 5 nranks 8 cudaDev 1 busId 1a000 - Init COMPLETE

错误排查

运行代码时，我们发现了一个错误：

RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:825, unhandled system error, NCCL version 2.7.8
ncclSystemError: System call (socket, malloc, munmap, etc) failed.
tesla-105:29334:30213 [1] NCCL INFO Channel 00 : 6[af000] -> 1[1a000] [receive] via NET/Socket/1
tesla-105:29334:30213 [1] NCCL INFO Channel 00 : 1[1a000] -> 0[18000] via P2P/IPC
tesla-105:29331:30215 [0] NCCL INFO Channel 01 : 4[18000] -> 0[18000] [receive] via NET/Socket/1
tesla-105:29331:30215 [0] NCCL INFO Channel 01 : 0[18000] -> 1[1a000] via P2P/IPC
tesla-105:29336:30216 [2] NCCL INFO Call to connect returned Connection refused, retrying
tesla-105:29336:30216 [2] NCCL INFO Call to connect returned Connection refused, retrying
tesla-105:29336:30216 [2] NCCL INFO Call to connect returned Connection refused, retrying
tesla-105:29336:30216 [2] NCCL INFO Call to connect returned Connection refused, retrying

这个问题与网络通信有关。为了解决这个问题，可以进行以下尝试：