nccl-test多机多卡测试

bmseven

已于 2024-07-03 16:34:26 修改

阅读量3.3k

点赞数 27

分类专栏： NCCL 文章标签：深度学习

于 2024-07-03 16:33:02 首次发布

本文链接：https://blog.csdn.net/bmseven/article/details/140155650

版权

NCCL 专栏收录该内容

1 篇文章

订阅专栏

ssh免密登录

ubuntu默认安装有SSH client，还需要安装 SSH server

sudo apt install openssh-server

本机生成公私钥

cd ~/.ssh
ssh-keygen -t rsa

在.ssh/目录下，会生成两个文件：id_rsa和id_rsa.pub

注意：正确配置.ssh目录以及其下文件权限

sudo chmod 700 .ssh/
sudo chmod 600 .ssh/authorized_keys

上传公钥到目标机器

ssh-copy-id star@192.168.0.100

注意：@前是用户名，后是ip

测试免密登录

ssh star@192.168.0.100

几台机器都需要设置，一定要确保可以互相免密登录！

安装 NCCL（Ubuntu）

在 Ubuntu 上安装 NCCL 需要先将包含 NCCL 软件包的仓库添加到 APT 系统中，然后通过 APT 安装 NCCL 软件包。有两个可用的仓库：本地仓库和网络仓库。建议选择后者以便在发布新版本时轻松获取升级。

安装仓库。

对于本地 NCCL 仓库：

sudo dpkg -i nccl-repo-<version>.deb

注意：本地仓库安装将提示您安装它嵌入的本地密钥，并用该密钥签署软件包。请确保按照说明安装本地密钥，否则安装阶段将失败。

对于网络仓库

wget https://developer.download.nvidia.com/compute/cuda/repos/<distro>/<architecture>/cuda-keyring_1.0-1_all.deb
sudo dpkg -i cuda-keyring_1.0-1_all.deb

更新 APT 数据库

sudo apt update

使用 APT 安装 libnccl2 软件包。此外，如果您需要编译带有 NCCL 的应用程序，可以安装 libnccl-dev 软件包：

如果您使用网络仓库，以下命令将升级 CUDA 到最新版本:

sudo apt install libnccl2 libnccl-dev

如果您希望保留旧版本的 CUDA，请指定特定版本:

sudo apt install libnccl2=2.8.4-1+cuda11.1 libnccl-dev=2.8.4-1+cuda11.1

安装MPI（Ubuntu）

采用源码编译安装

下载OpenMPI源码

前往OpenMPI官方网站下载或者使用wget:

wget https://download.open-mpi.org/release/open-mpi/v4.1/openmpi-4.1.4.tar.gz

解压

tar -zxvf openmpi-4.1.4.tar.gz

编译和安装

./configure --prefix=/usr/local/openmpi
sudo make 
sudo make install

配置环境变量

/etc/profile中添加

export PATH=/usr/local/openmpi/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/openmpi/lib:$LD_LIBRARY_PATH

注意：需要重启生效

验证

运行以下命令来验证OpenMPI是否正确安装：

mpicc --version
mpirun --version

如果命令输出相应的版本信息，说明OpenMPI已经成功安装并配置好了

NCCL 测试

这些测试检查 NCCL 操作的性能和正确性。

构建

要构建这些测试，只需输入 make。

如果 CUDA 没有安装在 /usr/local/cuda，可以指定 CUDA_HOME。类似地，如果 NCCL 没有安装在 /usr，可以指定 NCCL_HOME。

make CUDA_HOME=/path/to/cuda Nncc

NCCL 测试依赖 MPI 来在多个进程（因此多个节点）上工作。如果你想用 MPI 支持来编译测试，需要设置 MPI=1 并将 MPI_HOME 设置为 MPI 安装的路径。

make MPI=1 MPI_HOME=/path/to/mpi CUDA_HOME=/path/to/cuda NCCL_HOME=/path/to/nccl

使用方法

NCCL 测试可以在多个进程、多个线程和每个线程的多个 CUDA 设备上运行。进程的数量由 MPI 管理，因此不作为参数传递给测试。

示例

在 8 个 GPU 上运行（-g 8），从 8 字节到 128M 字节：

./build/all_reduce_perf -b 8 -e 128M -f 2 -g 8

在 4个进程上（2个节点）使用 MPI 运行，每个进程 1个 GPU，总共 4 个 GPU：

mpirun -np 4 -H 192.168.0.111:2,192.168.0.100:2 ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 1

参数说明

所有测试都支持相同的参数集

GPU 数量
- -t, --nthreads <num threads> 每个进程的线程数。默认：1。
- -g, --ngpus <GPUs per thread> 每个线程的 GPU 数。默认：1。
扫描的大小
- -b, --minbytes <min size in bytes> 开始的最小大小。默认：32M。
- -e, --maxbytes <max size in bytes> 结束的最大大小。默认：32M。
- 增量可以是固定的也可以是乘数因子。只应使用其中之一。
- -i, --stepbytes <increment size> 固定增量大小。默认：1M。
- -f, --stepfactor <increment factor> 增量的乘数因子。默认：禁用。
NCCL 操作参数
- -o, --op <sum/prod/min/max/avg/all> 指定要执行的归约操作。仅与 Allreduce、Reduce 或 ReduceScatter 之类的归约操作相关。默认：Sum。
- -d, --datatype <nccltype/all> 指定要使用的数据类型。默认：Float。
- -r, --root <root/all> 指定要使用的 root。仅用于有 root 的操作，如广播或归约。默认：0。
性能
- -n, --iters <iteration count> 迭代次数。默认：20。
- -w, --warmup_iters <warmup iteration count> 热身迭代次数（不计时）。默认：5。
- -m, --agg_iters <aggregation count> 每次迭代要聚合的操作次数。默认：1。
- -a, --average <0/1/2/3> 将性能报告为所有 ranks 的平均值（仅 MPI=1 时）。<0=Rank0,1=Avg,2=Min,3=Max>。默认：1。
测试操作
- -p, --parallel_init <0/1> 使用线程并行初始化 NCCL。默认：0。
- -c, --check <check iteration count> 执行计数迭代，检查每次迭代的结果正确性。这在大量 GPU 上可能会很慢。默认：1。
- -z, --blocking <0/1> 使 NCCL 集体操作阻塞，即让 CPU 在每次集体操作后等待并同步。默认：0。
- -G, --cudagraph <num graph launches> 将迭代捕获为 CUDA 图并指定重放次数。默认：0。

多机运行常见问题

问题1：

bash: orted: 未找到命令
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:

* not finding the required libraries and/or binaries on
  one or more nodes. Please check your PATH and LD_LIBRARY_PATH
  settings, or configure OMPI with --enable-orterun-prefix-by-default

* lack of authority to execute on one or more specified nodes.
  Please verify your allocation and authorities.

* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
  Please check with your sys admin to determine the correct location to use.

*  compilation of the orted with dynamic libraries when static are required
  (e.g., on Cray). Please check your configure cmd line and consider using
  one of the contrib/platform definitions for your system type.

* an inability to create a connection back to mpirun due to a
  lack of common network interfaces and/or no route found between
  them. Please check network connectivity (including firewalls
  and network routing requirements).
--------------------------------------------------------------------------

方法：添加参数 --prefix

mpirun -np 4 -H 192.168.0.111:2,192.168.0.100:2 --prefix /usr/local/openmpi ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 1

问题2：

--------------------------------------------------------------------------
A compressed message was received by the Open MPI run time system
(PMIx) that could not be decompressed.  This means that Open MPI has
compression support enabled on one node and not enabled on another.
This is an unsupported configuration.

Compression support is enabled when both of the following conditions
are met:

1. The Open MPI run time system (PMIx) is built with compression
   support.
2. The necessary compression libraries (e.g., libz) can be found at
   run time.

You should check that both of these conditions are true on both the
node where mpirun is invoked and all the nodes where MPI processes
will be launched.  The node listed below does not have both conditions
met:

  node without compression support:  wenji-Ubuntu

NOTE: There may also be other nodes without compression support.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:

* not finding the required libraries and/or binaries on
  one or more nodes. Please check your PATH and LD_LIBRARY_PATH
  settings, or configure OMPI with --enable-orterun-prefix-by-default

* lack of authority to execute on one or more specified nodes.
  Please verify your allocation and authorities.

* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
  Please check with your sys admin to determine the correct location to use.

*  compilation of the orted with dynamic libraries when static are required
  (e.g., on Cray). Please check your configure cmd line and consider using
  one of the contrib/platform definitions for your system type.

* an inability to create a connection back to mpirun due to a
  lack of common network interfaces and/or no route found between
  them. Please check network connectivity (including firewalls
  and network routing requirements).
--------------------------------------------------------------------------

方法：安装zlib

sudo apt install zlib1g