多机多卡训练组网及配置NCCL_GDR_IB

本文详细指导了如何在Linux系统上安装NVIDIA驱动、CUDA和Mellanox OFED,配置NCCL环境变量以提升GPU与HCA卡性能,包括nv_peer_memory和GDRTEST的安装与调优。重点讲解了NCCL参数对通信效率的影响,如P2P通信、socket线程数和缓存设置。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

GPU+Mellanox HCA GDR TEST 环境部署

1、安装nvidia驱动(建议采用最新驱动),根据操作系统和显卡型号选择即可

https://www.nvidia.cn/Download/index.aspx?lang=cn

2、安装CUDA

https://developer.nvidia.com/cuda-downloads

3、安装对应操作系统版本的Mellanox OFED

https://www.mellanox.com/products/infiniband-drivers/linux/mlnx_ofed

上述安装完成后可以使用nvidia-smi以及ibstat查看GPU/HCA卡的状态

4、关闭selinux;配置cuda环境变量;

# set enforce 0

# export PATH=$PATH:/usr/local/cuda/bin

# export LD_LIBRARY_PATH=/usr/local/cuda/lib:/usr/local/cuda/lib64:$LD_LIBRARY_PATH

5、安装nv_peer_memory

https://github.com/Mellanox/nv_peer_memory

# ./build_module.sh

# rpmbuild --rebuild /tmp/nvidia_peer_memory-1.0-8.src.rpm

# rpm -ivh <path to generated binary rpm file>

6、安装GDR TEST的相关套件并测试(perftest)

https://github.com/linux-rdma/perftest

7、nccl test

https://www.jianshu.com/p/f34c08ccc2ff

https://github.com/NVIDIA/nccl-tests

NCCL环境变量理解(影响通信效率)

https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html

修改环境变量或在nccl.conf中修改相关参数选项。可以改变通信特点,进而起到影响通行性能的作用。

NCCL_P2P_DISABLE 默认是开启P2P通信的,这样一般会更高效,用到点对点通信延迟会有所改善,带宽也是。

NCCL_P2P_LEVEL 开启P2P后可以设置P2P的级别,比如在那些特定条件下可以开启点对点通信,具体的可以参看文档(0-5)

NCCL_SOCKET_NTHREADS 增加它的数量可以提高socker传输的效率,但是会增加CPU的负担

NCCL_BUFFLE_SIZE 缓存数据量,缓存越大一次ring传输的数据就越大自然对带宽的压力最大,但是相应的总延迟次数会少。缺省值是4M(4194304),注意设置的时候使用bytes(字节大小)

NCCL_MAX/MIN_NCHANNELS 最小和最大的rings,rings越多对GPU的显存、带宽的压力都越大,也会影响计算性能

NCCL_CHECKS_DISABLE 在每次集合通信进行前对参数检验校对,这会增加延迟时间,在生产环境中可以设为1.缺省是0

NCCL_CHECK_POINTERS 在每次集合通信进行前对CUDA内存 指针进行校验,这会增加延迟时间,在生产环境中可以设为1.缺省是0

NCCL_NET_GDR_LEVEL GDR触发的条件,默认是当GPU和NIC挂载一个swith上面时使用GDR

NCCL_IGNORE_CPU_AFFINITY 忽略CPU与应用的亲和性使用GPU与nic的亲和性为主

NCCL_ALGO 通信使用的算法,ring Tree Collnet

### NCCL Multi-Node Multi-GPU Training Setup and Configuration For effective multi-node, multi-GPU training setups utilizing NCCL, several key components must be configured properly. NCCL stands as a critical component for distributed deep learning due to its efficient communication mechanisms that significantly enhance performance across multiple GPUs and nodes[^1]. #### Prerequisites Before setting up the environment, ensure all machines have compatible versions of CUDA installed since NCCL relies heavily on this framework. #### Environment Preparation To prepare an environment suitable for scalable multi-node deep learning training using GPUs in cloud environments like AWS, one should follow specific guidelines provided by repositories such as `hpc-cluster`. These resources offer detailed instructions alongside scripts designed specifically for creating clusters via services like AWS CloudFormation templates[^3]. #### Installation Steps Install necessary packages including NCCL itself which can typically be done through package managers or directly from NVIDIA’s official sources ensuring compatibility with existing system configurations. #### Network Configuration Proper network settings are crucial when configuring inter-node communications within a cluster. This involves optimizing parameters related to bandwidth utilization between nodes while minimizing latency where possible. #### Running Distributed Training Jobs Once everything is set up correctly, running jobs becomes straightforward but still requires careful consideration regarding data distribution strategies among workers during parallel processing tasks. High-level APIs offered by frameworks supporting multi-GPU operations simplify these processes further allowing developers more focus on model development rather than infrastructure management[^2]. ```bash # Example command line instruction for launching a job mpirun -np <number_of_processes> \ --hostfile hosts.txt \ python train.py ``` --related questions-- 1. How does NCCL compare against other libraries used for similar purposes? 2. Can you explain what kind of optimizations NCCL applies internally to achieve better performance? 3. Are there any best practices recommended when scaling out models beyond single-machine limits into larger clusters? 4. In terms of cost efficiency, how beneficial would it be to utilize cloud-based solutions over local hardware investments for large-scale projects involving many GPUs? 5. What challenges might arise when transitioning from single-node to multi-node architectures particularly concerning synchronization issues?
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值