H800多机多卡通信带宽测试
参考连接
背景知识
NVIDIA H800 GPU 之间通过 NVLink 的带宽可以通过 NVLink 版本和连接数量来估算。NVLink 的最新版本 (NVLink 4) 为每条链路提供大约 50 GB/s 的双向带宽(单个方向为 25 GB/s)。
要确定两张 H800 GPU 之间的 NVLink 带宽,重要的是知道这两张 GPU 通过多少条 NVLink 链接相连。
假设每对 GPU 之间通过多条 NVLink 链接相连,我们可以通过简单的计算得出总带宽。例如:
示例计算
假设 H800 每对 GPU 之间有 8 条 NVLink 连接:
每条 NVLink 连接的双向带宽:50 GB/s
总带宽 = 每条 NVLink 连接的双向带宽 × NVLink 连接的数量
总带宽 = 50 GB/s × 8 条 = 400 GB/s (双向), 每个方向 200 GB/s
NCCL(NVIDIA Collective Communication Library)在通过NVLink进行多卡通信时,会同时利用多条NVLink链路以最大化带宽和最小化延迟。这是通过多路径通信和负载均衡机制来实现的。
以下是NCCL在多卡通信中使用NVLink的几个关键优化机制:
- 多路径通信:
NCCL能够自动检测和使用多条NVLink链路,这可以在不同GPU之间并行传输数据,从而提高总带宽。 - 拓扑感知:
NCCL会感知系统拓扑,即GPU之间的连接方式,并根据这些信息来选择最优的通信路由。它可以在需要时跨多个NVLink链路分配通信负载。 - 负载均衡:
NCCL会平衡数据传输的负载,确保每条NVLink链路都被充分利用,从而避免某些链路过载而其他链路闲置。这种负载均衡机制进一步提高了通信效率。 - 全局优化:
在多卡系统中,NCCL会进行全局优化,以便对整个通信过程进行高效的规划和调度。它不仅考虑单个链路的带宽,还考虑整体的网络拓扑结构。
nccl会自动使用带宽大的链路,IB的优先级高于socket,nccl会先搜索ib设备,如果没找到或者用户显式禁用,才会回退到套接字
NCCL_MAX_NRINGS:
如果你设置了两个环(rings),每个环会分担一部分数据。在这种情况下,假设你要发送128MB的数据,两个环各分担64MB的数据量。
这种做法的目的在于利用多个环路来提高整体通信带宽和减少通信的时间。
在实际操作中,这种数据分配机制基于如下的原理:
并行传输:通过多个环并行传输数据,每个环传输部分数据,使得总的通信时间缩短。理论上,多环的并行传输可以提高带宽利用率。
负载均衡:每个环分担相同或比例相同的数据量,减少单个环的负载压力,从而提升传输效率。
具体步骤
数据分割:将128MB的数据划分成两个64MB的块。
并行发送:两个环并行发送这两个64MB的数据块。
数据合并:接收端将数据块重新组装成完整的128MB数据。
环境变量解释
- NCCL_IB_GID_INDEX: 默认值0,0和1表示ipv6的。2和3才是ipv4。2表示使用 RoCE V1协议,3表示使用RoCE V2协议
查看NVLink链路情况
nvidia-smi nvlink -s
GPU 0: NVIDIA H800 (UUID: )
Link 0: 26.562 GB/s
Link 1: 26.562 GB/s
Link 2: 26.562 GB/s
Link 3: 26.562 GB/s
Link 4: 26.562 GB/s
Link 5: 26.562 GB/s
Link 6: 26.562 GB/s
Link 7: 26.562 GB/s
nvidia-smi topo -m
nvidia-smi topo -m
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 NIC1 NIC2 NIC3 NIC4 NIC5 NIC6 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NV8 NV8 NV8 NV8 NV8 NV8 NV8 PIX NODE SYS SYS SYS SYS NODE 0-55,112-167 0 N/A
GPU1 NV8 X NV8 NV8 NV8 NV8 NV8 NV8 NODE NODE SYS SYS SYS SYS NODE 0-55,112-167 0 N/A
GPU2 NV8 NV8 X NV8 NV8 NV8 NV8 NV8 NODE PIX SYS SYS SYS SYS NODE 0-55,112-167 0 N/A
GPU3 NV8 NV8 NV8 X NV8 NV8 NV8 NV8 NODE NODE SYS SYS SYS SYS NODE 0-55,112-167 0 N/A
GPU4 NV8 NV8 NV8 NV8 X NV8 NV8 NV8 SYS SYS PIX NODE NODE NODE SYS 56-111,168-223 1 N/A
GPU5 NV8 NV8 NV8 NV8 NV8 X NV8 NV8 SYS SYS NODE PIX NODE NODE SYS 56-111,168-223 1 N/A
GPU6 NV8 NV8 NV8 NV8 NV8 NV8 X NV8 SYS SYS NODE NODE PIX NODE SYS 56-111,168-223 1 N/A
GPU7 NV8 NV8 NV8 NV8 NV8 NV8 NV8 X SYS SYS NODE NODE NODE PIX SYS 56-111,168-223 1 N/A
NIC0 PIX NODE NODE NODE SYS SYS SYS SYS X NODE SYS SYS SYS SYS NODE
NIC1 NODE NODE PIX NODE SYS SYS SYS SYS NODE X SYS SYS SYS SYS NODE
NIC2 SYS SYS SYS SYS PIX NODE NODE NODE SYS SYS X NODE NODE NODE SYS
NIC3 SYS SYS SYS SYS NODE PIX NODE NODE SYS SYS NODE X NODE NODE SYS
NIC4 SYS SYS SYS SYS NODE NODE PIX NODE SYS SYS NODE NODE X NODE SYS
NIC5 SYS SYS SYS SYS NODE NODE NODE PIX SYS SYS NODE NODE NODE X SYS
NIC6 NODE NODE NODE NODE SYS SYS SYS SYS NODE NODE SYS SYS SYS SYS X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
NIC Legend:
NIC0: mlx5_0
NIC1: mlx5_3
NIC2: mlx5_4
NIC3: mlx5_5
NIC4: mlx5_6
NIC5: mlx5_7
NIC6: mlx5_bond_0
查看Mellanox网卡
lspci | grep Mellanox
0e:00.0 Ethernet controller: Mellanox Technologies MT2910 Family [ConnectX-7]
1f:00.0 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx]
1f:00.1 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx]
47:00.0 Ethernet controller: Mellanox Technologies MT2910 Family [ConnectX-7]
86:00.0 Ethernet controller: Mellanox Technologies MT2892 Family [ConnectX-6 Dx]
af:00.0 Ethernet controller: Mellanox Technologies MT2910 Family [ConnectX-7]
c3:00.0 Ethernet controller: Mellanox Technologies MT2892 Family [ConnectX-6 Dx]
d6:00.0 Ethernet controller: Mellanox Technologies MT2910 Family [ConnectX-7]
查看IB设备信息
ibv_devinfo
hca_id: mlx5_0
transport: InfiniBand (0)
fw_ver: 28.40.1000
node_guid: a088:c203:001a:3c02
sys_image_guid: a088:c203:001a:3c02
vendor_id: 0x02c9
vendor_part_id: 4129
hw_ver: 0x0
board_id: MT_0000000838
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: Ethernet
ibv_devices
device node GUID
------ ----------------
mlx5_0 a088c203005aa5ac
mlx5_3 a088c203005a87fc
mlx5_4 88e9a4ffff1e414c
mlx5_5 a088c203005a9edc
mlx5_6 88e9a4ffff1c0e4e
mlx5_7 a088c203005a9d9c
mlx5_bond_0 e8ebd30300dff98a
查看以太网设备与IB设备关联信息
ibdev2netdev
mlx5_0 port 1 ==> ens108np0 (Up)
mlx5_3 port 1 ==> ens110np0 (Up)
mlx5_4 port 1 ==> ens112np0 (Up)
mlx5_5 port 1 ==> ens113np0 (Up)
mlx5_6 port 1 ==> ens114np0 (Up)
mlx5_7 port 1 ==> ens115np0 (Up)
mlx5_bond_0 port 1 ==> bond0 (Up)
查看网卡的IP
ifconfig bond0
bond0: flags=5187<UP,BROADCAST,RUNNING,MASTER,MULTICAST> mtu 1500
inet 10.24.9.31 netmask 255.255.254.0 broadcast 10.24.9.255
inet6 fe80::41f:e9ff:fe0a:d004 prefixlen 64 scopeid 0x20<link>
ether 06:1f:e9:0a:d0:04 txqueuelen 1000 (Ethernet)
RX packets 691769 bytes 114308829 (114.3 MB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 361602 bytes 66554875 (66.5 MB)
TX errors 0 dropped 31 overruns 0 carrier 0 collisions 0
安装带宽测试工具
apt install perftest
测试mlx5_bond_0的带宽
ib_write_bw -s 536870912 -F --run_infinitely -x 3 -q 5 --ib-dev mlx5_bond_0
ib_write_bw -s 536870912 -F --run_infinitely -x 3 10.24.9.31 -q 5 --ib-dev mlx5_bond_0
---------------------------------------------------------------------------------------
RDMA_Write BW Test
Dual-port : OFF Device : mlx5_bond_0
Number of qps : 5 Transport type : IB
Connection type : RC Using SRQ : OFF
TX depth : 128
CQ Moderation : 100
Mtu : 1024[B]
Link type : Ethernet
GID index : 3
Max inline data : 0[B]
rdma_cm QPs : OFF
Data ex. method : Ethernet
---------------------------------------------------------------------------------------
#bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps]
536870912 200 0.00 20480.11 0.000040
测试mlx5_0的带宽
ib_write_bw -s 536870912 -F --run_infinitely -x 3 -q 5 --ib-dev mlx5_0
ib_write_bw -s 536870912 -F --run_infinitely -x 3 10.24.224.31 -q 5 --ib-dev mlx5_0
---------------------------------------------------------------------------------------
RDMA_Write BW Test
Dual-port : OFF Device : mlx5_0
Number of qps : 5 Transport type : IB
Connection type : RC Using SRQ : OFF
TX depth : 128
CQ Moderation : 100
Mtu : 4096[B]
Link type : Ethernet
GID index : 3
Max inline data : 0[B]
rdma_cm QPs : OFF
Data ex. method : Ethernet
---------------------------------------------------------------------------------------
#bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps]
536870912 500 0.00 51200.34 0.000100
生成NCCL拓扑结构
export NCCL_DEBUG=INFO
export NCCL_TOPO_DUMP_FILE=./dump-topo.xml
#运行NCCL测试程序
cat dump-topo.xml
输出
<system version="1">
<cpu numaid="1" affinity="ffffffff,ffffff00,00000000,0000ffff,ffffffff,ff000000,00000000" arch="x86_64" vendor="GenuineIntel" familyid="6" modelid="143">
<pci busid="0000:84:00.0" class="0x060400" vendor="0x1000" device="0xc030" subsystem_vendor="0x1000" subsystem_device="0x100b" link_speed="32.0 GT/s PCIe" link_width="16">
<pci busid="0000:86:00.0" class="0x020000" vendor="0x15b3" device="0x101d" subsystem_vendor="0x15b3" subsystem_device="0x0040" link_speed="16.0 GT/s PCIe" link_width="16">
<nic>
<net name="mlx5_4" dev="2" speed="200000" port="1" latency="0.000000" guid="0x4c411effffa4e988" maxconn="131072" gdr="1"/>
</nic>
</pci>
</pci>
<pci busid="0000:ac:00.0" class="0x060400" vendor="0x1000" device="0xc030" subsystem_vendor="0x1000" subsystem_device="0x100b" link_speed="32.0 GT/s PCIe" link_width="16">
<pci busid="0000:af:00.0" class="0x020000" vendor="0x15b3" device="0x1021" subsystem_vendor="0x15b3" subsystem_device="0x0023" link_speed="32.0 GT/s PCIe" link_width="16">
<nic>
<net name="mlx5_5" dev="3" speed="400000" port="1" latency="0.000000" guid="0xdc9e5a0003c288a0" maxconn="131072" gdr="1"/>
</nic>
</pci>
</pci>
<pci busid="0000:c0:00.0" class="0x060400" vendor="0x1000" device="0xc030" subsystem_vendor="0x1000" subsystem_device="0x100b" link_speed="32.0 GT/s PCIe" link_width="16">
<pci busid="0000:c3:00.0" class="0x020000" vendor="0x15b3" device="0x101d" subsystem_vendor="0x15b3" subsystem_device="0x0040" link_speed="16.0 GT/s PCIe" link_width="16">
<nic>
<net name="mlx5_6" dev="4" speed="200000" port="1" latency="0.000000" guid="0x4e0e1cffffa4e988" maxconn="131072" gdr="1"/>
</nic>
</pci>
</pci>
<pci busid="0000:d4:00.0" class="0x060400" vendor="0x1000" device="0xc030" subsystem_vendor="0x1000" subsystem_device="0x100b" link_speed="32.0 GT/s PCIe" link_width="16">
<pci busid="0000:d6:00.0" class="0x020000" vendor="0x15b3" device="0x1021" subsystem_vendor="0x15b3" subsystem_device="0x0023" link_speed="32.0 GT/s PCIe" link_width="16">
<nic>
<net name="mlx5_7" dev="5" speed="400000" port="1" latency="0.000000" guid="0x9c9d5a0003c288a0" maxconn="131072" gdr="1"/>
</nic>
</pci>
<pci busid="0000:d7:00.0" class="0x030200" vendor="0x10de" device="0x2324" subsystem_vendor="0x10de" subsystem_device="0x17a6" link_speed="32.0 GT/s PCIe" link_width="16">
<gpu dev="7" sm="90" rank="0" gdr="1">
<nvlink target="0000:05:00.0" count="2" tclass="0x068000"/>
<nvlink target="0000:04:00.0" count="2" tclass="0x068000"/>
<nvlink target="0000:03:00.0" count="2" tclass="0x068000"/>
<nvlink target="0000:06:00.0" count="2" tclass="0x068000"/>
</gpu>
</pci>
</pci>
</cpu>
<cpu numaid="0" affinity="00000000,000000ff,ffffffff,ffff0000,00000000,00ffffff,ffffffff" arch="x86_64" vendor="GenuineIntel" familyid="6" modelid="143">
<pci busid="0000:0c:00.0" class="0x060400" vendor="0x1000" device="0xc030" subsystem_vendor="0x1000" subsystem_device="0x100b" link_speed="32.0 GT/s PCIe" link_width="16">
<pci busid="0000:0e:00.0" class="0x020000" vendor="0x15b3" device="0x1021" subsystem_vendor="0x15b3" subsystem_device="0x0023" link_speed="32.0 GT/s PCIe" link_width="16">
<nic>
<net name="mlx5_0" dev="0" speed="400000" port="1" latency="0.000000" guid="0xaca55a0003c288a0" maxconn="131072" gdr="1"/>
</nic>
</pci>
</pci>
<pci busid="0000:1f:00.0" class="0x020000" vendor="0x15b3" device="0x1015" subsystem_vendor="0x15b3" subsystem_device="0x0069" link_speed="8.0 GT/s PCIe" link_width="8">
<nic>
<net name="mlx5_bond_0" dev="6" speed="25000" port="1" latency="0.000000" guid="0x8af9df0003d3ebe8" maxconn="131072" gdr="1"/>
</nic>
</pci>
<pci busid="0000:45:00.0" class="0x060400" vendor="0x1000" device="0xc030" subsystem_vendor="0x1000" subsystem_device="0x100b" link_speed="32.0 GT/s PCIe" link_width="16">
<pci busid="0000:47:00.0" class="0x020000" vendor="0x15b3" device="0x1021" subsystem_vendor="0x15b3" subsystem_device="0x0023" link_speed="32.0 GT/s PCIe" link_width="16">
<nic>
<net name="mlx5_3" dev="1" speed="400000" port="1" latency="0.000000" guid="0xfc875a0003c288a0" maxconn="131072" gdr="1"/>
</nic>
</pci>
</pci>
</cpu>
</system>
字段解释:
在网络和通信领域,速度字段一般以 Mbps 表示,特别是在网络设备配置和性能评估中。具体来说:
1 Mbps = 1,000,000 位每秒(Megabits per second)
1 Gbps = 1,000 Mbps(Gigabits per second)
因此,如果 speed=“400000”,意味着该网络设备的速度是 400,000 Mbps,即 400 Gbps。
name=“mlx5_7”:网络设备的名称或标识符。
dev=“5”:设备 ID。
speed=“400000”:网络设备的速度为 400,000 Mbps,即 400 Gbps。
port=“1”:端口编号。
latency=“0.000000”:网络设备的延迟,通常以毫秒或微秒为单位。
guid=“0x9c9d5a0003c288a0”:网络设备的全局唯一标识符 (GUID)。
maxconn=“131072”:最大连接数。
gdr=“1”:表示是否启用了 GPU 直连(GPU Direct RDMA)
确认 speed 字段与实际设备规格一致,如 Mellanox 网络设备通常支持多种速率(40 Gbps、100 Gbps、400 Gbps 等)
点对点带宽扫描程序
tee scan_topo.py <<-'EOF'
import deepspeed
import torch
import torch.distributed as dist
import time
import networkx as nx
import matplotlib.pyplot as plt
import os
import matplotlib
matplotlib.use('Agg')
def initialize_process():
"""
初始化PyTorch分布式进程组
"""
deepspeed.init_distributed()
local_rank=int(os.environ['LOCAL_RANK'])
torch.cuda.set_device(local_rank)
def bandwidth_test(src, dst, size=1024*1024*512):
"""
针对源节点和目标节点进行带宽测试
"""
tensor = torch.ones(size).to(torch.float32).cuda()
rank=dist.get_rank()
# 预热
if rank == src:
dist.send(tensor, dst)
elif rank == dst:
dist.recv(tensor, src)
torch.cuda.synchronize()
# 正式测试
start = None
if rank in [src,dst]:
start = time.time()
count=6
for _ in range(count):
if rank == src:
dist.send(tensor, dst)
elif rank == dst:
dist.recv(tensor, src)
if rank in [src,dst]:
torch.cuda.synchronize()
end = None
if rank in [src,dst]:
end = time.time()
elapsed_time = None
if rank in [src,dst]:
elapsed_time = end - start
elapsed_time_list = [None for _ in range(dist.get_world_size())]
dist.all_gather_object(elapsed_time_list, elapsed_time)
elapsed_time = max(list(filter(None, elapsed_time_list)))
bandwidth = count*size*4 / elapsed_time / 1e9 # 单位: GB/s
return bandwidth
def measure_bandwidth_topology(size):
"""
测量带宽拓扑
"""
bandwidth_matrix = [[0 for _ in range(size)] for _ in range(size)]
for src in range(size):
for dst in range(size):
if src != dst:
bandwidth = bandwidth_test(src, dst)
bandwidth_matrix[src][dst] = bandwidth
return bandwidth_matrix
def plot_bandwidth_topology(bandwidth_matrix, size):
"""
绘制带宽拓扑
"""
G = nx.DiGraph()
for i in range(size):
for j in range(size):
if i != j:
G.add_edge(i, j, weight=bandwidth_matrix[i][j])
# 生成颜色列表,将每个节点分配一个唯一的颜色
colors = plt.cm.tab20(range(size))
# 手动设置节点位置并稍微打乱以避免节点和标签重叠
pos = {}
for i in range(size):
row = i // 8 # 行号 (0 或 1)
col = i % 8 # 列号 (0 到 7)
pos[i] = (col , -row)
plt.figure(figsize=(20, 10)) # 设置绘图区域大小
nx.draw(G, pos, with_labels=False, node_size=400, node_color=[colors[i] for i in range(size)], font_weight='bold')
# 添加边标签的文本块
for node in range(size):
edges = [(i, j) for i, j in G.edges() if i == node]
label_text = "\n".join([f'{i:02d}->{j:02d}: {bandwidth_matrix[i][j]:5.2f}GB/s' for i, j in edges])
if label_text:
x, y = pos[node]
if node<8:
y+=0.25
else:
y-=0.03
# 在节点正下方显示标签
plt.text(x, y, label_text, fontsize=8, ha='center', va='top',
bbox=dict(facecolor='white', alpha=0, edgecolor='none'),
color=colors[node % len(colors)])
# 添加节点标签
for node, (x, y) in pos.items():
plt.text(x, y, str(node), fontsize=10, ha='center',
bbox=dict(facecolor='white', alpha=0, edgecolor='none'),
color='black')
plt.savefig('top.png', dpi=600, transparent=False, bbox_inches='tight')
def main():
"""
主函数:进行分布式带宽测试并生成拓扑图
"""
initialize_process()
size = dist.get_world_size()
# 所有进程进行带宽测试
bandwidth_matrix = measure_bandwidth_topology(size)
# 只有主节点(rank 0)生成拓扑图
if dist.get_rank() == 0:
plot_bandwidth_topology(bandwidth_matrix, size)
dist.barrier() # 确保所有节点同步完成
dist.destroy_process_group()
if __name__ == "__main__":
main()
EOF
tee hostfile <<-'EOF'
worker_1 slots=8
worker_2 slots=8
EOF
export NCCL_SOCKET_IFNAME=bond0 #设置通信网卡
export MAX_JOBS=32
export NCCL_DEBUG=ERROR
export NCCL_TOPO_DUMP_FILE=./dump-topo.xml
export NCCL_IB_GID_INDEX=3
export NCCL_IB_DISABLE=0
export NCCL_IB_HCA=mlx5
export NCCL_NET_GDR_LEVEL=2
export NCCL_IB_QPS_PER_CONNECTION=4
export NCCL_IB_TC=160
export NCCL_IB_TIMEOUT=22
export NCCL_PXN_DISABLE=0
export TCCL_TOPO_AFFINITY=1
export NCCL_P2P_DISABLE=0
export NCCL_SHM_DISABLE=1
export CUDA_LAUNCH_BLOCKING=0
export NCCL_ALGO=Ring
export NCCL_MIN_NRINGS=4
export NCCL_MAX_NRINGS=8
cat > ds_config.json <<-'EOF'
{
"train_batch_size": 1,
"steps_per_print": 1,
"fp16": {
"enabled": true,
"initial_scale_power": 16
}
}
EOF
deepspeed --hostfile ./hostfile --no_local_rank \
--no_python /usr/bin/bash -c 'cd ./ && python3 -u ./scan_topo.py --deepspeed \
--deepspeed_config=ds_config.json --distributed-backend=nccl'