字节跳动万卡集群网络分析

OString24

已于 2024-06-21 14:37:34 修改

阅读量1.7k

点赞数 22

文章标签：服务器人工智能

于 2024-06-19 22:12:22 首次发布

本文链接：https://blog.csdn.net/huntershuai/article/details/139464743

版权

从公开的信息披露，截至2023年9月，字节跳动已经建立超过一万张的英伟达Ampere架构GPU集群，目前正在建设Hopper架构的集群。英伟达Ampere架构主要包括A100和A800型号的芯片，Hopper架构相较前者则更新，主要包括H100和H800芯片

字节和北大公布的论文，关于网络拓扑的描述主要是其中一章节：

Network topology. Our datacenter network is built with highperformance switches based on Broadcom Tomahawk 4 chips. The total bandwidth of each Tomahawk chip is 25.6Tbps with 64×400Gbps ports. Three layers of switches are connected in a CLOS-like topology to connect more than 10,000 GPUs. For switches at each layer, the bandwidth percentage between downlink and uplink is 1:1. That is, 32 ports are used as downlink and 32 ports are used as uplink. The network provides high bandwidth with a small diameter. Every node can communicate with other nodes within a limited number of hops.

Reducing ECMP hashing conflicts. We carefully design the network topology and schedule network traffic to reduce ECMP hashing conflicts. First, at the top-of-rack (ToR) switch level, one 400G downlink port is split into two 200G downlink ports with specific AOC cables. The conflict probability is reduced as the bandwidth of each uplink is double of that of a downlink. Second, eight 200G NICs on the server is connected to eight different switches in a multi-rail way. The number of GPU servers connected by the same sets of ToR switches can reach 64. And we strategically schedule the dataintensive nodes from our training tasks to operate under the s

最低0.47元/天解锁文章