从公开的信息披露,截至2023年9月,字节跳动已经建立超过一万张的英伟达Ampere架构GPU集群,目前正在建设Hopper架构的集群。英伟达Ampere架构主要包括A100和A800型号的芯片,Hopper架构相较前者则更新,主要包括H100和H800芯片
字节和北大公布的论文,关于网络拓扑的描述主要是其中一章节:
Network topology. Our datacenter network is built with highperformance switches based on Broadcom Tomahawk 4 chips. The total bandwidth of each Tomahawk chip is 25.6Tbps with 64×400Gbps ports. Three layers of switches are connected in a CLOS-like topology to connect more than 10,000 GPUs. For switches at each layer, the bandwidth percentage between downlink and uplink is 1:1. That is, 32 ports are used as downlink and 32 ports are used as uplink. The network provides high bandwidth with a small diameter. Every node can communicate with other nodes within a limited number of hops.
Reducing ECMP hashing conflicts. We carefully design the network topology and schedule network traffic to reduce ECMP hashing conflicts. First, at the top-of-rack (ToR) switch level, one 400G downlink port is split into two 200G downlink ports with specific AOC cables. The conflict probability is reduced as the bandwidth of each uplink is double of that of a downlink. Second, eight 200G NICs on the server is connected to eight different switches in a multi-rail way. The number of GPU servers connected by the same sets of ToR switches can reach 64. And we strategically schedule the dataintensive nodes from our training tasks to operate under the same Top of Rack (ToR) switch. This approach significantly reduces the number of switch hops required for communication and further reduce ECMP hashing conflicts probability.
Congestion control. In distributed training, all-to-all communication may lead to congestion and elevated levels of Priority Flow Control (PFC) [18] when employing the default DCQCN [19] protocol at scale. Excessive use of PFC can result in head-of-line (HoL) blocking [19], thereby diminishing network throughput. To mitigate these issues, we have developed an algorithm incorporating principles from both Swift [20] and DCQCN, which integrates the precise measurement of Round-Trip Time (RTT) with the rapid congestion response capabilities of Explicit Congestion Notification (ECN). This approach significantly enhances throughput and minimizes congestion related to PFC.
Retransmit timeout setting. Parameters in NCCL can be set to control retransmit timer and retry count. We tune these parameters for fast recovery under link flapping. To further reduce the recover time, we enable the adap_retrans feature on the NIC. This feature enables retransmission in a shorter interval and help recover the transmission more quickly when the link flapping period is short.
根据这段描述试图我们试图重现整个集群的拓扑结构
- 首先这是一个三层CLOS-like 的网络拓扑;
- 交换机采用Broadcom Tomahawk 4芯片搭建(应该是基于此芯片自研的交换机);
- 基于3层交换网络,每层交换机带宽收敛比为1:1;
- 接入层,汇聚层和核心层都使用同类型以太网交换机;
- Torswitch,下行链路port 数量为32,但是使用AOC(有源光缆)一分二,支持接入64 台服务器, 因此第三层每个group 包含的GPU :64×8=512
- 每个spineswitch 下行链路为32 port,连接到所有的Torswitch,因此包含4个group,第二层每个block 包含共计 4×512=2048 张GPU
- coreswitch 下行链路是64port,因此,一个pod 中包含64 个spine switch,包含两个block,因此一个pod 中的GPU 数量为:2x2048=4096
- 多个pod 通过coreswitch 进行全互联,组成超万卡集群
单pod GPU数量为4096,多pod 通过core switch 进行全互联,可以支撑超大规模集群的scale out。
交换机
关于交换机的选择,基于博通芯片,从官网上看,其主要应用有:
从描述上来看,可以看到这款芯片可以用作tor switch 和spine switch,统一了所有交换机型号,也简化了整个clos网络的拓扑结构。另外,这款芯片是以太网交换芯片,不支持IB 网络,因此,从交换机芯片型号可以看出,整个集群是纯以太网进行架构作为scale-out 的互联。
这款芯片的详细参数罗列在此:
Enables the next generation of high-throughput, low latency hyperscale networks with up to 64 ports of 400GbE switching and routing
Support for up to 256 ports of 100GbE ports, enabling low-latency, single-hop networks for massive alternative compute clusters
Robust connectivity the industry’s highest performance and longest-reach 50G-PAM4 or 100G-PAM4 SerDes cores, enabling long-reach (LR) East-West optical links and Direct-Attached-Copper (DAC) in-rack cabling in the data center
The industry’s most advanced shared-buffer architecture, offering up to 10X higher incast absorption and providing the highest performance and lowest end-to-end latency for RoCEv2 workloads
New advanced load balancing mechanisms, virtually eliminating hash polarization and providing extremely efficient, controllable link utilization
Advanced congestion management, enabling new traffic management paradigms
Industry-leading instrumentation including IFA 2.0 for inband telemetry, postcards for out-of-band telemetry, SerDes link quality meters, and visibility into all on-chip packet drops and congestion events
Four 1 GHz ARM processors for high-bandwidth, fully-programmable streaming telemetry and sophisticated embedded applications such as on-chip statistics summarization
Implemented with unparalleled power efficiency in a monolithic 7nm die
节点
这里引用一张NV 4U 服务器节点内部的互联架构图,其中部署8张GPU, 1:1 配比8个IB Nic 节点,以1:1 的比例,字节提到服务器包含8个Nic 节点(这里的Nic 是以太网Nic),单机服务器推算包含8张GPU(A100/A800),当然这个配比可以根据实际引用灵活调节。以此作为以上数据计算的支撑。
网络拓扑性能指标
衡量一个网络拓扑的指标:
- Bisection Bandwidth
定义: 将网络分成两半时,这两半之间能够通过的最大数据传输速率。
应用: 反映了网络在最坏情况下的最大流量处理能力,是衡量网络容量的重要指标。 - Fabric Latency
定义: 数据包从源节点传输到目标节点的平均时间。
应用: 对于需要快速数据访问和低延迟响应的AI工作负载至关重要。 - Network Diameter
定义: 网络中任意两个节点之间的最大跳数。
应用: 影响了数据传输的延迟,直径越小,传输效率越高。 - Path Diversity
定义: 从源节点到目标节点的可用路径数量。
应用: 高路径多样性提高了网络的容错性和负载均衡能力。 - Fault Tolerance
定义: 网络在面对节点或链路故障时维持正常运行的能力。
应用: 高容错性确保网络在硬件故障或其他异常情况下依然能够高效运行。 - Scalability
定义: 网络在不显著降低性能的情况下扩展节点和带宽的能力。
应用: 可扩展性是数据中心网络设计的关键,确保网络能够随着AI任务和数据量的增长而扩展。 - Load Balancing Efficiency
定义: 网络在不同路径和节点之间均匀分配流量的能力。
应用: 高效的负载均衡避免了单点过载,提高了整体网络性能。 - Energy Efficiency
定义: 单位数据传输所消耗的能量。
应用: 能效高的网络减少了数据中心的总体能耗,有助于降低运营成本和环境影响。 - Topology Flexibility
定义: 网络拓扑结构适应不同应用需求和工作负载的能力。
应用: 灵活的拓扑结构能够优化不同类型AI任务的性能,如训练和推理。 - Congestion Control
定义: 网络在高负载情况下防止和处理拥塞的能力。
应用: 有效的拥塞控制机制确保在高流量情况下网络依然能够高效运行。
参考文献
- 集群论文: https://arxiv.org/abs/2402.15627
- 华为交换机:https://support.huawei.com/enterprise/zh/doc/EDOC1100023543/96bfc311
- https://www.broadcom.com/products/ethernet-connectivity/switching/strataxgs/bcm56990-series
- https://engineering.fb.com/2014/11/14/production-engineering/introducing-data-center-fabric-the-next-generation-facebook-data-center-network/