Preamble
Network is hard, but "AI" demands it
Network is hard. I did rather poorly in this subject during college times and in my defense I think till this very day of writing, there are very few network health providers capable of delivering thorough and accurate checkups, saying much about the complexity of the system when even experts don't fully know what they are working with.
Still, as someone interested in deep learning and its acceleration, it's impossible to not look deeper into data communication systems when we are working with >100B models. More specifically, for large computation workloads, we pull the oldest trick in CS book and cut them into parallel or pipelined pieces, and for large DNN we usually have EP, DP, PP, TP, see
Paradigms of Parallelism | Colossal-AI
and
Expert parallelism - Amazon SageMaker
for briefings. These DNN parallelisation generate different synchronization patterns and memory access workloads. For example, EP and DP have independent parallel workloads and have perhaps the highest compute to communication ratio, so they require the least bandwidth and least stringent latency, while PP and TP on the other hand, cut models and require frequent synchronization at high speed to avoid blocking tensor/vector processors.
(an even better summary of common parallelization of LLM deployment by NV: Parallelisms — NVIDIA NeMo Framework User Guide latest documentation)
Layered Communication and Main Schools
Different parallelism and subsequent memory access workloads and synchronization requirements are hard to cater to by a single network design and in practice we use a mixture of systems at different distribution levels.
Starting from a single ALU unit (for our purpose, an ALU, an accelerator or a fully functional processor matters not here; we can simply lump congestion to network latency, the key is we need only to see computation endpoints, memory and communication pathways), say a GPU SM core, we form layerd groups, such as:
SM core (ALU + L0 cache) => GPU (ALU swarm + L1 cache) => GPU SoC ("chip" or further packaged "card"), say Grace+2*Hoppers for H100 (GPU*N + L2 cache + memory) => node and rack (GPU SoCs + NVlink/PCIe, e.g. NVIDIA HGX H100) => data center (LAN) => www
(more about GPU network: 华尔街见闻)
Their communication efficiencies are dictated by the physical scales and densities; currently as shown, we have a mish-mash of local networks to handle communication at different levels, reflecting in the co-existence of
ld/st and rd/wr.
TP roughly resides at node level, PP at rack, DP and EP at higher level endpoints.
There are attempts to unify communications under a single banner, a single philosophy,
either "computation" with ld/st (system bus) approach, called "scaling up" for their central property in preserving communication speed and low latency by localizing data exchange, (not necessarily UMA though);
or "network" with rd/wr (RDMA) approach, called "scaling out" for the central property of device unanimity or device blindness, which is the core principal for network design; RDMA will be lossy and it's quite expensive for it to handle OoO.
(more about RDMA, heavy read, be warned: https://mp.weixin.qq.com/mp/appmsgalbum?__biz=MzUxNzQ5MTExNw==&action=getalbum&album_id=3398249338911260673#wechat_redirect)
The central question here is how to scale communication while preserve dma rate and synchronization latency.
Here we focus on the features of classic network designs.
Source and Resource
[1] entry article: https://support.huawei.com/enterprise/zh/doc/EDOC1100203347
[2] IB v Ethernet: Infiniband 和 以太网Ethernet 对比_ib交换机和以太网交换机区别-CSDN博客
[3] network protocol evolution and RDMA: RDMA这十年的反思1:从协议演进的视角
[4] Infiniband whitepaper: https://network.nvidia.com/pdf/whitepapers/IB_Intro_WP_190.pdf
Protocol Evolution[3]
RoCE、IB和TCP等网络的基本知识及差异对比[1]
在分布式存储网络中,我们使用的协议有RDMA over Converged Etherne (RoCE)、Infiniband (IB) 和TCP/IP。其中RoCE和IB属于RDMA (RemoteDirect Memory Access)技术,他和传统的TCP/IP有什么区别呢,接下来我们将做详细对比。
RDMA和TCP/IP
面对高性能计算、大数据分析等IO高并发、低时延应用,现有TCP/IP软硬件架构不能满足应用的需求,这主要体现在传统的TCP/IP网络通信是通过内核发送消息,这种通信方式存在很高的数据移动和数据复制的开销。RDMA(RemoteDirect Memory Access)技术全称远程直接内存访问,就是为了解决网络传输中服务器端数据处理的延迟而产生的。如图1-1,RDMA技术能直接通过网络接口访问内存数据,无需操作系统内核的介入。这允许高吞吐、低延迟的网络通信,尤其适合在大规模并行计算机集群中使用。
RDMA的种类
目前有三种RDMA网络,分别是Infiniband、RoCE(RDMA over Converged Ethernet)、iWARP。
其中,Infiniband是一种专为RDMA设计的网络,从硬件级别保证可靠传输 ,技术先进,但是成本高昂。
==> this is largely due to the fact that IB bypassed lossy issue by introducing credit, which makes it less scalable; it's basically gone rogue for a network protocol.
而RoCE 和 iWARP都是基于以太网的RDMA技术,这使高速、超低延时、极低CPU使用率的RDMA技术得以部署在目前使用最广泛的以太网上。
如图1-2所示,RoCE协议有RoCEv1和RoCEv2两个版本,RoCEv1是基于以太网链路层实现的RDMA协议(交换机需要支持PFC等流控技术,在物理层保证可靠传输),而RoCEv2是以太网TCP/IP协议中UDP层实现,引入IP解决了扩展性问题。
InfiniBand | iWARP | RoCE | |
---|---|---|---|
性能 | 最好 | 稍差(受TCP影响) | 与InfiniBand相当 |
成本 | 高 | 中 | 低 |
稳定性 | 好 | 差 | 较好 |
交换机 | IB交换机 | 以太网交换机 | 以太网交换机 |
由表1-1所示,三种RDMA网络的特点总结如下:
- InfiniBand:设计之初就考虑了 RDMA,从硬件级别保证可靠传输,提供更高的带宽和更低的时延。但是成本高,需要支持IB网卡和交换机。
- RoCE:基于 Ethernet 做 RDMA,消耗的资源比 iWARP 少,支持的特性比 iWARP 多。可以使用普通的以太网交换机,但是需要支持RoCE的网卡。
- iWARP:基于TCP的RDMA网络,利用TCP达到可靠传输。相比RoCE,在大型组网的情况下,iWARP的大量TCP连接会占用大量的内存资源,对系统规格要求更高。可以使用普通的以太网交换机,但是需要支持iWARP的网卡。
分布式存储中常用的网络协议
- IB:常用于DPC场景中的存储前端网络。
- RoCE:常用于存储后端网络。
- TCP/IP:常用于业务网络。