The GPUs for the Zion, ZionEX, and now Grand Teton servers all make use of the OCP Application Module (OAM) form factor, created by Facebook and Microsoft three years ago. The prior GPU-accelerated AI machines – which includes Big Sur from 2016, Big Basin from 2017, and Big Basin 2 from 2018 – all used PCI-Express GPU accelerators and did not make use of the Nvidia custom SXM sockets with their NVLink networking that Nvidia reserves for its highest performing systems.
ZionEx
ZionEX互联更详细的拆解图:
- OAM(GPU/Asic)之间通过背板进行互联;
- CPU (Cooper Lake Xeon SPs in the Zion are gluelessly linked by Intel’s UltraPath Interconnect (UPI) in a twisted hypercube topology )
- 在多插槽系统中,每个 CPU 都通过 UPI 链路直接互联,或者通过中间Scalable Memory Interconnect (SMI) 间接互联。
- UPI 支持多条链路(通常 2~3 条),每条链路带宽高达 10.4 GT/s 或更高(具体取决于 CPU 代际)。
- 每条 UPI 链路可以动态路由数据包,保证高效通信
- Scaleout 网络依赖ClearCreek 层,每个OAM都可以通过switch 上匹配的Rdma Nic 与集群中的任意一个OAM 进行通信(包含机柜内的)
8卡+ CC layer互联实物图
从下到上,总共三层,分别是AL,CC,EP layer box,CC 和EP之间通过Whiser cable 互联
Angel Landing Layer
Clear Creek Layer
Grand Teton
reference
- https://www.nextplatform.com/2022/10/20/the-iron-that-will-drive-ai-at-meta-platforms/
- https://www.opencompute.org/documents/facebook-zion-system-spec-1-0-pdf