NVIDIA GPU A100 Ampere(安培)架构深度解析
文章目录
1. NVIDIA A100 Highlights
NVIDIA A100 SPECS TABLE
1.1 NVIDIA A100对比Volta有20x性能的性能提升。
1.2 NVIDIA A100的5个新特性
- World’s Largest 7nm chip 54B XTORS, HBM2
- 3rd Gen Tensor Cores Faster, Flexible, Easier to use 20x AI Perf with TF32
- New Sparsity Acceleration Harness Sparsity in AI Models 2x AI Performance
- New Multi-Instance GPU Optimal utilization with right sized GPU 7x Simultaneous Instances per GPU
- 3rd Gen NVLINK and NVSWITCH Efficient Scaling to Enable Super GPU 2X More Bandwidth
1.3 AI加速:使用BERT-LARGE进行训练、推理
1.4 A100 HPC 加速
与NVIDIA Tesla V100相比,A100 GPU HPC应用程序加速 。
HPC apps detail:
- AMBER based on PME-Cellulose,
- GROMACS with STMV (h-bond),
- LAMMPS with Atomic Fluid LJ-2.5,
- NAMD with v3.0a1 STMV_NVE
- Chroma with szscl21_24_128,
- FUN3D with dpw,
- RTM with Isotropic Radius 4 1024^3,
- SPECFEM3D with Cartesian four material model
- BerkeleyGW based on Chi Sum
1.5 GA100 架构图
NVIDIA GA100由多个GPU处理群集(GPC),纹理处理群集(TPC),流式多处理器(SM)和HBM2内存控制器组成。
A100 GPU的构架名称为GA100,一个完整GA100架构实现包括以下单元:
- 8 GPCs, 8 TPCs/GPC, 2 SMs/TPC, 16 SMs/GPC, 128 SMs per full GPU
- 64 FP32 CUDA Cores/SM, 8192 FP32 CUDA Cores per full GPU
- 4 third-generation Tensor Cores/SM, 512 third-generation Tensor Cores per full GPU
- 6 HBM2 stacks, 12 512-bit memory controllers
基于GA100架构的A100 GPU包括以下单元:
- 7 GPCs, 7 or 8 TPCs/GPC, 2 SMs/TPC, up to 16 SMs/GPC, 108 SMs
- 64 FP32 CUDA Cores/SM, 6912 FP32 CUDA Cores per GPU
- 4 third-generation Tensor Cores/SM, 432 third-generation Tensor Cores per GPU
- 5 HBM2 stacks, 10 512-bit memory controllers
A100 GPU具体设计细节如下:
- GA100架构一共拥有6个HBM2的内存,每个HBM2内存对应两个内存控制器模块。但A100实际设计的时候内存为40GB,只有5个HBM2模块,对应10个内存控制器。
- 与V100相比,A100内部拥有两个L2 Cache,因此能提供V100 2倍多的L2 Cache带宽。
- GA100拥有8个GPC,每个GPC中拥有8个TPC(GPC:图形处理集群、TPC:纹理处理集群),每一个TPC包含2个SM,因此一个完整的GA100芯片应该包含 8*8*2=128个SM。但现在发布的A100 Spec中只包含108个SM,因此目前的A100并不是一个完整的Full GA100构架芯片。
1.6 GA100 SM架构
新的A100 SM大大提高了性能,建立在Volta和Turing SM体系结构中引入的功能的基础上,并增加了许多新功能和增强功能。
A100 SM架构如下图所示。Volta和Turing每个SM具有八个Tensor Co