A Domain Specific Supercomputer for Training Deep Neural Networks
Inference vs. Training
Both share some computational elements including matrix multiplications, convolutions, and activation functions, so inference and training DSAs might have similar functional units. Key architectural aspects where the requirements differ include:
- Harder Parallelization:
- Training produces a consistent set of weights, while inference tasks are largely independent to one another
- More Computation (both in types and quantity)
- backpropagation
- More Memory
- weight updates requiring results from both intermediate results forward- and back- propagation can cost upto 10 times the space than just storing the weights themselves
- quantization of float to int8 is no longer a guaranteed shortcut
- More Adaptability
- while inference methods are relatively unchanged, new learning/training algorithms are rapidly evolving and the hardware design must be adaptable.
Perfect Scaling
Fortunately for TPUs, these recent results show that batch sizes of 256– 8,192 scale perfectly without losing accuracy, which makes large MXUs an attractive option for high performance
TPUv2 Design
2 tensor cores per chip
==> a balance between latency and (parallel programming)
====> Global wires on a chip don’t scale with shrinking feature size, so their relative delay increases. ==> hence we pack multipule smaller cores on a single chip
see The Incredible Shrinking CPU
scaling of interconnection https://web.stanford.edu/class/ee311/NOTES/Interconnect%20Scaling.pdf
Figure 3. TPUv2 chip floor plan
Six Major Components
ICI
HBM
Core Sequencer
- use VLIW
- work together with XLA compiler
- manage with Imem and Smem, no instruction cache
Vector Processing Unit
1. The VPU streams data to and from the MXU through decoupling FIFOs.
2. The VPU collects and distributes data to Vmem via data-level parallelism (2D matrix and vector functional units) and instruction-level parallelism (8 operations per instruction).
==> VPU is the major component to incorporate adaptability into the structure due to the relatively "generalized" nature of vector ops in Al algorithms
====> still, though this module might be able to adapt to new demands, the solution is not guaranteed to be optimal ==> but we may only need to change this one module to reach the desired results:
Matrix Multiplication Unit
[Key Design Reasoning]
four 128x128 MXUs is 37%–48%, which is 1.6x of a single 256x256 MXU (22%–30%) yet take about the same die area. The reason is that some convolutions are naturally smaller than 256x256, so sections of the MXU would be idle. Sixteen 64x64 MXUs would have a little higher utilization (38%–52%) but would need more area. The reason is the MXU area is determined either by the logic for the multipliers or by the wires on its perimeter for the inputs, outputs, and control. In our technology, for 128x128 and larger the MXU’s area is limited by the multipliers but area for 64x64 and smaller MXUs is limited by the I/O and control wires.
Transpose Reduction Permute Unit
128x128 matrix transposes,
reductions, and permutations of the VPU lanes.
TPUv3
TPUv3 is the same gen as v2, but a strict, direct upgrade based on experience developing v2.
TPUv3 has ≈1.35x the clock rate, ICI bandwidth, and memory bandwidth plus twice the number of MXUs, so peak performance rises 2.7x. Liquid cools the chip to allow 1.6x more power. We also expanded the TPUv3 supercomputer to 1024 chips (see Figure 4). Table 3 lists key features of the three TPU generations along with a contemporary GPU (NVIDIA Volta)
DSA Arithmetic: brain floating format bf16
Key Observations:
1. by Table 3. the peak flop of f16 is about 8x that of f32
2. out of the 32 bits of f32, only the 8 bits of exponent are necessary to avoid overflow compared to f16's 5
===> the 23 bits mantissa can be reduced to 7 without accuracy loss
hence
==> recall float类型在内存中的存储方式_从hello world开始-CSDN博客_float在内存中的存储形式bf16 is no less sensitive to small update since the exponent range is the same;
loss of precision only happens when mantissa is truncated and does not affect accuracy
see more at: A Study of BFLOAT16 for Deep Learning Training
DSA Dev. Env.: XAL Compiler
a typical example of software-hardware co-development
==> XLA manages all memory transfer directly, without any cache in the architecture
====> the compiler is then reponsible for operation fusion/composition as well; by passing read/write intermediate results
==> XLA exploits the huge parallelism that an input TF dataflow graph represents. Beyond the parallelism of operations (“ops”) in a graph, each op can comprise millions of multiplications and additions on data tensors of millions of elements. XLA maps this abundant parallelism across hundreds of chips in a supercomputer, a few cores per chip, multiple units per core, and thousands of multipliers and adders inside each functional unit.
the compiler is crucial to the applicability of the supercomputer architecture
TPU vs. GPU
GPU is afterall a DSA device aimed at array processing, not for weights iterations
TPUv2/3 vs. v1 on Inference
so long as the model applies large batch size during inference, the v2/3 will perform better than v1 ==> lower latency and larger throughput
==> mind that v2/3 does not support int8 ops, costing more power, but saves the step of quantization