TPUv2/3 Multi-Chip Parallelized DL DSA

最新推荐文章于 2024-07-22 09:44:44 发布

EverNoob

最新推荐文章于 2024-07-22 09:44:44 发布

阅读量202

点赞数

分类专栏： System_Design Computer_Architecture Hardware 文章标签：硬件架构

本文链接：https://blog.csdn.net/maxzcl/article/details/121399583

版权

Hardware 同时被 3 个专栏收录

28 篇文章 0 订阅

订阅专栏

System_Design

27 篇文章 1 订阅

订阅专栏

Computer_Architecture

19 篇文章 0 订阅

订阅专栏

A Domain Specific Supercomputer for Training Deep Neural Networks

Inference vs. Training

Both share some computational elements including matrix multiplications, convolutions, and activation functions, so inference and training DSAs might have similar functional units. Key architectural aspects where the requirements differ include:

Harder Parallelization:
Training produces a consistent set of weights, while inference tasks are largely independent to one another
More Computation (both in types and quantity)
backpropagation
More Memory
weight updates requiring results from both intermediate results forward- and back- propagation can cost upto 10 times the space than just storing the weights themselves
quantization of float to int8 is no longer a guaranteed shortcut
More Adaptability
while inference methods are relatively unchanged, new learning/training algorithms are rapidly evolving and the hardware design must be adaptable.

Perfect Scaling

Fortunately for TPUs, these recent results show that batch sizes of 256– 8,192 scale perfectly without losing accuracy, which makes large MXUs an attractive option for high performance

TPUv2 Design

2 tensor cores per chip

==> a balance between latency and (parallel programming)

====> Global wires on a chip don’t scale with shrinking feature size, so their relative delay increases. ==> hence we pack multipule smaller cores on a single chip

see The Incredible Shrinking CPU

scaling of interconnection https://web.stanford.edu/class/ee311/NOTES/Interconnect%20Scaling.pdf

Figure 3. TPUv2 chip floor plan

Six Major Components

ICI

HBM

Core Sequencer

use VLIW
work together with XLA compiler
- manage with Imem and Smem, no instruction cache

Vector Processing Unit

1. The VPU streams data to and from the MXU through decoupling FIFOs.

2. The VPU collects and distributes data to Vmem via data-level parallelism (2D matrix and vector functional units) and instruction-level parallelism (8 operations per instruction).

==> VPU is the major component to incorporate adaptability into the structure due to the relatively "generalized" nature of vector ops in Al algorithms

====> still, though this module might be able to adapt to new demands, the solution is not guaranteed to be optimal ==> but we may only need to change this one module to reach the desired results:

Matrix Multiplication Unit

[Key Design Reasoning]

four 128x128 MXUs is 37%–48%, which is 1.6x of a single 256x256 MXU (22%–30%) yet take about the same die area. The reason is that some convolutions are naturally smaller than 256x256, so sections of the MXU would be idle. Sixteen 64x64 MXUs would have a little higher utilization (38%–52%) but would need more area. The reason is the MXU area is determined either by the logic for the multipliers or by the wires on its perimeter for the inputs, outputs, and control. In our technology, for 128x128 and larger the MXU’s area is limited by the multipliers but area for 64x64 and smaller MXUs is limited by the I/O and control wires.

Transpose Reduction Permute Unit

128x128 matrix transposes,
reductions, and permutations of the VPU lanes.

TPUv3

TPUv3 is the same gen as v2, but a strict, direct upgrade based on experience developing v2.

TPUv3 has ≈1.35x the clock rate, ICI bandwidth, and memory bandwidth plus twice the number of MXUs, so peak performance rises 2.7x. Liquid cools the chip to allow 1.6x more power. We also expanded the TPUv3 supercomputer to 1024 chips (see Figure 4). Table 3 lists key features of the three TPU generations along with a contemporary GPU (NVIDIA Volta)

DSA Arithmetic: brain floating format bf16

Key Observations:

1. by Table 3. the peak flop of f16 is about 8x that of f32

2. out of the 32 bits of f32, only the 8 bits of exponent are necessary to avoid overflow compared to f16's 5

===> the 23 bits mantissa can be reduced to 7 without accuracy loss

hence

==> recall float类型在内存中的存储方式_从hello world开始-CSDN博客_float在内存中的存储形式bf16 is no less sensitive to small update since the exponent range is the same;

loss of precision only happens when mantissa is truncated and does not affect accuracy

see more at: A Study of BFLOAT16 for Deep Learning Training

DSA Dev. Env.: XAL Compiler

a typical example of software-hardware co-development

==> XLA manages all memory transfer directly, without any cache in the architecture

====> the compiler is then reponsible for operation fusion/composition as well; by passing read/write intermediate results

==> XLA exploits the huge parallelism that an input TF dataflow graph represents. Beyond the parallelism of operations (“ops”) in a graph, each op can comprise millions of multiplications and additions on data tensors of millions of elements. XLA maps this abundant parallelism across hundreds of chips in a supercomputer, a few cores per chip, multiple units per core, and thousands of multipliers and adders inside each functional unit.

the compiler is crucial to the applicability of the supercomputer architecture

TPU vs. GPU

GPU is afterall a DSA device aimed at array processing, not for weights iterations

TPUv2/3 vs. v1 on Inference

so long as the model applies large batch size during inference, the v2/3 will perform better than v1 ==> lower latency and larger throughput

==> mind that v2/3 does not support int8 ops, costing more power, but saves the step of quantization

Supercomputer Scaling

EverNoob

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
TPUv2/3 Multi-Chip Parallelized DL DSA

unit isInference vs. TrainingBoth sharesome computational elements including matrix multiplications, convolutions, and activation functions, so inference and training DSAs might have similar functional units. Key architectural aspects where the requi.
复制链接

扫一扫

专栏目录