TPUv2/3 Multi-Chip Parallelized DL DSA

A Domain Specific Supercomputer for Training Deep Neural Networks

Inference vs. Training

Both share some computational elements including matrix multiplications, convolutions, and activation functions, so inference and training DSAs might have similar functional units. Key architectural aspects where the requirements differ include:

  • Harder Parallelization:
    • Training produces a consistent set of weights, while inference tasks are largely independent to one another
  • More Computation (both in types and quantity)
    • backpropagation
  • More Memory
    • weight updates requiring results from both intermediate results forward- and back- propagation can cost upto 10 times the space than just storing the weights themselves
    •  quantization of float to int8 is no longer a guaranteed shortcut
  • More Adaptability
    • while inference methods are relatively unchanged, new learning/training algorithms are rapidly evolving and the hardware design must be adaptable.

Perfect Scaling

Fortunately for TPUs, these recent results show that batch sizes of 256– 8,192 scale perfectly without losing accuracy, which makes large MXUs an attractive option for high performance

TPUv2 Design

2 tensor cores per chip

==> a balance between latency and (parallel programming)

====> Global wires on a chip don’t scale with shrinking feature size, so their relative delay increases.  ==> hence we pack multipule smaller cores on a single chip

see The Incredible Shrinking CPU 

scaling of interconnection https://web.stanford.edu/class/ee311/NOTES/Interconnect%20Scaling.pdf

Figure 3. TPUv2 chip floor plan

Six Major Components

ICI

 HBM

Core Sequencer

  • use VLIW
  • work together with XLA compiler
    • manage with Imem and Smem, no instruction cache

Vector Processing Unit

1. The VPU streams data to and from the MXU through decoupling FIFOs.

2. The VPU collects and distributes data to Vmem via data-level parallelism (2D matrix and vector functional units) and instruction-level parallelism (8 operations per instruction).

==> VPU is the major component to incorporate adaptability into the structure due to the relatively "generalized" nature of vector ops in Al algorithms

====> still, though this module might be able to adapt to new demands, the solution is not guaranteed to be optimal ==> but we may only need to change this one module to reach the desired results:

Matrix Multiplication Unit

[Key Design Reasoning]

four 128x128 MXUs is 37%–48%, which is 1.6x of a single 256x256 MXU (22%–30%) yet take about the same die area. The reason is that some convolutions are naturally smaller than 256x256, so sections of the MXU would be idle. Sixteen 64x64 MXUs would have a little higher utilization (38%–52%) but would need more area. The reason is the MXU area is determined either by the logic for the multipliers or by the wires on its perimeter for the inputs, outputs, and control. In our technology, for 128x128 and larger the MXU’s area is limited by the multipliers but area for 64x64 and smaller MXUs is limited by the I/O and control wires.

Transpose Reduction Permute Unit

128x128 matrix transposes,
reductions, and permutations of the VPU lanes.

TPUv3

TPUv3 is the same gen as v2, but a strict, direct upgrade based on experience developing v2.

TPUv3 has ≈1.35x the clock rate, ICI bandwidth, and memory bandwidth plus twice the number of MXUs, so peak performance rises 2.7x. Liquid cools the chip to allow 1.6x more power. We also expanded the TPUv3 supercomputer to 1024 chips (see Figure 4). Table 3 lists key features of the three TPU generations along with a contemporary GPU (NVIDIA Volta)

 

DSA Arithmetic: brain floating format bf16

Key Observations:

1. by Table 3. the peak flop of f16 is about 8x that of f32

2. out of the 32 bits of f32, only the 8 bits of exponent are necessary to avoid overflow compared to f16's 5

===> the 23 bits mantissa can be reduced to 7 without accuracy loss

hence

 ==> recall float类型在内存中的存储方式_从hello world开始-CSDN博客_float在内存中的存储形式bf16 is no less sensitive to small update since the exponent range is the same;

loss of precision only happens when mantissa is truncated and does not affect accuracy

 see more at: A Study of BFLOAT16 for Deep Learning Training

 

DSA Dev. Env.: XAL Compiler

a typical example of software-hardware co-development

==> XLA manages all memory transfer directly, without any cache in the architecture

====> the compiler is then reponsible for operation fusion/composition as well; by passing read/write intermediate results 

==> XLA exploits the huge parallelism that an input TF dataflow graph represents. Beyond the parallelism of operations (“ops”) in a graph, each op can comprise millions of multiplications and additions on data tensors of millions of elements. XLA maps this abundant parallelism across hundreds of chips in a supercomputer, a few cores per chip, multiple units per core, and thousands of multipliers and adders inside each functional unit.

the compiler is crucial to the applicability of the supercomputer architecture

TPU vs. GPU 

GPU is afterall a DSA device aimed at array processing, not for weights iterations

 

TPUv2/3 vs. v1 on Inference

so long as the model applies large batch size during inference, the v2/3 will perform better than v1 ==> lower latency and larger throughput

==> mind that v2/3 does not support int8 ops, costing more power, but saves the step of quantization

Supercomputer Scaling

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值