TPUv2/v3 Design Process

The Design Process for Google’s Training Chips: TPUv2 and TPUv3

break down of the accompanying paper: https://blog.csdn.net/maxzcl/article/details/121399583

Challenges of ML Training DSA

Inference to Training

More computation

More means both the types of computation, such as backpropagation

==> the number of float operations involved can be 10 orders of magnitude larger

More memory

During training, temporary data are kept longer for use during backpropagation. With inference, there is no backpropagation, so data are more ephemeral.

Wider operands

Inference can tolerate int8 numerics relatively easily, but training is more sensitive to dynamic range due to the accumulation of small gradients during weight update.

More programmability

Much of training is experimentation, meaning unstable workload targets such as new model architectures or optimizers. The operating mode for handling a long tail of training workloads can be quite different from a heavily optimized inference system.

Harder parallelization

For inference, one chip can hitmost latency targets. Beyond that, chips can be scaled out for greater throughput. In contrast, exaflops- scale training runs need to produce a single, consistent set of parameters across the full system, which is easily bottlenecked by off-chip communication.

Bukcet of Priority

Key: TPUv1 to TPUv2

Figure 1 shows five piecewise edits that turn TPUv1 into a training chip.

First, splitting on-chip SRAM makes sense when buffering data between sequential fixed-function units, but that is bad for training as onchip memory requires more flexibility. The first edit merges these into a single vector memory [see Figure 1 (b) and (c)].

For the activation pipeline, we moved away from the fixed-function datapath (containing pooling units or hard-coded activation functions) and built a more programmable vector unit [see Figure 1 (d)] priority4 . The matrix multiply unit (MXU) attaches to the vector unit as coprocessor [Figure 1(e)].

Loading the read-only parameters into the MXU for inference does not work for training. Training writes those parameters, and it needs significant buffer space for temporary per-step variables. Hence, DDR3 moves behind Vector Memory so that the pair form a memory hierarchy [also in Figure 1(e)]. Adopting in-package HBM DRAM instead of DDR3 upgrades bandwidth twentyfold, critical to utilizing the core priority2.

Last is scale. These humongous computations are much bigger than any one chip. We connect the memory system to a custom interconnect fabric (ICI for InterChip Interconnect) for multichip training [see Figure 1(f)] priority3 . And with that final edit, we have a training chip!

Figure 2 provides a cleaner diagram, showing the two-core configuration. The TPUv2 core datapath is blue, HBM is green, host connectivity is purple, and the interconnect router and links are yellow. The TPUv2 Core contains the building blocks of linear algebra: scalar, vector, and matrix computation.

==> 3 considerations for choosing the number of cores on a chip

====>1. global wiring latency ==> large single core could be slow

====> 2. wiring vs. ALU ==> at certain point, smaller cores will have too heavy wiring overheads in area

====> 3. programmability ==> too many weaker smaller cores that need to be managed by the compiler can increase the engineering risk significantly

TPUv2 Core

Software Codesign

The TPUv2 Core was codesigned closely with the compiler team to ensure programmability

==> 2 key decisions

====> 1. a VLIW architecture was the simplest way for the hardware to express instruction level parallelism

====> 2. ensure generality by architecting within the principled language of linear algebra 4 . That meant focusing on the computational primitives for scalar, vector, and matrix data types.

Scalar and Vector Unit

 Notably:

1. Figure 3(a) shows a diagram of the scalar unit. At the top left is the instruction bundle memory. While an instruction cache backed by HBM would have been nice, a DMA target for software-managed instruction overlays was easier 1 . It is not the flashiest solution, but remember that this part just needs to be “good enough.” The greatness in this system lies elsewhere.

2. The (scalar) memory hierarchy is under the control of the compiler, which simplifies the hardware design while delivering high performance 1 2 5 .

Matrix Computation Units

==> apparently it's the general purpose VPU taking up the most floor area ==> price we pay for generalization 

Memory System 

On Chip Memory

HBM

for how DMA works and performs see DMA and DMA_Performance

DMA and DMA_PerformancePartition of Concern

==> a adaptation of decoupled mudular design principle in hardware architecting

The core (blue in Figure 2) provides predictable, tightly scheduled execution 2. The memory system (in green) handles asynchronous prefetch DMA execution from the larger HBM memory space 5. The hand-off between the regimes is managed with sync flags.

Interconnect

 ==> the PCIe/bus is orders of magnitude slower than ICI or HBM

====> obviously we need to minimize the interconnect communication across devices

====> each device/local chip cluster connected by ICI should be functionally independent

======> this functional completeness allows the scalability and performance of the interconnected system

TPUv3

==> a strict upgrade based on TPUv2

====> beware of the second-system effect 

 Conclusion

1 Build quickly
Our cross-team co-design philosophy found simpler- to-design hardware solutions that also gave more predictable software control, such DMA to main memory (HBM) and compiler-controlled on chip memory instead of caches. Along the way, we made difficult tradeoffs to preserve the development schedule, such as splitting the HBM between the two cores, tolerating an inefficient chip layout, and using FIFOs to simplify XLA compiler scheduling.

2 Achieve high performance.

Matrix computation density, high HBM bandwidth, and XLA compiler optimizations deliver excellent performance.

3 Scale up.

Our system-first (in that each chip should be a functionally independent system, so that communication and interdependency is minimized across the supercomputer) approach and simple-to-use interconnect (as in unified interface) let TPUv3 scale natively to 1024 chips and deliver nearly linear speedup for production applications.

4 Work for new workloads out of the box.

To support the deluge of training workloads, we built a core grounded in linear algebra (minimal granularity in modules) that works well with the XLA compiler, and HBM ensures we have enough capacity and bandwidth to keep pace with growing models.

5 Be cost effective

The matrix units are efficient, the design was simple without gratuitous bells and whistles, and we got our money's worth in performance.

  • 1
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值