TPUv2/v3 Design Process

最新推荐文章于 2024-04-26 11:40:07 发布

EverNoob

最新推荐文章于 2024-04-26 11:40:07 发布

阅读量665

点赞数 1

分类专栏： Computer_Architecture Hardware System_Design 文章标签：硬件架构

本文链接：https://blog.csdn.net/maxzcl/article/details/121416538

版权

Hardware 同时被 3 个专栏收录

28 篇文章 0 订阅

订阅专栏

System_Design

27 篇文章 1 订阅

订阅专栏

Computer_Architecture

19 篇文章 0 订阅

订阅专栏

The Design Process for Google’s Training Chips: TPUv2 and TPUv3

break down of the accompanying paper: https://blog.csdn.net/maxzcl/article/details/121399583

Challenges of ML Training DSA

Inference to Training

More computation

More means both the types of computation, such as backpropagation

==> the number of float operations involved can be 10 orders of magnitude larger

More memory

During training, temporary data are kept longer for use during backpropagation. With inference, there is no backpropagation, so data are more ephemeral.

Wider operands

Inference can tolerate int8 numerics relatively easily, but training is more sensitive to dynamic range due to the accumulation of small gradients during weight update.

More programmability

Much of training is experimentation, meaning unstable workload targets such as new model architectures or optimizers. The operating mode for handling a long tail of training workloads can be quite different from a heavily optimized inference system.

Harder parallelization

For inference, one chip can hitmost latency targets. Beyond that, chips can be scaled out for greater throughput. In contrast, exaflops- scale training runs need to produce a single, consistent set of parameters across the full system, which is easily bottlenecked by off-chip communication.

Bukcet of Priority

Key: TPUv1 to TPUv2

Figure 1 shows five piecewise edits that turn TPUv1 into a training chip.

First, splitting on-chip SRAM makes sense when buffering data between sequential fixed-function units, but that is bad for training as onchip memory requires more flexibility. The first edit merges these into a single vector memory [see Figure 1 (b) and (c)].

For the activation pipeline, we moved away from the fixed-function datapath (containing pooling units or hard-coded activation functions) and built a more programmable vector unit [see Figure 1 (d)] priority4 . The matrix multiply unit (MXU) attaches to the vector unit as coprocessor [Figure 1(e)].

Loading the read-only parameters into the MXU for inference does not work for training. Training writes those parameters, and it needs significant buffer space for temporary per-step variables. Hence, DDR3 moves behind Vector Memory so that the pair form a memory hierarchy [also in Figure 1(e)]. Adopting in-package HBM DRAM instead of DDR3 upgrades bandwidth twentyfold, critical to utilizing the core priority2.

Last is scale. These humongous computations are much bigger than any one chip. We connect the memory system to a custom interconnect fabric (ICI for InterChip Interconnect) for multichip training [see Figure 1(f)] priority3 . And with that final edit, we have a training chip!

Figure 2 provides a cleaner diagram, showing the two-core configuration. The TPUv2 core datapath is blue, HBM is green, host connectivity is purple, and the interconnect router and links are yellow. The TPUv2 Core contains the building blocks of linear algebra: scalar, vector, and matrix computation.

==> 3 considerations for choosing the number of cores on a chip

====>1. global wiring latency ==> large single core could be slow

====> 2. wiring vs. ALU ==> at certain point, smaller cores will have too heavy wiring overheads in area

====> 3. programmability ==> too many weaker smaller cores that need to be managed by the compiler can increase the engineering risk significantly

TPUv2 Core

Software Codesign

The TPUv2 Core was codesigned closely with the compiler team to ensure programmability

==> 2 key decisions

====> 1. a VLIW architecture was the simplest way for the hardware to express instruction level parallelism

====> 2. ensure generality by architecting within the principled language of linear algebra 4 . That meant focusing on the computational primitives for scalar, vector, and matrix data types.

Scalar and Vector Unit

Notably:

1. Figure 3(a) shows a diagram of the scalar unit. At the top left is the instruction bundle memory. While an instruction cache backed by HBM would have been nice, a DMA target for software-managed instruction overlays was easier 1 . It is not the flashiest solution, but remember that this part just needs to be “good enough.” The greatness in this system lies elsewhere.

2. The (scalar) memory hierarchy is under the control of the compiler, which simplifies the hardware design while delivering high performance 1 2 5 .

Matrix Computation Units

==> apparently it's the general purpose VPU taking up the most floor area ==> price we pay for generalization

Memory System

On Chip Memory

HBM

for how DMA works and performs see DMA and DMA_Performance

DMA and DMA_PerformancePartition of Concern

==> a adaptation of decoupled mudular design principle in hardware architecting

The core (blue in Figure 2) provides predictable, tightly scheduled execution 2. The memory system (in green) handles asynchronous prefetch DMA execution from the larger HBM memory space 5. The hand-off between the regimes is managed with sync flags.

Interconnect

==> the PCIe/bus is orders of magnitude slower than ICI or HBM

====> obviously we need to minimize the interconnect communication across devices

====> each device/local chip cluster connected by ICI should be functionally independent

======> this functional completeness allows the scalability and performance of the interconnected system

TPUv3

==> a strict upgrade based on TPUv2

====> beware of the second-system effect

Conclusion

1 Build quickly
Our cross-team co-design philosophy found simpler- to-design hardware solutions that also gave more predictable software control, such DMA to main memory (HBM) and compiler-controlled on chip memory instead of caches. Along the way, we made difficult tradeoffs to preserve the development schedule, such as splitting the HBM between the two cores, tolerating an inefficient chip layout, and using FIFOs to simplify XLA compiler scheduling.

2 Achieve high performance.

Matrix computation density, high HBM bandwidth, and XLA compiler optimizations deliver excellent performance.

3 Scale up.

Our system-first （in that each chip should be a functionally independent system, so that communication and interdependency is minimized across the supercomputer） approach and simple-to-use interconnect (as in unified interface) let TPUv3 scale natively to 1024 chips and deliver nearly linear speedup for production applications.

4 Work for new workloads out of the box.

To support the deluge of training workloads, we built a core grounded in linear algebra (minimal granularity in modules) that works well with the XLA compiler, and HBM ensures we have enough capacity and bandwidth to keep pace with growing models.

5 Be cost effective

The matrix units are efficient, the design was simple without gratuitous bells and whistles, and we got our money's worth in performance.

EverNoob

关注

1
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
TPUv2/v3 Design Process

The Design Process for Google’s Training Chips: TPUv2 and TPUv3break down of the accompanying paper:https://blog.csdn.net/maxzcl/article/details/121399583Challengesof ML Training DSAInference to TrainingMore computationMore means both the types..
复制链接

扫一扫

专栏目录