TPUv4/4i: 4th Generation DL DSA

最新推荐文章于 2024-11-26 15:47:30 发布

EverNoob

最新推荐文章于 2024-11-26 15:47:30 发布

阅读量1.4k

点赞数 2

分类专栏： System_Design Computer_Architecture Hardware 文章标签：硬件架构

本文链接：https://blog.csdn.net/maxzcl/article/details/121419786

版权

Hardware 同时被 3 个专栏收录

28 篇文章

订阅专栏

System_Design

27 篇文章

订阅专栏

Computer_Architecture

19 篇文章

订阅专栏

本文探讨了谷歌TPUv4i的设计理念和技术革新，包括支持多种数据类型、增加片上SRAM存储、四维张量DMA等，旨在通过硬件与软件的协同优化提升机器学习推理性能。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

from Ten Lessons From Three Generations Shaped Google’s TPUv4i

Evolution of ML DSA

for TPUv1 see TPUv1: Single Chipped Inference DL DSA_maxzcl的博客-CSDN博客

for TPUv2/3 see https://blog.csdn.net/maxzcl/article/details/121399583

for TPUv1 to TPUv2 see TPUv2/v3 Design Process_maxzcl的博客-CSDN博客

The Ten Lessons

General

① Logic, wires, SRAM, & DRAM improve unequally

② Leverage prior compiler optimizations

the fortunes of a new architecture have been bound to the quality of its compilers.

Indeed, compiler problems likely sank the Itanium’s VLIW architecture [25], yet many DSAs rely on VLIW (see §6) including TPUs. Architects wish for great compilers to be developed on simulators, yet much of the progress occurs after hardware is available since compiler writers can measure actual time taken by code. Thus, reaching an architecture’s full potential quickly is much easier if it can leverage prior compiler optimizations (==> which requires the hardware design to incorporate software designs) rather than starting from scratch.

③ Design for performance per TCO vs per CapEx

Capital Expense (CapEx) is the price for an item

Operation Expense (OpEx) is the cost of operation, including electricity consumed and power provisioning

TCO: Total Cost of Ownership

TCO = CapEx + 3 ✕ OpEx //accounting amortizes computer CapEx over 3-5 years

most companies care more about performance/TCO of production apps (perf/TCO)

A DSA should aim for good Perf/TCO over its full lifetime, and not only at its birth.

ML DSA

④ Support Backwards ML Compatibility

This is directly handled by the compiler, but should be fundamentally supported by consistency in TPU structures

⑤ Inference DSAs need air cooling for global scale

==> optimality is not just "the best possible" but the most suitable.

⑥ Some inference apps need floating point arithmetic

DSAs may offer quantization, but unlike TPUv1, they should not require it.

Quantized arithmetic grants area and power savings, but it can trade those for reduced quality, delayed deployment, and some apps don’t work well when quantized (see Figure 4 and NMT from MLPerf Inference 0.5 in §4).

Early in TPUv1 development, application developers said a 1% drop in quality was acceptable, but they changed their minds by the time the hardware arrived, perhaps because DNN overall quality improved so that 1% added to a 40% error was relatively small but 1% added to a 12% error was relatively large.

DNN Applications

⑦ Production inference normally needs multi-tenancy

Definition of Multitenancy - IT Glossary | Gartner

Multitenancy is a reference to the mode of operation of software where multiple independent instances of one or multiple applications operate in a shared environment. The instances (tenants) are logically isolated, but physically integrated.

for more on multitenacy and SaaS, see What is multitenancy?

multitenacy here refers to the fact that most application requires execution of multiple models/agents, hence:

⑧ DNNs grow ~1.5x/year in memory and compute

architects should provide headroom so that DSAs can remain useful over their full lifetimes.

⑨ DNN workloads evolve with DNN breakthroughs

programmability and flexibility are crucial for inference DSAs to track DNN progress.

⑩ Inference SLO limit is P99 latency, not batch size

SLO: Service Level Objectives

==> P99 (99th percentile) latency is what the user-end application cares about

====> the DSA should exploit the specialization advantage to provide low latency in the case of large input batch sizes

====> and perform no worse than general purpose devices in the case of small batch size.

The 4th-Gen TPU

What The Lessons Keep

Given the importance of leveraging prior compiler optimizations ② and backwards ML compatibility ④—plus the benefits of reusing earlier hardware designs—TPUv4i was going to follow TPUv3:

1 or 2 brawny cores per chip,

a large systolic MXU array and vector unit per core,

compiler-controlled vector memory, and compiler-controlled DMA access to HBM.

TPUv4 and TPUv4i

the concurrrent development of the 2 chips was enabled by the realization:

==> a single core, with dual arrangements

==> The core dev. guideline for the 4th gen. is truely inspired ==> do not let past mistake or regret lay in waste

Schematics

Figure 5. TPUv4i chip block diagram. Architectural memories are HBM, Common Memory (CMEM), Vector Memory (VMEM), Scalar Memory (SMEM), and Instruction Memory (IMEM). The data path is the Matrix Multiply Unit (MXU), Vector Processing Unit (VPU), Cross-Lane Unit (XLU), and TensorCore Sequencer (TCS). The uncore (everything not in blue) includes the On-Chip Interconnect (OCI), ICI Router (ICR), ICI Link Stack (LST), HBM Controller (HBMC), Unified Host Interface (UHI), and Chip Manager (MGR).

Figure 6. TPUv4i chip floorplan. The die is <400 mm2 (see Table 1). CMEM is 28% of the area. OCI blocks are stretched to fill space in the abutted floorplan because the die dimensions and overall layout are dominated by the TensorCore, CMEM, and SerDes locations. The TensorCore and CMEM block arrangements are derived from the TPUv4 floorplan.

Compiler compatibility, not binary compatibility

Increased on-chip SRAM storage with common memory (CMEM)

128 MB Common Memory (CMEM) of TPUv4i. This expanded memory hierarchy reduces the number of accesses to the slowest and least energy efficient memory

We picked 128MB as the knee of the curve between good performance and a reasonable chip size, as the amortized chip cost is a significant fraction of TCO ③.

Four-dimensional tensor DMA

1. TPUv4i contains tensor DMA engines that are distributed throughout the chip’s uncore to mitigate the impact of interconnect latency and wire scaling challenges ①. The tensor DMA engines function as coprocessors that fully decode and execute TensorCore DMA instructions.

2. To maximize predictable performance and simplify hardware and software, TPUv4i unifies the DMA architecture across local (on-chip), remote (chip-to-chip), and host (host-to-chip and chip-to-host) transfers to simplify scaling of applications from a single chip to a complete system.

Custom on-chip interconnect (OCI)

Rapidly growing and evolving DNN workloads ⑧, ⑨ have driven the TPU uncore towards greater flexibility each generation. Each component of past TPUs designs were connected point-to-point (Figure 1). As memory bandwidth increases and the number of components grows, a point-to-point approach becomes too expensive, requiring significant routing resources and die area. It also requires up-front choices about which communication patterns to support. For example, in TPUv3, a TensorCore can only access half of HBM as a local memory [30]: it must go through the ICI to access the other half of HBM. This split imposes limits on how software can use the chip in the future ⑧.

NUMA https://en.wikipedia.org/wiki/Non-uniform_memory_access

Non-uniform memory access (NUMA) is a computer memory design used in multiprocessing, where the memory access time depends on the memory location relative to the processor.

see more at: NUMA Collections_maxzcl的博客-CSDN博客

Arithmetic Improvements

1. to decided supported datatypes

Another big decision is the arithmetic unit. The danger of requiring quantization ⑥ and the importance of backwards ML compatibility ④ meant retaining bfloat16 and fp32 from TPUv3 despite aiming at inference. As we also wanted applications quantized for TPUv1 to port easily to TPUv4i, TPUv4i also supports int8.

2. improvement from the XLA , introduction of CMEM and the choice of compiler compatible

Our XLA colleagues suggested that they could handle twice as many MXUs in TPUv4i as they did for TPUv3 ②.

Logic improved the most in the more advanced technology node, ①, so we could afford more MXUs. Equally important, the new CMEM could feed them (§5 and §7.A).

3. reduction of critical path (through MXU)

We also wanted to reduce the latency through the systolic array of the MXU while minimizing area and power. Rather than sequentially adding each floating-point multiplication result to the previous partial sum with a series of 128 two-input adders,

TPUv4i first sums groups of four multiplication results together, and then adds them to the previous partial sum with a series of 32 two-input adders. This optimized addition cuts the critical path through the systolic array to ¼ the latency of the baseline approach.

Once we decided to adopt a four-input sum, we recognized the opportunity to optimize that component by building a custom four-input floating point adder that eliminates the rounding and normalization logic for the intermediate results. Although the new results are not numerically equivalent, eliminating rounding steps increases accuracy over the old summation logic. Fortunately, the differences from a four- versus two-input adder are small enough to not affect ML results meaningfully ④. Moreover, the four-input adder saved 40% area and 25% power relative to a series of 128 two-input adders. It also reduced overall MXU peak power by 12%, which directly impacts the TDP and cooling system design ⑤ because the MXUs are the most power-dense components of the chip.

Scaling

Workload Analysis

extensive tracing and performance counter hardware features, particularly in the uncore. They are used by the software stack to measure and analyze system-level bottlenecks in user workloads and guide continuous compiler-level and application-level optimizations (Figure 2). These features increase design time and area, but are worthwhile because we aim for Perf/ TCO, not Perf/CapEx ③. The features enable significant system-level performance improvements and boost developer productivity over the lifetime of the product as DNN workloads grow and evolve (see Table 4) ⑦, ⑧, ⑨.