TPUv1: Single Chipped Inference DL DSA

最新推荐文章于 2024-02-07 15:18:44 发布

EverNoob

最新推荐文章于 2024-02-07 15:18:44 发布

阅读量250

点赞数

分类专栏： Computer_Architecture Notes 文章标签：硬件架构

本文链接：https://blog.csdn.net/maxzcl/article/details/121382055

版权

Notes 同时被 2 个专栏收录

140 篇文章 0 订阅

订阅专栏

Computer_Architecture

19 篇文章 0 订阅

订阅专栏

from A Domain-Specific Architecture for Deep Neural Networks | September 2018 | Communications of the ACM

Limits of General Purpose Architectures

1. End of Moore's Law

2. End of Dennard Scaling

3. Lack of further ILP and the diminishing return of multicore structures specified by Amdahl's Law

==> at the heart of everything:

Since transistors are not getting much better ==> smaller (reflecting the end of Moore’s Law), the peak power per mm2 of chip area is increasing (due to the end of Dennard scaling), but the power budget per chip is not increasing (due to electro-migration and mechanical and thermal limits), and chip designers have already played the multi-core card (which is limited by Amdahl’s Law), architects now widely believe the only path left for major improvements in performance-cost-energy is domain specific architectures. They do only a few tasks but do them extremely well.

TPU V1

To reduce the risk of delaying deployment, Google engineers designed the TPU to be a coprocessor on the I/O bus rather than be tightly integrated with a CPU, allowing it to plug into existing servers just as a GPU does. Moreover, to simplify hardware design and debugging, the host server sends TPU instructions for it to execute rather than fetch them itself. The TPU is thus closer in spirit to a floating-point unit (FPU) coprocessor than it is to a GPU.

key-compoonent: the matrix multiply unit is the heart of the TPU, with 256×256 MACs that can perform eight-bit multiply-and-adds on signed or unsigned integers.

Eight-bit integer multiplies can require 6× less energy and take 6× less area than
IEEE 754 16-bit floating-point multiplies, and the edge for integer addition is 13× in energy and 38× in area; see: High-Performance Hardware for Machine Learning

==> the optimal performance is when the matmul unit is kept busy

====> to do so, the UB instructions are decoupled-access/execute

====> matmul adopts "systolic excution" to reduce reads and writes from/to SRAMs to save energy

see Systolic Array

Performance

==>1. notice that the GPU (red) has a very high baseline op. per sec, due to its highly parallelized design.

==>2. the TPU has a very tall op. per sec threshold comparing to CPU/GPU from the same epoch

====> reaching the ceiling becomes harder, leaves more room for op. intensity optimization

==> 3

====> the memory transfer unit(s) kept busy ==> not enough op. perfomed on per byte of data moved around ==> optimization of the ALU, i.e. how to do ops. faster (more op. per sec) is pointless under the slope ==> consider adding more ops. per byte moved till we reach under the "roof" ==> then we consider how to keep the main ALU busy all the time (at or above peak performance limit)

======> or change the architecture to data-centered when operation intensity cannot be increased ==> raise the "slope" higher by reducing latency to increase memory access throughput

==>

====> indicates that throughput optimization to reach the ceiling usually cost individual latency

====> TPUv1 is single threaded, non-parallel architecture ==> the matmul unit itself exploits DLP, but there are no parallelled matmul units ==> while general purpose architectures rely heavily on sophisticated parallel structures to speed up the common cases (hence not really ideal for dealing with fringe requirements of centain tasks ==> here, it manifests in terms of wasted potentials), TPU as a DSA is minimalized and designed specifically for the task in question and performs better.

======> the TPU has only one processor, while the K80 has 13, and it is much easier to meet a rigid latency target with a single thread.

======>

========> "even" in the sense that we expect TPU to specialize in dealing with large quantity of certain data, where GPU and CPU will fail to compete

Design

Ideas that did not fly for general-purpose computing may be ideal for domain-specific architectures.
For the TPU, three important architectural features date back to the early 1980s:

systolic arrays,

decoupled-access/execute,33

complex instruction set computer instructions. 29

The first reduced the area and power of the large matrix multiply unit; the second fetches weights concurrently during operation of the matrix multiply unit; and the third better utilizes the limited bandwidth of the PCIe bus for delivering instructions. History-aware, domain-specific architects could thus have a competitive edge.

Key Features for Energy-Performance Gap

One processor ==> easier to meet latency limits
Large, two-dimensional multiply unit. ==> matrix multiplies benefi from two-dimensional hardware;
Systolic arrays. The two-dimensional organization enables systolic arrays, reducing register accesses and energy;
Quantization: Eight-bit integers
Minimalism: The TPU drops features required by CPUs and GPUs that DNNs do not use, making the TPU cheaper while saving energy and allowing transistors to be repurposed for domain-specific on-chip memory (better peak performance).