TPUv1: Single Chipped Inference DL DSA

from A Domain-Specific Architecture for Deep Neural Networks | September 2018 | Communications of the ACM

Limits of General Purpose Architectures

1. End of Moore's Law

2. End of Dennard Scaling

3. Lack of further ILP and the diminishing return of multicore structures specified by Amdahl's Law

==> at the heart of everything:

Since transistors are not getting much better ==> smaller (reflecting the end of Moore’s Law), the peak power per mm2 of chip area is increasing (due to the end of Dennard scaling), but the power budget per chip is not increasing (due to electro-migration and mechanical and thermal limits), and chip designers have already played the multi-core card (which is limited by Amdahl’s Law), architects now widely believe the only path left for major improvements in performance-cost-energy is domain specific architectures. They do only a few tasks but do them extremely well.

TPU V1

 To reduce the risk of delaying deployment, Google engineers designed the TPU to be a coprocessor on the I/O bus rather than be tightly integrated with a CPU, allowing it to plug into existing servers just as a GPU does. Moreover, to simplify hardware design and debugging, the host server sends TPU instructions for it to execute rather than fetch them itself. The TPU is thus closer in spirit to a floating-point unit (FPU) coprocessor than it is to a GPU.

key-compoonent: the matrix multiply unit is the heart of the TPU, with 256×256 MACs that can perform eight-bit multiply-and-adds on signed or unsigned integers.

Eight-bit integer multiplies can require 6× less energy and take 6× less area than
IEEE 754 16-bit floating-point multiplies, and the edge for integer addition is 13× in energy and 38× in area; see: High-Performance Hardware for Machine Learning

==> the optimal performance is when the matmul unit is kept busy

====> to do so, the UB instructions are decoupled-access/execute

====> matmul adopts "systolic excution" to reduce reads and writes from/to SRAMs to save energy

see Systolic Array

Performance

==>1. notice that the GPU (red) has a very high baseline op. per sec, due to its highly parallelized design.

==>2. the TPU has a very tall op. per sec threshold comparing to CPU/GPU from the same epoch

====> reaching the ceiling becomes harder, leaves more room for op. intensity optimization

 ==> 3

 ====> the memory transfer unit(s) kept busy ==> not enough op. perfomed on per byte of data moved around ==> optimization of the ALU, i.e. how to do ops. faster (more op. per sec) is pointless under the slope ==> consider adding more ops. per byte moved till we reach under the "roof" ==> then we consider how to keep the main ALU busy all the time (at or above peak performance limit)

======> or change the architecture to data-centered when operation intensity cannot be increased ==> raise the "slope" higher by reducing latency to increase memory access throughput

==>

 ====> indicates that throughput optimization to reach the ceiling usually cost individual latency

====> TPUv1 is single threaded, non-parallel architecture ==> the matmul unit itself exploits DLP, but there are no parallelled matmul units ==> while general purpose architectures rely heavily on sophisticated parallel structures to speed up the common cases (hence not really ideal for dealing with fringe requirements of centain tasks ==> here, it manifests in terms of wasted potentials), TPU as a DSA is minimalized and designed specifically for the task in question and performs better.

======> the TPU has only one processor, while the K80 has 13, and it is much easier to meet a rigid latency target with a single thread.

======> 

========> "even" in the sense that we expect TPU to specialize in dealing with large quantity of certain data, where GPU and CPU will fail to compete

Design

Ideas that did not fly for general-purpose computing may be ideal for domain-specific architectures.
For the TPU, three important architectural features date back to the early 1980s:

systolic arrays,

decoupled-access/execute,33

complex instruction set computer instructions. 29 

The first reduced the area and power of the large matrix multiply unit; the second fetches weights concurrently during operation of the matrix multiply unit; and the third better utilizes the limited bandwidth of the PCIe bus for delivering instructions. History-aware, domain-specific architects could thus have a competitive edge.

Key Features for Energy-Performance Gap

  • One processor ==> easier to meet latency limits
  • Large, two-dimensional multiply unit. ==> matrix multiplies benefi from two-dimensional hardware;
  • Systolic arrays. The two-dimensional organization enables systolic arrays, reducing register accesses and energy;
  • Quantization: Eight-bit integers
  • Minimalism: The TPU drops features required by CPUs and GPUs that DNNs do not use, making the TPU cheaper while saving energy and allowing transistors to be repurposed for domain-specific on-chip memory (better peak performance).
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值