论文阅读（第三部分）：Full Stack Optimization of Transformer Inference: a Survey

最新推荐文章于 2024-10-12 16:45:49 发布

PEAKKIZZA

最新推荐文章于 2024-10-12 16:45:49 发布

阅读量857

点赞数 20

分类专栏：大模型文章标签：论文阅读 transformer 深度学习

本文链接：https://blog.csdn.net/peakkizza/article/details/135911434

版权

大模型专栏收录该内容

7 篇文章 1 订阅

订阅专栏

论文阅读（第三部分）：Full Stack Optimization of Transformer Inference: a Survey

原文链接：https://arxiv.org/pdf/2302.14017.pdf

硬件设计

2.1 经典DNN加速器

典型深度学习加速器的组成：

Off-chip DRAM used for holding the weights and activations of the full network, which needs to be large enough to hold all model weights and activations;
Smaller on-chip memory, referred to here as the global buffer, which needs to be large enough to hold a subset of the weights and inputs in order to feed the processing elements (PEs);
An array of PEs, each of which has the capability to perform MAC operations, and which often contains one or more small local memories called register files (RFs) that can store data with lower per-access energy than the global buffer; and
An internal network-on-chip (NoC) that transfers data between PEs.
图：the structure of a typical DNN accelerator

全局缓冲器被设计成足够大可以容纳权重和激活为了允许数据重用和限制从off-chipDRAM的数据传输

在PE上的本地存储用于提供数据重用减少全局缓冲器的访问

没有重用时，MAC算子需要加载三次参数：
- 正在被乘的两个value和部分和（输出矩阵给定为的部分累加值）
reads from a local buffer are roughly 6 times as expensive as a single MAC operation
reads from external DRAM are roughly 200 times as expensive
关键：利用重用机会
两种数据流方法最大化数据重用：
- **时间数据流temporal dataflows**
  - Temporal dataflows contain an array of centrally-controlled PEs that load data from the global buffer and perform the requested ALU (Arithmetic Logic Unit) operations before writing the data back to the global buffer.
  - These PEs do not contain local memories, and there is no communication or data movement between PEs
    - eg：Single-Instruction Multiple Data (SIMD)
    - eg：single-Instruction Multiple Thread (SIMT) execution units
  - 全连接层和卷积层都被映射为矩阵矩阵乘法运算
- 空间数据流spatial dataflows
  - the PEs can communicate and data can be moved between PEs to leverage data reuse, without repeated reads from the global buffer
  - The PEs themselves often contain RFs to hold weights or partial sums locally to improve data reuse, and additional reuse can be attained through passing data between adjacent PEs.
  - Spatial dataflows are commonly used in FPGA and ASIC-based accelerators, especially for convolutional networks
    
    **ASIC（Application-Specific Integrated Circuit）**是一种专门用于特定应用领域的集成电路。ASIC芯片通常用于加速特定的计算任务，如深度学习、视频编码等。下面是一个使用Python编程语言的示例，演示如何使用ASIC-based accelerators
  - Weight stationary dataflows minimize the number of reads required for weight matrices by keeping weights in the local RFs in PEs and streaming through inputs
  - Output stationary dataflows minimize energy from reading and writing partial sums by accumulating the outputs in the local RFs in the PEs
  - No local reuse dataflows have no RF in each PE, and use the area savings from having no RFs to allocate a larger global buffer
  - Row stationary dataflows maximize reuse for both partial sums and weights通过保持PE中一行的权重不变，在输入中流式化，并流式化出部分和
  Note that since DNNs consist of sequences of layers, it is also possible to fuse operations in order to further leverage data reuse across multiple layers
  
  Typical DNN accelerators consist of on-chip memory for holding a subset of model weights and inputs and an array of processing elements (PEs) which can perform MAC operations. Off-chip DRAM is used for holding weights andactivations for the full network, and an internal network-onchip (NoC) can be used for transferring data between PEs. DNN accelerators typically aim to leverage either temporal dataflows (by performing the same operation in parallel on several datapoints) or spatial dataflows (where data can be transferred between PEs to leverage additional reuse opportunities). Spatial dataflow reuse schemes include weight stationary dataflows, which hold weights in local memories in the PEs to improve reuse.

2.2 适配Transformer的DNN加速器

One difference between accelerators for CNNs and for Transformers is that due to differences in terms of arithmetic intensity and matrix dimensions, these models have different optimal sizes for each level of the memory hierarchy as well as different memory bandwidth requirements.
另外一个考虑是推理阶段的计算非线性函数
- 算子需要额外的在on-chp计算上的支持
- 算子需要卸载到CPU上
  
  Several accelerators for Transformer inference contain specialized post-processing units for nonlinear functions
- 这种方式增加额外的单元同时会增加加速器的area
  
  Accelerators specialized for the MHA module are designed to match the dataflow of the MHA module,where all the operations are “fused” together
- 这种方式具有较小的灵活性但是通过减少数据访问从而提高了性能
  
  Recall that operation fusion refers to using the output values from one operation (e.g., a matmul) directly as input to the following operation (e.g., a Softmax layer) without writing the intermediate values to off-chip memory.
- 考虑对于非线性函数的数据流的放置,因为非线性算子有更高的MOPs尽管他们有较小的FLOPs
summary：

Transformers和CNN的加速器具有不同的内存层次结构的最佳尺寸以及不同的内存带宽需求。MHA模块的加速器倾向于设计硬化的数据通路来利用操作符融合。端到端的Transformer加速器倾向于不围绕MHA模块中的图级数据流设计其数据通路。Transformer加速器倾向于集成一个后处理单元（post-processing），以便在片上高效地计算非线性函数。

2.3 分析模型

在这里插入图片描述
- 本地内存和处理其单元阵列computing tiled matrixmatrix multiplications它依赖于外部存储器来存储所有的模型参数。
- 性能估计假设计算时间和内存操作时间可以完全重叠。
- Note that double buffering was assumed in the scratchpad to ensure that compute could be overlapped with memory reads/writes wherever possible.
- 特殊的功能单元可以去计算所需的非线性操作因此这些计算都不需要在片外计算。

Latency Breakdown and End-to-end Latency：

在平方tiling的假设下，对所有矩阵进行运算且不进行算子融合
- 每个算子都需要从外部存储器中读取输入，并将输出刷新出来
- 分析模型的结果与CPU上的性能分析结果相比，在运行时延迟缩放和故障方面表现出相似的趋势
- 需要注意的是，分析模型是在假设硬件架构不同于CPU架构的情况下设计的，因此对于不同的硬件平台，运行时行为不一定完全相同。

Non-ideal Arithmetic Intensity

算术强度提供了一个粗略的估计，即在理想情况下，对于不同的操作，数据重用的可能性有多大
- 由于矩阵的大小超过了局部tiling的容量，因此算术强度会进一步降低。
- Tiling 技术通过利用 GPU 上的 shared memory 来减少对 global memory 的访问，以提高核函数的执行效率。
- 在使用 Tiling 优化后，总计算量不变，仍为 2 x 32 x 32 x 32 flop。但是，shared memory 中的每个元素都别使用了 16 次，从而使总的 global memory 访问量减少 16 倍，变为 2 x 32 x 32 x 32 / 16 ，Computation-to-memory ratio 为 4 (flop/byte)，比之前提高了 16 倍。在实际场景中，我们可以通过调整 tile 的大小来进一步提高 computation-to-memory ratio。
- Tiling 技术通过将 global memory 中的元素加载到 shared memory 中以便多次使用，从而减少了对 global memory 的访问次数；一般情况下，如果分片大小为 K×K 个元素，则 global memory 的访问次数会减少 K 倍。
- 一般来说，如果输入矩阵的维数为 N，分片大小为 K，则内积计算将经过 N/K 次迭代。这些迭代是减少对 global memory 访问的关键。每次迭代的计算都集中在输入矩阵的一个小分片上，线程就可以协作地将分片加载到 shared memory 中，并使用 shared memory 中的元素来满足本次迭代中的重复输入需求。这种集中访问行为称为局部性，当一个算法表现出局部性时，就有机会使用小型高速内存来服务于大多数访问，并从全局内存中删除这些访问。
- 还要注意，代码中的 A_tile 和 B_tile 被不断重复使用来保存输入的元素。在每个迭代中，使用相同的 A_tile 和 B_tile 来保持本次迭代中的分片，从而允许更小的 shared memory 来服务于对 global memory 的大部分访问。
- 通过考虑硬件细节，解析建模可以提供更准确的估计，即非理想算术强度
- 我们发现非线性算数强度像比如理想算数强度下降，这是由于Tiling效应以及在非线性操作之前必须加载和存储的大的32位输出激活。

summary:

解析建模是在目标硬件平台上识别DNN推理的瓶颈和运行时特征的有用工具。这种技术在设计阶段特别有用，因为在设计阶段，对实际硬件的剖析可能是困难的，但为了做出设计决策，分析是必要的我们提供了使用解析建模来获得延迟崩溃和非理想算术强度的例子。详细地，我们已经证明，在考虑硬件约束和实现细节的情况下，相比于理想情况，Transformer的非理想算术强度可以进一步降低(高达2.5 ×)

2.4 Case Study:搭建Transformer加速器

Baseline Accelerator:
在这里插入图片描述
图中的**Scrachpad**

SPM 是由 SRAM 存储部件 + 地址译码部件 + 数据输出电路三个部分构成，使用片上高速总线和处理器连接；

一般 Cache 中，不仅有SRAM存储单元等三部分，通常还包含TagRAM部件和比较逻辑电路部件，所以相对SP 访问延迟和能耗略高。

特点:
SPM和主存统一编址，具有和主存统一的不重叠的地址空间。不需要TagRAM部件就可以直接访问SPM中的数据。

对比于Cache ,SPM不具有用于存储Tag 的部件和地址比较部件，硬件构造相对比较简单，实用同样的制造工艺下，SPM的芯片面积一般为Cache 的65%。

优势:
体积较小、功耗低、访问速度比Cache 快。（快在哪？统一编址？） 程序员可以灵活控制是SPM的最大优势。

SPM由于和主存统一编址，处理器可以直接访问，而不会出现像 Cache 一样不命中的现象，也不会访问主存。因而功耗低，速度快。

加速器采用16 × 16的脉动阵列进行矩阵运算，从而实现了权重固定数据流
When performing convolutions, the dimensions of the output-channels and input-channels are spatially unrolled
The 8bit integer weights （整数权重）and inputs are stored in a 256 kB local scratchpad memory
32-bit partial sums（32位部分和） are stored in a dual-ported（双端口） 64 kB accumulator SRAM which performs matrix additions
system-on-chip (SoC)系统集成芯片
对这些操作的本地化支持是很重要的，因为它消除了在DRAM或外部缓存之间进行昂贵的往返传输的需要（其中CPU可以执行这些操作），以及本地Scraphpad(Gemmini存储其矩阵操作数)。
the performance on Transformer workloads such as BERT is severely limited due to the need to perform operations such as GELU, LayerNorm, and Softmax on the CPU.

Performance Bottlenecks：

MACs:

MACs全称Multiply–Accumulate Operations，即乘加累积操作，1MACs包含一个乘法操作与一个加法操作，大约包含2FLOPs。通常MACs与FLOPs存在一个2倍的关系。
Note that over 99% of FLOPs in our Transformer inference are MACs for matmuls,
浮点加法器或乘法器比整数加法器多消耗数量级的能量。
在我们的基线CNN加速器中，matmul是用INT8输入执行的，但这些**必须在matmul操作之间进行去量化和再量化，**以便在CPU上执行浮点激活。进一步增加了能量和延迟开销。
而Transformers主要执行矩阵，通常使用小矩阵和/或矩形矩阵，具有显著较低的算术强度和不同的最优分块策略。

Memory Hierarchy：
在这里插入图片描述

通过简单地调整输入/权重Scraphpad和32位部分累加器的大小，显著提高了BERT的矩阵运算性能
query × key矩阵具有l × l的输出激活矩阵，对于长序列长度，其激活矩阵远大于l × d / h的输入查询和key矩阵。因此，增加累积缓冲区的大小可以提高这些操作的输出重用性。

Hardware-Software Co-Design:
在这里插入图片描述

only 1% of time is actually spent on matmuls.
其余部分用于浮点非线性激活，归一化，或量化和去量化操作，因为它们卸载到CPU。
To alleviate the overhead of runtime quantization and dequantization:
where only matmuls are quantized
*to replace floating-point nonlinear operations such as GELU and Softmax with integer polynomial approximations such that they are both faster and cheaper to implement in specialized hardware accelerators.*用整数多项式运算来近似代替浮点非线性运算例如：GELU和Softmax

summary：

The bottleneck non-matmul operations running on the CPU takes 96% of total runtime
Activation functions performed in floating point require repeated dequantization and requantization
The lower arithmetic intensity nature of transformer inference is more sensitive to non-optimized memory hierarchy.

解决方法：
Reduced scratchpad capacity in favor of an increase in accumulator size, which enabled higher output reuse and improved memory efficiency
Switched to I-BERT, an integer version of BERT that approximates floating point activations, eliminating quantization overhead近似浮点运算消除量化开销
Added special normalizer units and activation units that offload GELU, LayerNorm and Softmax computations from the CPU