论文阅读（第一部分）：Full Stack Optimization of Transformer Inference: a Survey

最新推荐文章于 2024-09-14 20:22:29 发布

PEAKKIZZA

最新推荐文章于 2024-09-14 20:22:29 发布

阅读量970

点赞数 21

分类专栏：大模型文章标签：论文阅读 transformer 语言模型

本文链接：https://blog.csdn.net/peakkizza/article/details/135874391

版权

大模型专栏收录该内容

7 篇文章 1 订阅

订阅专栏

论文阅读（第一部分）：Full Stack Optimization of Transformer Inference: a Survey

原文链接：https://arxiv.org/pdf/2302.14017.pdf

Notes

speech recognition语音识别
we survey different approaches for efficient Transformer inference：
- 分析和概述现有Transformer架构中的瓶颈，以及它们与先前卷积模型的异同
- Transformer架构对硬件的影响，包括非线性操作(如Layer Normalization、Softmax和GELU)以及线性操作对硬件设计的影响
- approaches for optimizing a fixed Transformer architecture
- challenges in finding the right mapping and scheduling of operations for Transformer models;
- 利用神经结构搜索适应结构优化Transformer模型的方法

Key takeaways

CPUs and GPUs are both commonly used in general-performance computing platforms
深度学习模型由少量不同的操作组成，这些操作重复了数百万或数十亿次并且它们不需要很强的灵活性。
虽然现代cpu和gpu可以并行执行多个操作，但它们缺乏利用深度学习模型中大量数据重用机会的能力

Quotes

software frameworks:

Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. {TensorFlow}: a system for {Large-Scale} machine learning. In USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2016.

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.

编译器：

NVIDIA. TensorRT: https://developer.nvidia.com/tensorrt, 2018

Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, et al. {TVM}: An automated end-to-end optimizing compiler for deep learning. In 13th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 18), pages 578–594, 2018

Amit Sabne. Xla: Compiling machine learning for peak performance. 2020

academia:

Hasan Genc, Seah Kim, Alon Amid, Ameer Haj-Ali, Vighnesh Iyer, Pranav Prakash, Jerry Zhao, Daniel Grubb, Harrison Liew, Howard Mao, Albert Ou, Colin Schmidt, Samuel Steffl, John Wright, Ion Stoica, Jonathan Ragan-Kelley, Krste Asanovic, Borivoje Nikolic, and Yakun Sophia Shao. Gemmini: Enabling systematic deep-learning architecture evaluation via full-stack integration. In Proceedings of the 58th Annual Design Automation Conference (DAC), 2021.

Transformers and large language models”

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018

对于transformer推理：

especially due to their growing size and run-time complexity
Transformers are mostly composed of matrix multiplications (matmuls) together with memory-intensive nonlinear operations.
transformer的计算图比CNN复杂得多——具有更多类型的算子节点和更多的数据流拆分和连接

contribution

分析transformer的运行特性和提出各种方法提高transformer的推理性能:
- 对Transformer体系结构的运行时特征和瓶颈进行分析和概要分析
- transformer推理的硬件架构，包括变压器架构的非线性操作对其设计的影响
- 优化策略，如修剪和量化，以进一步提高固定Transformer架构的性能
- 设计和调整Transformer架构，通过自动神经架构搜索过程提高硬件效率
将调查的方法放到Gemmini上实现:
- 在CNN特定领域的加速器上运行transformer的主要瓶颈不一定是线性操作，而是花在浮点非线性操作以及量化和去量化算子上的时间。
- ***For Transformer accelerators, it is often better to have a larger accumulator size and smaller scratchpad size——***累加器用于存储中间结果，而scratchpad用于存储临时变量和中间计算步骤
- transformer中的调度调度只需要3个循环，而cnn中的卷积则需要6个循环，需要合适的调度决策。差别高达四个数量级
- Fusing Batch Normalization与相邻卷积融合对于CNN是容易的，在Transformer体系结构中将Layer Normalization与preceding matmul融合时，会对映射施加约束，特别是与tile大小相关的约束，但是在某些情况下，对于tranformer映射约束导致的运行时成本可能超过操作融合带来的收益。

1 Performance Bottlenecks

1.1 MHA阶段

先乘以三个不同的权重矩阵：这将产生三种不同的激活，即查询、键和值激活，然后将查询、键和值激活拆分为多个（h）块，每个块的隐藏维度为 $\frac{d}{h}$ ，然后将这些块转发给不同的注意力头head，在那里查询和键块沿着隐藏维度相乘，生成大小为𝑙×𝑙的激活矩阵。这个激活矩阵然后通过Softmax操作传递（其结果通常被称为注意力得分），并与值块相乘，得到隐藏维度 $\frac{d}{h}$ 的激活。所有来自注意头的激活都沿着隐藏维度连接起来，以生成隐藏维度𝑑 的单个激活。然后由最后一个线性层与权重矩阵𝑊out投影到相同的维度，MHA模块中最后一个线性层的输出在添加到剩余连接以获得MHA模块输出之前，将通过LayerNorm操作符传递。
An input sequence to the Transformer block is composed of 𝑙 tokens, each represented by a vector of 𝑑 dimension, forming a 𝑑 × 𝑙 matrix.
input是由tokens构成的，格式是维度 *d **序列 l（部分）的一个矩阵
A token is a segment of an input sequence.
For example, when the input is a sentence, a token may be a word or a sentence fragment
一个MHA模块由六个线性操作组成，其中四个是相同的权重到激活的矩阵。（the 𝑊𝑄 , 𝑊𝐾 , 𝑊𝑉 and 𝑊out projections）剩下的两个是激活到激活的矩阵。（query × key and attention score × value）
- 第一种类型的matmuls作为projections投影，第二种类型的matmuls作为激活到激活的matmuls(简称act-to-act matmuls)，因为它们具有不同的运行时行为
  
  图：all types of linear layers in a Transformer block in both MHA and FFN modules.
  $Attention(Q、K、V)=softmax(\frac{QK^{T}}{\sqrt{d_{k}}})V$

1.2 FFN阶段

在这里插入图片描述 - The FFN module is a relatively simple block consisting of two linear layers

输入序列首先通过带有权重矩阵 $W_{1}$ 的第一个线性层从隐藏维度𝑑投影到更高的FFN维度 $d_{FFN}$ 。随后，将投影序列通过第二层线性层以权值矩阵 $W_{2}$ 投影回**原始维度𝑑。一般情况下，选择尺寸𝑑FFN比𝑑大4倍。导致𝑊1和𝑊2的纵横比为4:1(eg:BERT-base),**In between these two linear layers is a non-linear layer.

GRLU:Gaussian error linear units

1.2.1 非线性操作

有几种非线性操作，如Softmax、LayerNorm和GELU，需要专门的支持或片外计算。在所有算子中占比较小，但是更难优化如果处理不合适将会带来巨大的开销。This is because they require multiple passes over all input values进行多次值传递，这需要将这些值保存在临时内存中。
softmax算子包括
- 指数算子exponential operations:指数函数容易出现数值溢出，因此transformer是用maximum subtraction trick来转换表达式
  $exp(x_{i})/\sum_{j}exp(x_{j})into exp(x_{i}-x_{max})/\sum_{j}exp(x_{j}-x_{max})$
  where $x_{max}$ is the maximum of the $x_{j}$ ’s
但是这需要对输入进行额外的传递，从而得到一个三次传递数值稳定的实现。计算LayerNorm函数还需要在隐藏维度上对整个输入值进行多次传递。在第一轮传递中，必须计算平均值。在第二传递中，这将用于计算**标准偏差，**在第三次传递中，实际应用归一化，每个输入值需要一个除法。
将整个序列长度维度的结果相加
通过将输入除以求和结果使其归一化
非线性操作给算子融合带来了挑战，算子融合可以减少层间通信通过结合多个算子成一个算子。不同于batch normalization（BatchNorm）在CNN中可以包含在之前的线性算子中，
LayerNorm需要在运行时计算输入的均值和方差。要将此操作与前面的matmul操作融合，在写出结果之前必须在reduction维度上累积整个输出矩阵（计算均值和方差的维度），这倒是不规则的tiling维度和更低的数据重用。
tradeoff：
- 将这些算子与以前的层融合，而不是使用更好的tiling维度为了最大化重用

1.2.2 Encoder and Decoder Architectures

*In this setting, the encoder takes the entire source language sentence as input and **passes it through multiple Transformer encoder blocks,***并且提取输入的序列的高级别特征，这些提取的特征之后进入decoder，它负责为目标语言生成tokens，这是基于encoder的源语言特性以及它之前生成的tokens。还有一些只有encoder和只有decoder的结构。
Encoder Block
- BERT，RoBERTa，XLNet是只有encoder的结构
- the encoder-only structure is suitable for natural language understanding tasks
- 例如情感分析和句子相似度分析（其中整个输入序列被输入到模型中。
the inference is composed of matrix-matrix multiplications as well as element-wise additions and nonlinear operations.
- MHA模块和FFN模块中投影层的开销与输入序列长度成线性关系𝑙。
- 在MHA模块中，act-to-act矩阵与序列长度成二次比例
- 对于较短的序列长度，投影层占主导地位，使编码器块的整体复杂性𝑂(𝑙)
- 对于较长的序列长度，act-to-act矩阵占主导地位，使总体复杂性𝑂( $l^{2}$ )。
Decoder Block
- 本质上是自回归
- 这意味着给定时间步长的输出基于前一个时间步长的输出。
- 模型根据之前的tokens预测生成下一个token，因此推理必须按照顺序和迭代地进行，每次输出一个token。
- 适合自然语言生成任务
- 可以在模型开始生成后续token之前并行地使用输入提示tokens。
- which operates on the entire input sequence, the decoder block is inferred one token at a time
- 投影算子只用于输入token，导致一个矩阵-向量乘法和一个常数代价。
these operations scale linearly with sequence length
- 在较大的时间步长中处理token比在较小的时间步长中处理token需要更多的计算
A key detail to note is that the full key and value activations must be present for the input token to attend to all previously generated tokens.
- token生成的常见优化技术是**在后续迭代中缓存和重用先前生成的token的中间键和值。**这样避免了之后每次都重新计算中间key和value。
- 总的来说，生成完整序列的端到端复杂性对于投影层呈线性增长，而对于其他两个act-to-act矩阵呈二次增长。
  
  图：端到端的transformer decoder计算图一层的迭代