论文阅读:EFFICIENTLY SCALING TRANSFORMER INFERENCE

论文阅读:EFFICIENTLY SCALING TRANSFORMER INFERENCE

原文链接:https://arxiv.org/abs/2211.05102

Notes

  • 有挑战的环境:large deep models, with tight latency targets and long sequence lengths
  • select the best multi-dimensional partitioning techniques optimized for TPU v4 slices
  • the latency and model FLOPS utilization (MFU) tradeoffs on 500B+ parameter models
  • multiple query heads share single key/value head
  • generative inference proceeds one token at a time

what makes generative inference of LLMs challenging

  • large models have a large memory footprint both due to the trained model parameters as well as the transient state needed during decoding.
  • tight latency targets become especially challenging for generative inference given the much lower parallelizability of Transformer generation relative to training
  • inference cost from the attention mechanism scales quadratically with input sequence length(二次关系)

The attention key and value tensors of each layer, which we refer to as the KV cache,

The large memory footprint gives rise to a large amount of memory traffic to load the parameters and KV cache from high-bandwidth memory (HBM) into the compute cores for each step,

how is the performance of different partitioning strategies affected by changes in model size, sequence length, and number of hardware chips?

  • The latency is the total time for an inference and can be broken down into the time to process the input tokens present at the start of the inference (which we call “prefill”) and the time to autoregressively generate output tokens (which we term “decode”).
  • The throughput of prefill or decode is the number of tokens processed or generated per second.
  • The model FLOPS utilization (MFU) is the ratio of the observed throughput to the theoretical maximum throughput
  • At small batch sizes and sequence lengths, the time to load weights dominates
  • At larger batch sizes and sequence lengths (e.g. 2048+ tokens with batch size 512+), the time to load the KV cache dominates.
  • each matmul performs one multiplication and one addition per pair of input token and parameter values in the forward pass
  • since (unlike the weights) the KV cache is unique for each sequence in the batch.
  • Both the weight loading part of the memory time and the non-attention compute time are proportional to the model size and inversely proportional to the number of chips.
    • the time needed for chip-to-chip communication decreases less quickly (or not at all) with the number of chips used
    • Lower latency can often be achieved with smaller batch sizes
  • It is most efficient to increase the batch size because larger batches typically result in better MFU
  • prefill can run in parallel over L i n p u t L_{input} Linput but decode must run sequentially over L g e n L_{gen} Lgen
  • 在大模型推理过程中,“prefill” 和 “decode” 是两个重要的步骤。“Prefill” 是指在进行推理之前,对输入数据进行预处理和填充,以确保输入数据符合模型的要求。(可以并行)“Decode” 则是指在模型生成输出后,对输出进行解码和后处理,以获得最终的结果。(顺序执行)
  • all-reduce into two phases: a reduction phase and a broadcast phase
    • The reduction phase is called reduce-scatter(x)
    • The broadcast phase is called all-gather(x)
    • The all-to-all collective shifts sharding from one tensor dimension to another

Partitioning the feedforward layer

  • Each weight shard is multiplied by the appropriate activation shard on each chip, and the results are aggregated between the chips with an allgather and/or reduce-scatter
  • if the first matmul is partitioned by the output axis, the resulting activation shard on each chip will be the exact one needed to compute the second matmul partitioned by the input axis.
  • the communication latency remains roughly constant independent of the number of chips used
  • 通信延迟与使用的芯片数量无关
  • 一维划分:
    • The weights are kept stationary in each chip, and the activations are transferred between chips to match the weight layout, requiring one all-gather and one reduce-scatter.

Partitioning the attention layer

  • An alternative approach, called multiquery attention
    • still emits nheads for the query tensor, but only a single head for the key and value tensors.
    • reduces the size of the KV cache tensors by a factor of nheads
  • Since the KV cache is orders of magnitude larger than the Q, K, and V tensors, it is very profitable to spend the all-to-all communication time on the small tensors to save the memory time on the large tensors.
  • During prefill
    • we use the sharded-over-heads layout.
    • multiquery attention enables using larger batch sizes and sequence lengths
      在这里插入图片描述
  • the savings are an order of magnitude compared to multihead attention.(节省一个数量级)
  • MQA减少了访存,但是增加了并行通信开销。
  • larger matrix-multiplications run more efficiently on accelerators.

Low-level optimizations

  • Looped CollectiveEinsum technique

量化:

  • reduces communication volume in weight-gathered layouts
  • enables memory time savings from weight loading

These results give us our basic strategy for selecting partitioning layout: during the prefill phase, we select from weight-stationary and weight-gathered layouts based on the current number of tokens in the batch. During the generate phase, we select the 2D weight-stationary layout because the batch size in tokens is always small

This is directly proportional to operational cost and inversely proportional to MFU.

chip-seconds per token calculated as:推理成本
c o s t ( c h i p − s e c o n d s p e r t o k e n ) = n c h i p s ∗ t i m e B L cost(chip-seconds per token)=\frac{n_{chips}*time}{BL} cost(chipsecondspertoken)=BLnchipstime

Key takeaways

  • **PaLM**是谷歌开发的一系列人工智能大型语言模型。它的全称是Pathways Language Model,是谷歌研究的一项工作,旨在构建一个强大的模型,可作为多种用例的基础。PaLM有多个版本,其中包括针对生命科学和医学信息的**Med-PaLM 2**,以及专注于网络安全部署的Sec-PaLM。PaLM 2是PaLM系列中的一个版本,具有改进的多语言、推理和编码能力。
    在这里插入图片描述
  • We estimate an approximately square-root relationship between model size and latency based on Figure 1模型大小和延迟是平方根关系

Pareto frontier,也称为Pareto optimal set,是指在多目标优化问题中,无法通过改善一个目标而不损害其他目标的解决方案集合。这些解决方案处于非支配关系,也就是说,没有其他解决方案能在所有目标上优于它们。Pareto frontier通常用于帮助理解和可视化多目标优化问题的解空间,以便在不同目标之间进行权衡和决策。

Batch size(批大小)是神经网络训练中的一个重要概念。它定义了在神经网络中一次前向传播和反向传播中使用的训练样本的数量。较大的批大小需要更多的内存空间,但可以加快训练速度。一个 epoch(时期)表示所有训练样本完成一次前向传播和反向传播,而**批大小则决定了完成一次 epoch 需要进行多少次迭代。**举例来说,如果有1000个训练样本,批大小设置为500,那么完成一个 epoch 需要进行2次迭代。

较小的批大小可以节省内存空间,但估计梯度的准确性会降低。另外,较小的批大小意味着更新参数的次数更多,网络收敛得更快。相反,较大的批大小需要更多的内存空间,但可以降低计算成本,因为需要的更新次数较少

**Feedforward**是神经网络中的一种基本结构,也称为前馈神经网络。在这种结构中,数据只能向前传播,不会出现循环连接。这与递归神经网络形成了鲜明对比。Feedforward神经网络通常用于解决分类和回归问题,它的工作原理是输入数据经过一系列加权和激活函数处理后,得到输出结果。

Feedforward神经网络是一种基本的神经网络结构,也称为**多层感知器(MLP)**。它是一种前馈型的神经网络,信息只能从输入节点前进,然后通过隐藏层(单层或多层),最终到达输出节点。Feedforward神经网络通过多层神经元的连接来处理复杂的非线性决策边界,以解决分类和回归问题。MLP是一种特殊类型的Feedforward神经网络,它由一个或多个隐藏层组成,每个隐藏层包含多个神经元。MLP通过学习数据之间的线性和非线性关系来解决各种问题,如情感分析、图像识别等。MLP的训练和优化通常使用梯度下降等算法来调整网络中的权重和偏差,以最小化损失函数。MLP在Python中可以使用TensorFlow和Keras等库来构建和训练。

MLP residual是指多层感知器(MLP)中的残差连接。在传统的MLP中,数据会依次通过每一层,而在具有残差连接的MLP中,数据可以通过跳过一些层直接到达后续层。这种残差连接提供了另一条路径,使数据能够更轻松地到达神经网络的后续部分。

残差连接的主要作用是解决训练深度神经网络时遇到的问题,如梯度消失和梯度爆炸。通过残差连接,即使网络有数百层,训练过程也更容易收敛。此外,残差连接还有助于解决梯度相关问题,为梯度引入了一些空间结构,从而帮助训练过程。

**Layer Normalization(层归一化)**是一种神经网络中常用的归一化技术,它有助于提高模型的训练速度和泛化性能。与批归一化不同,层归一化直接从隐藏层中神经元的输入中估计归一化统计量,因此归一化不会引入训练样本之间的新依赖关系。层归一化对于循环神经网络(RNN)效果良好,并且改善了多种现有RNN模型的训练时间和泛化性能。最近,它还被用于Transformer模型。

层归一化的统计量是通过同一层中所有隐藏单元计算得出的。具体而言,均值μl是该层中所有隐藏单元输入的平均值,标准差σl是所有隐藏单元输入与均值的差的平方的平均值的平方根。在层归一化下,同一层中的所有隐藏单元共享相同的归一化参数μ和σ,但不同的训练样本具有不同的归一化参数。与批归一化不同,层归一化不对小批量的大小施加任何限制,并且可以在纯在线模式下使用,批量大小为1。

**FasterTransformer** uses 16–32 NVIDIA A100s with 80GiB HBM

FLOP count(浮点运算次数)是指在计算机科学中,用于衡量算法或程序执行期间所执行的浮点运算的数量。通常用于评估算法或程序的性能和复杂度

**通信量(communication volume)**是指在计算中,数据在不同处理单元之间传输的总量。在并行计算中,通信量通常用于评估算法或程序在不同处理单元之间传输数据的频率和数量。

we observe that FLOP count and communication volume can fundamentally limit inference performance of dense Transformer models.

AIGC是指"Artificial Intelligence Generated Content",即人工智能生成的内容。AIGC类应用是指利用人工智能技术生成各种类型的内容,比如文本、图像、音频等。这些应用可以包括自然语言处理模型、图像生成模型、音频合成模型等。通过使用AIGC类应用,我们可以自动地生成各种类型的内容,为创意产业、媒体、广告等领域提供了全新的可能性。

Mixture of Experts (MoE) 是一种用于提高函数逼近准确性的方法,它通过将单一全局模型替换为本地模型(专家)的加权和来实现。这种方法基于通过聚类算法将问题域划分为多个子域,然后在每个子域上进行本地专家训练。MoE方法主要依赖于高斯混合模型(GMM)的期望最大化(EM)算法。对于回归目的,MoE的不同步骤包括聚类、本地专家训练和重新组合。
在这里插入图片描述
B:sequence batch

L:sequence length

E:model enmbed dimension

F:MLP feedforward dimension

一维划分communication time:
T c o m m = 2 B L E n e t w o r k b a n d w i d t h T_{comm}=\frac{2BLE}{network bandwidth} Tcomm=networkbandwidth2BLE
二维划分:The total compute cost is the same as 1D weightstationary, but communication is much more efficient.

the communication time scales as O ( 1 n c h i p s ) O(\frac{1}{\sqrt{n_{chips}}}) O(nchips 1)

  • we can continue to reduce latency by adding more chips, because communication time continues to reduce
  • 2D weight-stationary becomes more communication-efficient when n c h i p s > d f f d m o d e l \sqrt{n_{chips}}>\frac{d_{ff}}{d_{model}} nchips >dmodeldffsince typically d f f = 4 d m o d e l d_{ff}=4d_{model} dff=4dmodel ,this occurs when n c h i p s > 16 n_{chips}>16 nchips>16
  • because of their 2D layout, the activation communication includes two all-gathers and reduce-scatters.
  • 维划分communication time:
    T c o m m = 8 B L E n c h i p s ∗ n e t w o r k b a n d w i d t h T_{comm}=\frac{8BLE}{\sqrt{n_{chips}}*networkbandwidth} Tcomm=nchips networkbandwidth8BLE
  • The output of each per-chip matrix multiplication must then be aggregated between chips to be used as input to the subsequent operations.
    在这里插入图片描述
  • ence configurations depending on application goals.
  • BL corresponds to the total batch size in tokens.
  • 模型压缩技术:model conpression techiques
    • 量化quantization来模型加速
    • 剪枝pruning

Summary

  • The lowest cost is achieved at batch sizes larger than about 512,
  • More details on the relationship between model size and MFU
    在这里插入图片描述
  • 早在2021年,Google Research和OpenAI的合作论文给出答案:《Sparse is Enough in Scaling Transformers》,证明稀疏计算能够为大模型带来数十倍加速
  • Pathways架构采用稀疏计算原理:执行任务时仅稀疏激活模型的特定部分,计算真正有用的元素,这正是稀疏计算的本质。
  • 20
    点赞
  • 18
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值