论文阅读：EFFICIENTLY SCALING TRANSFORMER INFERENCE

PEAKKIZZA

已于 2024-01-26 16:05:15 修改

阅读量2.1k

点赞数 50

分类专栏：大模型文章标签：语言模型论文阅读论文笔记 transformer

于 2024-01-26 16:04:27 首次发布

本文链接：https://blog.csdn.net/peakkizza/article/details/135868082

版权

本文探讨了在大型深度模型中进行高效Transformer推理的挑战，涉及内存占用、紧致的延迟目标和长序列长度。研究了不同分区策略如何受模型大小、序列长度和硬件芯片数量的影响，重点关注了多查询注意力和内存通信优化。文章还提到了Google的PaLM模型和模型压缩技术，如量化来提升性能。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

论文阅读：EFFICIENTLY SCALING TRANSFORMER INFERENCE

原文链接：https://arxiv.org/abs/2211.05102

Notes

有挑战的环境：large deep models, with tight latency targets and long sequence lengths
select the best multi-dimensional partitioning techniques optimized for TPU v4 slices
the latency and model FLOPS utilization (MFU) tradeoffs on 500B+ parameter models
multiple query heads share single key/value head
generative inference proceeds one token at a time

what makes generative inference of LLMs challenging

large models have a large memory footprint both due to the trained model parameters as well as the transient state needed during decoding.
tight latency targets become especially challenging for generative inference given the much lower parallelizability of Transformer generation relative to training
inference cost from the attention mechanism scales quadratically with input sequence length（二次关系）

The attention key and value tensors of each layer, which we refer to as the KV cache,

The large memory footprint gives rise to a large amount of memory traffic to load the parameters and KV cache from high-bandwidth memory (HBM) into the compute cores for each step,

how is the performance of different partitioning strategies affected by changes in model size, sequence length, and number of hardware chips?

The latency is the total time for an inference and can be broken down into the time to process the input tokens present at the start of the inference (which we call “prefill”) and the time to autoregressively generate output tokens (which we term “decode”).
The throughput of prefill or decode is the number of tokens processed or generated per second.
The model FLOPS utilization (MFU) is the ratio of the observed throughput to the theoretical maximum throughput
At small batch sizes and sequence lengths, the time to load weights dominates
At larger batch sizes and sequence lengths (e.g. 2048+ tokens with batch size 512+), the time to load the KV cache dominates.
each matmul performs one multiplication and one addition per pair of input token and parameter values in the forward pass
since (unlike the weights) the KV cache is unique for each sequence in the batch.
Both the weight loading part of the memory time and the non-attention compute time are proportional to the model size and inversely proportional to the number of chips.
- the time needed for chip-to-chip communication decreases less quickly (or not at all) with the number of chips used
- Lower latency can often be achieved with smaller batch sizes
It is most efficient to increase the batch size because larger batches typically result in better MFU
prefill can run in parallel over $L_{input}$ but decode must run sequentially over $L_{gen}$
在大模型推理过程中，“prefill” 和 “decode” 是两个重要的步骤。“Prefill” 是指在进行推理之前，对输入数据进行预处理和填充，以确保输入数据符合模型的要求。（可以并行）“Decode” 则是指在模型生成输出后，对输出进行解码和后处理，以获得最终的结果。（顺序执行）
all-reduce into two phases: a reduction phase and a broadcast phase
- The reduction phase is called reduce-scatter(x)
- The broadcast phase is called all-gather(x)
- The all-to-all collective shifts sharding from one tensor dimension to another

Partitioning the feedforward layer

Each weight shard is multiplied by the appropriate activation shard on each chip, and the results are aggregated between the chips with an allgather and/or reduce-scatter
if the first matmul is partitioned by the output axis, the resulting activation shard on each chip will be the exact one needed to compute the second matmul partitioned by the input axis.
the communication latency remains roughly constant independent of the number of chips used
通信延迟与使用的芯片数量无关
一维划分：
- The weights are kept stationary in each chip, and the activations are transferred between chips to match the weight layout, requiring one all-gather and one reduce-scatter.

Partitioning the attention layer

An alternative approach, called multiquery attention
- still emits nheads for the query tensor, but only a single head for the key and value tensors.
- reduces the size of the KV cache tensors by a factor of nheads
Since the KV cache is orders of magnitude larger than the Q, K, and V tensors, it is very profitable to spend the all-to-all communication time on the small tensors to save the memory time on the large tensors.
During prefill
- we use the sharded-over-heads layout.
- multiquery attention enables using larger batch sizes and sequence lengths
the savings are an order of magnitude compared to multihead attention.(节省一个数量级)
MQA减少了访存，但是增加了并行通信开销。
larger matrix-multiplications run more efficiently on accelerators.