论文阅读:EFFICIENTLY SCALING TRANSFORMER INFERENCE

本文探讨了在大型深度模型中进行高效Transformer推理的挑战,涉及内存占用、紧致的延迟目标和长序列长度。研究了不同分区策略如何受模型大小、序列长度和硬件芯片数量的影响,重点关注了多查询注意力和内存通信优化。文章还提到了Google的PaLM模型和模型压缩技术,如量化来提升性能。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

论文阅读:EFFICIENTLY SCALING TRANSFORMER INFERENCE

原文链接:https://arxiv.org/abs/2211.05102

Notes

  • 有挑战的环境:large deep models, with tight latency targets and long sequence lengths
  • select the best multi-dimensional partitioning techniques optimized for TPU v4 slices
  • the latency and model FLOPS utilization (MFU) tradeoffs on 500B+ parameter models
  • multiple query heads share single key/value head
  • generative inference proceeds one token at a time

what makes generative inference of LLMs challenging

  • large models have a large memory footprint both due to the trained model parameters as well as the transient state needed during decoding.
  • tight latency targets become especially challenging for generative inference given the much lower parallelizability of Transformer generation relative to training
  • inference cost from the attention mechanism scales quadratically with input sequence length(二次关系)

The attention key and value tensors of each layer, which we refer to as the KV cache,

The large memory footprint gives rise to a large amount of memory traffic to load the parameters and KV cache from high-bandwidth memory (HBM) into the compute cores for each step,

how is the performance of different partitioning strategies affected by changes in model size, sequence length, and number of hardware chips?

  • The latency is the total time for an inference and can be broken down into the time to process the input tokens present at the start of the inference (which we call “prefill”) and the time to autoregressively generate output tokens (which we term “decode”).
  • The throughput of prefill or decode is the number of tokens processed or generated per second.
  • The model FLOPS utilization (MFU) is the ratio of the observed throughput to the theoretical maximum throughput
  • At small batch sizes and sequence lengths, the time to load weights dominates
  • At larger batch sizes and sequence lengths (e.g. 2048+ tokens with batch size 512+), the time to load the KV cache dominates.
  • each matmul performs one multiplication and one addition per pair of input token and parameter values in the forward pass
  • since (unlike the weights) the KV cache is unique for each sequence in the batch.
  • Both the weight loading part of the memory time and the non-attention compute time are proportional to the model size and inversely proportional to the number of chips.
    • the time needed for chip-to-chip communication decreases less quickly (or not at all) with the number of chips used
    • Lower latency can often be achieved with smaller batch sizes
  • It is most efficient to increase the batch size because larger batches typically result in better MFU
  • prefill can run in parallel over L i n p u t L_{input} Linput but decode must run sequentially over L g e n L_{gen} Lgen
  • 在大模型推理过程中,“prefill” 和 “decode” 是两个重要的步骤。“Prefill” 是指在进行推理之前,对输入数据进行预处理和填充,以确保输入数据符合模型的要求。(可以并行)“Decode” 则是指在模型生成输出后,对输出进行解码和后处理,以获得最终的结果。(顺序执行)
  • all-reduce into two phases: a reduction phase and a broadcast phase
    • The reduction phase is called reduce-scatter(x)
    • The broadcast phase is called all-gather(x)
    • The all-to-all collective shifts sharding from one tensor dimension to another

Partitioning the feedforward layer

  • Each weight shard is multiplied by the appropriate activation shard on each chip, and the results are aggregated between the chips with an allgather and/or reduce-scatter
  • if the first matmul is partitioned by the output axis, the resulting activation shard on each chip will be the exact one needed to compute the second matmul partitioned by the input axis.
  • the communication latency remains roughly constant independent of the number of chips used
  • 通信延迟与使用的芯片数量无关
  • 一维划分:
    • The weights are kept stationary in each chip, and the activations are transferred between chips to match the weight layout, requiring one all-gather and one reduce-scatter.

Partitioning the attention layer

  • An alternative approach, called multiquery attention
    • still emits nheads for the query tensor, but only a single head for the key and value tensors.
    • reduces the size of the KV cache tensors by a factor of nheads
  • Since the KV cache is orders of magnitude larger than the Q, K, and V tensors, it is very profitable to spend the all-to-all communication time on the small tensors to save the memory time on the large tensors.
  • During prefill
    • we use the sharded-over-heads layout.
    • multiquery attention enables using larger batch sizes and sequence lengths
      在这里插入图片描述
  • the savings are an order of magnitude compared to multihead attention.(节省一个数量级)
  • MQA减少了访存,但是增加了并行通信开销。
  • larger matrix-multiplications run more efficiently on accelerators.

Low-level optimizations

  • Looped CollectiveEinsum technique
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值