【论文阅读】S3: Increasing GPU Utilization during Generative Inference for Higher Throughput-CSDN博客

本文链接：https://blog.csdn.net/peakkizza/article/details/136995145

名字：scheduling sequences with speculation

除了已经很大的模型参数之外，保存序列中先前标记信息的键/值 (KV) 缓存可能会变得比模型本身还要大。
它为KV缓存保留了内存的最大序列长度，以保证在不知道输出序列长度的情况下生成完整的序列。这限制了我们使用较小的批量大小，从而导致 GPU 利用率较低，最重要的是吞吐量较低。
设计一个系统预测输出序列的长度根据长度生成query

内存容量和带宽。凸显了内存限制以及高效内存利用以提高 GPU 计算资源利用率的需求。、
提高 GPU 利用率和吞吐量的常见方法是增加批处理大小：批次内输入共享模型权重
因此，GPU 只需要将模型权重从其高带宽内存 (HBM) 加载到片上 SRAM 一次，然后将其重新用于批次内的所有输入——通常用于服务卷积和全连接神经网络
KV cache：
- the self-attention layer in Transformer-based text generation LLMs presents a challenge to this simple optimization due to its autoregressive nature. Specifically, when generating a new token in a sequence, the model needs to attend to all previous tokens in the sequence, requiring the model to retain all information from previous tokens and store them in HBM. We call this region in the HBM holding the information key/value cache (KV cache).
KV 缓存的大小随着批处理大小和序列的增加而增长，这限制了最大批处理大小，从而降低了 GPU 利用率，最终降低了吞吐量。

我们提出了 S3，通过推测来调度序列，这是一个通过预测输出序列长度和减少内存浪费来最大化吞吐量的框架。

本文贡献：

基于 Transformer 的生成模型是自回归的
由于模型一次生成一个token，因此它必须迭代自身 n 次才能生成 n 个 token长的序列。
一次迭代涉及一个输入token遍历模型，该模型是一堆transformer layer:
- 其中包含一个attention layer、两层layernorm和两个feedforward layer。
- the self-attention layer uses information on the past tokens to generate the next token.
第( $i^{th}$ ) 次迭代的模型使用当前标记 ( $t_{i}$ ) 及其在自注意力层中已生成的每个标记 ( $t_{0}……t_{i-1}$