InternLM: LMDeploy 量化部署进阶实践

dilvx

已于 2024-10-11 19:42:08 修改

阅读量197

点赞数 2

文章标签：机器学习

于 2024-10-11 19:39:16 首次发布

本文链接：https://blog.csdn.net/dilvx/article/details/142861060

版权

LMDeploy 部署模型

模型部署是将训练好的深度学习模型在特定环境中运行。欢迎使用 LMDeploy，支持市面上主流的格式和算法。

大模型缓存推理

本章的前半部分主要讲量化，包括 KV-Cache 量化、权重量化、激活值量化。量化主要是为了节省存储空间，用 int4, int8 来重新表示 fp16，将模型的显存占用控制在 200 G 可接受的范围下。值得注意的是，在 transformer 架构下，计算的瓶颈主要在显存带宽上，而不是运算速度上，因此，量化也能有效改善模型的推理效率。

在讲解 KV-Cache 量化之前，先讲一下什么是 kv-cache。

kv-cache 是对多头注意力计算时 Query, Key, Value 中，Key 和 Value 的缓存。当一个新的 token 序列 x_1, \ldots ,x_t, x_{t+1} 传入时，k_1, \ldots, k_t, k_{t+1}; v_1, \ldots, v_t, v_{t+1} 中只需要计算 k_{t+1} 和 v_{t+1} 即可，前 t 个值可以读取缓存值然后 merge。

When you’re doing auto-regressive text generation, you predict one token at a time. When predicting a given token, in the attention layer, you need to compute the attention between the most recent token and all tokens generated so far – you use the query from the last token, but the key and the value from all tokens generated so far. This means you have no benefit in caching the query, but you save a few computations if you cache the key and the value