[Arxiv 2024] PrefixQuant: Static Quantization Beats Dynamic through Prefixed Outliers in LLMs

连理o

已于 2024-11-27 10:59:01 修改

阅读量1.2k

点赞数 30

文章标签： Arxiv 2024

于 2024-10-17 19:43:13 首次发布

版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。

本文链接：https://blog.csdn.net/weixin_42437114/article/details/143021948

版权

模型部署专栏收录该内容

41 篇文章

订阅专栏

Contents

Introduction
Method
Experiments
References

Introduction

作者提出 PrefixQuant，基于 QuaRot，通过在 WA 量化时保持关键词元无损并加上 EfficientQAT 微调，能在 W4A4 static quantization 上做到比较好的量化效果；但和 CusionCache 一样，PrefixQuant 尽管可以保持所有关键词元无损，但却没有讨论过加上 prefix 后会对模型精度产生怎样的影响

Method

作者发现，对于 static quantization，由于关键词元与其他词元的激活值分布显著不同，如果不对关键词元做特殊处理，校准得到的量化参数会损害非关键词元的量化精度，例如关键词元的 down_proj 输入上会存在 massive outlier、KV cache 则会特别平坦；如果能保持关键词元无损，对其他 tokens 做校准，就能得到更小的量化范围，提升量化精度
Definition of Outlier Token. 通过 down proj 的输入激活值定位关键词元
其中， $\eta=64$
Number of Outlier Tokens. 通过校准集统计出每种模型中关键词元的数量 $o=\lceil\max(\mathbf O)\rceil$ ，其中 $\mathbf O\in\R^b$ 为所有 transformer block 中统计的关键词元数量
Which Tokens to Prefix? top- $o$ high-frequency outlier tokens + [BOS]

Block-wise Fine-tuning. 采用 EfficientQAT 微调 scale & weights

Experiments

Settings. 权重 per-channel symmetric quantization，KV cache per-head symmetric static quantization for 4-bit and per-tensor symmetric static quantization for 8-bit，激活值 per-tensor static quantization；校准数据集为 8 Pile samples with a 1024 sequence length，通过 grid search 找到 scale 初始值；微调数据集为 512 samples from Pile with a 1024 context length

Comparison Results.
Results on weight-only quantization.
Inference Speed. (1) Static Quantization Speedup.
(2) Linear Layer Speedup. For low-bit matrix multiplication, we use the 4-bit GEMM kernel from CUTLASS and design a custom kernel for W4A4 GEMV. We also integrate the de-quantization process into the GEMM and GEMV kernels.
(3) End-to-end speedup. 这里测速没有用 KV cache 量化 (it saves memory footprint through more computation overhead and only achieves speedup with large batch sizes)
Ablation Studies. (1) Main Components.
(2) Number of Prefixed Tokens.(3) Content of Prefixed Tokens.
Quantization Time.

References

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。