Introduction
- 作者提出 PrefixQuant,基于 QuaRot,通过在 WA 量化时保持关键词元无损并加上 EfficientQAT 微调,能在 W4A4 static quantization 上做到比较好的量化效果;但和 CusionCache 一样,PrefixQuant 尽管可以保持所有关键词元无损,但却没有讨论过加上 prefix 后会对模型精度产生怎样的影响
Method
- 作者发现,对于 static quantization,由于关键词元与其他词元的激活值分布显著不同,如果不对关键词元做特殊处理,校准得到的量化参数会损害非关键词元的量化精度,例如关键词元的 down_proj 输入上会存在 massive outlier、KV cache 则会特别平坦;如果能保持关键词元无损,对其他 tokens 做校准,就能得到更小的量化范围,提升量化精度
- Definition of Outlier Token. 通过 down proj 的输入激活值定位关键词元
其中, η = 64 \eta=64 η=64
- Number of Outlier Tokens. 通过校准集统计出每种模型中关键词元的数量 o = ⌈ max ( O ) ⌉ o=\lceil\max(\mathbf O)\rceil o=⌈max(O)⌉,其中 O ∈ R b \mathbf O\in\R^b O∈Rb 为所有 transformer block 中统计的关键词元数量
- Which Tokens to Prefix? top-
o
o
o high-frequency outlier tokens + [BOS]
- Block-wise Fine-tuning. 采用 EfficientQAT 微调 scale & weights
Experiments
- Settings. 权重 per-channel symmetric quantization,KV cache per-head symmetric static quantization for 4-bit and per-tensor symmetric static quantization for 8-bit,激活值 per-tensor static quantization;校准数据集为 8 Pile samples with a 1024 sequence length,通过 grid search 找到 scale 初始值;微调数据集为 512 samples from Pile with a 1024 context length
- Comparison Results.
- Results on weight-only quantization.
- Inference Speed. (1) Static Quantization Speedup.
(2) Linear Layer Speedup. For low-bit matrix multiplication, we use the 4-bit GEMM kernel from CUTLASS and design a custom kernel for W4A4 GEMV. We also integrate the de-quantization process into the GEMM and GEMV kernels.
(3) End-to-end speedup. 这里测速没有用 KV cache 量化 (it saves memory footprint through more computation overhead and only achieves speedup with large batch sizes)
- Ablation Studies. (1) Main Components.
(2) Number of Prefixed Tokens.
(3) Content of Prefixed Tokens.
- Quantization Time.