[Arxiv 2024] VPTQ: Extreme Low-bit Vector Post-Training Quantization for Large Language Models

连理o

已于 2024-10-08 23:33:01 修改

阅读量445

点赞数 17

文章标签： Arxiv 2024

于 2024-10-08 22:36:51 首次发布

本文链接：https://blog.csdn.net/weixin_42437114/article/details/142767798

版权

模型部署专栏收录该内容

27 篇文章 1 订阅

订阅专栏

Introduction
Method
Experiments
References

Introduction

作者提出 VPTQ (Vector Post-Training Quantization) 用于 LLM 的 weight-only 量化，自然地将 GPTQ 从 uniform quantization 扩展到了 vector quantization，相比 AQLM 在精度、量化时间和推理速度上都有一定提升；不过 VQ 在实际使用时，超参的设置可能会比较头疼，诸如 vector length 之类的参数不仅影响着模型的实际参数量，还会影响模型精度和推理速度，对推理速度的影响甚至可能是硬件相关的，要确定一个最优的超参设置也是不容易的

Method

Vector Quantization (VQ). VQ 维护 a finite set of vectors (i.e. codebook or lookup table) $\mathcal C$ ，将原始向量 $\mathbf W'$ 映射到 codebook 中的某个向量 (i.e., centroid vector) 上，以此来达到数据压缩的目的
例如，对于权重矩阵 $\mathbf W\in\R^{M\times N}$ ，可以先将其 reshape 为 $\mathbf W'\in\R^{(MN/v)\times v}$ ，然后通过 k-means 等聚类算法得到 codebook 中的 $k$ 个 centroids $\in\R^v$ ，再把 $\mathbf W'$ 中的每个向量分配给距离最近的 centroid，分配信息记录在 codebook 索引上. 经过 VQ 压缩后，codebook 占用比特数为 $16 k v$ ，codebook 索引占用比特数为 $MN\log_2k/v$ ，实际压缩率为 $\frac{16MN}{16kv+MN\log_2k/v}$ . 当 $M = N = 4096, k = 256, v = 8$ 时，实际压缩率为 15.9，约等于把权重压缩到了 1bit；然而，直接使用 VQ 通常精度欠佳，因此 PTQ 一般会进一步更新 codebook 和 codebook index 来降低量化误差

VPTQ Algorithm. 作者沿用了 GPTQ 的思路，每次量化一列然后更新剩余列的权重来弥补量化损失，GPTQ 中量化采用 RTN 量化，而 VPTQ 则是直接对一列的权重 $\mathbf W_{:,q}\in\R^{M}$ 做 VQ，即
这里假设固定 codebook，做 VQ 也就是根据欧氏距离为每个权重向量分配最合适的 centroid. 此外，VPTQ 还做了类似 group-wise 量化 策略，不同 group 内的列使用不同的 codebook. 注意到，在量化 $q$ 列权重时，GPTQ 更新权重后损失函数 $L$ 的增量 $\Delta L=\frac{\|\mathbf W_{:,q}-\hat {\mathbf W}_{:,q}\|_2^2}{\mathbf H_{qq}^{-1}}=\frac{\sum_{i=0}^{M/v}\|\mathbf W_{iv:(i+1)v,q}-\mathcal C_i\|_2^2}{\mathbf H_{qq}^{-1}}$ ，因此在 GPTQ 框架下如果要使得更新权重后 $\Delta L$ 尽量小，就必须最小化 $\sum_{i=0}^{M/v}\|\mathbf W_{iv:(i+1)v,q}-\mathcal C_i\|_2^2$ ，这也正好对应使用欧式最短距离去给每个原始权重分配相应的 centroid
Optimization in VPTQ. (1) Hessian-Weighted Centroid Initialization. 类似于 SqueezeLLM 中的 Sensitivity-Based K-means Clustering，作者没有直接使用 k-means 聚类，而是把 $\text{diag}(\mathbf H)$ 作为权重重要性，进行加权 k-means 聚类来初始化 codebook 中的 centroids；如下所示，不同列的重要性由 $h_{i,i}$ 来衡量
$\begin{aligned} L&=\Delta \mathbf {W}^T\mathbf H\Delta\mathbf W=\sum_{i=0}^{n-1} h_{i, i}\left\|\Delta \mathbf{W}_{:, i}\right\|^2 \\ &\ \ \ \ +\sum_{i=0}^{n-1} \sum_{j=0, j \neq i}^{n-1} h_{i, j}\left(\Delta \mathbf{W}_{:, i} \Delta \mathbf{W}_{:, j}\right) \\&\approx \sum_{i=0}^{n-1} h_{i, i}\left\|\Delta \mathbf{W}_{:, i}\right\|^2 \end{aligned}$ (2) Residual Vector Quantization (RVQ). RVQ improves vector quantization (VQ) by breaking down the compression of a weight matrix into two (or more) stages. Each stage further compresses the residual error from the previous quantization stage: 也就是额外使用 codebook 继续逼近量化误差，推理时则需要同时读入多个 codebooks 进行反量化
(3) Outlier Elimination. 同样类似于 SqueezeLLM，作者对 matrix tiles most affected by outliers 使用额外的 codebook 进行量化
End to end Quantization Algorithm. 不同于 GPTQ，VPTQ 是提前计算好了所有线性层的 $\mathbf H$ 而不是量化完前面的层才计算下一层的 $\mathbf H$ (相当于 VPTQ 使用的 $\mathbf H$ 是从全精度模型而非量化模型中计算得到的，不知道这样会对精度有怎样的影响)，因此不同线性层的量化可以在 GPU 上并行完成；此外，VPTQ 还可以在量化结束后进行 layer-wise fine-tuning，仅微调 LN 参数和 centroid；最后，VPTQ 还会进行 e2e fine-tuning

Experiments

Settings. 校准集采用 C4
Main Results. toks/s 为系统 decoding 阶段吞吐量 (prefill length 1, decoding length 256)；cost/h 为训练时间 (4×80GB A100 GPUs)；VPTQ 相比 AQLM 使用的 codebook 更小，也没有 QUIP# 中的在线 Hadamard 变换，因此解码速度更快，此外相比 AQLM 量化时间显著降低 (Q. 这里的推理速度为什么部分量化模型还没有全精度模型快，比如 7B 的 GPTQ)

Impact of vector length on LLM inference speed. As the vector length increases (from 2 to 6), the granularity of memory access for reading the lookup table in dequantization increases, which allows memory access to match the GPU’s cache line (128 bytes @ L1). This reduces memory access transactions and decreases cache misses. As the vector length further increases (from 8 to 12) along with the size and levels of the codebook, the codebook size further increases, which results in the codebook not ﬁtting in the L1 cache, thereby reducing the model’s inference speed. Additionally, we ﬁnd that a reasonable setting (e.g., v = 6, k = 4096) can achieve throughput similar to the original model for the quantized model, demonstrating the efﬁciency of the VPTQ design.

3 and 4 bits quantization.
Ablation Study. (1) Outlier Elimination. Rows #4, #8, #9, and #10 represent the results for eliminating 0%, 1%, 2%, and 5% outliers, respectively. (2) Finetuning. Rows #4, #11, and #12 show results for without any ﬁnetuning, with layer-wise ﬁnetuning, and with end-to-end ﬁnetuning, respectively. (3) Group Number. Rows #14, #15, #16, and #17 show the quantization results when 99% of parameters are divided into 1, 2, 4, and 8 groups, respectively. (4) Impact of Vector Length. Rows #2, #3, #4, and #6 show results for $v_1$ = 2, 4, 6, 8, keeping the average index bit at 2. (5) Residual Vector Quantization. Without any finetuning, rows #4 and #7; After layer-wise ﬁnetuning, rows #11 and #13.