qLoRA的双量化Double Quantization-原创详解

wangzhengkui123

已于 2023-06-24 20:24:57 修改

阅读量808

点赞数 5

文章标签： python 神经网络自然语言处理深度学习 transformer gpt-3 迁移学习

于 2023-06-24 15:23:03 首次发布

本文链接：https://blog.csdn.net/wangzhengkui123/article/details/131362235

版权

一背景介绍

继LoRA之后，qLoRA继续大火，进一步降低了对显存的要求，简直是平民福利。但是目前网上的很多资料对其中双量化部分的讲解并不清楚，很多只是对原文的翻译，不易于理解。经过个人分析和验证，特发此文详解与大家分享。

code地址：GitHub - artidoro/qlora: QLoRA: Efficient Finetuning of Quantized LLMs

三双量化部分的原文

Double Quantization

We introduce Double Quantization (DQ), the process of quantizing the

quantization constants for additional memory savings. While a small blocksize is required for precise 4-bit quantization [ 13 ], it also has a considerable memory overhead. For example, using 32-bit constants and a blocksize of 64 for W , quantization constants add 32/64 = 0.5 bits per parameter on average. Double Quantization helps reduce the memory footprint of quantization constants.

More specifically, Double Quantization treats quantization constants $c^{fp32}_2$ of the first quantization as inputs to a second quantization. This second step yields the quantized quantization constants $c_{2}^{fp8}$ and the second level of quantization constants $c_{1}^{fp32}$ . We use 8-bit Floats with a blocksize of 256 for the second quantization as no performance degradation is observed for 8-bit quantization, in line with results from Dettmers and Zettl emoyer [13]. Since the $c^{fp32}_2$ are positive, we subtract the mean from c2 before quantization to center the values around zero and make use of symmetric quantization. On average, for a blocksize of 64, this quantization reduces the memory footprint per parameter from 32 / 64 = 0 . 5 bits, to 8/64+32/(64·256)=0.127 bits, a reduction of 0.373 bits per parameter.

四解读

首先，列出关键词的解释

Quantization	量化
bit	比特，8比特为1字节
$c^{fp32}_2$	第一次量化中，为了能反量化保存的32为浮点数
$c^{fp8}_2$	第二次量化中，量化后的8位浮点数
$c^{fp32}_1$	第二次量化中，为了能反量化保存的32为浮点数

根据前文提出的量化方法，假设权重近似服从均值为0的正态分布，因此可以用其标准差表示其分布。所以，将一个权重张量进行量化后，不仅需要将保存量化后的张量，还需要额外一个32位的浮点数以表示其标准差（即 $c^{fp32}_2$ ），其占用32个比特的空间。因此，如果只做第一次量化，则需要额外存储的空间(除了存储量化张量以外)为32个比特，假如张量的大小（blocksize,即张量各个维度的乘积）为64，则其实就是对64个数字进行量化，那 额外需要的32比特平均到每个数字上，就是32/64=0.5比特。

作者为了把这个额外空间进一步降低，将 $c^{fp32}_2$ 进行进一步的量化。假如我们用64*256个数字需要量化，那就将其分为256个block，每64个数字划分到一个block中，对64个block中进行量化会产生256个 $c^{fp32}_2$ 。为了降低额外空间，需要对这256个 $c^{fp32}_2$ 进行第二次量化。具体做法是将其量化到8比特的浮点数格式（c_FP8_2)，并且再用一个FP32表示这256个 $c^{fp32}_2$ 的标准差，即为 $c_{1}^{fp32}$ 。所以，对64*256个数字进行量化所需要的额外空间为（8*256+32)/(64*256)=8/64+32/(64*256)=0.127比特，量化每个数字所需要的额外空间从0.5减少到0.127，所以减少了0.373。 注意不是每个权重值量化所需要的空间，而是所需要的额外空间。