AWQ量化及AutoAWQ代码详解

ICLR选手

已于 2024-07-12 18:26:35 修改

阅读量7.8k

点赞数 19

文章标签：人工智能深度学习

于 2024-07-12 18:10:34 首次发布

本文链接：https://blog.csdn.net/qq_48191249/article/details/140358123

版权

AWQ量化出自mit韩松组内2023年关于LLM量化的一篇文章：AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

在介绍量化之前，先简要的介绍一下模型的量化

1. 为什么要进行模型量化？量化有什么好处呢？

模型之所以要进行量化，是因为我们日常使用fp16(floating point 16）或者bf16（Brain Floating Point)训练模型，fp16有1个符号位，5个指数位，10个尾数位，表示范围为(finfo(resolution=0.001, min=-65504, max=65504, eps=0.000976562, smallest_normal=6.10352e-05, tiny=6.10352e-05, dtype=float16)。

而bf16有1个符号位，8个指数位，7个尾数位，可以表示的范围为：finfo(resolution=0.01, min=-3.38953e+38, max=3.38953e+38, eps=0.0078125, smallest_normal=1.17549e-38, tiny=1.17549e-38, dtype=bfloat16)

可以看到，fp16和bf16可以表示的范围是很大的，但是这也产生了两个问题：（1）内存（显存）占用开销很大；以作者实际使用的qwen1.5-32b模型来计算，32b的16位浮点模型需要64G的显存来加载，加上数据集和kv cache。大致需要4*4090 (96G)才能运行。这是一个较大的开销，4*4090的服务器在10w人名币左右。（2）模型的runtime开销较大16位的浮点模型在进行矩阵运算时，是十分耗时的，更甚者像squeezellm的观点，模型的整个runtime的bottleneck在于模型weight的load。squeezellm认为当模型weight的bits降低，可以显著加速模型的runtime。（但是我在用awq测试的时候，awq模型的运算速度更慢于16位的浮点模型，因为awq的weight 在进行gemm or gemv前，需要先dequant）。

在意识到16位浮点模型的这些劣势后，一些大佬就在想能不能把16位的浮点数转换成bits更少的整数类型，例如量化成int8(LLM.int8, SmoothQuant），int4(GPTQ，AWQ)，3bits(SqueezeLLM)。更有甚之，用到了1/2bits量化（AQLM）。模型在量化之后可以缓解上面浮点模型的两个问题，当使用awq把模型从16位的浮点模型(16 bits)量化到int4(4 bits)时，模型的大小从64G变为了16G，模型大小变为原来的1/4。可以节省大量的出计算资源。

2. 如何给模型进行量化呢？

模型量化详解-CSDN博客

总的来讲的话，16位的浮点模型可以由低bits的数乘以一个scale得到。

2.1 awq主要思路

核心观点1：权重并不同等重要，仅有小部分显著权重对推理结果影响较大

作者指出，模型的权重并不同等重要，仅有0.1%~1%的小部分显著权重对模型输出精度影响较大。因此如果能有办法只对0.1%~1%这一小部分权重保持原来的精度(FP16)，对其他权重进行低比特量化，就可以在保持精度几乎不变的情况下，大幅降低模型内存占用，并提升推理速度。这就涉及到一个问题，如何鉴别显著权重，常用的方法有三种

随机挑选：听天由命，随机选出0.1%~1%的权重作为显著权重，当然这种方法很不科学。
基于权重分布挑选：对权重矩阵(比如自注意力中的 𝑊𝑞 , 𝑊𝑘 , 𝑊𝑣 )中的元素按绝对值大小由大到小排序，绝对值越大越显著，选择前0.1%~1%的元素作为显著权重。
基于激活值分布挑选：激活值就是与权重矩阵作matmul运算的输入

作者对三种方式进行了测试(Tab 1)，发现随机挑选的结果与RTN的结果差不多，基于权重W的量化与随机挑选的结果差不多。而基于激活值分布挑选weight的结果与fp16的精度差不多。

作者为了避免方法在实现上过于复杂，在挑选显著权重时，并非在“元素”级别进行挑选，而是在“通道(channel)”级别进行挑选，即权重矩阵的一行作为一个单位。在计算时，首先将激活值对每一列求绝对值的平均值，然后把平均值较大的一列对应的通道视作显著通道，保留FP16精度。对其他通道进行低比特量化

但另一个问题随之而来，如果权重矩阵中有的元素用FP16格式存储，有的用INT4格式存储，不仅存储时很麻烦，计算时取数也很麻烦。于是，作者想了一个变通的方法——Scaling。

核心观点2：量化时对显著权重进行放大可以降低量化误差

量化公示可以写长上面那样，其中 𝑁 是量化后的比特数， Δ 是量化因子(scaler)。 w′=Round(wΔ) 是量化过程， Δ⋅w′ 是反量化过程。原始的 w 、 Δ 和输入 𝑥 都是FP16格式，不会带来精度损失。整个过程的精度损失全部来源于量化过程中的 Round 取整函数，其误差近似成[0, 0.5]的均匀分布，期望为0.25，可以写作 RoundErr(⋅)∼0.25

考虑对于权重矩阵 w 中的单个元素 𝑤 ，引入一个缩放因子 𝑠>1 ，量化过程将 𝑤 与该因子相乘，写作 w′=Round(w𝑠/Δ′) ，相应地将反量化过程写作 Δ′⋅w′𝑠，这样在计算过程上是“等价”的，如公式2

公式1和公式2在计算过程中是一样的，但是仍然会有不一样的精度损失，可以写作：

因此，作者改变了思路：为了更加hardware-friendly，我们对所有权重均进行低比特量化，但是，在量化时，对于显著权重乘以较大的 𝑠 ，相当于降低其量化误差；同时，对于非显著权重，乘以较小的 𝑠 ，相当于给予更少的关注。这便是上一节提到的缩放(Scaling)方法。

算法：自动计算scaling系数

按照作者的观点，激活值越大，对应通道越显著，就应该分配更大的缩放系数降低其量化误差。因此，作者统计了各通道的平均激活值（计算输入矩阵各列绝对值的平均值） sx ，并直接将此作为各通道的缩放系数。同时引入一个变量 𝛼 用于平衡显著通道和非显著通道的系数，由此，问题转化为优化L(s) 使用网格搜索 $\alpha$ 。在源码中，在[0, 1]平均取20个数，分别设为 $\alpha$ ，计算L(s)最小的为最佳 $\alpha$ 。smoothquant与awq的思路一致，而smoothquant计算s的方式为：

2.2 code

autoawq的量化过程从AwqQuantizer.init_quant()开始，self.awq_model是Qwen2AWQForCausalLM类（以qwen1.5为例），self.model是加载的qwen模型。

 def init_quant(self, n_samples=128, seqlen=512):
        modules = self.awq_model.get_model_layers(self.model) # return model.model.layers
        samples = get_calib_dataset(
            data=self.calib_data,
            tokenizer=self.tokenizer,
            n_samples=n_samples,
            block_size=seqlen,
            split=self.split,
            text_column=self.text_column,
        )
        samples = torch.cat(samples, dim=0)
-----------------------------------------------------------
        class Catcher(nn.Module):
            def __init__(self, module):
                super().__init__()
                self.module = module

            def forward(self, *args, **kwargs):
                # assume first input to forward is hidden states
                if len(args) > 0:
                    hidden_states = args[0]
                    del args
                else:
                    first_key = list(kwargs.keys())[0]
                    hidden_states = kwargs.pop(first_key)

                inps.append(hidden_states)
                layer_kwargs.update(kwargs)
                raise ValueError  # early exit to break later inference
-----------------------------------------------------------       
        return modules, layer_kwargs, inps

[STEP 1]：Get layer, extract linear modules, extract input features

把module里面的线性层用字典保存， _get_input_feat会把每一层的输入数据给提取保存。

# [STEP 1]: Get layer, extract linear modules, extract input features
    named_linears = get_named_linears(self.modules[i])
    # named_linears is the dictionary of named linear layers in the module, e.g. :
    """
    {'self_attn.q_proj': Linear(in_features=1024, out_features=1024, bias=True),
    'self_attn.k_proj': Linear(in_features=1024, out_features=1024, bias=True),
    'self_attn.v_proj': Linear(in_features=1024, out_features=1024, bias=True),
    'self_attn.o_proj': Linear(in_features=1024, out_features=1024, bias=False),
    'mlp.gate_proj': Linear(in_features=1024, out_features=2816, bias=False),
    'mlp.up_proj': Linear(in_features=1024, out_features=2816, bias=False),
    'mlp.down_proj': Linear(in_features=2816, out_features=1024, bias=False)}
    """

    # Filter out the linear layers we don't want to exclude
    named_linears = exclude_layers_to_not_quantize(
        named_linears, self.modules_to_not_convert
    )

    input_feat = self._get_input_feat(self.modules[i], named_linears)
    clear_memory()

[STEP 2]： Compute and apply scale list

module_config: List[Dict] = self.awq_model.get_layers_for_scaling(
    self.modules[i], input_feat, self.module_kwargs
)
# 上面的代码是把模型的层给抽取出来，纳入字典中， prev_op 就是前一个层， 
# layers 就是当前层的线性层， inp 就是层输入特征。
# module2inspect 是所有层的混合。

scales_list = [
    self._search_best_scale(self.modules[i], **layer)
    for layer in module_config
]
apply_scale(self.modules[i], scales_list, input_feat_dict=input_feat)

第2个step是计算每层的scaling，module_config是一个包含了当前层，前面层，输入特征的一个字典集合。然后开始找到最好的scale，首先把weight分组进行归一化，再在channel为度求得weight得mean。同时x作为input也计算在channel上的mean。计算fp16模型的输出用于比较得到最好的scale。

@torch.no_grad()
    def _search_best_scale(
        self,
        module,
        prev_op,
        layers: List[nn.Linear],
        inp: torch.Tensor,
        module2inspect=None,
        kwargs={},
    ):
        if module2inspect is None:
            assert len(layers) == 1
            module2inspect = layers[0]

        if "use_cache" in kwargs:
            kwargs.pop("use_cache")

        # Put x on the right device
        inp = inp.to(next(module2inspect.parameters()).device)

        # [STEP 1]: Compute per-channel mean of normalised weights
        # All layer weights are concatted together
        weight = torch.cat([_m.weight for _m in layers], dim=0)
        org_shape = weight.shape
        # The weights are reshaped to be organised by quantization group
        weight = weight.view(-1, self.group_size)
        # Calculates the relative magnitude of the weights within each of the quantization groups, 
        # and rescales each group individually so that each group has weights on a 0-1 scale.
        w_scale = weight.abs() / (weight.abs().amax(dim=1, keepdim=True) + 1e-6)
        # Resizes the rescaled weight matrix back up to its original dimensions
        w_scale = w_scale.view(org_shape)
        # Gets the average rescaled magnitude for each output channel
        w_mean = w_scale.mean(0)
        clear_memory(weight)

        # [STEP 2]: Compute per-channel mean of the input activation
        x_mean = inp.abs().view(-1, inp.shape[-1]).mean(0)

        # [STEP 3]: Compute output of module
        with torch.no_grad():
            module_kwargs = self._sanitize_kwargs(kwargs, module2inspect)

            fp16_output = module2inspect(inp, **module_kwargs)
            if isinstance(fp16_output, tuple):
                fp16_output = fp16_output[0]

        # [STEP 4]: Compute loss
        best_scales = self._compute_best_scale(
            inp, w_mean, x_mean, module2inspect, layers, fp16_output, module_kwargs
        )

        return (
            get_op_name(module, prev_op),
            tuple([get_op_name(module, m) for m in layers]),
            best_scales,
        )

_compute_best_scale使用网格搜索，对于公式(4)，在网格搜索中直接使用x_mean的 $\alpha$ (ratio)作为s, $\alpha$ 作为平衡因子，而网格搜索是找到最好的 $\alpha$ 使得量化完模型的输出和fp16模型的输出的差值最小(L2 Loss)。

def _compute_best_scale(
        self,
        x,
        w_mean,
        x_mean,
        module2inspect,
        linears2scale: List[nn.Linear],
        fp16_output,
        kwargs={},
    ):
        """
        Compute loss and select best scales

        L(s) = || Q(W * s) (s^-1 * X) - W * X ||
        Q: weight quantization function | pseudo_quantize_tensor(W * s)
        X: inputs from calib dataset    | X
        W: original weights in FP16     | layer
        s: per channel scaling factor   | s^-1 * X
        """
        n_grid = 20
        history = []
        best_ratio = -1
        best_scales = None
        best_error = float("inf")

        org_sd = {k: v.cpu() for k, v in module2inspect.state_dict().items()}

        device = x.device
        x_mean = x_mean.view(-1).to(device)
        w_mean = w_mean.view(-1).to(device)

        for ratio in range(n_grid):
            # create new scales
            ratio = ratio / n_grid

            # NOTE: s^-1 * x is fused here, according to paper
            if self.duo_scaling:
                scales = (x_mean.pow(ratio) / (w_mean.pow(1 - ratio) + 1e-4)).clamp(min=1e-4)
            else:
                scales = x_mean.pow(ratio).clamp(min=1e-4).view(-1)
            scales = scales / (scales.max() * scales.min()).sqrt()
            scales_view = scales.view(1, -1).to(device)

            # Q(W * s)
            for fc in linears2scale:
                fc.weight.mul_(scales_view)
                fc.weight.data = (
                    self.pseudo_quantize_tensor(fc.weight.data)[0] / scales_view
                )

            # W * X
            int_w_output = module2inspect(x, **kwargs)
            if isinstance(int_w_output, tuple):
                int_w_output = int_w_output[0]

            # compute mean squared error (L2 norm)
            loss = (
                (fp16_output - int_w_output).float().pow(2).mean().item()
            )  # NOTE: float prevents overflow

            history.append(loss)
            if loss < best_error:
                best_error = loss
                best_ratio = ratio
                best_scales = scales.clone()
            module2inspect.load_state_dict(org_sd)

        if best_ratio == -1:
            logging.debug(history)
            raise Exception

        assert torch.isnan(best_scales).sum() == 0, best_scales

        return best_scales.detach().cpu()

apply_scale对每层的weight进行scale处理，例如：

Test: 改变grid的大小

通过改变网格的大小，可以控制平衡因子的取值精度，设置grid分别为grid=10, 20(awq默认），40， 100得到的结果如下：

以结果来看，改变grid并没有带来精度的提升。

[STEP 3]: Compute and apply clipping list

也是使用网格搜索求的最合适的最大值，并裁减

for i_b in range(org_w_shape[0] // oc_batch_size):
            w = w_all[i_b * oc_batch_size : (i_b + 1) * oc_batch_size]

            org_max_val = w.abs().amax(dim=-1, keepdim=True)  # co, 1, n_group, 1

            best_max_val = org_max_val.clone()
            min_errs = torch.ones_like(org_max_val) * 1e9
            input_feat = input_feat.to(w.device)
            org_out = (input_feat * w).sum(dim=-1)  # co, n_token, n_group

            for i_s in range(int(max_shrink * n_grid)):
                max_val = org_max_val * (1 - i_s / n_grid)
                min_val = -max_val
                cur_w = torch.clamp(w, min_val, max_val)
                q_w = self.pseudo_quantize_tensor(cur_w)[0]
                cur_out = (input_feat * q_w).sum(dim=-1)

                # co, 1, n_group, 1
                err = (cur_out - org_out).pow(2).mean(dim=1).view(min_errs.shape)
                del cur_w
                del cur_out
                cur_best_idx = err < min_errs
                min_errs[cur_best_idx] = err[cur_best_idx]
                best_max_val[cur_best_idx] = max_val[cur_best_idx]
            best_max_val_all.append(best_max_val)

        best_max_val = torch.cat(best_max_val_all, dim=0)

@torch.no_grad()
def apply_clip(module, clip_list: Tuple[str, torch.Tensor]):
    for name, max_val in clip_list:
        layer: nn.Linear = get_op_by_name(module, name)
        layer.to(get_best_device())
        max_val = max_val.to(layer.weight.device)
        org_shape = layer.weight.shape
        layer.weight.data = layer.weight.data.reshape(*max_val.shape[:2], -1)
        layer.weight.data = torch.clamp(layer.weight.data, -max_val, max_val)
        layer.weight.data = layer.weight.data.reshape(org_shape)
        layer.cpu()

[STEP 4]: Quantize weights

    def _apply_quant(self, module, named_linears: Dict[str, nn.Linear]):
        for name, linear_layer in named_linears.items():
            # NOTE: small regression in perplexity if linear layer uses .cpu().float()
            linear_layer = linear_layer.to(get_best_device()).half()

            linear_layer.weight.data, scales, zeros = self.pseudo_quantize_tensor(
                linear_layer.weight.data
            )

            if self.version == "gemm":
                scales = scales.t().contiguous()
                if zeros is not None:
                    zeros = zeros.t().contiguous()
                q_linear_module = WQLinear_GEMM

            elif self.version == "gemv":
                q_linear_module = WQLinear_GEMV

            elif self.version == "marlin":
                q_linear_module = WQLinear_Marlin
            
            elif self.version == "gemv_fast":
                q_linear_module = WQLinear_GEMVFast

            else:
                raise ValueError(f"Unknown version {self.version}")

            q_linear = q_linear_module.from_linear(
                linear=linear_layer,
                w_bit=self.w_bit,
                group_size=self.group_size,
                init_only=False,
                scales=scales,
                zeros=zeros,
            )

            linear_layer.cpu()
            q_linear.to(next(module.parameters()).device)
            set_op_by_name(module, name, q_linear)
            clear_memory()

在GEMM中，通过下面代码对weight进行分组量化，每个组共享一个scale。

pack_num = 32 // awq_linear.w_bit
intweight = []
        for idx in range(awq_linear.in_features):
            intweight.append(
                torch.round(
                    (linear.weight.data[:, idx] + scale_zeros[idx // group_size])
                    / awq_linear.scales[idx // group_size]
                ).to(torch.int)[:, None]
            )
        intweight = torch.cat(intweight, dim=1)
        intweight = intweight.t().contiguous()
        intweight = intweight.to(dtype=torch.int32)

而在量化完之后，定义一个int32类型的qweight用于存储量化之后的weight。因为量化后的weight需要储存为int4类型，所以一个qweight可以存储8个量化weight。通过移位操作可以用一个int32储存8个int4数据。

  qweight = torch.zeros(
            (intweight.shape[0], intweight.shape[1] // 32 * awq_linear.w_bit),
            dtype=torch.int32,
            device=intweight.device,
        )

        for col in range(intweight.shape[1] // pack_num):
            if awq_linear.w_bit == 4:
                order_map = [0, 2, 4, 6, 1, 3, 5, 7]
            else:
                raise NotImplementedError("Only 4-bit are supported for now.")
            for i in range(pack_num):
                qweight_col = intweight[:, col * pack_num + order_map[i]]
                qweight[:, col] |= qweight_col << (i * awq_linear.w_bit)
        awq_linear.qweight = qweight

gemm在前向传播和反向传播的时候，都需要先dequant再进行计算

class WQLinearMMFunction(Function):
    @staticmethod
    # ctx is the first argument to forward
    def forward(
        ctx,
        x,
        qweight,
        qzeros,
        scales,
        w_bit=4,
        group_size=128,
        bias=None,
        out_features=0,
    ):
        # The forward pass can use ctx.
        ctx.save_for_backward(x, qweight, qzeros, scales, bias)
        ctx.out_features = out_features

        out_shape = x.shape[:-1] + (out_features,)
        x = x.to(torch.float16)

        if AWQ_INSTALLED:
            FP16_MATMUL_HEURISTIC_CONDITION = x.shape[0] * x.shape[1] >= 1024

            if FP16_MATMUL_HEURISTIC_CONDITION:
                out = awq_ext.dequantize_weights_cuda(
                    qweight, scales, qzeros, 0, 0, 0, False
                )
                out = torch.matmul(x, out)
            else:
                out = awq_ext.gemm_forward_cuda(
                    x.reshape(-1, x.shape[-1]), qweight, scales, qzeros, 8
                )
        else:
            out = dequantize_gemm(qweight, qzeros, scales, w_bit, group_size)
            out = torch.matmul(x, out)

        out = out + bias if bias is not None else out
        out = out.reshape(out_shape)

        # always want 3D tensor if tensor is 2D
        if len(out.shape) == 2:
            out = out.unsqueeze(0)

        return out