AWQ:Activation-aware Weight Quantization 用于LLM量化与加速-(1)背景与原理

一、背景

  • AWQ(Activation-aware Weight Quantization)是一种用于大型语言模型(LLM)的权重量化技术,旨在减少模型大小和加速推理过程,同时保持模型性能。AWQ的核心思想是权重对模型性能的重要性并不相同,因此通过保护更重要的权重(通常是较小的一部分,如0.1%-1%),可以显著减少量化误差。AWQ不依赖于反向传播或重构,有助于保持模型在不同领域和模态上的泛化能力。此外,AWQ还实现了一个高效的推理框架,显著提高了在桌面和移动GPU上的运行速度。

二、引言

  • 现状问题
    • QAT(量化感知训练)需要高训练成本。
    • PTQ low-bit 状态下,精度损失比较大,GPTQ使用二阶信息能一定程度缓解问题,但是容易过拟合,具体表现为重建模型的时候会扭曲分布外域上的学习特征。
  • 本文基于权重对LLM效果不是同样的重要的假设,提出AWQ:
    • 参考激活值,而不是权重值的方法来进行针对性量化,量化对象还是模型权重。
    • 所以,针对不同的权重,缩放范围也不一样,来达到最优量化的效果。
  • 本文还实现了一个高效的框架,用于推理,对比HF,实现3倍以上加速,AWQ已经可以用于vLLM、HuggingFace、LMDeploy等推理框架上。

三、实现原理

3.1 初步方案、推论和实验

  • 图a:RTN(round-to-nearest quantization),直接四舍五入量化,可以看到效果一般,PPL 43.2。
  • 图b:x是激活值,通过发现重要激活值,来找到右边蓝色的salient weights(指在模型中对最终输出或决策具有重要影响的权重)。保留着salient weights的FP16 channel,其他参数都进行了RTN量化,模型效果达到PPL 13.0。但是这种方案对硬件不友好,因为参数是混合精度存储的。
  • 图c:方案三,还是对W进行量化,只不过是针对x激活值来进行个性化的量化,所以可以看到右边Q(W)的矩阵,每一行不同颜色,代表每个通道缩放的范围都不一样。下面来看看具体的原理。
    在这里插入图片描述
    在这里插入图片描述
  • Table1 是通过实验来证明一个结论,基于activation(激活值的大小)来进行进行权重量化,比基于权重W,和random,更接近全是FP16的效果。并且保留0.1%-1%的channel,其他进行量化,模型效果已经十分接近FP16精度的效果。

3.2 对Weight进行放大对量化误差的影响

### Qwen2.5-VL-7B Instruct AWQ Quantization Model Details and Usage The Qwen2.5-VL-7B Instruct model is an advanced version of the Qwen series, designed to handle multimodal tasks with high efficiency and accuracy[^1]. The AWQ (Adaptive Weight Quantization) technique reduces the memory footprint while maintaining performance close to that of the full-precision model. #### Key Features of Qwen2.5-VL-7B Instruct AWQ Quantized Model - **Quantization**: Utilizes Adaptive Weight Quantization (AWQ), which compresses weights into lower precision formats without significant loss in inference quality. - **Multimodality Support**: Capable of processing both textual and visual data effectively. - **Inference Acceleration**: Optimized for faster deployment on resource-constrained environments by leveraging vLLM integration techniques mentioned earlier. Below is a sample code snippet demonstrating how one might integrate this specific variant within their projects using `vllm`: ```python from transformers import AutoTokenizer, pipeline import torch from awq_quantize_utils import load_awq_model # Hypothetical utility function based on context provided def initialize_qwen_vl_awq(): """ Initializes the Qwen2.5-VL-7B Instruct model with AWQ quantization applied via custom utilities or preprocessed checkpoints. Returns: A PyTorch-based transformer pipeline ready for multimodal task execution. """ base_path = "./path_to_pretrained_checkpoint/Qwen2_5_VL_AWQ" tokenizer = AutoTokenizer.from_pretrained(base_path) device_map="auto" if torch.cuda.is_available() else None model = load_awq_model( pretrained_model_name_or_path=base_path, w_bit=4, # Example bit-width used during weight quantization process; adjust accordingly per documentation specifics. q_group_size=-1,# Typically set according to hardware capabilities & desired tradeoffs between speed vs size reduction goals achieved through different configurations available under respective implementations' guidelines outlined elsewhere outside direct scope here but referenced indirectly at . no_init_weights=True # Avoid reinitializing parameters since they've already been adapted as part of original training procedure before applying any further transformations like those involved when performing actual conversions towards achieving final compressed representations suitable enough after all necessary steps completed successfully including fine-tuning stages where applicable depending upon use case requirements etc... ) pipe = pipeline('text-generation', model=model, tokenizer=tokenizer, framework='pt') return pipe if __name__ == "__main__": generator_pipeline = initialize_qwen_vl_awq() result = generator_pipeline("Describe what you see:", max_length=50)[0]['generated_text'] print(result) ``` Please note that some functions such as `load_awq_model()` are placeholders representing potential internal processes required to properly instantiate models post-application of adaptive weighting schemes onto them prior starting regular operations involving generation activities among other things potentially supported too beyond just simple text outputs alone given its multi-modal nature inherently present throughout entire architecture design philosophy behind these kinds offerings altogether really now aren't we?
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值