SOTA LLM int4算法AutoRound欢迎试用

置顶 PeaceInMind

已于 2024-04-02 19:14:27 修改

阅读量989

点赞数 14

文章标签： transformer 自然语言处理 pytorch nlp chatgpt 算法语言模型

于 2024-02-28 10:05:10 首次发布

本文链接：https://blog.csdn.net/PeaceInMind/article/details/136338011

版权

介绍

最近，我们发布了 AutoRound v0.1，专为低比特 LLM 推理而设计权重量化算法，可在一系列流行模型上实现接近无损压缩，包括 gemma-7B、Mistral-7b-v0.1、Mistral-7B-Instruct-v0.2、Mixtral-8x7B-Instruct-v0.1、Phi2、LLAMA2、Qwen1.5–7B-chat 等。AutoRound 只需要微调200步即可在 W4G128、W4G-1、W3G128 和 W2G128 大多数场景下优于最近的方法（GPTQ[1]、AWQ[2]、OmniQuant[3] 和 HQQ[4], 并且随着微调步数的提升，精度一般会进一步提升。此外，AutoRound 在推理阶段不会引入任何额外的开销。

关键特性：

广泛的模型支持：AutoRound 适用于多种模型系列，已验证约 20 个模型族。

部署灵活性：轻松将量化模型导出为 ITREX[5] 和 AutoGPTQ[6] 格式，以分别实现在 Intel CPU 和 Nvidia GPU 平台上的无缝部署

Tuning设备兼容性：可以在 Intel Guadi2、Intel CPU 和 Nvidia GPU 上进行微调

量化模型/定制超参：公开了不少模型或者模型特定的超参

算法概述

在这里插入图片描述

我们的方法采用了符号梯度下降（SignSGD）来微调舍入值[7]和权重的最小最大值，仅需200步即可完成。上图展示了我们方法的概述，其中 V 是rounding的扰动，其范围在 [-0.5,0.5] 之间，而 alpha 和 beta 是权重的最小最大值的可调比例，我们通常将范围设置为 [0.5, 1]。选择 SignSGD 的原因受是因为这些范围都是有界的，这为 SignSGD 提供了几个优势。

不同算法比较

我们在尽量公平的实验环境中对比了不同的算法，包括GPTQ, AWQ, Omniquant和HQQ，这个链接 https://github.com/intel/auto-round/blob/main/docs/acc.md里有所有的数据，下表只展示了部分数据。所有的数据都是用 lm-eval[9] 0.3版本和qdq fake模型，评估标准用的是11 个零样本任务的平均准确率。总的来说，AutoRound在 llamv1/llamav2/mistral-7b 上的 W4G-1、W4G128、W3G128 和 W2G128 等绝大多数场景下都取得了领先的性能，与 GPTQ 相比在32个场景中我们在30个有优势，跟AWQ相比是27/32，HQQ 15/16，OmniQuant 16/16。而在tuning时间上，HQQ由于是data free的方法要快很多，然后都用512数据标定的情况下， GPTQ、AWQ 和AutoRound耗时差不多，而 OmniQuant明显更慢。

在这里插入图片描述

已量化模型

我们已经公开了不少模型和模型特定的超参，其中一些模型已经上传到 HF，有些模型仍在审核中，有些由于权限问题只能发布超参。大多数模型的eval我们仍然使用 11 个任务的平均准确率。对于中文模型，我们参考 Qwen使用 4 个任务（CEVAL,CMMLU,MMLU,GSM8K）的平均准确性。所有不带有‘qdq’标签的模型都是使用lm-eval 0.4 的真实量化模型进行评估，而其他的由于lm-eval的问题我们使用的是 qdq 模型进行评估

Intel/neural-chat-7b-v3–3-int4-inc (FP16 0.6778, INT4 0.6748 )

Intel/neural-chat-7b-v3–1-int4-inc (FP16 0.6769, INT4 0.6721)

Intel/Mistral-7B-v0.1-int4-inc (BF16 0.6306, INT4 0.6308 )

Intel/phi-2-int4-inc (FP16 0.6155, INT4 qdq 0.6163)

Intel/falcon-7b-int4-inc (FP16 0.5521, INT4 qdq 0.5507)

Intel/gemma-2b-int4-inc (FP16 0.5383, INT4 0.5338)

mistralai/Mistral-7B-Instruct-v0.2 recipe (BF16 0.6647, INT4 0.6621)

google/gemma-7b recipe (FP16 0.6239, INT4 0.6307)

google/gemma-7b-it recipe (FP16 0.6022, INT4 0.6017)

mistralai/Mixtral-8x7B-Instruct-v0.1 recipe (BF16 0.7000, INT4 0.6977 )

mistralai/Mixtral-8x7B-v0.1 recipe (BF16 0.6698, INT4 0.6633)

meta-llama/Llama-2–7b-chat-hf recipe(FP16 0.5901, INT4 qdq 0.5897)

Qwen/Qwen1.5–7B-Chat recipe(BF16 0.6231, INT4 0.6205)

使用方法

模型量化

##pip install auto-round
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "meta-llama/Llama-2-7b-hf"
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
from auto_round import AutoRound
bits, group_size, sym = 4, 128, False
## device="auto", "hpu" or "cpu" or "cuda"
autoround = AutoRound(model, tokenizer, bits=bits, group_size=group_size, sym=sym, device=None)
autoround.quantize()
output_dir = "./tmp_autoround"
autoround.save_quantized(output_dir)

利用ITREX[5]在CPU设备上模型推理

from intel_extension_for_transformers.transformers import AutoModelForCausalLM
from transformers import AutoTokenizer
 
quantized_model_path = "./tmp_autoround"
model = AutoModelForCausalLM.from_pretrained(quantized_model_path, device_map="auto", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(quantized_model_path, use_fast=True)
text = "There is a girl who likes adventure,"
inputs = tokenizer(text, return_tensors="pt").to(model.device)
print(tokenizer.decode(model.generate(**inputs, max_new_tokens=50)[0]))

利用AutoGPTQ[5]在cuda设备上模型推理

from transformers import AutoModelForCausalLM, AutoTokenizer

quantized_model_path = "./tmp_autoround"
model = AutoModelForCausalLM.from_pretrained(quantized_model_path, device_map="auto", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(quantized_model_path, use_fast=True)
text = "There is a girl who likes adventure,"
inputs = tokenizer(text, return_tensors="pt").to(model.device)
print(tokenizer.decode(model.generate(**inputs, max_new_tokens=50)[0]))

其他

AutoRound/HQQ/GPTQ/AWQ 已经在 Intel Neural Compressor [8] 中实现，并在 Intel Extension for Transformers [5] 中得到支持，以更好地适配Intel 设备。有关AutoRound 的更多详细信息，请参阅 https://github.com/intel/auto-round。

Reference

[1]Frantar, Elias, et al. “Gptq: Accurate post-training quantization for generative pre-trained transformers.” arXiv preprint arXiv:2210.17323 (2022).

[2]Lin, Ji, et al. “Awq: Activation-aware weight quantization for llm compression and acceleration.” arXiv preprint arXiv:2306.00978 (2023).

[3]Shao, Wenqi, et al. “Omniquant: Omnidirectionally calibrated quantization for large language models.” arXiv preprint arXiv:2308.13137 (2023).

[4]https://github.com/mobiusml/hqq

[5]https://github.com/intel/intel-extension-for-transformers

[6]https://github.com/AutoGPTQ/AutoGPTQ

[7]Cheng, Wenhua, et al. “Optimize weight rounding via signed gradient descent for the quantization of llms.” arXiv preprint arXiv:2309.05516 (2023).

[8]https://github.com/intel/neural-compressor

[9]https://github.com/EleutherAI/lm-evaluation-harness

PeaceInMind

关注

14
点赞
踩
19

收藏

觉得还不错? 一键收藏
1
评论
SOTA LLM int4算法AutoRound欢迎试用

AutoRound（https://github.com/intel/auto-round）实现了出色的量化性能，在W4G128上多数场景中接近无损压缩，适用于包括gemma-7B、Mistral-7b、Mixtral-8x7B-v0.1、Mixtral-8x7B-Instruct-v0.1、Phi2、LLAMA2等一系列流行模型。在尽量公正的评估中，AutoRound在W4G128、W4G-1、W3G128、W2G128的大多数场景中优于GPTQ，AWQ等方法。
复制链接

扫一扫