LLM大模型学习：LLM大模型推理加速模型量化

最新推荐文章于 2024-09-10 15:42:03 发布

AGI大模型老王

最新推荐文章于 2024-09-10 15:42:03 发布

阅读量831

点赞数 23

文章标签：学习 AI大模型人工智能产品经理大模型入门大模型学习

本文链接：https://blog.csdn.net/2401_85390073/article/details/141993837

版权

大模型一般有以下4种量化方式：

AWQ
AutoGPTQ
bitsandbytes
llama.cpp

本文档使用qwen-7b-1.5 模型，并包含完整的量化代码。

huggingface官方的量化文档：huggingface.co/docs/transf…

本文的代码地址：github.com/night-is-yo…

AWQ

首先，需要安装autoawq库

pip install autoawq

量化步骤

def quantize():
    model_path = "/home/chuan/models/qwen/Qwen1___5-7B-Chat"
    quant_path = "/home/chuan/models/qwen/Qwen1___5-7B-Chat-awq"
    quant_config = {"zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM"}

    # Load model
    model = AutoAWQForCausalLM.from_pretrained(model_path,
                                               trust_remote_code=True)
    tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

    # Quantize
    model.quantize(tokenizer, quant_config=quant_config)

    print(quant_config)
    quantization_config = AwqConfig(
        bits=quant_config["w_bit"],
        group_size=quant_config["q_group_size"],
        zero_point=quant_config["zero_point"],
        version=quant_config["version"].lower(),
    ).to_dict()

    # the pretrained transformers model is stored in the model attribute + we need to pass a dict
    model.model.config.quantization_config = quantization_config
    # a second solution would be to use Autoconfig and push to hub (what we do at llm-awq)

    # save model weights
    model.save_quantized(quant_path, safetensors=True)
    tokenizer.save_pretrained(quant_path)

AutoGPTQ

Transformers已经整合了optimum API，用于对语言模型执行GPTQ量化。您可以以8、4、3甚至2位加载和量化您的模型

首先，需要安装库

git clone https://github.com/PanQiWei/AutoGPTQ.git && cd AutoGPTQ
pip install -vvv -e .

pip install optimum

量化步骤

def quantize2():
    model_path = "/home/chuan/models/qwen/Qwen1___5-7B-Chat"
    quant_path = "/home/chuan/models/qwen/Qwen1___5-7B-Chat-gptq"

    tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

    dataset = [
        "auto-gptq is an easy-to-use model quantization library with user-friendly apis, based on GPTQ algorithm."]

    gptq_config = GPTQConfig(bits=4, dataset=dataset, tokenizer=tokenizer)
  
    model = AutoModelForCausalLM.from_pretrained(model_path, quantization_config=gptq_config, device_map="auto", trust_remote_code=True)

    model.save_pretrained(quant_path)
    tokenizer.save_pretrained(quant_path)

对于 4 位模型，您可以使用 exllama 内核来提高推理速度。默认情况下，它处于启用状态。您可以通过在 GPTQConfig 中传递 use_exllama 来更改此配置。这将覆盖存储在配置中的量化配置。请注意，您只能覆盖与内核相关的属性。此外，如果您想使用 exllama 内核，整个模型需要全部部署在 gpus 上。此外，您可以使用版本 > 0.4.2 的 Auto-GPTQ 并传递 device_map = “cpu” 来执行 CPU 推理。对于 CPU 推理，您必须在 GPTQConfig 中传递 use_exllama = False。

bitsandbytes

首先，需要安装库

pip install bitsandbytes

from transformers import AutoModelForCausalLM

model_8bit = AutoModelForCausalLM.from_pretrained("facebook/opt-350m", load_in_8bit=True)
model_4bit = AutoModelForCausalLM.from_pretrained("facebook/opt-350m", load_in_4bit=True)

高级用法

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)
model_4bit = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map="auto")

bnb_4bit_use_double_quant

嵌套量化技术，可以减少内存占用
bnb_4bit_quant_type

NF4 数据类型，这是一种针对使用正态分布初始化的权重而适应的新型 4 位数据类型
bnb_4bit_compute_dtype

为了与模型的权重保持一致，请确保使用相同的 bnb_4bit_compute_dtype 和 torch_dtype 参数。

llama.cpp

llama.cpp是在本地和云端的各种硬件上以最少的设置和最先进的性能实现 LLM 推理。

没有任何依赖的纯 C/C++ 实现
Apple 芯片是一等公民 - 通过 ARM NEON、Accelerate 和 Metal 框架进行优化
对 x86 架构的 AVX、AVX2 和 AVX512 支持
2 位、3 位、4 位、5 位、6 位和 8 位整数量化可加快推理速度并减少内存使用
用于在 NVIDIA GPU 上运行 LLM 的自定义 CUDA 内核（通过 HIP 支持 AMD GPU）
Vulkan、SYCL 和（部分）OpenCL 后端支持
CPU+GPU 混合推理，部分加速大于 VRAM 总容量的模型

下面是llama.cpp量化的命令

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make -j LLAMA_CUBLAS=1

python convert-hf-to-gguf.py /home/chuan/models/qwen/Qwen1___5-7B-Chat
# quantize the model to 4-bits (using Q4_K_M method)
./quantize /home/chuan/models/qwen/Qwen1___5-7B-Chat/ggml-model-f16.gguf /home/chuan/models/qwen/Qwen1___5-7B-Chat/ggml-model-Q4_K_M.gguf Q4_K_M

# start inference on a gguf model
./main -m /home/chuan/models/qwen/Qwen1___5-7B-Chat/ggml-model-Q4_K_M.gguf -n 128

使用llama.cpp 聊天

# custom arguments using a 13B model
./main -m /home/chuan/models/qwen/Qwen1___5-7B-Chat/ggml-model-Q4_K_M.gguf -n 256 --repeat_penalty 1.0 --color -i -r "User:" -f prompts/chat-with-bob.txt

使用llama.cpp 聊天有很多可选的参数，这里略。。。

还可以按照json的格式输出，如下

./main -m /home/chuan/models/qwen/Qwen1___5-7B-Chat/ggml-model-Q4_K_M.gguf -n 256 --grammar-file grammars/json.gbnf -p 'Request: schedule a call at 8pm; Command:'

也可以在python中使用llama.cpp

pip install llama-cpp-python

from llama_cpp import Llama
llm = Llama(
      model_path="/home/chuan/models/qwen/Qwen1___5-7B-Chat/ggml-model-Q4_K_M.gguf",
      # n_gpu_layers=-1, # Uncomment to use GPU acceleration
      # seed=1337, # Uncomment to set a specific seed
      # n_ctx=2048, # Uncomment to increase the context window
)
output = llm(
      "Q: Name the planets in the solar system? A: ", # Prompt
      max_tokens=32, # Generate up to 32 tokens, set to None to generate up to the end of the context window
      stop=["Q:", "\n"], # Stop generating just before the model would generate a new question
      echo=True # Echo the prompt back in the output
) # Generate a completion, can also call create_completion
print(output)

注意

虽然我们可以自己使用gptq、awq量化模型，但是由于量化模型使用的数据集不同，效果也是不同的，不建议自己量化模型，直接使用官方的量化模型即可

量化模型性能对比

optimum-benchmark --config-dir configs/ --config-name _base_ --multirun
optimum-benchmark --config-dir configs/ --config-name bnb --multirun
optimum-benchmark --config-dir configs/ --config-name gptq --multirun
optimum-benchmark --config-dir configs/ --config-name awq --multirun

生成token速度结果，optimum-benchmark暂时不支持llama.cpp，下表中bs表示batch_size