awq激活感知权重量化

8 篇文章 0 订阅

一、定义

  1. 定义
  2. huggingface 格式量化案例 https://zhuanlan.zhihu.com/p/693348118
  3. awq 常用量化工具
  4. 量化评估
  5. awq 量化加载方式

二、实现

  1. 定义
    模型量化目的: 不明显提供损失的效果下,降低显存,提高推理速度。
    量化分类:仅参数量化,如W4A16(AWQ)
    同时量化参数和激活值:W8A8,(SmootQuant)
  2. huggingface 格式量化案例 https://zhuanlan.zhihu.com/p/693348118
    pip3 install autoawq -i https://pypi.tuna.tsinghua.edu.cn/simple
    参考网站:https://github.com/casper-hansen/AutoAWQ/tree/main
    #需要在线hf 网站,不然需要修改模型,离线加载数据集
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_path = '/home/Qwen1.5_7b'
quant_path = '/home/Qwen1.5_7b_awq'
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM"}    
#GEMM、GEMV 为量化的两种方式,当长上下文时,GEMM效果更好。
# Load model
model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

# Quantize
model.quantize(tokenizer, quant_config=quant_config)

# Save quantized model
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)
  1. vllm 中量化:https://docs.vllm.ai/en/stable/quantization/auto_awq.html
    vllm 中量化也是采用的awq 进行的量化。

  2. awq 常用量化工具
    在这里插入图片描述

  3. 量化评估
    吞吐量(Throughput)和延迟(Latency)
    1. 采用vllm 测试
    1 吞吐量测试

pip install vllm
吞吐量测试
python benchmark_throughput.py --backend vllm --input-len 128 --output-len 512 --model /home/Qwen1.5_7b_awq -q awq --num-prompts 100 --seed 1100 --trust-remote-code --max-model-len 2048 --tensor-parallel-size 1 
#非量化模型
python benchmark_throughput.py --backend vllm --input-len 128 --output-len 512 --model /home/Qwen1.5_7b --num-prompts 100 --seed 1100 --trust-remote-code --max-model-len 2048 --tensor-parallel-size 1 

在这里插入图片描述
2. 延迟评
python benchmark_latency.py

  1. 量化加载方式:https://qwen.readthedocs.io/zh-cn/latest/quantization/awq.html
    transformer 中加载
from transformers import AutoModelForCausalLM, AutoTokenizer
device = "cuda" # the device to load the model onto

model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2-7B-Instruct-AWQ", # the quantized model
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2-7B-Instruct-AWQ")

prompt = "Give me a short introduction to large language model."
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(device)

generated_ids = model.generate(
    model_inputs.input_ids,
    max_new_tokens=512
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

vLLM加载:>>

python3 vllm/entrypoints/api_server.py --model /mnt/disk0/models/llama-2-7b-hf-awq/ --quantization awq

autoawq 加载

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

# Specify paths and hyperparameters for quantization
model_path = "your_model_path"
quant_path = "your_quantized_model_path"
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM" }

# Load your tokenizer and model with AutoAWQ
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoAWQForCausalLM.from_pretrained(model_path, device_map="auto", safetensors=True)
  • 3
    点赞
  • 10
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
wandb: Tracking run with wandb version 0.15.5 wandb: W&B syncing is set to `offline` in this directory. wandb: Run `wandb online` or set WANDB_MODE=online to enable cloud syncing. /home/zhangmengjie/anaconda3/envs/torch1/lib/python3.7/site-packages/gym/envs/registration.py:556: UserWarning: WARN: The environment Ant-v2 is out of date. You should consider upgrading to version `v4`. f"The environment {id} is out of date. You should consider " Error compiling Cython file: ------------------------------------------------------------ ... See c_warning_callback, which is the C wrapper to the user defined function ''' global py_warning_callback global mju_user_warning py_warning_callback = warn mju_user_warning = c_warning_callback ^ ------------------------------------------------------------ /home/zhangmengjie/anaconda3/envs/torch1/lib/python3.7/site-packages/mujoco_py/cymj.pyx:92:23: Cannot assign type 'void (const char *) except * nogil' to 'void (*)(const char *) noexcept nogil' Error compiling Cython file: ------------------------------------------------------------ ... See c_warning_callback, which is the C wrapper to the user defined function ''' global py_error_callback global mju_user_error py_error_callback = err_callback mju_user_error = c_error_callback ^ ------------------------------------------------------------ /home/zhangmengjie/anaconda3/envs/torch1/lib/python3.7/site-packages/mujoco_py/cymj.pyx:127:21: Cannot assign type 'void (const char *) except * nogil' to 'void (*)(const char *) noexcept nogil' Compiling /home/zhangmengjie/anaconda3/envs/torch1/lib/python3.7/site-packages/mujoco_py/cymj.pyx because it changed. [1/1] Cythonizing /home/zhangmengjie/anaconda3/envs/torch1/lib/python3.7/site-packages/mujoco_py/cymj.pyx wandb: Waiting for W&B process to finish... (failed 1). wandb: You can sync this run to the cloud by running: wandb: wandb sync /home/zhangmengjie/PID/Python/ERL-Re2-main/wandb/offline-run-20230721_165346-awq1hazo wandb: Find logs at: ./wandb/offline-run-20230721_165346-awq1hazo/logs
07-22
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值