使用Lmdeploy将Qwen2-7B量化和加速推理

量化前 53words/s

(设备为50%A100)
使用lmdeploy测试性能

python benchmark_qwen.py 
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████| 4/4 [02:37<00:00, 39.39s/it]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Warm up...[1/5]
Warm up...[2/5]
Warm up...[3/5]
Warm up...[4/5]
Warm up...[5/5]
Speed: 53.064 words/s

在这里插入图片描述

lmdeploy量化方法

lmdeploy lite auto_awq /root/models/Qwen2-7B-Instruct --work-dir /root/models/Qwen2-7B-Instruct-4bit
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:05<00:00,  1.45s/it]
Traceback (most recent call last):
  File "/root/.conda/envs/lmdeploy/bin/lmdeploy", line 8, in <module>
    sys.exit(run())
  File "/root/.conda/envs/lmdeploy/lib/python3.10/site-packages/lmdeploy/cli/entrypoint.py", line 26, in run
    args.run(args)
  File "/root/.conda/envs/lmdeploy/lib/python3.10/site-packages/lmdeploy/cli/lite.py", line 131, in auto_awq
    auto_awq(**kwargs)
  File "/root/.conda/envs/lmdeploy/lib/python3.10/site-packages/lmdeploy/lite/apis/auto_awq.py", line 54, in auto_awq
    model, tokenizer, work_dir = calibrate(model, calib_dataset, calib_samples,
  File "/root/.conda/envs/lmdeploy/lib/python3.10/site-packages/lmdeploy/lite/apis/calibrate.py", line 155, in calibrate
    raise RuntimeError(
RuntimeError: Currently, quantification and calibration of Qwen2ForCausalLM are not supported. The supported model types are InternLMForCausalLM, InternLM2ForCausalLM, QWenLMHeadModel, BaiChuanForCausalLM, BaichuanForCausalLM, LlamaForCausalLM.

原因是lmdeploy版本比较低,需要升级到0.4以上的版本

pip install lmdeploy[all]==0.4.2

量化后 312words/s

4bit之后速度提升了6倍

python benchmark_qwen_4bit.py 
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[WARNING] gemm_config.in is not found; using default GEMM algo                                                                                       
Warm up...[1/5]
Warm up...[2/5]
Warm up...[3/5]
Warm up...[4/5]
Warm up...[5/5]
Speed: 312.444 words/s

显存占用几乎一样,分配了更多kv cache
在这里插入图片描述

对比transformer和Lmdeploy

Lmdeploy测试代码

import datetime
from lmdeploy import pipeline

pipe = pipeline('/root/models/Qwen2-7B-Instruct-4bit')

# warmup
inp = "hello"
for i in range(5):
    print("Warm up...[{}/5]".format(i+1))
    response = pipe([inp])

# test speed
inp = "请介绍一下你自己。"
times = 10
total_words = 0
start_time = datetime.datetime.now()
for i in range(times):
    response = pipe([inp])
    total_words += len(response[0].text)
end_time = datetime.datetime.now()

delta_time = end_time - start_time
delta_time = delta_time.seconds + delta_time.microseconds / 1000000.0
speed = total_words / delta_time
print("Speed: {:.3f} words/s".format(speed))

报错

python benchmark_transformer_qwen2.py 
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
You have loaded an AWQ model on CPU and have a CUDA device available, make sure to set your model on a GPU device in order to run your model.
`low_cpu_mem_usage` was None, now set to True since model is quantized.
Loading checkpoint shards:   0%|                                                                                               | 0/3 [00:00<?, ?it/s]/root/.conda/envs/lmdeploy/lib/python3.10/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  return self.fget.__get__(instance, owner)()
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00,  8.68it/s]
Warm up...[1/5]
Traceback (most recent call last):
  File "/root/benchmark_transformer_qwen2.py", line 15, in <module>
    response, history = model.chat(tokenizer, inp, history=[])
  File "/root/.conda/envs/lmdeploy/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1695, in __getattr__
    raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")
AttributeError: 'Qwen2ForCausalLM' object has no attribute 'chat'

原因是需要增加chat方法,代码如下

import torch
import datetime
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("/root/models/Qwen2-7B-Instruct", trust_remote_code=True)

# Set `torch_dtype=torch.float16` to load model in float16, otherwise it will be loaded as float32 and cause OOM Error.
model = AutoModelForCausalLM.from_pretrained("/root/models/Qwen2-7B-Instruct", torch_dtype=torch.float16, trust_remote_code=True).cuda()
model = model.eval()

def chat(tokenizer, ques, history=[], **kw):
    iids = tokenizer.apply_chat_template(
        history + [{'role': 'user', 'content': ques}], 
        add_generation_prompt=1,
    )
    oids = model.generate(
        inputs=torch.tensor([iids]).to(model.device),
        max_new_tokens=512,
    )
    oids = oids[0][len(iids):].tolist()
    if oids[-1] == tokenizer.eos_token_id:
        oids = oids[:-1]
    ans = tokenizer.decode(oids)
    
    return ans


model.chat = chat



# warmup
inp = "hello"
for i in range(5):
    print("Warm up...[{}/5]".format(i+1))
    response = model.chat(tokenizer, inp)

# test speed
inp = "请介绍一下你自己。"
times = 10
total_words = 0
start_time = datetime.datetime.now()
for i in range(times):
    response = model.chat(tokenizer, inp)
    total_words += len(response)
end_time = datetime.datetime.now()

delta_time = end_time - start_time
delta_time = delta_time.seconds + delta_time.microseconds / 1000000.0
speed = total_words / delta_time
print("Speed: {:.3f} words/s".format(speed))

量化前

python benchmark_transformer_qwen2.py 
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████| 4/4 [00:03<00:00,  1.01it/s]
cuda:0
Warm up...[1/5]
Warm up...[2/5]
Warm up...[3/5]
Warm up...[4/5]
Warm up...[5/5]
Speed: 75.586 words/s

量化后速度反而慢了很多

python benchmark_transformer_qwen2.py 
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
You have loaded an AWQ model on CPU and have a CUDA device available, make sure to set your model on a GPU device in order to run your model.
`low_cpu_mem_usage` was None, now set to True since model is quantized.
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00,  8.40it/s]
Warm up...[1/5]
Warm up...[2/5]
Warm up...[3/5]
Warm up...[4/5]
Warm up...[5/5]
Speed: 4.495 words/s

注意 You have loaded an AWQ model on CPU and have a CUDA device available, make sure to set your model on a GPU device in order to run your model. 需要修改代码

model = AutoModelForCausalLM.from_pretrained("/root/models/Qwen2-7B-Instruct-4bit", torch_dtype=torch.float16, trust_remote_code=True).cuda()

改成

model = AutoModelForCausalLM.from_pretrained("/root/models/Qwen2-7B-Instruct-4bit", torch_dtype=torch.float16, trust_remote_code=True, device_map="cuda").cuda()

然后速度快了一些

python benchmark_transformer_qwen2.py 
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

`low_cpu_mem_usage` was None, now set to True since model is quantized.
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00,  8.40it/s]
Warm up...[1/5]
Warm up...[2/5]
Warm up...[3/5]
Warm up...[4/5]
Warm up...[5/5]
Speed: 60.280 words/s

格式从safetensors变成了bin

在这里插入图片描述
下载Qwen2-7B-Instruct-GPTQ-Int4

huggingface-cli download --resume-download Qwen/Qwen2-7B-Instruct-GPTQ-Int4 --local-dir /root/models/Qwen2-7B-Instruct-GPTQ-Int4/

稍快一点,仍然慢于Qwen2-7B-Instruct

Warm up...[1/5]
Warm up...[2/5]
Warm up...[3/5]
Warm up...[4/5]
Warm up...[5/5]
Speed: 11.385 words/s

Qwen2效率评估参考
https://qwen.readthedocs.io/zh-cn/latest/benchmark/speed_benchmark.html

### 使用 Qwen2-7B-Instruct-GPTQ-INT4 模型 #### 特点应用场景 Qwen2-7B-Instruct-INT4 是阿里云推出的一个量化模型,拥有 70 亿个参数并经过指令微调,使其能更有效地理解与执行多种任务。此模型采用 GPTQ 方法进行了 INT4 的量化处理,在保持性能的同时显著降低了计算资源需求[^1]。 这种配置使得该模型特别适合于那些希望利用大型语言模型的强大功能而又受限于硬件条件的应用场景,比如小型服务器上的在线问答系统、移动设备端的文字辅助工具等。 #### 安装依赖库 为了能够在本地环境中顺利部署并使用上述提到的大规模多模态预训练模型——Qwen2-7B-Instruct-Int4,需按照如下命令依次安装所需的Python包: ```bash pip install opencv-python pip install uvicorn pip install fastapi pip install git+https://github.com/huggingface/transformers.git pip install qwen-vl-utils pip install torchvision pip install python-multipart pip install 'accelerate>=0.26.0' pip install optimum pip install auto-gptq ``` 注意:由于 `optimum` 库可能会覆盖之前已有的 `transformers` 版本,因此建议最后再单独更新一次 `transformers` 或者确保其版本是最新的稳定版[^2]。 如果遇到 CUDA 扩展未安装的问题,则可以通过指定特定版本来解决这个问题: ```bash pip install torch==2.2.1 pip install torchvision==0.17.1 pip install auto-gptq==0.7.1 ``` 这些操作可以有效避免因软件兼容性而导致的错误消息 "CUDA extension not installed."[^3]。 #### 推理过程示例 下面给出一段简单的 Python 脚本来展示如何加载已经准备好的 Qwen2-7B-Instruct-GPTQ-INT4 并进行基本的文本生成任务: ```python from transformers import AutoTokenizer, AutoModelForCausalLM import torch tokenizer = AutoTokenizer.from_pretrained("path_to_model") # 替换为实际路径 model = AutoModelForCausalLM.from_pretrained("path_to_model", device_map="auto") input_text = "你好" inputs = tokenizer(input_text, return_tensors='pt').to('cuda') with torch.no_grad(): outputs = model.generate(**inputs) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` 这段代码展示了从加载模型到输入提示词直至最终获得输出结果的一系列流程。需要注意的是,“path_to_model” 需要被替换为你所下载或克隆下来的 Qwen2-7B-Instruct-GPTQ-INT4 文件夹的具体位置。 对于更加详细的指导以及更多高级特性的介绍,推荐查阅 Hugging Face 上提供的官方文档或是 GitHub 项目页面内的 README.md 文件,那里包含了更为详尽的操作指南技术细节说明。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值