使用Lmdeploy将Qwen2-7B量化和加速推理

mybbsss

已于 2024-08-03 11:39:21 修改

阅读量2.2k

点赞数 14

文章标签：人工智能 python

于 2024-06-17 16:32:51 首次发布

本文链接：https://blog.csdn.net/mycomin/article/details/139741575

版权

量化前 53words/s

（设备为50%A100）
使用lmdeploy测试性能

python benchmark_qwen.py 
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████| 4/4 [02:37<00:00, 39.39s/it]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Warm up...[1/5]
Warm up...[2/5]
Warm up...[3/5]
Warm up...[4/5]
Warm up...[5/5]
Speed: 53.064 words/s

在这里插入图片描述

lmdeploy量化方法

lmdeploy lite auto_awq /root/models/Qwen2-7B-Instruct --work-dir /root/models/Qwen2-7B-Instruct-4bit
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:05<00:00,  1.45s/it]
Traceback (most recent call last):
  File "/root/.conda/envs/lmdeploy/bin/lmdeploy", line 8, in <module>
    sys.exit(run())
  File "/root/.conda/envs/lmdeploy/lib/python3.10/site-packages/lmdeploy/cli/entrypoint.py", line 26, in run
    args.run(args)
  File "/root/.conda/envs/lmdeploy/lib/python3.10/site-packages/lmdeploy/cli/lite.py", line 131, in auto_awq
    auto_awq(**kwargs)
  File "/root/.conda/envs/lmdeploy/lib/python3.10/site-packages/lmdeploy/lite/apis/auto_awq.py", line 54, in auto_awq
    model, tokenizer, work_dir = calibrate(model, calib_dataset, calib_samples,
  File "/root/.conda/envs/lmdeploy/lib/python3.10/site-packages/lmdeploy/lite/apis/calibrate.py", line 155, in calibrate
    raise RuntimeError(
RuntimeError: Currently, quantification and calibration of Qwen2ForCausalLM are not supported. The supported model types are InternLMForCausalLM, InternLM2ForCausalLM, QWenLMHeadModel, BaiChuanForCausalLM, BaichuanForCausalLM, LlamaForCausalLM.

原因是lmdeploy版本比较低，需要升级到0.4以上的版本

pip install lmdeploy[all]==0.4.2

量化后 312words/s

4bit之后速度提升了6倍

python benchmark_qwen_4bit.py 
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[WARNING] gemm_config.in is not found; using default GEMM algo                                                                                       
Warm up...[1/5]
Warm up...[2/5]
Warm up...[3/5]
Warm up...[4/5]
Warm up...[5/5]
Speed: 312.444 words/s

显存占用几乎一样，分配了更多kv cache
在这里插入图片描述

对比transformer和Lmdeploy

Lmdeploy测试代码

import datetime
from lmdeploy import pipeline

pipe = pipeline('/root/models/Qwen2-7B-Instruct-4bit')

# warmup
inp = "hello"
for i in range(5):
    print("Warm up...[{}/5]".format(i+1))
    response = pipe([inp])

# test speed
inp = "请介绍一下你自己。"
times = 10
total_words = 0
start_time = datetime.datetime.now()
for i in range(times):
    response = pipe([inp])
    total_words += len(response[0].text)
end_time = datetime.datetime.now()

delta_time = end_time - start_time
delta_time = delta_time.seconds + delta_time.microseconds / 1000000.0
speed = total_words / delta_time
print("Speed: {:.3f} words/s".format(speed))

报错

python benchmark_transformer_qwen2.py 
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
You have loaded an AWQ model on CPU and have a CUDA device available, make sure to set your model on a GPU device in order to run your model.
`low_cpu_mem_usage` was None, now set to True since model is quantized.
Loading checkpoint shards:   0%|                                                                                               | 0/3 [00:00<?, ?it/s]/root/.conda/envs/lmdeploy/lib/python3.10/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  return self.fget.__get__(instance, owner)()
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00,  8.68it/s]
Warm up...[1/5]
Traceback (most recent call last):
  File "/root/benchmark_transformer_qwen2.py", line 15, in <module>
    response, history = model.chat(tokenizer, inp, history=[])
  File "/root/.conda/envs/lmdeploy/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1695, in __getattr__
    raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")
AttributeError: 'Qwen2ForCausalLM' object has no attribute 'chat'

原因是需要增加chat方法，代码如下

import torch
import datetime
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("/root/models/Qwen2-7B-Instruct", trust_remote_code=True)

# Set `torch_dtype=torch.float16` to load model in float16, otherwise it will be loaded as float32 and cause OOM Error.
model = AutoModelForCausalLM.from_pretrained("/root/models/Qwen2-7B-Instruct", torch_dtype=torch.float16, trust_remote_code=True).cuda()
model = model.eval()

def chat(tokenizer, ques, history=[], **kw):
    iids = tokenizer.apply_chat_template(
        history + [{'role': 'user', 'content': ques}], 
        add_generation_prompt=1,
    )
    oids = model.generate(
        inputs=torch.tensor([iids]).to(model.device),
        max_new_tokens=512,
    )
    oids = oids[0][len(iids):].tolist()
    if oids[-1] == tokenizer.eos_token_id:
        oids = oids[:-1]
    ans = tokenizer.decode(oids)
    
    return ans


model.chat = chat



# warmup
inp = "hello"
for i in range(5):
    print("Warm up...[{}/5]".format(i+1))
    response = model.chat(tokenizer, inp)

# test speed
inp = "请介绍一下你自己。"
times = 10
total_words = 0
start_time = datetime.datetime.now()
for i in range(times):
    response = model.chat(tokenizer, inp)
    total_words += len(response)
end_time = datetime.datetime.now()

delta_time = end_time - start_time
delta_time = delta_time.seconds + delta_time.microseconds / 1000000.0
speed = total_words / delta_time
print("Speed: {:.3f} words/s".format(speed))

量化前

python benchmark_transformer_qwen2.py 
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████| 4/4 [00:03<00:00,  1.01it/s]
cuda:0
Warm up...[1/5]
Warm up...[2/5]
Warm up...[3/5]
Warm up...[4/5]
Warm up...[5/5]
Speed: 75.586 words/s

量化后速度反而慢了很多

python benchmark_transformer_qwen2.py 
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
You have loaded an AWQ model on CPU and have a CUDA device available, make sure to set your model on a GPU device in order to run your model.
`low_cpu_mem_usage` was None, now set to True since model is quantized.
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00,  8.40it/s]
Warm up...[1/5]
Warm up...[2/5]
Warm up...[3/5]
Warm up...[4/5]
Warm up...[5/5]
Speed: 4.495 words/s

注意 You have loaded an AWQ model on CPU and have a CUDA device available, make sure to set your model on a GPU device in order to run your model. 需要修改代码

model = AutoModelForCausalLM.from_pretrained("/root/models/Qwen2-7B-Instruct-4bit", torch_dtype=torch.float16, trust_remote_code=True).cuda()

改成

model = AutoModelForCausalLM.from_pretrained("/root/models/Qwen2-7B-Instruct-4bit", torch_dtype=torch.float16, trust_remote_code=True, device_map="cuda").cuda()

然后速度快了一些

python benchmark_transformer_qwen2.py 
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

`low_cpu_mem_usage` was None, now set to True since model is quantized.
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00,  8.40it/s]
Warm up...[1/5]
Warm up...[2/5]
Warm up...[3/5]
Warm up...[4/5]
Warm up...[5/5]
Speed: 60.280 words/s

格式从safetensors变成了bin

在这里插入图片描述
下载Qwen2-7B-Instruct-GPTQ-Int4

huggingface-cli download --resume-download Qwen/Qwen2-7B-Instruct-GPTQ-Int4 --local-dir /root/models/Qwen2-7B-Instruct-GPTQ-Int4/

稍快一点，仍然慢于Qwen2-7B-Instruct

Warm up...[1/5]
Warm up...[2/5]
Warm up...[3/5]
Warm up...[4/5]
Warm up...[5/5]
Speed: 11.385 words/s

Qwen2效率评估参考
https://qwen.readthedocs.io/zh-cn/latest/benchmark/speed_benchmark.html