多卡(3090)部署通义千问Qwen2-72B大模型并加速至38tps：vLLM库的使用和错误排查

EthanLifeGreat

已于 2024-07-09 18:40:41 修改

阅读量3.8k

点赞数 22

文章标签：网络 python ai chatgpt

于 2024-07-08 19:29:38 首次发布

本文链接：https://blog.csdn.net/weixin_44652758/article/details/140274328

版权

项目地址：EthanLifeGreat/Qwen2-local-api: Qwen2 vllm api & gradio front-end (github.com)

做了Qwen1的加速，其中关于Auto-GPTQ的安装问题在Qwen2中依然适用。但是Qwen2比Qwen1加载模型快了很多，笔者也不知道为什么。

下面是Hugging Face transformer版的千问2，token生成速度在15个每秒左右，但还不够快，在这篇文章里我们用vLLM将速度翻倍，达到38 tokens/s。

transformer库Auto-GBTQ版本，289字19秒生成，约15 token/s

想用vLLM加速处理，于是参考了【以Qwen2为例】vLLM流式推理部署，openai接口调用，requests调用_qwen2 openai-CSDN博客

但是按照他的流程走（版本都和他一样）：

pip install vllm
pip install nvidia-nccl-cu12==2.20.5

启动vllm后端：

python -m vllm.entrypoints.openai.api_server --served-model-name Qwen2-7B-Instruct --model Qwen/Qwen2-7B-Instruct

时报错：

(VllmWorkerProcess pid=1567610) INFO 07-08 09:29:33 utils.py:613] Found nccl from environment variable VLLM_NCCL_SO_PATH=/mnt/shareEEx/chenyixiang/nccl/usr/lib/x86_64-linux-gnu/
(VllmWorkerProcess pid=1567609) ERROR 07-08 09:29:33 pynccl_wrapper.py:196] Failed to load NCCL library from /mnt/shareEEx/chenyixiang/nccl/usr/lib/x86_64-linux-gnu/ .It is expected if you are not running on NVIDIA/AMD GPUs.Otherwise, the nccl library might not exist, be corrupted or it does not support the current platform Linux-5.4.0-182-generic-x86_64-with-glibc2.31.If you already have the library, please set the environment variable VLLM_NCCL_SO_PATH to point to the correct nccl library path.

意思是关于NCCL协同通讯库没有安装好，vllm后台识别不到nccl：

让我设置好VLLM_NCCL_SO_PATH

于是我就按照NVIDIA官网中找安装方式，发现好多都需要sudo安装deb包，但是笔者的机器没有sudo权限，于是直接考虑下载源码：https://developer.nvidia.com/downloads/compute/machine-learning/nccl/secure/2.22.3/agnostic/x64/nccl_2.22.3-1+cuda12.2_x86_64.txz/

并解压到~/Qwen/nccl_2.22.3-1+cuda12.2_x86_64/后

设置VLLM_NCCL_SO_PATH

# 错误
export VLLM_NCCL_SO_PATH=~/Qwen/nccl_2.22.3-1+cuda12.2_x86_64/lib/

vllm后台仍然识别不到nccl，

几经周折才试出来，要把PATH直接指向.so文件（虽然以前的PATH好像都是直接指向lib/文件夹就行，但这次是指向文件）：

# 正确
export VLLM_NCCL_SO_PATH=~/Qwen/nccl_2.22.3-1+cuda12.2_x86_64/lib/libnccl.so

再次运行：

python -m vllm.entrypoints.openai.api_server --served-model-name Qwen2-72B-Instruct-GPTQ-Int4 --model $MODEL_PATH \
--tensor-parallel-size 4 --host 0.0.0.0 --port 8008

成功！

可以看到，短上下文的处理速度达到了恐怖的38.7tokens/s，与官方给出的A100速度基本持平

相比于开头的transformer版本，速度提升了两倍有余。

python调用服务端代码：

from openai import OpenAI
# Set OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://0.0.0.0:8008/v1"

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

chat_response = client.chat.completions.create(
    model="Qwen2-72B-Instruct-GPTQ-Int4",
    messages=[
        {"role": "system", "content": "你是一个有用的人工智能助手"},
        {"role": "user", "content": "为什么生鱼片其实是死鱼片？对此生成不少于1000字的解释。"},
    ]
)
print("Chat response:", chat_response)

python调用服务端流式处理代码：

from openai import OpenAI

# Set OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://0.0.0.0:8008/v1"

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)


# send a ChatCompletion request to count to 100
response = client.chat.completions.create(
    model="Qwen2-72B-Instruct-GPTQ-Int4",
    messages=[
        {"role": "system", "content": "你是一个有用的人工智能助手"},
        {"role": "user", "content": "为什么生鱼片其实是死鱼片？对此生成不少于1000字的解释。"},
    ],
    temperature=0,
    stream=True
)

# create variables to collect the stream of chunks
collected_messages = []
print("LLM: ", end="", flush=True)

# iterate through the stream of events
for chunk in response:
    chunk_message = chunk.choices[0].delta.content  # extract the message
    if not chunk_message:  # if the message is empty, skip it
        continue
    collected_messages.append(chunk_message)  # save the message
    print(chunk_message, end="", flush=True)  # print the response stream