vLLM环境安装与运行实例【最新版（0.6.4.post1）】

莽夫搞战术

已于 2024-12-15 00:42:13 修改

阅读量5.5k

点赞数 14

分类专栏：环境搭建文章标签：自然语言处理语言模型 pytorch

于 2024-12-01 17:21:39 首次发布

本文链接：https://blog.csdn.net/yd778473278/article/details/144077743

版权

环境搭建专栏收录该内容

6 篇文章

订阅专栏

vLLM环境安装与运行实例【最新版（0.6.4.post1）可以使用beam search】

vLLM
1. vLLM环境安装
2. vLLM运行示例
- 2.1 sampling方式测试样例
- 2.2 beam search方式测试样例

vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving.
vLLM是一个快速易用的LLM推理和服务库。

1. vLLM环境安装

1. 1 正确安装

1. 1 .1 CUDA11.8

基础环境安装

环境	版本	备注
CUDA	11.8	vllm官方要求CUDA=12.4或者CUDA=11.8
Python	3.10.4	可以通过Annaconda或者源码安装
vllm	0.6.4.post1	使用beam search需要使用最新版本

Python安装（可选，通过Annaconda）

conda create -n vllm_beam python=3.10 -y
conda activate vllm_beam

vllm_beam为conda创建的环境名称

安装vllm

pip3 install https://github.com/vllm-project/vllm/releases/download/v0.6.4.post1/vllm-0.6.4.post1+cu118-cp38-abi3-manylinux1_x86_64.whl --extra-index-url https://download.pytorch.org/whl/cu118 -i https://pypi.tuna.tsinghua.edu.cn/simple
# 如果下载过慢可以使用VPN下载好，用以下命令
# pip3 install ./vllm-0.6.4.post1+cu118-cp38-abi3-manylinux1_x86_64.whl --extra-index-url https://download.pytorch.org/whl/cu118 -i https://pypi.tuna.tsinghua.edu.cn/simple

1. 1 .2 CUDA12.1

环境要求

环境	版本	备注
gcc	10.5.0	xformers要求c++17，gcc>9
CUDA	12.1	vllm官方要求CUDA=12.1
Python	3.12.8	vllm官方要求Python=3.12
torch	2.5.1+cu121	自动下载支持CUDA12.1的最新版
git	1.8.3.1	metadata安装需要，centos 7 yum 自动安装的版本
vllm	0.6.4.post1	使用beam search需要使用最新版本

环境安装

gcc、CUDA、Python、torch的具体安装都可参考我的另一篇文章fairseq-0.12.2多机训练环境搭建。

Python安装（可选，通过Annaconda）

conda create -n vllm python=3.12 -y
conda activate vllm

vllm为conda创建的环境名称

安装其他依赖

#root权限安装git
yum -y install git

#安装PyTorch
pip3 install torch==2.5.1 --extra-index-url https://download.pytorch.org/whl/cu121 -i https://pypi.tuna.tsinghua.edu.cn/simple

安装vllm

pip3 install vllm==0.6.4.post1 --extra-index-url https://download.pytorch.org/whl/cu121 -i https://pypi.tuna.tsinghua.edu.cn/simple

可能得错误

没有torch直接安装vllm
没有git直接安装vllm
gcc版本低于9

1. 2 错误历程

1.2.1 直接安装无错误，但运行报错【完全失败】

直接安装（已安装CUDA=12.1）

conda create -n vllm_beam python=3.10 -y
conda activate vllm_beam
pip3 install vllm -i https://pypi.tuna.tsinghua.edu.cn/simple

1.2.1.1 vllm自动安装的Python依赖包，与NVIDIA驱动版本冲突

错误提示：

libcusparse.so.12: symbol __nvJitLinkComplete_12_4, version libnvJitLink.so.12 not defined in file libnvJitLink.so.12 with link time reference

由于服务器NVIDIA显卡驱动版本为Driver Version: 535.104.05 CUDA Version: 12.2，因此无法使用CUDA==12.4，导致错误，因此考虑降低PyTorch版本：

pip3 install torch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 --index-url https://download.pytorch.org/whl/cu121 -i https://pypi.tuna.tsinghua.edu.cn/simple

1.2.1.2 PyTorch版本过低

错误提示：

NotImplementedError: Error in model execution (input dumped to /tmp/err_execute_model_input_20241127-112327.pkl): Could not run 'vllm_flash_attn_c::varlen_fwd' with arguments from the 'CUDA' backend. This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using custom build). If you are a Facebook employee using PyTorch on mobile, please visit https://fburl.com/ptmfixes for possible resolutions. 'vllm_flash_attn_c::varlen_fwd' is only available for these backends: [HIP, Meta, BackendSelect, Python, FuncTorchDynamicLayerBackMode, Functionalize, Named, Conjugate, Negative, ZeroTensor, ADInplaceOrView, AutogradOther, AutogradCPU, AutogradCUDA, AutogradXLA, AutogradMPS, AutogradXPU, AutogradHPU, AutogradLazy, AutogradMeta, Tracer, AutocastCPU, AutocastXPU, AutocastCUDA, FuncTorchBatched, BatchedNestedTensor, FuncTorchVmapMode, Batched, VmapMode, FuncTorchGradWrapper, PythonTLSSnapshot, FuncTorchDynamicLayerFrontMode, PreDispatch, PythonDispatcher].

1.2.2 安装CUDA11.8的版本

从官方教程里发现可以安装CUDA=11.8版本
CUDA==11.8版本
然后通过vllm项目的Releases看到：
vllm realeases
Python3的安装包名vllm-0.6.4.post1+cu118-cp38-abi3-manylinux1_x86_64.whl
从这个名字可以知道以下信息：

原始	软件名	版本/说明
vllm-0.6.4.post1	vllm	0.6.4.post1
cu118	CUDA	11.8
cp38	Python	3.8
abi3	ABI	Python的应用二进制接口（Application Binary Interface，ABI），abi3表示可以支持什么Python3版本
manylinux1	操作系统	支持多数的Linux操作系统
x86_64	CPU结构	仅支持X86结构的芯片

操作如下：

安装CUDA=11.8，可以参考我的另一篇文章fairseq-0.12.2多机训练环境搭建中的第二步安装CUDA。

安装时遇到问题

[INFO]: Driver not installed.
[INFO]: Checking compiler version...
[INFO]: gcc location: /mnt/yangdi/gcc-13.3.0/bin/gcc

[INFO]: gcc version: gcc version 13.3.0 (GCC) 

[ERROR]: unsupported compiler version: 13.3.0. Use --override to override this check.

需要降低gcc版本。
方案一：是注释掉新安装的gcc=13.3.0
方案二：安装gcc=8.5.0，同样可以参考我的另一篇文章fairseq-0.12.2多机训练环境搭建中的第一步安装gcc。

删除conda中的Python环境

conda env remove --name vllm_beam

安装新的Python环境（3.10版本）

conda create -n vllm_beam python=3.10 -y
conda activate vllm_beam

安装3.8版失败，因此换成了3.10。
4. 安装vllm环境

pip3 install https://github.com/vllm-project/vllm/releases/download/v0.6.4.post1/vllm-0.6.4.post1+cu118-cp38-abi3-manylinux1_x86_64.whl --extra-index-url https://download.pytorch.org/whl/cu118 -i https://pypi.tuna.tsinghua.edu.cn/simple
#如果下载过慢可以使用VPN下载好，用以下命令
#pip3 install ./vllm-0.6.4.post1+cu118-cp38-abi3-manylinux1_x86_64.whl --extra-index-url https://download.pytorch.org/whl/cu118 -i https://pypi.tuna.tsinghua.edu.cn/simple

注：Python=3.8遇到的错误，可以参考解决，最终没有安装成功：

安装报错——metadata

解决办法：

curl https://sh.rustup.rs -sSf | sh

中间需要选择1，然后因为没有VPN需要等待很长时间
cargo安装

安装完成后，需要更新环境变量

source ~/.bashrc

安装报错——没有匹配的torch版本

pip3 install torch==2.4.1 --extra-index-url https://download.pytorch.org/whl/cu118 -i https://pypi.tuna.tsinghua.edu.cn/simple

Python=3.8安装vllm=vllm-0.6.4.post1最终还是失败了

2. vLLM运行示例

2.1 sampling方式测试样例

测试代码

from vllm import LLM, SamplingParams

prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

llm = LLM(model="./opt-125m")

outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

测试结果

CUDA_VISIBLE_DEVICES=0 python3 test.py

说明：

CUDA_VISIBLE_DEVICES为指定运行显卡号

运行结果：
vllm运行示例

2.2 beam search方式测试样例

测试代码

from vllm import LLM
from vllm.sampling_params import BeamSearchParams
import time

llm = LLM(
    model="Meta-Llama-3.1-8B-Instruct", 
    max_model_len = 39456,
)

print("model load success!")

beam_params = BeamSearchParams(beam_width=4, max_tokens=100)

prompts = [
    "你是谁？",
    "你好"
]

outputs = llm.beam_search(prompts, beam_params)

for output in outputs:
    print(len(output.sequences))
    print(output.sequences[0].text)