环境
安装
- 下载vllm源码
git clone https://github.com/vllm-project/vllm.git
pip3 install -e .
- 修改配置
ninja # For faster builds.
psutil
ray >= 2.9
sentencepiece # Required for LLaMA tokenizer.
numpy
torch == 2.0.1
transformers >= 4.38.0 # Required for Gemma.
# xformers == 0.0.23.post1 # Required for CUDA 12.1.
xformers == 0.0.22
fastapi
uvicorn[standard]
pydantic >= 2.0 # Required for OpenAI server.
prometheus_client >= 0.18.0
pynvml == 11.5.0
triton == 2.0.0
outlines
# cupy-cuda12x == 12.1.0 # Required for CUDA graphs. CUDA 11.8 users should install cupy-cuda11x instead.
cupy-cuda11x
# Should be mirrored in pyproject.toml
ninja
packaging
setuptools>=49.4.0
torch==2.0.1
wheel
requires = [
"ninja",
"packaging",
"setuptools >= 49.4.0",
"torch == 2.0.1",
"wheel",
]
可能遇到的问题
- 安装flash-attention的时候报错,pip not found
可以把setup.py中安装flash attention部分注释掉,自己手动执行其中的命令,然后再进行安装 - 遇到triton和vllm的torch版本冲突,上述步骤安装完,重新安装一下triton
参考资料
- https://www.cnblogs.com/marsggbo/p/17966269