时间 | 版本 | 修改人 | 描述 |
---|---|---|---|
2024年7月4日09:48:09 | V0.1 | 宋全恒 | 新建文档 |
简介
由于最近需要向vLLM上集成功能,因此,需要能够调试自己的仓库LLMs_Inference,该文档记录了源码编译的完整的过程。
参考链接如下:
正常简单执行下述的代码,即可完成源码的编译安装
git clone https://github.com/vllm-project/vllm.git
cd vllm
# export VLLM_INSTALL_PUNICA_KERNELS=1 # optionally build for multi-LoRA capability
pip install -e . # This may take 5-10 minutes.
但实际上还是比较麻烦的。因为仓库LLMs_Inference是从vllm仓库fork出来的,所以理论上应该是一样的。
仓库介绍
仓库中有多个依赖环境,
这些文件通常用于记录项目的依赖关系,以便在特定环境中进行安装和配置。
-
requirements.txt
:一般用于列出项目所需的所有依赖项及其版本要求。通过在该文件中指定所需的库和版本,方便一次性安装所有依赖。 -
requirements-cpu.txt
、requirements-cuda.txt
、requirements-rocm.txt
、requirements-neuron.txt
:这些文件可能是针对不同的硬件或计算环境的特定依赖列表。例如,requirements-cuda.txt
可能包含与 CUDA(Compute Unified Device Architecture,一种并行计算平台和编程模型)相关的依赖;requirements-rocm.txt
可能涉及 ROCm(Radeon Open Compute platform,AMD 的开源计算平台)的依赖;requirements-neuron.txt
也许和特定的神经元芯片或相关技术的依赖有关。
而 requirements-dev.txt
通常用于开发环境所需的额外依赖项,这些依赖可能不是项目在运行时必需的,但对于开发、测试、构建等过程是需要的。
源码编译安装 vLLM 是否需要安装所有这些依赖文件,取决于你的具体需求和使用场景。
如果你计划在特定的硬件环境(如使用 CUDA、ROCM 等)中运行 vLLM 或进行相关开发,那么可能需要根据相应的环境安装对应的依赖文件。
以安装 vLLM 为例,通常需要先创建 conda 环境并激活,然后查看 requirements.txt
中指定的 PyTorch 版本等依赖信息,再进行安装。
vllm开发环境准备
直接在宿主机上安装
解决torch依赖下载问题
(llms_inference) yuzailiang@ubuntu:/mnt/self-define/sunning/lmdeploy/LLMs_Inference$ python -c "import torch; print('device count:',torch.cuda.device_count(), 'available: ', torch.cuda.is_available())"
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/home/yuzailiang/anaconda3/envs/llms_inference/lib/python3.9/site-packages/torch/__init__.py", line 237, in <module>
from torch._C import * # noqa: F403
ImportError: /home/yuzailiang/anaconda3/envs/llms_inference/lib/python3.9/site-packages/torch/lib/../../nvidia/cusparse/lib/libcusparse.so.12: undefined symbol: __nvJitLinkAddData_12_1, version libnvJitLink.so.12
上面是直接安装时发现虽然安装了torch==2.3.0
但是,无法使用gpu。
解决方式:
单独安装torch依赖
pip install torch==2.3.0 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
安装之后,验证torch可以正确的驱动CUDA调用GPU
(llms_inference) yuzailiang@ubuntu:/mnt/self-define/sunning/lmdeploy/LLMs_Inference$ python -c "import torch; print('device count:',torch.cuda.device_count(), 'available: ', torch.cuda.is_available())"
device count: 8 available: True
继续执行pip install -e .
问题还是存在
packages/torch/__init__.py", line 237, in <module>
from torch._C import * # noqa: F403
ImportError: /home/yuzailiang/anaconda3/envs/llms_inference/lib/python3.9/site-packages/torch/lib/../../nvidia/cusparse/lib/libcusparse.so.12: undefined symbol: __nvJitLinkAddData_12_1, version libnvJitLink.so.12
尝试使用cu121-Couldn’t find CUDA library root.
pip install torch==2.3.0 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
Building wheels for collected packages: vllm
Building editable for vllm (pyproject.toml) ... error
error: subprocess-exited-with-error
× Building editable for vllm (pyproject.toml) did not run successfully.
│ exit code: 1
╰─> [139 lines of output]
running editable_wheel
creating /tmp/pip-wheel-c7m73v0l/.tmp-t6j0dz53/vllm.egg-info
writing /tmp/pip-wheel-c7m73v0l/.tmp-t6j0dz53/vllm.egg-info/PKG-INFO
writing dependency_links to /tmp/pip-wheel-c7m73v0l/.tmp-t6j0dz53/vllm.egg-info/dependency_links.txt
writing requirements to /tmp/pip-wheel-c7m73v0l/.tmp-t6j0dz53/vllm.egg-info/requires.txt
writing top-level names to /tmp/pip-wheel-c7m73v0l/.tmp-t6j0dz53/vllm.egg-info/top_level.txt
writing manifest file '/tmp/pip-wheel-c7m73v0l/.tmp-t6j0dz53/vllm.egg-info/SOURCES.txt'
reading manifest file '/tmp/pip-wheel-c7m73v0l/.tmp-t6j0dz53/vllm.egg-info/SOURCES.txt'
reading manifest template 'MANIFEST.in'
adding license file 'LICENSE'
writing manifest file '/tmp/pip-wheel-c7m73v0l/.tmp-t6j0dz53/vllm.egg-info/SOURCES.txt'
creating '/tmp/pip-wheel-c7m73v0l/.tmp-t6j0dz53/vllm-0.4.2+cu120.dist-info'
creating /tmp/pip-wheel-c7m73v0l/.tmp-t6j0dz53/vllm-0.4.2+cu120.dist-info/WHEEL
running build_py
running build_ext
-- The CXX compiler identification is GNU 9.4.0
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Build type: RelWithDebInfo
-- Target device: cuda
-- Found Python: /home/yuzailiang/anaconda3/envs/llms_inference/bin/python3.9 (found version "3.9.19") found components: Interpreter Development.Module
-- Found python matching: /home/yuzailiang/anaconda3/envs/llms_inference/bin/python3.9.
-- Found CUDA: /usr/local/cuda-12.0 (found version "12.0")
CMake Error at /tmp/pip-build-env-xxhrqd7n/overlay/lib/python3.9/site-packages/cmake/data/share/cmake-3.30/Modules/Internal/CMakeCUDAFindToolkit.cmake:148 (message):
Couldn't find CUDA library root.
Call Stack (most recent call first):
subprocess.CalledProcessError: Command '['cmake', '/mnt/self-define/sunning/lmdeploy/LLMs_Inference', '-G', 'Ninja', '-DCMAKE_BUILD_TYPE=RelWithDebInfo', '-DCMAKE_LIBRARY_OUTPUT_DIRECTORY=/tmp/tmpwkyqp9r6.build-lib/vllm', '-DCMAKE_ARCHIVE_OUTPUT_DIRECTORY=/tmp/tmpuvdjn65m.build-temp', '-DVLLM_TARGET_DEVICE=cuda', '-DCMAKE_CXX_COMPILER_LAUNCHER=ccache', '-DCMAKE_CUDA_COMPILER_LAUNCHER=ccache', '-DVLLM_PYTHON_EXECUTABLE=/home/yuzailiang/anaconda3/envs/llms_inference/bin/python3.9', '-DNVCC_THREADS=1', '-DCMAKE_JOB_POOL_COMPILE:STRING=compile', '-DCMAKE_JOB_POOLS:STRING=compile=256']' returned non-zero exit status 1.
上述是非常复杂的环境,因为显示正在使用的cu120,即cuda 12.0.
Building wheels for collected packages: vllm
Building editable for vllm (pyproject.toml) ... error
subprocess.CalledProcessError: Command '['cmake', '/mnt/self-define/sunning/lmdeploy/LLMs_Inference', '-G', 'Ninja', '-DCMAKE_BUILD_TYPE=RelWithDebInfo', '-DCMAKE_LIBRARY_OUTPUT_DIRECTORY=/tmp/tmplnbtei_9.build-lib/vllm', '-DCMAKE_ARCHIVE_OUTPUT_DIRECTORY=/tmp/tmpdu5xwhxm.build-temp', '-DVLLM_TARGET_DEVICE=cuda', '-DCMAKE_CXX_COMPILER_LAUNCHER=ccache', '-DCMAKE_CUDA_COMPILER_LAUNCHER=ccache', '-DVLLM_PYTHON_EXECUTABLE=/home/yuzailiang/anaconda3/envs/llms_inference/bin/python3.9', '-DNVCC_THREADS=1', '-DCMAKE_JOB_POOL_COMPILE:STRING=compile', '-DCMAKE_JOB_POOLS:STRING=compile=256']' returned non-zero exit status 1.
但是nvidia-smi显示的cuda版本为12.4,就很奇怪。
(base) yuzailiang@ubuntu:/mnt/self-define/sunning/lmdeploy/LLMs_Inference$ nvidia-smi
Tue Jul 9 09:02:46 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.14 Driver Version: 550.54.14 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+---------------------
而在 /usr/local
目录下,并没有这个cuda12.4.
(base) yuzailiang@ubuntu:/usr/local$ ll | grep cuda
lrwxrwxrwx 1 root root 22 Jul 8 05:22 cuda -> /etc/alternatives/cuda/
lrwxrwxrwx 1 root root 25 Jul 8 05:22 cuda-12 -> /etc/alternatives/cuda-12/
drwxr-xr-x 17 root root 4096 Jun 2