PC本地部署vllm gemma-2b

最新推荐文章于 2024-08-11 14:57:27 发布

yi_xiansen

最新推荐文章于 2024-08-11 14:57:27 发布

阅读量2k

点赞数 10

文章标签：深度学习人工智能 llama

本文链接：https://blog.csdn.net/yi_xiansen/article/details/139401176

版权

PC本地部署vLLM

vLLM是一个快速且易于使用的LLM推理和服务库。
vLLM具有以下特点：

顶尖水准的服务吞吐量
PagedAttention对注意力关键和价值记忆的有效管理
连续批处理传入请求
使用CUDA/HIP图快速执行模型
量化：GPTQ、AWQ、SqueezeLLM、FP8KV缓存
优化的CUDA内核

vLLM具有灵活性，易于与以下组件一起使用：

与流行的HuggingFace模型无缝集成
具有各种解码算法的高吞吐量服务，包括并行采样、波束搜索等
分布式推理的张量并行性支持
流输出
兼容OpenAI的API服务器
支持NVIDIA GPU和AMD GPU

由于vllm不支持Windows系统，下文将讲述如何在WSL2(linux系统)上部署

1. WSL2安装

WSL安装可参考站内该文章WSL2安装（详细过程）

注：WSL版本必须为2，否则后续无法正常加载显卡驱动
WSL1与2版本差异

2. vllm安装

vllm官网建议安装cuda12.1，与conda管理python包。

anaconda3下载链接，选择linux-x86安装包下载到本地即可。下载到指定目录后，例如我下载在D盘中，进入wsl2终端cd /mnt/d，bash Anaconda3-2024.02-1-Linux-x86_64.sh即可开始安装。安装过程一路回车输入yes即可，注最后一步建议输yes，会将conda环境变量写入~/.bashrc文件中。
conda国内清华源配置，vim ~/.condarc，输入如下内容

channels:
  - defaults
show_channel_urls: true
default_channels:
  - https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
  - https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/r
  - https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/msys2
custom_channels:
  conda-forge: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
  msys2: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
  bioconda: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
  menpo: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
  pytorch: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
  pytorch-lts: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
  simpleitk: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud

创建python环境conda create -n vllm python=3.9 -y
激活vllm conda环境conda activate vllm
配置pip国内清华源pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple
待cuda下载完成后，pip install vllm安装vllm包。
安装vllm包后，再补齐一下pytorch2.3.0安装，pip3 install torch torchvision torchaudio。

3. cuda安装

安装cuda前需在windows安装英伟达驱动，根据自己显卡情况安装。安装成功后，在windows的cmd终端或wsl2终端使用nvidia-smi应有正常回显。
cuda与驱动有如下参考信息，更详细情况可参考官方CUDA Toolkit and Corresponding Driver Versions，若版本不配套会导致功能无法正常使用。

vllm官方使用cuda12.1版本，使用deb（network）安装方式安装，有如下命令。

wget https://developer.download.nvidia.com/compute/cuda/repos/wsl-ubuntu/x86_64/cuda-keyring_1.0-1_all.deb
sudo dpkg -i cuda-keyring_1.0-1_all.deb
sudo apt-get update
sudo apt-get -y install cuda

安装完毕后将cuda写入环境变量。

echo 'export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH' >> ~/.bashrc
echo 'export PATH=/usr/local/cuda/bin:$PATH' >> ~/.bashrc
source ~/.bashrc

使用nvcc -V有正常回显，cuda安装成功，验证cuda能否正常使用cd /usr/local/cuda/extras/demo_suite/ && ./deviceQuery && cd -，出现PASS打屏即保证cuda功能正常。

4. 功能验证

本人PC显卡为RTX2070，显存仅有8G，仅能运行4B参数量以下模型，下文以谷歌gemma-2b作为演示。
参考vllm引擎启动命令参数，本人以如下命令启动vllm openai server。

python3 -m vllm.entrypoints.openai.api_server --model /data/gemma_test/gemma-2b/ --dtype float16 --served-model-name gemma-2b --host 127.0.0.1 --port 8080

出现如下打屏，说明服务端启动成功。
vllm服务端打印

另起一终端输入如下命令，可得到模型输出。

curl http://127.0.0.1:8080/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "gemma-2b",
        "prompt": "请你介绍一下你自己"
    }'

vllm回复