【xinference部署大模型超详细教程 gemma-it为例子】

放飞自我的Coder

已于 2024-03-12 23:20:41 修改

阅读量3.1k

点赞数 7

文章标签： xinference llm openai API

于 2024-03-10 18:12:33 首次发布

本文链接：https://blog.csdn.net/qq_39749966/article/details/136605916

版权

inference文档地址

你的点赞和收藏是我持续分享优质内容的动力哦~

加速

source /etc/network_turbo # 仅限autodl平台
pip config set global.index-url https://mirrors.pku.edu.cn/pypi/web/simple

第一步

安装 xinference 和 vLLM：

vLLM 是一个支持高并发的高性能大模型推理引擎。当满足以下条件时，Xinference 会自动选择 vllm 作为引擎来达到更高的吞吐量：
模型的格式必须是 PyTorch 或者 GPTQ
量化方式必须是 GPTQ 4 bit 或者 none
运行的操作系统必须是 Linux 且至少有一张支持 CUDA 的显卡
运行的模型必须在 vLLM 引擎的支持列表里

pip install "xinference[vllm]"

PyTorch(transformers) 引擎支持`几乎所有的最新模型`，这是 Pytorch 模型默认使用的引擎：

pip install "xinference[transformers]"

当使用 `GGML` 引擎时，建议根据当前使用的硬件手动安装依赖，从而获得最佳的加速效果。

pip install xinference ctransformers

Nvidia

CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python

AMD

CMAKE_ARGS="-DLLAMA_HIPBLAS=on" pip install llama-cpp-python

MAC

CMAKE_ARGS="-DLLAMA_METAL=on" pip install llama-cpp-python

如果使用量化

pip install accelerate
pip install bitsandbytes

第二步

创建开发环境文档

安装 nodejs

conda install nodejs

如果出错，尝试运行

conda clean -i
# or $HOME一般是root, 即/root/.condarc
rm $HOME/.condarc

下载源码

git clone https://github.com/xorbitsai/inference.git
cd inference

pip install -e .
xinference-local

编译前端界面(只使用 api 则不需要编译前端)

npm cache clean --force
npm install
npm run build

# -> 您可以返回到包含 setup.cfg 和 setup.py 文件的目录

pip install -e .

开始使用

先拉起本地服务

XINFERENCE_HOME=./models/ xinference-local --host 0.0.0.0 --port 9997

然后启动运行

quantization: none, 4-bit, 8-bit
国内下载加速：HF_ENDPOINT=https://hf-mirror.com XINFERENCE_MODEL_SRC=modelscope

HF_ENDPOINT=https://hf-mirror.com XINFERENCE_MODEL_SRC=modelscope xinference launch --model-name gemma-it --size-in-billions 2 --model-format pytorch --quantization 8-bit

测试

列出模型列表

from xinference.client import Client

client = Client("http://0.0.0.0:9997")
print(client.list_models())

在这里插入图片描述

openai 接口调用

import openai

# Assume that the model is already launched.
# The api_key can't be empty, any string is OK.
model_uid = 'gemma-it'
client = openai.Client(api_key="not empty", base_url="http://localhost:9997/v1")
client.chat.completions.create(
    model=model_uid,
    messages=[
        {
            "content": "What is the largest animal?",
            "role": "user",
        }
    ],
    max_tokens=1024
)