参考连接:官方网站
一、Xinference 是什么?
Xorbits Inference (Xinference) 是一个开源平台,用于简化各种 AI 模型的运行和集成。借助 Xinference,您可以使用任何开源 LLM、嵌入模型和多模态模型在云端或本地环境中运行推理,并创建强大的 AI 应用。简单来讲,就是一个可以安装各种模型可视化的安装平台。
1.1 准备工作
- Xinference 使用 GPU 加速推理,该镜像需要在有 GPU 显卡并且安装 CUDA 的机器上运行。
- 保证 CUDA 在机器上正确安装。可以使用
nvidia-smi
检查是否正确运行。 - 镜像中的 CUDA 版本为
12.4
。为了不出现预期之外的问题,请将宿主机的 CUDA 版本和 NVIDIA Driver 版本分别升级到12.4
和550
以上。
注意: 在安装之前可以先cmd执行一下 nvidia-smi 命令,看看本机的gpu版本多少的。按照官网的要求,CUDA和NVIDIA Driver的版本必须得12.4和550以上。我本地是 CUDA Version: 12.2 、Driver Version: 537.34 的,但也能运行起来,我升级升不上去,不知道为啥,如果有知道的小伙伴,欢迎交流。
二、通过Docker安装
因为我是Windows操作系统,也不想直接通过本地安装的方式安装,就直接参考官网,通过Docker的方式安装,很简单,一条命令即可。
注意:通过docker方式安装,电脑必须要有GPU(显卡),否则安装失败。
快捷命令:
Windows 执行命令:(注意盘符问题)
docker run -d --name xinference --gpus all -v e:/xinference/models:/root/models -v e:/xinference/.xinference:/root/.xinference -v e:/xinference/.cache/huggingface:/root/.cache/huggingface -e XINFERENCE_HOME=/root/models -p 9997:9997 registry.cn-hangzhou.aliyuncs.com/xprobe_xinference/xinference:latest xinference-local -H 0.0.0.0
Linux执行命令:
docker run -d --name xinference --gpus all -v /opt/xinference/models:/root/models -v /opt/xinference/.xinference:/root/.xinference -v /opt/xinference/.cache/huggingface:/root/.cache/huggingface -e XINFERENCE_HOME=/root/models -p 9997:9997 registry.cn-hangzhou.aliyuncs.com/xprobe_xinference/xinference:latest xinference-local -H 0.0.0.0
参数解释 (重点):
参数名 | 解释 | 是否必填 |
---|---|---|
–name | 设置容器名称 | 是 |
–gpus | 使用gpu | 是 |
-v e:/xinference/models:/root/models | 默认情况下,镜像中不包含任何模型文件,使用过程中会在容器内下载模型。如果需要使用已经下载好的模型,需要将宿主机的目录挂载到容器内。这种情况下,需要在运行容器时指定本地卷,并且为 Xinference 配置环境变量。 | 是 (自定义挂载目录,与下面默认挂载方式二选一) |
-e XINFERENCE_HOME=/root/models | 将主机上指定的目录挂载到容器中,并设置 XINFERENCE_HOME 环境变量指向容器内的该目录。这样,所有下载的模型文件将存储在您在主机上指定的目录中。您无需担心在 Docker 容器停止时丢失这些文件,下次运行容器时,您可以直接使用现有的模型,无需重复下载。 | 是 (如果选择自定义目录,则需要指定环境变量) |
-v e:/xinference/.xinference:/root/.xinference -v e:/xinference/.cache/huggingface:/root/.cache/huggingface | 如果你在宿主机使用的默认路径下载的模型,由于 xinference cache 目录是用的软链的方式存储模型,需要将原文件所在的目录也挂载到容器内。例如你使用 huggingface 和 modelscope 作为模型仓库,那么需要将这两个对应的目录挂载到容器内,一般对应的 cache 目录分别在 <home_path>/.cache/huggingface 和 <home_path>/.cache/modelscope | 是(默认挂载方式与上面自定义挂载方式二选一) |
-p 9997:9997 | 端口映射 | 是 |
registry.cn-hangzhou.aliyuncs.com/xprobe_xinference/xinference:latest | 当前,可以通过两个渠道拉取 Xinference 的官方镜像。 1. 在 Dockerhub 的 xprobe/xinference 仓库里。2. Dockerhub 中的镜像会同步上传一份到阿里云公共镜像仓库中,供访问 Dockerhub 有困难的用户拉取。 目前可用的标签包括: nightly-main : 这个镜像会每天从 GitHub main 分支更新制作,不保证稳定可靠。v<release version> : 这个镜像会在 Xinference 每次发布的时候制作,通常可以认为是稳定可靠的。latest : 这个镜像会在 Xinference 发布时指向最新的发布版本 。 对于 CPU 版本,增加 -cpu后缀,如 nightly-main-cpu`。 | 是 |
-H 0.0.0.0 | -H 0.0.0.0 必须指定的,否则在容器外无法连接到 Xinference 服务。 | 是 |
2.1 页面访问
通过以上命令启动之后,即可通过 localhost:9997 也可以通过本机IP地址访问,比如 192.168.1.152:9997 去访问。
2.2 操作页面介绍
这里简单介绍一下操作页面,界面很简单,基本都能看懂,需要什么模型,去到对应的模型库里面下载即可。
三、部署一个简单模型
部署好之后,我们在线部署一个简单对话模型: 以 qwen-chat 为例
3.1 搜索 qwen-chat 回车
3.2 开始部署
参数说明:
参数填写完之后,点击小火箭,即可部署。这里需要等待,因为需要去模型仓库里面拉取模型,默认两个:huggingface和modelscope 。下载模型需要开代理,我这边下载默认是从huggingface里面下载的,所以全程代理下载。部署速度由代理速度决定。
3.3 开始对话
部署好之后,我们在 “Running Models” 看到模型。需要注意的是,能够跑的模型数量取决于GPU数量,如果你只有一颗GPU,那只能跑一个模型,以此类推。
点击后面的开始对话,即可跳转对面页面。
可以看到,这么模型还不错。
四、对接本地模型
Xinference作为一个模型集成平台,当然也可以对接本地的模型。
4.1 下载模型
首先,去到对应的模型仓库,下载好自己想要对接的模型。此处下载的时候需要注意,一定要参考官方能够支持的模型种类,否则是注册不了的。
模型仓库:
https://huggingface.co/
https://www.modelscope.cn/
目前Xinference 支持的模型家族有,大部分都是支持的。
MODEL NAME | ABILITIES | COTNEXT_LENGTH | DESCRIPTION |
---|---|---|---|
aquila2 | generate | 2048 | Aquila2 series models are the base language models |
aquila2-chat | chat | 2048 | Aquila2-chat series models are the chat models |
aquila2-chat-16k | chat | 16384 | AquilaChat2-16k series models are the long-text chat models |
baichuan | generate | 4096 | Baichuan is an open-source Transformer based LLM that is trained on both Chinese and English data. |
baichuan-2 | generate | 4096 | Baichuan2 is an open-source Transformer based LLM that is trained on both Chinese and English data. |
baichuan-2-chat | chat | 4096 | Baichuan2-chat is a fine-tuned version of the Baichuan LLM, specializing in chatting. |
baichuan-chat | chat | 4096 | Baichuan-chat is a fine-tuned version of the Baichuan LLM, specializing in chatting. |
c4ai-command-r-v01 | chat | 131072 | C4AI Command-R(+) is a research release of a 35 and 104 billion parameter highly performant generative model. |
chatglm | chat | 2048 | ChatGLM is an open-source General Language Model (GLM) based LLM trained on both Chinese and English data. |
chatglm2 | chat | 8192 | ChatGLM2 is the second generation of ChatGLM, still open-source and trained on Chinese and English data. |
chatglm2-32k | chat | 32768 | ChatGLM2-32k is a special version of ChatGLM2, with a context window of 32k tokens instead of 8k. |
chatglm3 | chat, tools | 8192 | ChatGLM3 is the third generation of ChatGLM, still open-source and trained on Chinese and English data. |
chatglm3-128k | chat | 131072 | ChatGLM3 is the third generation of ChatGLM, still open-source and trained on Chinese and English data. |
chatglm3-32k | chat | 32768 | ChatGLM3 is the third generation of ChatGLM, still open-source and trained on Chinese and English data. |
code-llama | generate | 100000 | Code-Llama is an open-source LLM trained by fine-tuning LLaMA2 for generating and discussing code. |
code-llama-instruct | chat | 100000 | Code-Llama-Instruct is an instruct-tuned version of the Code-Llama LLM. |
code-llama-python | generate | 100000 | Code-Llama-Python is a fine-tuned version of the Code-Llama LLM, specializing in Python. |
codegeex4 | chat | 131072 | the open-source version of the latest CodeGeeX4 model series |
codeqwen1.5 | generate | 65536 | CodeQwen1.5 is the Code-Specific version of Qwen1.5. It is a transformer-based decoder-only language model pretrained on a large amount of data of codes. |
codeqwen1.5-chat | chat | 65536 | CodeQwen1.5 is the Code-Specific version of Qwen1.5. It is a transformer-based decoder-only language model pretrained on a large amount of data of codes. |
codeshell | generate | 8194 | CodeShell is a multi-language code LLM developed by the Knowledge Computing Lab of Peking University. |
codeshell-chat | chat | 8194 | CodeShell is a multi-language code LLM developed by the Knowledge Computing Lab of Peking University. |
codestral-v0.1 | generate | 32768 | Codestrall-22B-v0.1 is trained on a diverse dataset of 80+ programming languages, including the most popular ones, such as Python, Java, C, C++, JavaScript, and Bash |
cogvlm2 | chat, vision | 8192 | CogVLM2 have achieved good results in many lists compared to the previous generation of CogVLM open source models. Its excellent performance can compete with some non-open source models. |
csg-wukong-chat-v0.1 | chat | 32768 | csg-wukong-1B is a 1 billion-parameter small language model(SLM) pretrained on 1T tokens. |
deepseek | generate | 4096 | DeepSeek LLM, trained from scratch on a vast dataset of 2 trillion tokens in both English and Chinese. |
deepseek-chat | chat | 4096 | DeepSeek LLM is an advanced language model comprising 67 billion parameters. It has been trained from scratch on a vast dataset of 2 trillion tokens in both English and Chinese. |
deepseek-coder | generate | 16384 | Deepseek Coder is composed of a series of code language models, each trained from scratch on 2T tokens, with a composition of 87% code and 13% natural language in both English and Chinese. |
deepseek-coder-instruct | chat | 16384 | deepseek-coder-instruct is a model initialized from deepseek-coder-base and fine-tuned on 2B tokens of instruction data. |
deepseek-vl-chat | chat, vision | 4096 | DeepSeek-VL possesses general multimodal understanding capabilities, capable of processing logical diagrams, web pages, formula recognition, scientific literature, natural images, and embodied intelligence in complex scenarios. |
falcon | generate | 2048 | Falcon is an open-source Transformer based LLM trained on the RefinedWeb dataset. |
falcon-instruct | chat | 2048 | Falcon-instruct is a fine-tuned version of the Falcon LLM, specializing in chatting. |
gemma-2-it | chat | 8192 | Gemma is a family of lightweight, state-of-the-art open models from Google, built from the same research and technology used to create the Gemini models. |
gemma-it | chat | 8192 | Gemma is a family of lightweight, state-of-the-art open models from Google, built from the same research and technology used to create the Gemini models. |
glaive-coder | chat | 16384 | A code model trained on a dataset of ~140k programming related problems and solutions generated from Glaive’s synthetic data generation platform. |
glm-4v | chat, vision | 8192 | GLM4 is the open source version of the latest generation of pre-trained models in the GLM-4 series launched by Zhipu AI. |
glm4-chat | chat, tools | 131072 | GLM4 is the open source version of the latest generation of pre-trained models in the GLM-4 series launched by Zhipu AI. |
glm4-chat-1m | chat, tools | 1048576 | GLM4 is the open source version of the latest generation of pre-trained models in the GLM-4 series launched by Zhipu AI. |
gorilla-openfunctions-v1 | chat | 4096 | OpenFunctions is designed to extend Large Language Model (LLM) Chat Completion feature to formulate executable APIs call given natural language instructions and API context. |
gorilla-openfunctions-v2 | chat | 4096 | OpenFunctions is designed to extend Large Language Model (LLM) Chat Completion feature to formulate executable APIs call given natural language instructions and API context. |
gpt-2 | generate | 1024 | GPT-2 is a Transformer-based LLM that is trained on WebTest, a 40 GB dataset of Reddit posts with 3+ upvotes. |
internlm-20b | generate | 16384 | Pre-trained on over 2.3T Tokens containing high-quality English, Chinese, and code data. |
internlm-7b | generate | 8192 | InternLM is a Transformer-based LLM that is trained on both Chinese and English data, focusing on practical scenarios. |
internlm-chat-20b | chat | 16384 | Pre-trained on over 2.3T Tokens containing high-quality English, Chinese, and code data. The Chat version has undergone SFT and RLHF training. |
internlm-chat-7b | chat | 4096 | Internlm-chat is a fine-tuned version of the Internlm LLM, specializing in chatting. |
internlm2-chat | chat | 32768 | The second generation of the InternLM model, InternLM2. |
internlm2.5-chat | chat | 32768 | InternLM2.5 series of the InternLM model. |
internlm2.5-chat-1m | chat | 262144 | InternLM2.5 series of the InternLM model supports 1M long-context |
internvl-chat | chat, vision | 32768 | InternVL 1.5 is an open-source multimodal large language model (MLLM) to bridge the capability gap between open-source and proprietary commercial models in multimodal understanding. |
llama-2 | generate | 4096 | Llama-2 is the second generation of Llama, open-source and trained on a larger amount of data. |
llama-2-chat | chat | 4096 | Llama-2-Chat is a fine-tuned version of the Llama-2 LLM, specializing in chatting. |
llama-3 | generate | 8192 | Llama 3 is an auto-regressive language model that uses an optimized transformer architecture |
llama-3-instruct | chat | 8192 | The Llama 3 instruction tuned models are optimized for dialogue use cases and outperform many of the available open source chat models on common industry benchmarks… |
llama-3.1 | generate | 131072 | Llama 3.1 is an auto-regressive language model that uses an optimized transformer architecture |
llama-3.1-instruct | chat | 131072 | The Llama 3.1 instruction tuned models are optimized for dialogue use cases and outperform many of the available open source chat models on common industry benchmarks… |
minicpm-2b-dpo-bf16 | chat | 4096 | MiniCPM is an End-Size LLM developed by ModelBest Inc. and TsinghuaNLP, with only 2.4B parameters excluding embeddings. |
minicpm-2b-dpo-fp16 | chat | 4096 | MiniCPM is an End-Size LLM developed by ModelBest Inc. and TsinghuaNLP, with only 2.4B parameters excluding embeddings. |
minicpm-2b-dpo-fp32 | chat | 4096 | MiniCPM is an End-Size LLM developed by ModelBest Inc. and TsinghuaNLP, with only 2.4B parameters excluding embeddings. |
minicpm-2b-sft-bf16 | chat | 4096 | MiniCPM is an End-Size LLM developed by ModelBest Inc. and TsinghuaNLP, with only 2.4B parameters excluding embeddings. |
minicpm-2b-sft-fp32 | chat | 4096 | MiniCPM is an End-Size LLM developed by ModelBest Inc. and TsinghuaNLP, with only 2.4B parameters excluding embeddings. |
minicpm-llama3-v-2_5 | chat, vision | 2048 | MiniCPM-Llama3-V 2.5 is the latest model in the MiniCPM-V series. The model is built on SigLip-400M and Llama3-8B-Instruct with a total of 8B parameters. |
mistral-instruct-v0.1 | chat | 8192 | Mistral-7B-Instruct is a fine-tuned version of the Mistral-7B LLM on public datasets, specializing in chatting. |
mistral-instruct-v0.2 | chat | 8192 | The Mistral-7B-Instruct-v0.2 Large Language Model (LLM) is an improved instruct fine-tuned version of Mistral-7B-Instruct-v0.1. |
mistral-instruct-v0.3 | chat | 32768 | The Mistral-7B-Instruct-v0.2 Large Language Model (LLM) is an improved instruct fine-tuned version of Mistral-7B-Instruct-v0.1. |
mistral-large-instruct | chat | 131072 | Mistral-Large-Instruct-2407 is an advanced dense Large Language Model (LLM) of 123B parameters with state-of-the-art reasoning, knowledge and coding capabilities. |
mistral-nemo-instruct | chat | 1024000 | The Mistral-Nemo-Instruct-2407 Large Language Model (LLM) is an instruct fine-tuned version of the Mistral-Nemo-Base-2407 |
mistral-v0.1 | generate | 8192 | Mistral-7B is a unmoderated Transformer based LLM claiming to outperform Llama2 on all benchmarks. |
mixtral-8x22b-instruct-v0.1 | chat | 65536 | The Mixtral-8x22B-Instruct-v0.1 Large Language Model (LLM) is an instruct fine-tuned version of the Mixtral-8x22B-v0.1, specializing in chatting. |
mixtral-instruct-v0.1 | chat | 32768 | Mistral-8x7B-Instruct is a fine-tuned version of the Mistral-8x7B LLM, specializing in chatting. |
mixtral-v0.1 | generate | 32768 | The Mixtral-8x7B Large Language Model (LLM) is a pretrained generative Sparse Mixture of Experts. |
omnilmm | chat, vision | 2048 | OmniLMM is a family of open-source large multimodal models (LMMs) adept at vision & language modeling. |
openbuddy | chat | 2048 | OpenBuddy is a powerful open multilingual chatbot model aimed at global users. |
openhermes-2.5 | chat | 8192 | Openhermes 2.5 is a fine-tuned version of Mistral-7B-v0.1 on primarily GPT-4 generated data. |
opt | generate | 2048 | Opt is an open-source, decoder-only, Transformer based LLM that was designed to replicate GPT-3. |
orca | chat | 2048 | Orca is an LLM trained by fine-tuning LLaMA on explanation traces obtained from GPT-4. |
orion-chat | chat | 4096 | Orion-14B series models are open-source multilingual large language models trained from scratch by OrionStarAI. |
orion-chat-rag | chat | 4096 | Orion-14B series models are open-source multilingual large language models trained from scratch by OrionStarAI. |
phi-2 | generate | 2048 | Phi-2 is a 2.7B Transformer based LLM used for research on model safety, trained with data similar to Phi-1.5 but augmented with synthetic texts and curated websites. |
phi-3-mini-128k-instruct | chat | 128000 | The Phi-3-Mini-128K-Instruct is a 3.8 billion-parameter, lightweight, state-of-the-art open model trained using the Phi-3 datasets. |
phi-3-mini-4k-instruct | chat | 4096 | The Phi-3-Mini-4k-Instruct is a 3.8 billion-parameter, lightweight, state-of-the-art open model trained using the Phi-3 datasets. |
platypus2-70b-instruct | generate | 4096 | Platypus-70B-instruct is a merge of garage-bAInd/Platypus2-70B and upstage/Llama-2-70b-instruct-v2. |
qwen-chat | chat, tools | 32768 | Qwen-chat is a fine-tuned version of the Qwen LLM trained with alignment techniques, specializing in chatting. |
qwen-vl-chat | chat, vision | 4096 | Qwen-VL-Chat supports more flexible interaction, such as multiple image inputs, multi-round question answering, and creative capabilities. |
qwen1.5-chat | chat, tools | 32768 | Qwen1.5 is the beta version of Qwen2, a transformer-based decoder-only language model pretrained on a large amount of data. |
qwen1.5-moe-chat | chat, tools | 32768 | Qwen1.5-MoE is a transformer-based MoE decoder-only language model pretrained on a large amount of data. |
qwen2-instruct | chat, tools | 32768 | Qwen2 is the new series of Qwen large language models |
qwen2-moe-instruct | chat, tools | 32768 | Qwen2 is the new series of Qwen large language models. |
seallm_v2 | generate | 8192 | We introduce SeaLLM-7B-v2, the state-of-the-art multilingual LLM for Southeast Asian (SEA) languages |
seallm_v2.5 | generate | 8192 | We introduce SeaLLM-7B-v2.5, the state-of-the-art multilingual LLM for Southeast Asian (SEA) languages |
skywork | generate | 4096 | Skywork is a series of large models developed by the Kunlun Group · Skywork team. |
skywork-math | generate | 4096 | Skywork is a series of large models developed by the Kunlun Group · Skywork team. |
starchat-beta | chat | 8192 | Starchat-beta is a fine-tuned version of the Starcoderplus LLM, specializing in coding assistance. |
starcoder | generate | 8192 | Starcoder is an open-source Transformer based LLM that is trained on permissively licensed data from GitHub. |
starcoderplus | generate | 8192 | Starcoderplus is an open-source LLM trained by fine-tuning Starcoder on RedefinedWeb and StarCoderData datasets. |
starling-lm | chat | 4096 | We introduce Starling-7B, an open large language model (LLM) trained by Reinforcement Learning from AI Feedback (RLAIF). The model harnesses the power of our new GPT-4 labeled ranking dataset |
telechat | chat | 8192 | The TeleChat is a large language model developed and trained by China Telecom Artificial Intelligence Technology Co., LTD. The 7B model base is trained with 1.5 trillion Tokens and 3 trillion Tokens and Chinese high-quality corpus. |
tiny-llama | generate | 2048 | The TinyLlama project aims to pretrain a 1.1B Llama model on 3 trillion tokens. |
vicuna-v1.3 | chat | 2048 | Vicuna is an open-source LLM trained by fine-tuning LLaMA on data collected from ShareGPT. |
vicuna-v1.5 | chat | 4096 | Vicuna is an open-source LLM trained by fine-tuning LLaMA on data collected from ShareGPT. |
vicuna-v1.5-16k | chat | 16384 | Vicuna-v1.5-16k is a special version of Vicuna-v1.5, with a context window of 16k tokens instead of 4k. |
wizardcoder-python-v1.0 | chat | 100000 | |
wizardlm-v1.0 | chat | 2048 | WizardLM is an open-source LLM trained by fine-tuning LLaMA with Evol-Instruct. |
wizardmath-v1.0 | chat | 2048 | WizardMath is an open-source LLM trained by fine-tuning Llama2 with Evol-Instruct, specializing in math. |
xverse | generate | 2048 | XVERSE is a multilingual large language model, independently developed by Shenzhen Yuanxiang Technology. |
xverse-chat | chat | 2048 | XVERSEB-Chat is the aligned version of model XVERSE. |
yi | generate | 4096 | The Yi series models are large language models trained from scratch by developers at 01.AI. |
yi-1.5 | generate | 4096 | Yi-1.5 is an upgraded version of Yi. It is continuously pre-trained on Yi with a high-quality corpus of 500B tokens and fine-tuned on 3M diverse fine-tuning samples. |
yi-1.5-chat | chat | 4096 | Yi-1.5 is an upgraded version of Yi. It is continuously pre-trained on Yi with a high-quality corpus of 500B tokens and fine-tuned on 3M diverse fine-tuning samples. |
yi-1.5-chat-16k | chat | 16384 | Yi-1.5 is an upgraded version of Yi. It is continuously pre-trained on Yi with a high-quality corpus of 500B tokens and fine-tuned on 3M diverse fine-tuning samples. |
yi-200k | generate | 262144 | The Yi series models are large language models trained from scratch by developers at 01.AI. |
yi-chat | chat | 4096 | The Yi series models are large language models trained from scratch by developers at 01.AI. |
yi-vl-chat | chat, vision | 4096 | Yi Vision Language (Yi-VL) model is the open-source, multimodal version of the Yi Large Language Model (LLM) series, enabling content comprehension, recognition, and multi-round conversations about images. |
zephyr-7b-alpha | chat | 8192 | Zephyr-7B-α is the first model in the series, and is a fine-tuned version of mistralai/Mistral-7B-v0.1. |
zephyr-7b-beta | chat | 8192 | Zephyr-7B-β is the second model in the series, and is a fine-tuned version of mistralai/Mistral-7B-v0.1 |
4.2 注册模型
下载好对应模型之后,放到容器挂载的目录下面。此处目录注意,如果是第一次注册本地模型,直接放到你启动Xinference的挂载目录即可。比如 e:/xinference/models ,注册之后会自动创建对应的模型仓库,然后移动模型,以后就可以直接放到对应仓库下面即可。
页面注册:
模型参考信息参考:模型家族
确认信息无误之后,点击注册模型,如果没有报错,即注册成功。
4.3 启动模型
点击注册好的模型,填写相关信息,开始启动,相关信息填写参考上面部署简单模型那里。
4.4 开始对话
启动好之后,还是在“Running Models” 当中,点击后面的对话,即可开始对话。
五、总结
以上就是 Windows11 通过 Docker 部署Xinference 平台的操作步骤。
关于上面显卡没办法升级的问题,还没解决,如果有会的小伙伴,欢迎交流。
【亲测】MaxKB如何对接 Xinference 大模型