【亲测】Windows 11通过Docker安装Xinference 平台

参考连接:官方网站

一、Xinference 是什么?

Xorbits Inference (Xinference) 是一个开源平台,用于简化各种 AI 模型的运行和集成。借助 Xinference,您可以使用任何开源 LLM、嵌入模型和多模态模型在云端或本地环境中运行推理,并创建强大的 AI 应用。简单来讲,就是一个可以安装各种模型可视化的安装平台。

1.1 准备工作

  • Xinference 使用 GPU 加速推理,该镜像需要在有 GPU 显卡并且安装 CUDA 的机器上运行。
  • 保证 CUDA 在机器上正确安装。可以使用 nvidia-smi 检查是否正确运行。
  • 镜像中的 CUDA 版本为 12.4 。为了不出现预期之外的问题,请将宿主机的 CUDA 版本和 NVIDIA Driver 版本分别升级到 12.4550 以上。

注意: 在安装之前可以先cmd执行一下 nvidia-smi 命令,看看本机的gpu版本多少的。按照官网的要求,CUDA和NVIDIA Driver的版本必须得12.4和550以上。我本地是 CUDA Version: 12.2 、Driver Version: 537.34 的,但也能运行起来,我升级升不上去,不知道为啥,如果有知道的小伙伴,欢迎交流。

二、通过Docker安装

因为我是Windows操作系统,也不想直接通过本地安装的方式安装,就直接参考官网,通过Docker的方式安装,很简单,一条命令即可。
注意:通过docker方式安装,电脑必须要有GPU(显卡),否则安装失败。
快捷命令:

Windows 执行命令:(注意盘符问题)

docker run  -d  --name xinference --gpus all  -v e:/xinference/models:/root/models  -v e:/xinference/.xinference:/root/.xinference -v e:/xinference/.cache/huggingface:/root/.cache/huggingface -e XINFERENCE_HOME=/root/models  -p 9997:9997  registry.cn-hangzhou.aliyuncs.com/xprobe_xinference/xinference:latest  xinference-local -H 0.0.0.0

Linux执行命令:

docker run  -d  --name xinference --gpus all  -v /opt/xinference/models:/root/models  -v /opt/xinference/.xinference:/root/.xinference -v /opt/xinference/.cache/huggingface:/root/.cache/huggingface -e XINFERENCE_HOME=/root/models  -p 9997:9997  registry.cn-hangzhou.aliyuncs.com/xprobe_xinference/xinference:latest  xinference-local -H 0.0.0.0

参数解释 (重点):

参数名解释是否必填
–name设置容器名称
–gpus使用gpu
-v e:/xinference/models:/root/models默认情况下,镜像中不包含任何模型文件,使用过程中会在容器内下载模型。如果需要使用已经下载好的模型,需要将宿主机的目录挂载到容器内。这种情况下,需要在运行容器时指定本地卷,并且为 Xinference 配置环境变量。 (自定义挂载目录,与下面默认挂载方式二选一)
-e XINFERENCE_HOME=/root/models将主机上指定的目录挂载到容器中,并设置 XINFERENCE_HOME 环境变量指向容器内的该目录。这样,所有下载的模型文件将存储在您在主机上指定的目录中。您无需担心在 Docker 容器停止时丢失这些文件,下次运行容器时,您可以直接使用现有的模型,无需重复下载。 (如果选择自定义目录,则需要指定环境变量)
-v e:/xinference/.xinference:/root/.xinference -v e:/xinference/.cache/huggingface:/root/.cache/huggingface如果你在宿主机使用的默认路径下载的模型,由于 xinference cache 目录是用的软链的方式存储模型,需要将原文件所在的目录也挂载到容器内。例如你使用 huggingface 和 modelscope 作为模型仓库,那么需要将这两个对应的目录挂载到容器内,一般对应的 cache 目录分别在 <home_path>/.cache/huggingface 和 <home_path>/.cache/modelscope是(默认挂载方式与上面自定义挂载方式二选一)
-p 9997:9997端口映射
registry.cn-hangzhou.aliyuncs.com/xprobe_xinference/xinference:latest当前,可以通过两个渠道拉取 Xinference 的官方镜像。
1. 在 Dockerhub 的 xprobe/xinference 仓库里。
2. Dockerhub 中的镜像会同步上传一份到阿里云公共镜像仓库中,供访问 Dockerhub 有困难的用户拉取。
目前可用的标签包括:
nightly-main: 这个镜像会每天从 GitHub main 分支更新制作,不保证稳定可靠。
v<release version>: 这个镜像会在 Xinference 每次发布的时候制作,通常可以认为是稳定可靠的。
latest: 这个镜像会在 Xinference 发布时指向最新的发布版本 。 对于 CPU 版本,增加 -cpu后缀,如nightly-main-cpu`。
-H 0.0.0.0-H 0.0.0.0 必须指定的,否则在容器外无法连接到 Xinference 服务。

2.1 页面访问

通过以上命令启动之后,即可通过 localhost:9997 也可以通过本机IP地址访问,比如 192.168.1.152:9997 去访问。
image.png

2.2 操作页面介绍

这里简单介绍一下操作页面,界面很简单,基本都能看懂,需要什么模型,去到对应的模型库里面下载即可。
image.png

三、部署一个简单模型

部署好之后,我们在线部署一个简单对话模型: 以 qwen-chat 为例

3.1 搜索 qwen-chat 回车

image.png

3.2 开始部署

image.png
参数说明:
image.png
参数填写完之后,点击小火箭,即可部署。这里需要等待,因为需要去模型仓库里面拉取模型,默认两个:huggingface和modelscope 。下载模型需要开代理,我这边下载默认是从huggingface里面下载的,所以全程代理下载。部署速度由代理速度决定。

3.3 开始对话

部署好之后,我们在 “Running Models” 看到模型。需要注意的是,能够跑的模型数量取决于GPU数量,如果你只有一颗GPU,那只能跑一个模型,以此类推。
image.png
点击后面的开始对话,即可跳转对面页面。
image.png
image.png
可以看到,这么模型还不错。

四、对接本地模型

Xinference作为一个模型集成平台,当然也可以对接本地的模型。

4.1 下载模型

首先,去到对应的模型仓库,下载好自己想要对接的模型。此处下载的时候需要注意,一定要参考官方能够支持的模型种类,否则是注册不了的。
模型仓库:
https://huggingface.co/
https://www.modelscope.cn/
目前Xinference 支持的模型家族有,大部分都是支持的。

MODEL NAMEABILITIESCOTNEXT_LENGTHDESCRIPTION
aquila2generate2048Aquila2 series models are the base language models
aquila2-chatchat2048Aquila2-chat series models are the chat models
aquila2-chat-16kchat16384AquilaChat2-16k series models are the long-text chat models
baichuangenerate4096Baichuan is an open-source Transformer based LLM that is trained on both Chinese and English data.
baichuan-2generate4096Baichuan2 is an open-source Transformer based LLM that is trained on both Chinese and English data.
baichuan-2-chatchat4096Baichuan2-chat is a fine-tuned version of the Baichuan LLM, specializing in chatting.
baichuan-chatchat4096Baichuan-chat is a fine-tuned version of the Baichuan LLM, specializing in chatting.
c4ai-command-r-v01chat131072C4AI Command-R(+) is a research release of a 35 and 104 billion parameter highly performant generative model.
chatglmchat2048ChatGLM is an open-source General Language Model (GLM) based LLM trained on both Chinese and English data.
chatglm2chat8192ChatGLM2 is the second generation of ChatGLM, still open-source and trained on Chinese and English data.
chatglm2-32kchat32768ChatGLM2-32k is a special version of ChatGLM2, with a context window of 32k tokens instead of 8k.
chatglm3chat, tools8192ChatGLM3 is the third generation of ChatGLM, still open-source and trained on Chinese and English data.
chatglm3-128kchat131072ChatGLM3 is the third generation of ChatGLM, still open-source and trained on Chinese and English data.
chatglm3-32kchat32768ChatGLM3 is the third generation of ChatGLM, still open-source and trained on Chinese and English data.
code-llamagenerate100000Code-Llama is an open-source LLM trained by fine-tuning LLaMA2 for generating and discussing code.
code-llama-instructchat100000Code-Llama-Instruct is an instruct-tuned version of the Code-Llama LLM.
code-llama-pythongenerate100000Code-Llama-Python is a fine-tuned version of the Code-Llama LLM, specializing in Python.
codegeex4chat131072the open-source version of the latest CodeGeeX4 model series
codeqwen1.5generate65536CodeQwen1.5 is the Code-Specific version of Qwen1.5. It is a transformer-based decoder-only language model pretrained on a large amount of data of codes.
codeqwen1.5-chatchat65536CodeQwen1.5 is the Code-Specific version of Qwen1.5. It is a transformer-based decoder-only language model pretrained on a large amount of data of codes.
codeshellgenerate8194CodeShell is a multi-language code LLM developed by the Knowledge Computing Lab of Peking University.
codeshell-chatchat8194CodeShell is a multi-language code LLM developed by the Knowledge Computing Lab of Peking University.
codestral-v0.1generate32768Codestrall-22B-v0.1 is trained on a diverse dataset of 80+ programming languages, including the most popular ones, such as Python, Java, C, C++, JavaScript, and Bash
cogvlm2chat, vision8192CogVLM2 have achieved good results in many lists compared to the previous generation of CogVLM open source models. Its excellent performance can compete with some non-open source models.
csg-wukong-chat-v0.1chat32768csg-wukong-1B is a 1 billion-parameter small language model(SLM) pretrained on 1T tokens.
deepseekgenerate4096DeepSeek LLM, trained from scratch on a vast dataset of 2 trillion tokens in both English and Chinese.
deepseek-chatchat4096DeepSeek LLM is an advanced language model comprising 67 billion parameters. It has been trained from scratch on a vast dataset of 2 trillion tokens in both English and Chinese.
deepseek-codergenerate16384Deepseek Coder is composed of a series of code language models, each trained from scratch on 2T tokens, with a composition of 87% code and 13% natural language in both English and Chinese.
deepseek-coder-instructchat16384deepseek-coder-instruct is a model initialized from deepseek-coder-base and fine-tuned on 2B tokens of instruction data.
deepseek-vl-chatchat, vision4096DeepSeek-VL possesses general multimodal understanding capabilities, capable of processing logical diagrams, web pages, formula recognition, scientific literature, natural images, and embodied intelligence in complex scenarios.
falcongenerate2048Falcon is an open-source Transformer based LLM trained on the RefinedWeb dataset.
falcon-instructchat2048Falcon-instruct is a fine-tuned version of the Falcon LLM, specializing in chatting.
gemma-2-itchat8192Gemma is a family of lightweight, state-of-the-art open models from Google, built from the same research and technology used to create the Gemini models.
gemma-itchat8192Gemma is a family of lightweight, state-of-the-art open models from Google, built from the same research and technology used to create the Gemini models.
glaive-coderchat16384A code model trained on a dataset of ~140k programming related problems and solutions generated from Glaive’s synthetic data generation platform.
glm-4vchat, vision8192GLM4 is the open source version of the latest generation of pre-trained models in the GLM-4 series launched by Zhipu AI.
glm4-chatchat, tools131072GLM4 is the open source version of the latest generation of pre-trained models in the GLM-4 series launched by Zhipu AI.
glm4-chat-1mchat, tools1048576GLM4 is the open source version of the latest generation of pre-trained models in the GLM-4 series launched by Zhipu AI.
gorilla-openfunctions-v1chat4096OpenFunctions is designed to extend Large Language Model (LLM) Chat Completion feature to formulate executable APIs call given natural language instructions and API context.
gorilla-openfunctions-v2chat4096OpenFunctions is designed to extend Large Language Model (LLM) Chat Completion feature to formulate executable APIs call given natural language instructions and API context.
gpt-2generate1024GPT-2 is a Transformer-based LLM that is trained on WebTest, a 40 GB dataset of Reddit posts with 3+ upvotes.
internlm-20bgenerate16384Pre-trained on over 2.3T Tokens containing high-quality English, Chinese, and code data.
internlm-7bgenerate8192InternLM is a Transformer-based LLM that is trained on both Chinese and English data, focusing on practical scenarios.
internlm-chat-20bchat16384Pre-trained on over 2.3T Tokens containing high-quality English, Chinese, and code data. The Chat version has undergone SFT and RLHF training.
internlm-chat-7bchat4096Internlm-chat is a fine-tuned version of the Internlm LLM, specializing in chatting.
internlm2-chatchat32768The second generation of the InternLM model, InternLM2.
internlm2.5-chatchat32768InternLM2.5 series of the InternLM model.
internlm2.5-chat-1mchat262144InternLM2.5 series of the InternLM model supports 1M long-context
internvl-chatchat, vision32768InternVL 1.5 is an open-source multimodal large language model (MLLM) to bridge the capability gap between open-source and proprietary commercial models in multimodal understanding.
llama-2generate4096Llama-2 is the second generation of Llama, open-source and trained on a larger amount of data.
llama-2-chatchat4096Llama-2-Chat is a fine-tuned version of the Llama-2 LLM, specializing in chatting.
llama-3generate8192Llama 3 is an auto-regressive language model that uses an optimized transformer architecture
llama-3-instructchat8192The Llama 3 instruction tuned models are optimized for dialogue use cases and outperform many of the available open source chat models on common industry benchmarks…
llama-3.1generate131072Llama 3.1 is an auto-regressive language model that uses an optimized transformer architecture
llama-3.1-instructchat131072The Llama 3.1 instruction tuned models are optimized for dialogue use cases and outperform many of the available open source chat models on common industry benchmarks…
minicpm-2b-dpo-bf16chat4096MiniCPM is an End-Size LLM developed by ModelBest Inc. and TsinghuaNLP, with only 2.4B parameters excluding embeddings.
minicpm-2b-dpo-fp16chat4096MiniCPM is an End-Size LLM developed by ModelBest Inc. and TsinghuaNLP, with only 2.4B parameters excluding embeddings.
minicpm-2b-dpo-fp32chat4096MiniCPM is an End-Size LLM developed by ModelBest Inc. and TsinghuaNLP, with only 2.4B parameters excluding embeddings.
minicpm-2b-sft-bf16chat4096MiniCPM is an End-Size LLM developed by ModelBest Inc. and TsinghuaNLP, with only 2.4B parameters excluding embeddings.
minicpm-2b-sft-fp32chat4096MiniCPM is an End-Size LLM developed by ModelBest Inc. and TsinghuaNLP, with only 2.4B parameters excluding embeddings.
minicpm-llama3-v-2_5chat, vision2048MiniCPM-Llama3-V 2.5 is the latest model in the MiniCPM-V series. The model is built on SigLip-400M and Llama3-8B-Instruct with a total of 8B parameters.
mistral-instruct-v0.1chat8192Mistral-7B-Instruct is a fine-tuned version of the Mistral-7B LLM on public datasets, specializing in chatting.
mistral-instruct-v0.2chat8192The Mistral-7B-Instruct-v0.2 Large Language Model (LLM) is an improved instruct fine-tuned version of Mistral-7B-Instruct-v0.1.
mistral-instruct-v0.3chat32768The Mistral-7B-Instruct-v0.2 Large Language Model (LLM) is an improved instruct fine-tuned version of Mistral-7B-Instruct-v0.1.
mistral-large-instructchat131072Mistral-Large-Instruct-2407 is an advanced dense Large Language Model (LLM) of 123B parameters with state-of-the-art reasoning, knowledge and coding capabilities.
mistral-nemo-instructchat1024000The Mistral-Nemo-Instruct-2407 Large Language Model (LLM) is an instruct fine-tuned version of the Mistral-Nemo-Base-2407
mistral-v0.1generate8192Mistral-7B is a unmoderated Transformer based LLM claiming to outperform Llama2 on all benchmarks.
mixtral-8x22b-instruct-v0.1chat65536The Mixtral-8x22B-Instruct-v0.1 Large Language Model (LLM) is an instruct fine-tuned version of the Mixtral-8x22B-v0.1, specializing in chatting.
mixtral-instruct-v0.1chat32768Mistral-8x7B-Instruct is a fine-tuned version of the Mistral-8x7B LLM, specializing in chatting.
mixtral-v0.1generate32768The Mixtral-8x7B Large Language Model (LLM) is a pretrained generative Sparse Mixture of Experts.
omnilmmchat, vision2048OmniLMM is a family of open-source large multimodal models (LMMs) adept at vision & language modeling.
openbuddychat2048OpenBuddy is a powerful open multilingual chatbot model aimed at global users.
openhermes-2.5chat8192Openhermes 2.5 is a fine-tuned version of Mistral-7B-v0.1 on primarily GPT-4 generated data.
optgenerate2048Opt is an open-source, decoder-only, Transformer based LLM that was designed to replicate GPT-3.
orcachat2048Orca is an LLM trained by fine-tuning LLaMA on explanation traces obtained from GPT-4.
orion-chatchat4096Orion-14B series models are open-source multilingual large language models trained from scratch by OrionStarAI.
orion-chat-ragchat4096Orion-14B series models are open-source multilingual large language models trained from scratch by OrionStarAI.
phi-2generate2048Phi-2 is a 2.7B Transformer based LLM used for research on model safety, trained with data similar to Phi-1.5 but augmented with synthetic texts and curated websites.
phi-3-mini-128k-instructchat128000The Phi-3-Mini-128K-Instruct is a 3.8 billion-parameter, lightweight, state-of-the-art open model trained using the Phi-3 datasets.
phi-3-mini-4k-instructchat4096The Phi-3-Mini-4k-Instruct is a 3.8 billion-parameter, lightweight, state-of-the-art open model trained using the Phi-3 datasets.
platypus2-70b-instructgenerate4096Platypus-70B-instruct is a merge of garage-bAInd/Platypus2-70B and upstage/Llama-2-70b-instruct-v2.
qwen-chatchat, tools32768Qwen-chat is a fine-tuned version of the Qwen LLM trained with alignment techniques, specializing in chatting.
qwen-vl-chatchat, vision4096Qwen-VL-Chat supports more flexible interaction, such as multiple image inputs, multi-round question answering, and creative capabilities.
qwen1.5-chatchat, tools32768Qwen1.5 is the beta version of Qwen2, a transformer-based decoder-only language model pretrained on a large amount of data.
qwen1.5-moe-chatchat, tools32768Qwen1.5-MoE is a transformer-based MoE decoder-only language model pretrained on a large amount of data.
qwen2-instructchat, tools32768Qwen2 is the new series of Qwen large language models
qwen2-moe-instructchat, tools32768Qwen2 is the new series of Qwen large language models.
seallm_v2generate8192We introduce SeaLLM-7B-v2, the state-of-the-art multilingual LLM for Southeast Asian (SEA) languages
seallm_v2.5generate8192We introduce SeaLLM-7B-v2.5, the state-of-the-art multilingual LLM for Southeast Asian (SEA) languages
skyworkgenerate4096Skywork is a series of large models developed by the Kunlun Group · Skywork team.
skywork-mathgenerate4096Skywork is a series of large models developed by the Kunlun Group · Skywork team.
starchat-betachat8192Starchat-beta is a fine-tuned version of the Starcoderplus LLM, specializing in coding assistance.
starcodergenerate8192Starcoder is an open-source Transformer based LLM that is trained on permissively licensed data from GitHub.
starcoderplusgenerate8192Starcoderplus is an open-source LLM trained by fine-tuning Starcoder on RedefinedWeb and StarCoderData datasets.
starling-lmchat4096We introduce Starling-7B, an open large language model (LLM) trained by Reinforcement Learning from AI Feedback (RLAIF). The model harnesses the power of our new GPT-4 labeled ranking dataset
telechatchat8192The TeleChat is a large language model developed and trained by China Telecom Artificial Intelligence Technology Co., LTD. The 7B model base is trained with 1.5 trillion Tokens and 3 trillion Tokens and Chinese high-quality corpus.
tiny-llamagenerate2048The TinyLlama project aims to pretrain a 1.1B Llama model on 3 trillion tokens.
vicuna-v1.3chat2048Vicuna is an open-source LLM trained by fine-tuning LLaMA on data collected from ShareGPT.
vicuna-v1.5chat4096Vicuna is an open-source LLM trained by fine-tuning LLaMA on data collected from ShareGPT.
vicuna-v1.5-16kchat16384Vicuna-v1.5-16k is a special version of Vicuna-v1.5, with a context window of 16k tokens instead of 4k.
wizardcoder-python-v1.0chat100000
wizardlm-v1.0chat2048WizardLM is an open-source LLM trained by fine-tuning LLaMA with Evol-Instruct.
wizardmath-v1.0chat2048WizardMath is an open-source LLM trained by fine-tuning Llama2 with Evol-Instruct, specializing in math.
xversegenerate2048XVERSE is a multilingual large language model, independently developed by Shenzhen Yuanxiang Technology.
xverse-chatchat2048XVERSEB-Chat is the aligned version of model XVERSE.
yigenerate4096The Yi series models are large language models trained from scratch by developers at 01.AI.
yi-1.5generate4096Yi-1.5 is an upgraded version of Yi. It is continuously pre-trained on Yi with a high-quality corpus of 500B tokens and fine-tuned on 3M diverse fine-tuning samples.
yi-1.5-chatchat4096Yi-1.5 is an upgraded version of Yi. It is continuously pre-trained on Yi with a high-quality corpus of 500B tokens and fine-tuned on 3M diverse fine-tuning samples.
yi-1.5-chat-16kchat16384Yi-1.5 is an upgraded version of Yi. It is continuously pre-trained on Yi with a high-quality corpus of 500B tokens and fine-tuned on 3M diverse fine-tuning samples.
yi-200kgenerate262144The Yi series models are large language models trained from scratch by developers at 01.AI.
yi-chatchat4096The Yi series models are large language models trained from scratch by developers at 01.AI.
yi-vl-chatchat, vision4096Yi Vision Language (Yi-VL) model is the open-source, multimodal version of the Yi Large Language Model (LLM) series, enabling content comprehension, recognition, and multi-round conversations about images.
zephyr-7b-alphachat8192Zephyr-7B-α is the first model in the series, and is a fine-tuned version of mistralai/Mistral-7B-v0.1.
zephyr-7b-betachat8192Zephyr-7B-β is the second model in the series, and is a fine-tuned version of mistralai/Mistral-7B-v0.1

4.2 注册模型

下载好对应模型之后,放到容器挂载的目录下面。此处目录注意,如果是第一次注册本地模型,直接放到你启动Xinference的挂载目录即可。比如 e:/xinference/models ,注册之后会自动创建对应的模型仓库,然后移动模型,以后就可以直接放到对应仓库下面即可。
image.png

页面注册:
模型参考信息参考:模型家族
image.png
image.png

image.png
确认信息无误之后,点击注册模型,如果没有报错,即注册成功。

4.3 启动模型

点击注册好的模型,填写相关信息,开始启动,相关信息填写参考上面部署简单模型那里。
image.png

4.4 开始对话

启动好之后,还是在“Running Models” 当中,点击后面的对话,即可开始对话。

image.png
image.png

五、总结

以上就是 Windows11 通过 Docker 部署Xinference 平台的操作步骤。
关于上面显卡没办法升级的问题,还没解决,如果有会的小伙伴,欢迎交流。
【亲测】MaxKB如何对接 Xinference 大模型

评论 8
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值