【亲测】Windows 11通过Docker安装Xinference 平台

参考连接:官方网站

一、Xinference 是什么?

Xorbits Inference (Xinference) 是一个开源平台,用于简化各种 AI 模型的运行和集成。借助 Xinference,您可以使用任何开源 LLM、嵌入模型和多模态模型在云端或本地环境中运行推理,并创建强大的 AI 应用。简单来讲,就是一个可以安装各种模型可视化的安装平台。

1.1 准备工作

  • Xinference 使用 GPU 加速推理,该镜像需要在有 GPU 显卡并且安装 CUDA 的机器上运行。
  • 保证 CUDA 在机器上正确安装。可以使用 nvidia-smi 检查是否正确运行。
  • 镜像中的 CUDA 版本为 12.4 。为了不出现预期之外的问题,请将宿主机的 CUDA 版本和 NVIDIA Driver 版本分别升级到 12.4550 以上。

注意: 在安装之前可以先cmd执行一下 nvidia-smi 命令,看看本机的gpu版本多少的。按照官网的要求,CUDA和NVIDIA Driver的版本必须得12.4和550以上。我本地是 CUDA Version: 12.2 、Driver Version: 537.34 的,但也能运行起来,我升级升不上去,不知道为啥,如果有知道的小伙伴,欢迎交流。

二、通过Docker安装

因为我是Windows操作系统,也不想直接通过本地安装的方式安装,就直接参考官网,通过Docker的方式安装,很简单,一条命令即可。
注意:通过docker方式安装,电脑必须要有GPU(显卡),否则安装失败。
快捷命令:

Windows 执行命令:(注意盘符问题)

docker run  -d  --name xinference --gpus all  -v e:/xinference/models:/root/models  -v e:/xinference/.xinference:/root/.xinference -v e:/xinference/.cache/huggingface:/root/.cache/huggingface -e XINFERENCE_HOME=/root/models  -p 9997:9997  registry.cn-hangzhou.aliyuncs.com/xprobe_xinference/xinference:latest  xinference-local -H 0.0.0.0

Linux执行命令:

docker run  -d  --name xinference --gpus all  -v /opt/xinference/models:/root/models  -v /opt/xinference/.xinference:/root/.xinference -v /opt/xinference/.cache/huggingface:/root/.cache/huggingface -e XINFERENCE_HOME=/root/models  -p 9997:9997  registry.cn-hangzhou.aliyuncs.com/xprobe_xinference/xinference:latest  xinference-local -H 0.0.0.0

参数解释 (重点):

参数名解释是否必填
–name设置容器名称
–gpus使用gpu
-v e:/xinference/models:/root/models默认情况下,镜像中不包含任何模型文件,使用过程中会在容器内下载模型。如果需要使用已经下载好的模型,需要将宿主机的目录挂载到容器内。这种情况下,需要在运行容器时指定本地卷,并且为 Xinference 配置环境变量。 (自定义挂载目录,与下面默认挂载方式二选一)
-e XINFERENCE_HOME=/root/models将主机上指定的目录挂载到容器中,并设置 XINFERENCE_HOME 环境变量指向容器内的该目录。这样,所有下载的模型文件将存储在您在主机上指定的目录中。您无需担心在 Docker 容器停止时丢失这些文件,下次运行容器时,您可以直接使用现有的模型,无需重复下载。 (如果选择自定义目录,则需要指定环境变量)
-v e:/xinference/.xinference:/root/.xinference -v e:/xinference/.cache/huggingface:/root/.cache/huggingface如果你在宿主机使用的默认路径下载的模型,由于 xinference cache 目录是用的软链的方式存储模型,需要将原文件所在的目录也挂载到容器内。例如你使用 huggingface 和 modelscope 作为模型仓库,那么需要将这两个对应的目录挂载到容器内,一般对应的 cache 目录分别在 <home_path>/.cache/huggingface 和 <home_path>/.cache/modelscope是(默认挂载方式与上面自定义挂载方式二选一)
-p 9997:9997端口映射
registry.cn-hangzhou.aliyuncs.com/xprobe_xinference/xinference:latest当前,可以通过两个渠道拉取 Xinference 的官方镜像。
1. 在 Dockerhub 的 xprobe/xinference 仓库里。
2. Dockerhub 中的镜像会同步上传一份到阿里云公共镜像仓库中,供访问 Dockerhub 有困难的用户拉取。
目前可用的标签包括:
nightly-main: 这个镜像会每天从 GitHub main 分支更新制作,不保证稳定可靠。
v<release version>: 这个镜像会在 Xinference 每次发布的时候制作,通常可以认为是稳定可靠的。
latest: 这个镜像会在 Xinference 发布时指向最新的发布版本 。 对于 CPU 版本,增加 -cpu后缀,如nightly-main-cpu`。
-H 0.0.0.0-H 0.0.0.0 必须指定的,否则在容器外无法连接到 Xinference 服务。

2.1 页面访问

通过以上命令启动之后,即可通过 localhost:9997 也可以通过本机IP地址访问,比如 192.168.1.152:9997 去访问。
image.png

2.2 操作页面介绍

这里简单介绍一下操作页面,界面很简单,基本都能看懂,需要什么模型,去到对应的模型库里面下载即可。
image.png

三、部署一个简单模型

部署好之后,我们在线部署一个简单对话模型: 以 qwen-chat 为例

3.1 搜索 qwen-chat 回车

image.png

3.2 开始部署

image.png
参数说明:
image.png
参数填写完之后,点击小火箭,即可部署。这里需要等待,因为需要去模型仓库里面拉取模型,默认两个:huggingface和modelscope 。下载模型需要开代理,我这边下载默认是从huggingface里面下载的,所以全程代理下载。部署速度由代理速度决定。

3.3 开始对话

部署好之后,我们在 “Running Models” 看到模型。需要注意的是,能够跑的模型数量取决于GPU数量,如果你只有一颗GPU,那只能跑一个模型,以此类推。
image.png
点击后面的开始对话,即可跳转对面页面。
image.png
image.png
可以看到,这么模型还不错。

四、对接本地模型

Xinference作为一个模型集成平台,当然也可以对接本地的模型。

4.1 下载模型

首先,去到对应的模型仓库,下载好自己想要对接的模型。此处下载的时候需要注意,一定要参考官方能够支持的模型种类,否则是注册不了的。
模型仓库:
https://huggingface.co/
https://www.modelscope.cn/
目前Xinference 支持的模型家族有,大部分都是支持的。

MODEL NAMEABILITIESCOTNEXT_LENGTHDESCRIPTION
aquila2generate2048Aquila2 series models are the base language models
aquila2-chatchat2048Aquila2-chat series models are the chat models
aquila2-chat-16kchat16384AquilaChat2-16k series models are the long-text chat models
baichuangenerate4096Baichuan is an open-source Transformer based LLM that is trained on both Chinese and English data.
baichuan-2generate4096Baichuan2 is an open-source Transformer based LLM that is trained on both Chinese and English data.
baichuan-2-chatchat4096Baichuan2-chat is a fine-tuned version of the Baichuan LLM, specializing in chatting.
baichuan-chatchat4096Baichuan-chat is a fine-tuned version of the Baichuan LLM, specializing in chatting.
c4ai-command-r-v01chat131072C4AI Command-R(+) is a research release of a 35 and 104 billion parameter highly performant generative model.
chatglmchat2048ChatGLM is an open-source General Language Model (GLM) based LLM trained on both Chinese and English data.
chatglm2chat8192ChatGLM2 is the second generation of ChatGLM, still open-source and trained on Chinese and English data.
chatglm2-32kchat32768ChatGLM2-32k is a special version of ChatGLM2, with a context window of 32k tokens instead of 8k.
chatglm3chat, tools8192ChatGLM3 is the third generation of ChatGLM, still open-source and trained on Chinese and English data.
chatglm3-128kchat131072ChatGLM3 is the third generation of ChatGLM, still open-source and trained on Chinese and English data.
chatglm3-32kchat32768ChatGLM3 is the third generation of ChatGLM, still open-source and trained on Chinese and English data.
code-llamagenerate100000Code-Llama is an open-source LLM trained by fine-tuning LLaMA2 for generating and discussing code.
code-llama-instructchat100000Code-Llama-Instruct is an instruct-tuned version of the Code-Llama LLM.
code-llama-pythongenerate100000Code-Llama-Python is a fine-tuned version of the Code-Llama LLM, specializing in Python.
codegeex4chat131072the open-source version of the latest CodeGeeX4 model series
codeqwen1.5generate65536CodeQwen1.5 is the Code-Specific version of Qwen1.5. It is a transformer-based decoder-only language model pretrained on a large amount of data of codes.
codeqwen1.5-chatchat65536CodeQwen1.5 is the Code-Specific version of Qwen1.5. It is a transformer-based decoder-only language model pretrained on a large amount of data of codes.
codeshellgenerate8194CodeShell is a multi-language code LLM developed by the Knowledge Computing Lab of Peking University.
codeshell-chatchat8194CodeShell is a multi-language code LLM developed by the Knowledge Computing Lab of Peking University.
codestral-v0.1generate32768Codestrall-22B-v0.1 is trained on a diverse dataset of 80+ programming languages, including the most popular ones, such as Python, Java, C, C++, JavaScript, and Bash
cogvlm2chat, vision8192CogVLM2 have achieved good results in many lists compared to the previous generation of CogVLM open source models. Its excellent performance can compete with some non-open source models.
csg-wukong-chat-v0.1chat32768csg-wukong-1B is a 1 billion-parameter small language model(SLM) pretrained on 1T tokens.
deepseekgenerate4096DeepSeek LLM, trained from scratch on a vast dataset of 2 trillion tokens in both English and Chinese.
deepseek-chatchat4096DeepSeek LLM is an advanced language model comprising 67 billion parameters. It has been trained from scratch on a vast dataset of 2 trillion tokens in both English and Chinese.
deepseek-codergenerate16384Deepseek Coder is composed of a series of code language models, each trained from scratch on 2T tokens, with a composition of 87% code and 13% natural language in both English and Chinese.
deepseek-coder-instructchat16384deepseek-coder-instruct is a model initialized from deepseek-coder-base and fine-tuned on 2B tokens of instruction data.
deepseek-vl-chatchat, vision4096DeepSeek-VL possesses general multimodal understanding capabilities, capable of processing logical diagrams, web pages, formula recognition, scientific literature, natural images, and embodied intelligence in complex scenarios.
falcongenerate2048Falcon is an open-source Transformer based LLM trained on the RefinedWeb dataset.
falcon-instructchat2048Falcon-instruct is a fine-tuned version of the Falcon LLM, specializing in chatting.
gemma-2-itchat8192Gemma is a family of lightweight, state-of-the-art open models from Google, built from the same research and technology used to create the Gemini models.
gemma-itchat8192Gemma is a family of lightweight, state-of-the-art open models from Google, built from the same research and technology used to create the Gemini models.
glaive-coderchat16384A code model trained on a dataset of ~140k programming related problems and solutions generated from Glaive’s synthetic data generation platform.
glm-4vchat, vision8192GLM4 is the open source version of the latest generation of pre-trained models in the GLM-4 series launched by Zhipu AI.
glm4-chatchat, tools131072GLM4 is the open source version of the latest generation of pre-trained models in the GLM-4 series launched by Zhipu AI.
glm4-chat-1mchat, tools1048576GLM4 is the open source version of the latest generation of pre-trained models in the GLM-4 series launched by Zhipu AI.
gorilla-openfunctions-v1chat4096OpenFunctions is designed to extend Large Language Model (LLM) Chat Completion feature to formulate executable APIs call given natural language instructions and API context.
gorilla-openfunctions-v2chat4096OpenFunctions is designed to extend Large Language Model (LLM) Chat Completion feature to formulate executable APIs call given natural language instructions and API context.
gpt-2generate1024GPT-2 is a Transformer-based LLM that is trained on WebTest, a 40 GB dataset of Reddit posts with 3+ upvotes.
internlm-20bgenerate16384Pre-trained on over 2.3T Tokens containing high-quality English, Chinese, and code data.
internlm-7bgenerate8192InternLM is a Transformer-based LLM that is trained on both Chinese and English data, focusing on practical scenarios.
internlm-chat-20bchat16384Pre-trained on over 2.3T Tokens containing high-quality English, Chinese, and code data. The Chat version has undergone SFT and RLHF training.
internlm-chat-7bchat4096Internlm-chat is a fine-tuned version of the Internlm LLM, specializing in chatting.
internlm2-chatchat32768The second generation of the InternLM model, InternLM2.
internlm2.5-chatchat32768InternLM2.5 series of the InternLM model.
internlm2.5-chat-1mchat262144InternLM2.5 series of the InternLM model supports 1M long-context
internvl-chatchat, vision32768InternVL 1.5 is an open-source multimodal large language model (MLLM) to bridge the capability gap between open-source and proprietary commercial models in multimodal understanding.
llama-2generate4096Llama-2 is the second generation of Llama, open-source and trained on a larger amount of data.
llama-2-chatchat4096Llama-2-Chat is a fine-tuned version of the Llama-2 LLM, specializing in chatting.
llama-3generate8192Llama 3 is an auto-regressive language model that uses an optimized transformer architecture
llama-3-instructchat8192The Llama 3 instruction tuned models are optimized for dialogue use cases and outperform many of the available open source chat models on common industry benchmarks…
llama-3.1generate131072Llama 3.1 is an auto-regressive language model that uses an optimized transformer architecture
llama-3.1-instructchat131072The Llama 3.1 instruction tuned models are optimized for dialogue use cases and outperform many of the available open source chat models on common industry benchmarks…
minicpm-2b-dpo-bf16chat4096MiniCPM is an End-Size LLM developed by ModelBest Inc. and TsinghuaNLP, with only 2.4B parameters excluding embeddings.
minicpm-2b-dpo-fp16chat4096MiniCPM is an End-Size LLM developed by ModelBest Inc. and TsinghuaNLP, with only 2.4B parameters excluding embeddings.
minicpm-2b-dpo-fp32chat4096MiniCPM is an End-Size LLM developed by ModelBest Inc. and TsinghuaNLP, with only 2.4B parameters excluding embeddings.
minicpm-2b-sft-bf16chat4096MiniCPM is an End-Size LLM developed by ModelBest Inc. and TsinghuaNLP, with only 2.4B parameters excluding embeddings.
minicpm-2b-sft-fp32chat4096MiniCPM is an End-Size LLM developed by ModelBest Inc. and TsinghuaNLP, with only 2.4B parameters excluding embeddings.
minicpm-llama3-v-2_5chat, vision2048MiniCPM-Llama3-V 2.5 is the latest model in the MiniCPM-V series. The model is built on SigLip-400M and Llama3-8B-Instruct with a total of 8B parameters.
mistral-instruct-v0.1chat8192Mistral-7B-Instruct is a fine-tuned version of the Mistral-7B LLM on public datasets, specializing in chatting.
mistral-instruct-v0.2chat8192The Mistral-7B-Instruct-v0.2 Large Language Model (LLM) is an improved instruct fine-tuned version of Mistral-7B-Instruct-v0.1.
mistral-instruct-v0.3chat32768The Mistral-7B-Instruct-v0.2 Large Language Model (LLM) is an improved instruct fine-tuned version of Mistral-7B-Instruct-v0.1.
mistral-large-instructchat131072Mistral-Large-Instruct-2407 is an advanced dense Large Language Model (LLM) of 123B parameters with state-of-the-art reasoning, knowledge and coding capabilities.
mistral-nemo-instructchat1024000The Mistral-Nemo-Instruct-2407 Large Language Model (LLM) is an instruct fine-tuned version of the Mistral-Nemo-Base-2407
mistral-v0.1generate8192Mistral-7B is a unmoderated Transformer based LLM claiming to outperform Llama2 on all benchmarks.
mixtral-8x22b-instruct-v0.1chat65536The Mixtral-8x22B-Instruct-v0.1 Large Language Model (LLM) is an instruct fine-tuned version of the Mixtral-8x22B-v0.1, specializing in chatting.
mixtral-instruct-v0.1chat32768Mistral-8x7B-Instruct is a fine-tuned version of the Mistral-8x7B LLM, specializing in chatting.
mixtral-v0.1generate32768The Mixtral-8x7B Large Language Model (LLM) is a pretrained generative Sparse Mixture of Experts.
omnilmmchat, vision2048OmniLMM is a family of open-source large multimodal models (LMMs) adept at vision & language modeling.
openbuddychat2048OpenBuddy is a powerful open multilingual chatbot model aimed at global users.
openhermes-2.5chat8192Openhermes 2.5 is a fine-tuned version of Mistral-7B-v0.1 on primarily GPT-4 generated data.
optgenerate2048Opt is an open-source, decoder-only, Transformer based LLM that was designed to replicate GPT-3.
orcachat2048Orca is an LLM trained by fine-tuning LLaMA on explanation traces obtained from GPT-4.
orion-chatchat4096Orion-14B series models are open-source multilingual large language models trained from scratch by OrionStarAI.
orion-chat-ragchat4096Orion-14B series models are open-source multilingual large language models trained from scratch by OrionStarAI.
phi-2generate2048Phi-2 is a 2.7B Transformer based LLM used for research on model safety, trained with data similar to Phi-1.5 but augmented with synthetic texts and curated websites.
phi-3-mini-128k-instructchat128000The Phi-3-Mini-128K-Instruct is a 3.8 billion-parameter, lightweight, state-of-the-art open model trained using the Phi-3 datasets.
phi-3-mini-4k-instructchat4096The Phi-3-Mini-4k-Instruct is a 3.8 billion-parameter, lightweight, state-of-the-art open model trained using the Phi-3 datasets.
platypus2-70b-instructgenerate4096Platypus-70B-instruct is a merge of garage-bAInd/Platypus2-70B and upstage/Llama-2-70b-instruct-v2.
qwen-chatchat, tools32768Qwen-chat is a fine-tuned version of the Qwen LLM trained with alignment techniques, specializing in chatting.
qwen-vl-chatchat, vision4096Qwen-VL-Chat supports more flexible interaction, such as multiple image inputs, multi-round question answering, and creative capabilities.
qwen1.5-chatchat, tools32768Qwen1.5 is the beta version of Qwen2, a transformer-based decoder-only language model pretrained on a large amount of data.
qwen1.5-moe-chatchat, tools32768Qwen1.5-MoE is a transformer-based MoE decoder-only language model pretrained on a large amount of data.
qwen2-instructchat, tools32768Qwen2 is the new series of Qwen large language models
qwen2-moe-instructchat, tools32768Qwen2 is the new series of Qwen large language models.
seallm_v2generate8192We introduce SeaLLM-7B-v2, the state-of-the-art multilingual LLM for Southeast Asian (SEA) languages
seallm_v2.5generate8192We introduce SeaLLM-7B-v2.5, the state-of-the-art multilingual LLM for Southeast Asian (SEA) languages
skyworkgenerate4096Skywork is a series of large models developed by the Kunlun Group · Skywork team.
skywork-mathgenerate4096Skywork is a series of large models developed by the Kunlun Group · Skywork team.
starchat-betachat8192Starchat-beta is a fine-tuned version of the Starcoderplus LLM, specializing in coding assistance.
starcodergenerate8192Starcoder is an open-source Transformer based LLM that is trained on permissively licensed data from GitHub.
starcoderplusgenerate8192Starcoderplus is an open-source LLM trained by fine-tuning Starcoder on RedefinedWeb and StarCoderData datasets.
starling-lmchat4096We introduce Starling-7B, an open large language model (LLM) trained by Reinforcement Learning from AI Feedback (RLAIF). The model harnesses the power of our new GPT-4 labeled ranking dataset
telechatchat8192The TeleChat is a large language model developed and trained by China Telecom Artificial Intelligence Technology Co., LTD. The 7B model base is trained with 1.5 trillion Tokens and 3 trillion Tokens and Chinese high-quality corpus.
tiny-llamagenerate2048The TinyLlama project aims to pretrain a 1.1B Llama model on 3 trillion tokens.
vicuna-v1.3chat2048Vicuna is an open-source LLM trained by fine-tuning LLaMA on data collected from ShareGPT.
vicuna-v1.5chat4096Vicuna is an open-source LLM trained by fine-tuning LLaMA on data collected from ShareGPT.
vicuna-v1.5-16kchat16384Vicuna-v1.5-16k is a special version of Vicuna-v1.5, with a context window of 16k tokens instead of 4k.
wizardcoder-python-v1.0chat100000
wizardlm-v1.0chat2048WizardLM is an open-source LLM trained by fine-tuning LLaMA with Evol-Instruct.
wizardmath-v1.0chat2048WizardMath is an open-source LLM trained by fine-tuning Llama2 with Evol-Instruct, specializing in math.
xversegenerate2048XVERSE is a multilingual large language model, independently developed by Shenzhen Yuanxiang Technology.
xverse-chatchat2048XVERSEB-Chat is the aligned version of model XVERSE.
yigenerate4096The Yi series models are large language models trained from scratch by developers at 01.AI.
yi-1.5generate4096Yi-1.5 is an upgraded version of Yi. It is continuously pre-trained on Yi with a high-quality corpus of 500B tokens and fine-tuned on 3M diverse fine-tuning samples.
yi-1.5-chatchat4096Yi-1.5 is an upgraded version of Yi. It is continuously pre-trained on Yi with a high-quality corpus of 500B tokens and fine-tuned on 3M diverse fine-tuning samples.
yi-1.5-chat-16kchat16384Yi-1.5 is an upgraded version of Yi. It is continuously pre-trained on Yi with a high-quality corpus of 500B tokens and fine-tuned on 3M diverse fine-tuning samples.
yi-200kgenerate262144The Yi series models are large language models trained from scratch by developers at 01.AI.
yi-chatchat4096The Yi series models are large language models trained from scratch by developers at 01.AI.
yi-vl-chatchat, vision4096Yi Vision Language (Yi-VL) model is the open-source, multimodal version of the Yi Large Language Model (LLM) series, enabling content comprehension, recognition, and multi-round conversations about images.
zephyr-7b-alphachat8192Zephyr-7B-α is the first model in the series, and is a fine-tuned version of mistralai/Mistral-7B-v0.1.
zephyr-7b-betachat8192Zephyr-7B-β is the second model in the series, and is a fine-tuned version of mistralai/Mistral-7B-v0.1

4.2 注册模型

下载好对应模型之后,放到容器挂载的目录下面。此处目录注意,如果是第一次注册本地模型,直接放到你启动Xinference的挂载目录即可。比如 e:/xinference/models ,注册之后会自动创建对应的模型仓库,然后移动模型,以后就可以直接放到对应仓库下面即可。
image.png

页面注册:
模型参考信息参考:模型家族
image.png
image.png

image.png
确认信息无误之后,点击注册模型,如果没有报错,即注册成功。

4.3 启动模型

点击注册好的模型,填写相关信息,开始启动,相关信息填写参考上面部署简单模型那里。
image.png

4.4 开始对话

启动好之后,还是在“Running Models” 当中,点击后面的对话,即可开始对话。

image.png
image.png

五、总结

以上就是 Windows11 通过 Docker 部署Xinference 平台的操作步骤。
关于上面显卡没办法升级的问题,还没解决,如果有会的小伙伴,欢迎交流。
【亲测】MaxKB如何对接 Xinference 大模型

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值