【亲测】Windows 11通过Docker安装Xinference 平台

灬囖

已于 2024-08-12 11:39:44 修改

阅读量1.1w

点赞数 26

分类专栏： MaxKB 文章标签： docker 容器 xinference 大模型

于 2024-08-11 23:51:02 首次发布

本文链接：https://blog.csdn.net/hao65103940/article/details/141114358

版权

MaxKB 专栏收录该内容

14 篇文章

订阅专栏

参考连接：官方网站

一、Xinference 是什么？

Xorbits Inference (Xinference) 是一个开源平台，用于简化各种 AI 模型的运行和集成。借助 Xinference，您可以使用任何开源 LLM、嵌入模型和多模态模型在云端或本地环境中运行推理，并创建强大的 AI 应用。简单来讲，就是一个可以安装各种模型可视化的安装平台。

1.1 准备工作

Xinference 使用 GPU 加速推理，该镜像需要在有 GPU 显卡并且安装 CUDA 的机器上运行。
保证 CUDA 在机器上正确安装。可以使用 nvidia-smi 检查是否正确运行。
镜像中的 CUDA 版本为 12.4 。为了不出现预期之外的问题，请将宿主机的 CUDA 版本和 NVIDIA Driver 版本分别升级到 12.4 和 550 以上。

注意： 在安装之前可以先cmd执行一下 nvidia-smi 命令，看看本机的gpu版本多少的。按照官网的要求，CUDA和NVIDIA Driver的版本必须得12.4和550以上。我本地是 CUDA Version: 12.2 、Driver Version: 537.34 的，但也能运行起来，我升级升不上去，不知道为啥，如果有知道的小伙伴，欢迎交流。

二、通过Docker安装

因为我是Windows操作系统，也不想直接通过本地安装的方式安装，就直接参考官网，通过Docker的方式安装，很简单，一条命令即可。
注意：通过docker方式安装，电脑必须要有GPU（显卡），否则安装失败。
快捷命令：

Windows 执行命令：（注意盘符问题）

docker run  -d  --name xinference --gpus all  -v e:/xinference/models:/root/models  -v e:/xinference/.xinference:/root/.xinference -v e:/xinference/.cache/huggingface:/root/.cache/huggingface -e XINFERENCE_HOME=/root/models  -p 9997:9997  registry.cn-hangzhou.aliyuncs.com/xprobe_xinference/xinference:latest  xinference-local -H 0.0.0.0

Linux执行命令：

docker run  -d  --name xinference --gpus all  -v /opt/xinference/models:/root/models  -v /opt/xinference/.xinference:/root/.xinference -v /opt/xinference/.cache/huggingface:/root/.cache/huggingface -e XINFERENCE_HOME=/root/models  -p 9997:9997  registry.cn-hangzhou.aliyuncs.com/xprobe_xinference/xinference:latest  xinference-local -H 0.0.0.0

参数解释（重点）：

参数名	解释	是否必填
–name	设置容器名称	是
–gpus	使用gpu	是
-v e:/xinference/models:/root/models	默认情况下，镜像中不包含任何模型文件，使用过程中会在容器内下载模型。如果需要使用已经下载好的模型，需要将宿主机的目录挂载到容器内。这种情况下，需要在运行容器时指定本地卷，并且为 Xinference 配置环境变量。	是（自定义挂载目录，与下面默认挂载方式二选一）
-e XINFERENCE_HOME=/root/models	将主机上指定的目录挂载到容器中，并设置 `XINFERENCE_HOME` 环境变量指向容器内的该目录。这样，所有下载的模型文件将存储在您在主机上指定的目录中。您无需担心在 Docker 容器停止时丢失这些文件，下次运行容器时，您可以直接使用现有的模型，无需重复下载。	是（如果选择自定义目录，则需要指定环境变量）
-v e:/xinference/.xinference:/root/.xinference -v e:/xinference/.cache/huggingface:/root/.cache/huggingface	如果你在宿主机使用的默认路径下载的模型，由于 xinference cache 目录是用的软链的方式存储模型，需要将原文件所在的目录也挂载到容器内。例如你使用 huggingface 和 modelscope 作为模型仓库，那么需要将这两个对应的目录挂载到容器内，一般对应的 cache 目录分别在 <home_path>/.cache/huggingface 和 <home_path>/.cache/modelscope	是（默认挂载方式与上面自定义挂载方式二选一）
-p 9997:9997	端口映射	是
registry.cn-hangzhou.aliyuncs.com/xprobe_xinference/xinference:latest	当前，可以通过两个渠道拉取 Xinference 的官方镜像。 1. 在 Dockerhub 的 `xprobe/xinference` 仓库里。 2. Dockerhub 中的镜像会同步上传一份到阿里云公共镜像仓库中，供访问 Dockerhub 有困难的用户拉取。目前可用的标签包括： `nightly-main`: 这个镜像会每天从 GitHub main 分支更新制作，不保证稳定可靠。 `v<release version>`: 这个镜像会在 Xinference 每次发布的时候制作，通常可以认为是稳定可靠的。 latest`: 这个镜像会在 Xinference 发布时指向最新的发布版本。对于 CPU 版本，增加` -cpu`后缀，如`nightly-main-cpu`。	是
-H 0.0.0.0	`-H 0.0.0.0` 必须指定的，否则在容器外无法连接到 Xinference 服务。	是

2.1 页面访问

通过以上命令启动之后，即可通过 localhost:9997 也可以通过本机IP地址访问，比如 192.168.1.152:9997 去访问。

2.2 操作页面介绍

这里简单介绍一下操作页面，界面很简单，基本都能看懂，需要什么模型，去到对应的模型库里面下载即可。

三、部署一个简单模型

部署好之后，我们在线部署一个简单对话模型: 以 qwen-chat 为例

3.1 搜索 qwen-chat 回车

3.2 开始部署

参数说明：

参数填写完之后，点击小火箭，即可部署。这里需要等待，因为需要去模型仓库里面拉取模型，默认两个：huggingface和modelscope 。下载模型需要开代理，我这边下载默认是从huggingface里面下载的，所以全程代理下载。部署速度由代理速度决定。

3.3 开始对话

部署好之后，我们在 “Running Models” 看到模型。需要注意的是，能够跑的模型数量取决于GPU数量，如果你只有一颗GPU，那只能跑一个模型，以此类推。

点击后面的开始对话，即可跳转对面页面。

可以看到，这么模型还不错。

四、对接本地模型

Xinference作为一个模型集成平台，当然也可以对接本地的模型。

4.1 下载模型

首先，去到对应的模型仓库，下载好自己想要对接的模型。此处下载的时候需要注意，一定要参考官方能够支持的模型种类，否则是注册不了的。
模型仓库：
https://huggingface.co/
https://www.modelscope.cn/
目前Xinference 支持的模型家族有，大部分都是支持的。

MODEL NAME	ABILITIES	COTNEXT_LENGTH	DESCRIPTION
aquila2	generate	2048	Aquila2 series models are the base language models
aquila2-chat	chat	2048	Aquila2-chat series models are the chat models
aquila2-chat-16k	chat	16384	AquilaChat2-16k series models are the long-text chat models
baichuan	generate	4096	Baichuan is an open-source Transformer based LLM that is trained on both Chinese and English data.
baichuan-2	generate	4096	Baichuan2 is an open-source Transformer based LLM that is trained on both Chinese and English data.
baichuan-2-chat	chat	4096	Baichuan2-chat is a fine-tuned version of the Baichuan LLM, specializing in chatting.
baichuan-chat	chat	4096	Baichuan-chat is a fine-tuned version of the Baichuan LLM, specializing in chatting.
c4ai-command-r-v01	chat	131072	C4AI Command-R(+) is a research release of a 35 and 104 billion parameter highly performant generative model.
chatglm	chat	2048	ChatGLM is an open-source General Language Model (GLM) based LLM trained on both Chinese and English data.
chatglm2	chat	8192	ChatGLM2 is the second generation of ChatGLM, still open-source and trained on Chinese and English data.
chatglm2-32k	chat	32768	ChatGLM2-32k is a special version of ChatGLM2, with a context window of 32k tokens instead of 8k.
chatglm3	chat, tools	8192	ChatGLM3 is the third generation of ChatGLM, still open-source and trained on Chinese and English data.
chatglm3-128k	chat	131072	ChatGLM3 is the third generation of ChatGLM, still open-source and trained on Chinese and English data.
chatglm3-32k	chat	32768	ChatGLM3 is the third generation of ChatGLM, still open-source and trained on Chinese and English data.
code-llama	generate	100000	Code-Llama is an open-source LLM trained by fine-tuning LLaMA2 for generating and discussing code.
code-llama-instruct	chat	100000	Code-Llama-Instruct is an instruct-tuned version of the Code-Llama LLM.
code-llama-python	generate	100000	Code-Llama-Python is a fine-tuned version of the Code-Llama LLM, specializing in Python.
codegeex4	chat	131072	the open-source version of the latest CodeGeeX4 model series
codeqwen1.5	generate	65536	CodeQwen1.5 is the Code-Specific version of Qwen1.5. It is a transformer-based decoder-only language model pretrained on a large amount of data of codes.
codeqwen1.5-chat	chat	65536	CodeQwen1.5 is the Code-Specific version of Qwen1.5. It is a transformer-based decoder-only language model pretrained on a large amount of data of codes.
codeshell	generate	8194	CodeShell is a multi-language code LLM developed by the Knowledge Computing Lab of Peking University.
codeshell-chat	chat	8194	CodeShell is a multi-language code LLM developed by the Knowledge Computing Lab of Peking University.
codestral-v0.1	generate	32768	Codestrall-22B-v0.1 is trained on a diverse dataset of 80+ programming languages, including the most popular ones, such as Python, Java, C, C++, JavaScript, and Bash
cogvlm2	chat, vision	8192	CogVLM2 have achieved good results in many lists compared to the previous generation of CogVLM open source models. Its excellent performance can compete with some non-open source models.
csg-wukong-chat-v0.1	chat	32768	csg-wukong-1B is a 1 billion-parameter small language model(SLM) pretrained on 1T tokens.
deepseek	generate	4096	DeepSeek LLM, trained from scratch on a vast dataset of 2 trillion tokens in both English and Chinese.
deepseek-chat	chat	4096	DeepSeek LLM is an advanced language model comprising 67 billion parameters. It has been trained from scratch on a vast dataset of 2 trillion tokens in both English and Chinese.
deepseek-coder	generate	16384	Deepseek Coder is composed of a series of code language models, each trained from scratch on 2T tokens, with a composition of 87% code and 13% natural language in both English and Chinese.
deepseek-coder-instruct	chat	16384	deepseek-coder-instruct is a model initialized from deepseek-coder-base and fine-tuned on 2B tokens of instruction data.
deepseek-vl-chat	chat, vision	4096	DeepSeek-VL possesses general multimodal understanding capabilities, capable of processing logical diagrams, web pages, formula recognition, scientific literature, natural images, and embodied intelligence in complex scenarios.
falcon	generate	2048	Falcon is an open-source Transformer based LLM trained on the RefinedWeb dataset.
falcon-instruct	chat	2048	Falcon-instruct is a fine-tuned version of the Falcon LLM, specializing in chatting.
gemma-2-it	chat	8192	Gemma is a family of lightweight, state-of-the-art open models from Google, built from the same research and technology used to create the Gemini models.
gemma-it	chat	8192	Gemma is a family of lightweight, state-of-the-art open models from Google, built from the same research and technology used to create the Gemini models.
glaive-coder	chat	16384	A code model trained on a dataset of ~140k programming related problems and solutions generated from Glaive’s synthetic data generation platform.
glm-4v	chat, vision	8192	GLM4 is the open source version of the latest generation of pre-trained models in the GLM-4 series launched by Zhipu AI.
glm4-chat	chat, tools	131072	GLM4 is the open source version of the latest generation of pre-trained models in the GLM-4 series launched by Zhipu AI.
glm4-chat-1m	chat, tools	1048576	GLM4 is the open source version of the latest generation of pre-trained models in the GLM-4 series launched by Zhipu AI.
gorilla-openfunctions-v1	chat	4096	OpenFunctions is designed to extend Large Language Model (LLM) Chat Completion feature to formulate executable APIs call given natural language instructions and API context.
gorilla-openfunctions-v2	chat	4096	OpenFunctions is designed to extend Large Language Model (LLM) Chat Completion feature to formulate executable APIs call given natural language instructions and API context.
gpt-2	generate	1024	GPT-2 is a Transformer-based LLM that is trained on WebTest, a 40 GB dataset of Reddit posts with 3+ upvotes.
internlm-20b	generate	16384	Pre-trained on over 2.3T Tokens containing high-quality English, Chinese, and code data.
internlm-7b	generate	8192	InternLM is a Transformer-based LLM that is trained on both Chinese and English data, focusing on practical scenarios.
internlm-chat-20b	chat	16384	Pre-trained on over 2.3T Tokens containing high-quality English, Chinese, and code data. The Chat version has undergone SFT and RLHF training.
internlm-chat-7b	chat	4096	Internlm-chat is a fine-tuned version of the Internlm LLM, specializing in chatting.
internlm2-chat	chat	32768	The second generation of the InternLM model, InternLM2.
internlm2.5-chat	chat	32768	InternLM2.5 series of the InternLM model.
internlm2.5-chat-1m	chat	262144	InternLM2.5 series of the InternLM model supports 1M long-context
internvl-chat	chat, vision	32768	InternVL 1.5 is an open-source multimodal large language model (MLLM) to bridge the capability gap between open-source and proprietary commercial models in multimodal understanding.
llama-2	generate	4096	Llama-2 is the second generation of Llama, open-source and trained on a larger amount of data.
llama-2-chat	chat	4096	Llama-2-Chat is a fine-tuned version of the Llama-2 LLM, specializing in chatting.
llama-3	generate	8192	Llama 3 is an auto-regressive language model that uses an optimized transformer architecture
llama-3-instruct	chat	8192	The Llama 3 instruction tuned models are optimized for dialogue use cases and outperform many of the available open source chat models on common industry benchmarks…
llama-3.1	generate	131072	Llama 3.1 is an auto-regressive language model that uses an optimized transformer architecture
llama-3.1-instruct	chat	131072	The Llama 3.1 instruction tuned models are optimized for dialogue use cases and outperform many of the available open source chat models on common industry benchmarks…
minicpm-2b-dpo-bf16	chat	4096	MiniCPM is an End-Size LLM developed by ModelBest Inc. and TsinghuaNLP, with only 2.4B parameters excluding embeddings.
minicpm-2b-dpo-fp16	chat	4096	MiniCPM is an End-Size LLM developed by ModelBest Inc. and TsinghuaNLP, with only 2.4B parameters excluding embeddings.
minicpm-2b-dpo-fp32	chat	4096	MiniCPM is an End-Size LLM developed by ModelBest Inc. and TsinghuaNLP, with only 2.4B parameters excluding embeddings.
minicpm-2b-sft-bf16	chat	4096	MiniCPM is an End-Size LLM developed by ModelBest Inc. and TsinghuaNLP, with only 2.4B parameters excluding embeddings.
minicpm-2b-sft-fp32	chat	4096	MiniCPM is an End-Size LLM developed by ModelBest Inc. and TsinghuaNLP, with only 2.4B parameters excluding embeddings.
minicpm-llama3-v-2_5	chat, vision	2048	MiniCPM-Llama3-V 2.5 is the latest model in the MiniCPM-V series. The model is built on SigLip-400M and Llama3-8B-Instruct with a total of 8B parameters.
mistral-instruct-v0.1	chat	8192	Mistral-7B-Instruct is a fine-tuned version of the Mistral-7B LLM on public datasets, specializing in chatting.
mistral-instruct-v0.2	chat	8192	The Mistral-7B-Instruct-v0.2 Large Language Model (LLM) is an improved instruct fine-tuned version of Mistral-7B-Instruct-v0.1.
mistral-instruct-v0.3	chat	32768	The Mistral-7B-Instruct-v0.2 Large Language Model (LLM) is an improved instruct fine-tuned version of Mistral-7B-Instruct-v0.1.
mistral-large-instruct	chat	131072	Mistral-Large-Instruct-2407 is an advanced dense Large Language Model (LLM) of 123B parameters with state-of-the-art reasoning, knowledge and coding capabilities.
mistral-nemo-instruct	chat	1024000	The Mistral-Nemo-Instruct-2407 Large Language Model (LLM) is an instruct fine-tuned version of the Mistral-Nemo-Base-2407
mistral-v0.1	generate	8192	Mistral-7B is a unmoderated Transformer based LLM claiming to outperform Llama2 on all benchmarks.
mixtral-8x22b-instruct-v0.1	chat	65536	The Mixtral-8x22B-Instruct-v0.1 Large Language Model (LLM) is an instruct fine-tuned version of the Mixtral-8x22B-v0.1, specializing in chatting.
mixtral-instruct-v0.1	chat	32768	Mistral-8x7B-Instruct is a fine-tuned version of the Mistral-8x7B LLM, specializing in chatting.
mixtral-v0.1	generate	32768	The Mixtral-8x7B Large Language Model (LLM) is a pretrained generative Sparse Mixture of Experts.
omnilmm	chat, vision	2048	OmniLMM is a family of open-source large multimodal models (LMMs) adept at vision & language modeling.
openbuddy	chat	2048	OpenBuddy is a powerful open multilingual chatbot model aimed at global users.
openhermes-2.5	chat	8192	Openhermes 2.5 is a fine-tuned version of Mistral-7B-v0.1 on primarily GPT-4 generated data.
opt	generate	2048	Opt is an open-source, decoder-only, Transformer based LLM that was designed to replicate GPT-3.
orca	chat	2048	Orca is an LLM trained by fine-tuning LLaMA on explanation traces obtained from GPT-4.
orion-chat	chat	4096	Orion-14B series models are open-source multilingual large language models trained from scratch by OrionStarAI.
orion-chat-rag	chat	4096	Orion-14B series models are open-source multilingual large language models trained from scratch by OrionStarAI.
phi-2	generate	2048	Phi-2 is a 2.7B Transformer based LLM used for research on model safety, trained with data similar to Phi-1.5 but augmented with synthetic texts and curated websites.
phi-3-mini-128k-instruct	chat	128000	The Phi-3-Mini-128K-Instruct is a 3.8 billion-parameter, lightweight, state-of-the-art open model trained using the Phi-3 datasets.
phi-3-mini-4k-instruct	chat	4096	The Phi-3-Mini-4k-Instruct is a 3.8 billion-parameter, lightweight, state-of-the-art open model trained using the Phi-3 datasets.
platypus2-70b-instruct	generate	4096	Platypus-70B-instruct is a merge of garage-bAInd/Platypus2-70B and upstage/Llama-2-70b-instruct-v2.
qwen-chat	chat, tools	32768	Qwen-chat is a fine-tuned version of the Qwen LLM trained with alignment techniques, specializing in chatting.
qwen-vl-chat	chat, vision	4096	Qwen-VL-Chat supports more flexible interaction, such as multiple image inputs, multi-round question answering, and creative capabilities.
qwen1.5-chat	chat, tools	32768	Qwen1.5 is the beta version of Qwen2, a transformer-based decoder-only language model pretrained on a large amount of data.
qwen1.5-moe-chat	chat, tools	32768	Qwen1.5-MoE is a transformer-based MoE decoder-only language model pretrained on a large amount of data.
qwen2-instruct	chat, tools	32768	Qwen2 is the new series of Qwen large language models
qwen2-moe-instruct	chat, tools	32768	Qwen2 is the new series of Qwen large language models.
seallm_v2	generate	8192	We introduce SeaLLM-7B-v2, the state-of-the-art multilingual LLM for Southeast Asian (SEA) languages
seallm_v2.5	generate	8192	We introduce SeaLLM-7B-v2.5, the state-of-the-art multilingual LLM for Southeast Asian (SEA) languages
skywork	generate	4096	Skywork is a series of large models developed by the Kunlun Group · Skywork team.
skywork-math	generate	4096	Skywork is a series of large models developed by the Kunlun Group · Skywork team.
starchat-beta	chat	8192	Starchat-beta is a fine-tuned version of the Starcoderplus LLM, specializing in coding assistance.
starcoder	generate	8192	Starcoder is an open-source Transformer based LLM that is trained on permissively licensed data from GitHub.
starcoderplus	generate	8192	Starcoderplus is an open-source LLM trained by fine-tuning Starcoder on RedefinedWeb and StarCoderData datasets.
starling-lm	chat	4096	We introduce Starling-7B, an open large language model (LLM) trained by Reinforcement Learning from AI Feedback (RLAIF). The model harnesses the power of our new GPT-4 labeled ranking dataset
telechat	chat	8192	The TeleChat is a large language model developed and trained by China Telecom Artificial Intelligence Technology Co., LTD. The 7B model base is trained with 1.5 trillion Tokens and 3 trillion Tokens and Chinese high-quality corpus.
tiny-llama	generate	2048	The TinyLlama project aims to pretrain a 1.1B Llama model on 3 trillion tokens.
vicuna-v1.3	chat	2048	Vicuna is an open-source LLM trained by fine-tuning LLaMA on data collected from ShareGPT.
vicuna-v1.5	chat	4096	Vicuna is an open-source LLM trained by fine-tuning LLaMA on data collected from ShareGPT.
vicuna-v1.5-16k	chat	16384	Vicuna-v1.5-16k is a special version of Vicuna-v1.5, with a context window of 16k tokens instead of 4k.
wizardcoder-python-v1.0	chat	100000
wizardlm-v1.0	chat	2048	WizardLM is an open-source LLM trained by fine-tuning LLaMA with Evol-Instruct.
wizardmath-v1.0	chat	2048	WizardMath is an open-source LLM trained by fine-tuning Llama2 with Evol-Instruct, specializing in math.
xverse	generate	2048	XVERSE is a multilingual large language model, independently developed by Shenzhen Yuanxiang Technology.
xverse-chat	chat	2048	XVERSEB-Chat is the aligned version of model XVERSE.
yi	generate	4096	The Yi series models are large language models trained from scratch by developers at 01.AI.
yi-1.5	generate	4096	Yi-1.5 is an upgraded version of Yi. It is continuously pre-trained on Yi with a high-quality corpus of 500B tokens and fine-tuned on 3M diverse fine-tuning samples.
yi-1.5-chat	chat	4096	Yi-1.5 is an upgraded version of Yi. It is continuously pre-trained on Yi with a high-quality corpus of 500B tokens and fine-tuned on 3M diverse fine-tuning samples.
yi-1.5-chat-16k	chat	16384	Yi-1.5 is an upgraded version of Yi. It is continuously pre-trained on Yi with a high-quality corpus of 500B tokens and fine-tuned on 3M diverse fine-tuning samples.
yi-200k	generate	262144	The Yi series models are large language models trained from scratch by developers at 01.AI.
yi-chat	chat	4096	The Yi series models are large language models trained from scratch by developers at 01.AI.
yi-vl-chat	chat, vision	4096	Yi Vision Language (Yi-VL) model is the open-source, multimodal version of the Yi Large Language Model (LLM) series, enabling content comprehension, recognition, and multi-round conversations about images.
zephyr-7b-alpha	chat	8192	Zephyr-7B-α is the first model in the series, and is a fine-tuned version of mistralai/Mistral-7B-v0.1.
zephyr-7b-beta	chat	8192	Zephyr-7B-β is the second model in the series, and is a fine-tuned version of mistralai/Mistral-7B-v0.1