xinference 使用命令实践记录

最新推荐文章于 2025-03-28 13:52:16 发布

gs80140

最新推荐文章于 2025-03-28 13:52:16 发布

阅读量1k

点赞数 4

分类专栏： AI 文章标签： xinference qwen

本文链接：https://blog.csdn.net/gs80140/article/details/142603478

版权

AI 专栏收录该内容

146 篇文章

订阅专栏

1. qwen-chat 模型相关的参数组合，以决定它能够怎样跑在各种推理引擎上

命令

xinference engine -e http://0.0.0.0:9997 --model-name qwen-chat

结果

2. 将 qwen-chat 跑在 VLLM 推理引擎上，但是我不知道什么样的其他参数符合这个要求。

命令:

xinference engine -e http://0.0.0.0:9997 --model-name qwen-chat --model-engine vllm

3. 加载 GGUF 格式的 qwen-chat 模型，我需要知道其余的参数组合

命令

xinference engine -e http://0.0.0.0:9997 --model-name qwen-chat -f ggufv2

4. 运行一个内置的 llama-2-chat 模型。当你需要运行一个模型时，第一次运行是要从HuggingFace 下载模型参数，一般来说需要根据模型大小下载10到30分钟不等。当下载完成后，Xinference本地会有缓存的处理，以后再运行相同的模型不需要重新下载由于国内下载不了 HuggingFace , 在启动 xinference-local 时增加变量 export HF_ENDPOINT=https://hf-mirror.com 指定国内镜像

事先查询一下

xinference engine -e http://0.0.0.0:9997 --model-name llama-2-chat --model-engine vllm

运行命令

xinference launch --model-engine vllm -u my-llama-2 -n llama-2-chat -s 13 -f pytorch

报显存错误, 显存只有24G

RuntimeError: Failed to launch model, detail:
 [address=0.0.0.0:44231, pid=47189] CUDA out of memory. 
 Tried to allocate 270.00 MiB. 
 GPU 0 has a total capacity of 23.64 GiB of which 213.69 MiB is free. 
 Including non-PyTorch memory, 
 this process has 23.43 GiB memory in use. 
Of the allocated memory 22.99 GiB is allocated by PyTorch, 
and 1.76 MiB is reserved by PyTorch but unallocated. 
If reserved but unallocated memory is large 
try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

解决办法在 xinf.sh启动脚本增加环境变量设置

PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True