使用pip 安装xinference
特别推荐!在modelscope上可以使用免费的CPU和限时的GPU啦,成功安装xinference框架,并部署qwen-1.5大模型,速度7 tokens/s
# 安装xinf
pip3 install xinference
# 设置环境变量
export HF_ENDPOINT=https://hf-mirror.com
export XINFERENCE_MODEL_SRC=modelscope
export XINFERENCE_HOME=/mnt/workspace/xinf-data
xinference-local --host 0.0.0.0 --port 9997
可以运行大模型:
xinference launch --model-engine transformers --model-name qwen1.5-chat --size-in-billions 0_5 --model-format pytorch --quantization none
# 成功
Launch model name: qwen1.5-chat with kwargs: {}
Model uid: qwen1.5-chat
运行大模型报错,不能使用 llama.cpp库
xinference launch --model-engine llama.cpp --model-name qwen1.5-chat --size-in-billions 0_5 --model-format ggufv2 --quantization none
Traceback (most recent call last):
File "/opt/conda/bin/xinference", line 8, in <module>
sys.exit(cli())
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
return self.main(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1078, in main
rv = self.invoke(ctx)
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/click/decorators.py", line 33, in new_func
return f(get_current_context(), *args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/xinference/deploy/cmdline.py", line 814, in model_launch
model_uid = client.launch_model(
File "/opt/conda/lib/python3.10/site-packages/xinference/client/restful/restful_client.py", line 864, in launch_model
raise RuntimeError(
测试时http接口
curl -X 'POST' 'http://0.0.0.0:9997/v1/chat/completions' \
-H 'Content-Type: application/json' -d '{
"model": "qwen1.5-chat",
"messages": [
{
"role": "user",
"content": "北京景点?"
}
],
"temperature": 1
}'
curl -X 'POST' 'http://0.0.0.0:9997/v1/chat/completions' \
-H 'Content-Type: application/json' -d '{
"model": "qwen1.5-chat","stream":true,
"messages": [
{
"role": "user",
"content": "北京景点?"
}
],
"temperature": 1
}'
测试速度&总结
2024-05-29 07:13:45,715 xinference.model.llm.pytorch.utils 3916 INFO Average generation speed: 7.19 tokens/s.
空闲的时候,会被删除掉。数据不会被保存!单次最长10个小时使用!!