LLM
首先vim编辑一个shell脚本,比如vim start_model.shell,然后把如下命令修改为自己的模型然后粘贴,在chmod 777 start_model.shell赋予权限,然后启动,具体命令如下:
vim start_model.shell #编辑shell脚本
chmod 777 start_model.shell #赋予权限
./start_model.shell #启动shell脚本
curl 'http://127.0.0.1:9997/v1/models' \
-H 'Accept: */*' \
-H 'Accept-Language: zh-CN,zh;q=0.9,en;q=0.8' \
-H 'Connection: keep-alive' \
-H 'Content-Type: application/json' \
-H 'Cookie: token=no_auth' \
-H 'Origin: http://127.0.0.1:9997' \
-H 'Referer: http://127.0.0.1:9997/ui/' \
-H 'Sec-Fetch-Dest: empty' \
-H 'Sec-Fetch-Mode: cors' \
-H 'Sec-Fetch-Site: same-origin' \
-H 'User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36' \
-H 'sec-ch-ua: "Chromium";v="124", "Google Chrome";v="124", "Not-A.Brand";v="99"' \
-H 'sec-ch-ua-mobile: ?0' \
-H 'sec-ch-ua-platform: "Linux"' \
--data-raw '{"model_uid":null,"model_name":"qwen2-instruct","model_type":"LLM","model_engine":"Vllm","model_format":"pytorch","model_size_in_billions":7,"quantization":"none","n_gpu":"auto","replica":1,"request_limits":null,"worker_ip":null,"gpu_idx":null}'
各部分解释
URL:
curl 'http://127.0.0.1:9997/v1/models'
请求头:
-H 'Accept: */*'
-H 'Accept-Language: zh-CN,zh;q=0.9,en;q=0.8'
-H 'Connection: keep-alive'
-H 'Content-Type: application/json'
-H 'Cookie: token=no_auth'
-H 'Origin: http://127.0.0.1:9997'
-H 'Referer: http://127.0.0.1:9997/ui/'
-H 'Sec-Fetch-Dest: empty'
-H 'Sec-Fetch-Mode: cors'
-H 'Sec-Fetch-Site: same-origin'
-H 'User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36'
-H 'sec-ch-ua: "Chromium";v="124", "Google Chrome";v="124", "Not-A.Brand";v="99"'
-H 'sec-ch-ua-mobile: ?0'
-H 'sec-ch-ua-platform: "Linux"'
Accept
: 指定客户端可以接受的内容类型。Accept-Language
: 指定客户端可以接受的语言。Connection
: 控制连接的管理方式。Content-Type
: 指定请求体的内容类型,这里是application/json
。Cookie
: 发送请求时附带的 Cookie。Origin
: 指定请求的来源。Referer
: 指定请求的来源页面。Sec-Fetch-Dest
,Sec-Fetch-Mode
,Sec-Fetch-Site
: 这些是与安全相关的请求头,用于描述请求的上下文。User-Agent
: 指定客户端的用户代理信息。sec-ch-ua
,sec-ch-ua-mobile
,sec-ch-ua-platform
: 这些是与客户端硬件和软件环境相关的请求头。
请求体:
--data-raw '{"model_uid":null,"model_name":"qwen2-instruct","model_type":"LLM","model_engine":"Vllm,"model_format":"pytorch","model_size_in_billions":7,"quantization":"none","n_gpu":"auto","replica":1,"request_limits":null,"worker_ip":null,"gpu_idx":null}'
-
model_uid
: 模型的唯一标识符model_name
: 模型的名称。model_type
: 模型的类型(例如LLM
表示大语言模型)。model_engine
: 模型使用的引擎(例如Transformers,Vllm,llama.cpp,具体根据自己模型选择
)。model_format
: 模型的格式(例如pytorch,ggufv2
)。model_size_in_billions
: 模型的大小,以十亿为单位。(模型参数是多少就选多少,比例如我选的是qwen2-7b-instruct,那么我选的就是7)quantization
: 模型的量化方式(例如none
,8-bit,4-bit)。n_gpu
: 使用的 GPU 数量(目前我的测试是只能选择auto)。replica
: 模型的副本数量。request_limits
: 请求限制(这里是null
)。worker_ip
: 工作节点的 IP 地址(这里是null
)。gpu_idx
: GPU 的索引(这里是null
)。
model_name选择,进入xinference目录下的cache,可以看到模型的名称
emb_model
curl 'http://127.0.0.1:9997/v1/models' \
-H 'Accept: */*' \
-H 'Accept-Language: zh-CN,zh;q=0.9,en;q=0.8' \
-H 'Connection: keep-alive' \
-H 'Content-Type: application/json' \
-H 'Cookie: token=no_auth' \
-H 'Origin: http://127.0.0.1:9997' \
-H 'Referer: http://127.0.0.1:9997/ui/' \
-H 'Sec-Fetch-Dest: empty' \
-H 'Sec-Fetch-Mode: cors' \
-H 'Sec-Fetch-Site: same-origin' \
-H 'User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36' \
-H 'sec-ch-ua: "Chromium";v="124", "Google Chrome";v="124", "Not-A.Brand";v="99"' \
-H 'sec-ch-ua-mobile: ?0' \
-H 'sec-ch-ua-platform: "Linux"' \
--data-raw '{"model_uid":"bge-large-zh-v1.5","model_name":"bge-large-zh-v1.5","model_type":"embedding","replica":1,"n_gpu":"auto","worker_ip":null,"gpu_idx":null}'
启动脚本命令
vim start_model_emb.shell #编辑shell脚本
chmod 777 start_model_emb.shell #赋予脚本权限
./start_model_emb.shell #启动脚本
注:如果你只有一张显卡,建议先启动embedding_model,因为先启动LLM,可能就会提示如下报错,他建议你先启动embedding_model,小编是4090,24G显存,亲测有效,具体报错如下: