qwen API调用

本文介绍了GitHub上的VLLM项目,一个高吞吐量和内存高效的大型语言模型(LLMs)推理与服务引擎,以及如何在FastChat平台上本地部署和使用不同版本的模型,如Qwen-72B和量化模型。
摘要由CSDN通过智能技术生成

GitHub - QwenLM/vllm-gptq: A high-throughput and memory-efficient inference and serving engine for LLMs

pip install fschat

python -m fastchat.serve.controller

python -m fastchat.serve.vllm_worker --model-path $model_path --tensor-parallel-size 2 --trust-remote-code

python -m fastchat.serve.openai_api_server --host localhost --port 8000

pip install --upgrade openai=0.28

import openai
# to get proper authentication, make sure to use a valid key that's listed in
# the --api-keys flag. if no flag value is provided, the `api_key` will be ignored.
openai.api_key = "EMPTY"
openai.api_base = "http://localhost:8000/v1"

model = "qwen"
call_args = {
    'temperature': 1.0,
    'top_p': 1.0,
    'top_k': -1,
    'max_tokens': 2048, # output-len
    'presence_penalty': 1.0,
    'frequency_penalty': 0.0,
}
# create a chat completion
completion = openai.ChatCompletion.create(
  model=model,
  messages=[{"role": "user", "content": "Hello! What is your name?"}],
  **call_args
)
# print the completion
print(completion.choices[0].message.content)
 python -m fastchat.serve.openai_api_server --host IP --port 8000    

UI:

GitHub - lm-sys/FastChat:用于训练、服务和评估大型语言模型的开放平台。Vicuna 和 Chatbot Arena 的发布存储库。

python3 -m fastchat.serve.controller


python3 -m fastchat.serve.model_worker --model-path QWen-72B-Chat --num-gpus 2 --max-gpu-memory xxGiB

python3 -m fastchat.serve.gradio_web_server --host IP --port 8000  

    python3 -m fastchat.serve.gradio_web_server --host IP --port 8000   

**NETWORK ERROR DUE TO HIGH TRAFFIC. PLEASE REGENERATE OR REFRESH THIS PAGE.** 

上面的顺序不能乱,乱了启动失败。(结论有误)

本地化部署大模型方案二:fastchat+llm(vllm)_fastchat本地化部署大模型-CSDN博客

FastChat/docs/vllm_integration.md at main · lm-sys/FastChat · GitHub

  1. When you launch a model worker, replace the normal worker (fastchat.serve.model_worker) with the vLLM worker (fastchat.serve.vllm_worker). All other commands such as controller, gradio web server, and OpenAI API server are kept the same.

    python3 -m fastchat.serve.vllm_worker --model-path lmsys/vicuna-7b-v1.5
    

    If you see tokenizer errors, try

    python3 -m fastchat.serve.vllm_worker --model-path lmsys/vicuna-7b-v1.5 --tokenizer hf-internal-testing/llama-tokenizer
    

    If you use an AWQ quantized model, try ''' python3 -m fastchat.serve.vllm_worker --model-path TheBloke/vicuna-7B-v1.5-AWQ --quantization awq '''

  • 8
    点赞
  • 10
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值