vllm部署LLM(qwen2.5,llama,deepseek)

目录

环境

qwen2.5-1.5b-instruct

模型下载

vllm 安装

验证安装

vllm 启动

查看当前模型列表

OpenAI Completions API(文本生成)

OpenAI Chat Completions API(chat 对话)

vllm 进程查看,kill

llama3

deepseek-蒸馏版 qwen


环境

Name: vllm
Version: 0.7.3

Name: torch
Version: 2.5.1

Name: transformers
Version: 4.49.0

cuda:V100-32GB

Version:12.1

qwen2.5-1.5b-instruct

模型下载

from modelscope import snapshot_download
model_dir = snapshot_download('Qwen/Qwen2.5-1.5B-Instruct', cache_dir='/root/autodl-tmp/')

 cache_dir 保存路径

vllm 安装

直接 pip 即可

pip install vllm

验证安装

安装好 vllm,下载好模型后,可以用以下代码试下 vllm 推理,如果正常会有运行结果

import os
import warnings
os.environ['CUDA_VISIBLE_DEVICES'] = '0'
warnings.filterwarnings('ignore')

from transformers import AutoTokenizer
from vllm import LLM, SamplingParams
import pandas as pd
import time


class vllmModel():

    def __init__(self, model_path, temperature=0.1, max_tokens=4096, tp_size=1):
        """
        model_path: 模型路径
        temperature: 温度
        max_tokens: 模型最大输出 token
        tp_size: gpu 数量,可以为 1 张卡卡,多余一张卡,必须是双数,如 2,4,6
        """
        print(f'加载本地模型:{model_path}')
        self.tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
        self.llm = LLM(
            model=model_path,
            tensor_parallel_size=tp_size,
            max_model_len=4096,
            trust_remote_code=True,
            enforce_eager=True,
            dtype="float16",
            # 如果遇见 OOM 现象,建议开启下述参数
            # enable_chunked_prefill=True,
            # max_num_batched_tokens=8192
        )
        self.sampling_params = SamplingParams(temperature=temperature, max_tokens=max_tokens)
        print("模型加载完毕")

    def infer(self, prompts):
        """
        prompts: prompt 列表
        """
        prompts = [{"role": "user", "content": prompt} for prompt in prompts]
        inputs = []
        for prompt in prompts:
            _input = self.tokenizer.apply_chat_template([prompt], tokenize=False, add_generation_prompt=True)
            inputs.append(_input)
        outputs = self.llm.generate(prompts=inputs, sampling_params=self.sampling_params)
        result = []
        for output in outputs:
            text = output.outputs[0].text
            result.append(text)
        return result


# 加载模型
model_path = "/root/autodl-tmp/Qwen/Qwen2___5-1___5B-Instruct/"
llm = vllmModel(model_path)

# infer
print(llm.infer(['你好', '你能做什么?']))

vllm 启动

qwen2.5 兼容 openai 的 api 服务,可以用一下命令启动:

VLLM_WORKER_MULTIPROC_METHOD=spawn \
vllm serve /root/autodl-tmp/Qwen/Qwen2___5-1___5B-Instruct \
--trust-remote-code \
--served-model-name qwen2_5_1_5 \
--gpu-memory-utilization 0.2 \
--tensor-parallel-size 1 \
--port 8000 \
--dtype=half

--dtype=half 跟 gpu 有关,因为我报错了,提示加这参数,是精度

--port 8000 开启端口为 8000

--trust-remote-code  允许执行远程代码

--served-model-name qwen2_5_1_5  模型名称定义,不是模型路径

--gpu-memory-utilization 0.2  gpu 显存占用,如果太高,我最开始 0.98,调用会崩掉

--tensor-parallel-size 1 使用 gpu 的数量

serve /root/autodl-tmp/Qwen/Qwen2___5-1___5B-Instruct 模型加载路径

--quantization awq 如果部署量化模型,即模型后缀为 AWQ,需要加上

vllm_use_v1=1 写在最开头,代表境变量为 1,表示你希望使用 vLLM 的 V1 API 版本。这通常涉及到API的设计,这个参数加上了,我这里不仅掉不通,一调就挂了,或者没多久也会自己挂,所以去掉

VLLM_WORKER_MULTIPROC_METHOD=spawn  写在最开头

这个变量指定在多进程模式下,worker(工作进程)启动的方法。在这个例子中,值被设置为 spawn。在Python的multiprocessing模块中有多种方法可以创建新进程,其中包括forkspawnforkserverspawn方法意味着每个新的进程都是通过启动一个全新的Python解释器实例来创建的。这种方法相较于fork更加安全,尤其是在使用非Unix平台(如Windows)时,因为fork可能会导致一些问题,比如子进程继承了父进程的所有资源,可能导致死锁或其他意外行为。

--max_model_len 4096 模型能够处理的最大序列长度(以token为单位),显存不够报错了,可以调小

CUDA_VISIBLE_DEVICES=0,2 写在最开头,指定使用哪张 gpu,这里是第 1,3 卡

启动成功会显示如下:

查看当前模型列表

curl http://localhost:8000/v1/models

{
  "object": "list",
  "data": [
    {
      "id": "qwen2_5_1_5",
      "object": "model",
      "created": 1740235117,
      "owned_by": "vllm",
      "root": "/root/autodl-tmp/Qwen/Qwen2___5-1___5B-Instruct",
      "parent": null,
      "max_model_len": 32768,
      "permission": [
        {
          "id": "modelperm-514811fdf7464bc9bbd72db39850ef49",
          "object": "model_permission",
          "created": 1740235117,
          "allow_create_engine": false,
          "allow_sampling": true,
          "allow_logprobs": true,
          "allow_search_indices": false,
          "allow_view": true,
          "allow_fine_tuning": false,
          "organization": "*",
          "group": null,
          "is_blocking": false
        }
      ]
    }
  ]
}

OpenAI Completions API(文本生成)

一般指文本生成类的使用,非 chat,即非对话的使用方式

curl http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "qwen2_5_1_5",
        "prompt": "你好,你是?",
        "max_tokens": 2048,
        "temperature": 0.7,
        "top_p": 0.8,
        "repetition_penalty": 1.05
    }'

加 repetition_penalty 之前,会有很多重复: 

加之后:

"max_tokens": 2048, 生成文本的最大长度(以token为单位)
"temperature": 0.7, 生成文本的随机性,值越小,创造能力越强,输出更加多样,适合创造场景
"top_p": 0.8, 核采样,类似温度,
"repetition_penalty": 1.05 用来减少重复词语出现的一个调节器。当其值大于1时(例如这里的 repetition_penalty: 1.05),会降低之前已经出现在生成文本中的词语被再次选中的概率。这有助于提高生成文本的新颖性,避免不必要的重复

{
  "id": "cmpl-adb0ffa50a1142a1880620126e341a70",
  "object": "text_completion",
  "created": 1740236869,
  "model": "qwen2_5_1_5",
  "choices": [
    {
      "index": 0,
      "text": " 答案:我是小明。 A. 错误 B. 正确\nA. 错误\n\n解析:\"我是小明\"是一个常见的自我介绍,但它并不一定正确,因为每个人在回答\"你是谁\"的问题时,通常会提供自己真实的姓名或名字。\"小明\"可能是一个昵称或别名,这取决。。。。。。。。。。。。。。。。。。。",
      "logprobs": null,
      "finish_reason": "length",
      "stop_reason": null,
      "prompt_logprobs": null
    }
  ],
  "usage": {
    "prompt_tokens": 4,
    "total_tokens": 2052,
    "completion_tokens": 2048,
    "prompt_tokens_details": null
  }
}

 python openai 接口:

from openai import OpenAI

# 设置 OpenAI 的 API key 和 API base 来使用 vLLM 的 API server.
openai_api_key = "EMPTY"  # 如果不需要 API key,可以留空或设置为 "EMPTY"
openai_api_base = "http://localhost:8000/v1"

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

# 创建 Completion 请求
completion_response = client.completions.create(
    model="qwen2_5_1_5",
    prompt="你好,你是?",
    max_tokens=2048,
    temperature=0.7,
    top_p=0.8,
    extra_body={"repetition_penalty": 1.05},
)

# 检查响应并打印结果
if hasattr(completion_response, 'choices') and len(completion_response.choices) > 0:
    print("成功获取数据:")
    print(completion_response.choices[0].text)
else:
    print(f"请求失败,响应内容:{completion_response}")

python vllm 接口

import requests
import json

# 设置请求的URL
url = "http://localhost:8000/v1/completions"

# 设置请求头部
headers = {
    "Content-Type": "application/json"
}

# 准备请求数据
data = {
    "model": "qwen2_5_1_5",
    "prompt": "你好,你是?",
    "max_tokens": 2048,
    "temperature": 0.7,
    "top_p": 0.8,
    "repetition_penalty": 1.05
}

# 发送POST请求
response = requests.post(url, headers=headers, data=json.dumps(data))

# 检查响应状态并打印结果
if response.status_code == 200:
    try:
        # 尝试解析响应中的 'choices' 字段,并获取生成的文本
        choices = response.json().get('choices', [])
        if choices:
            print("成功获取数据:")
            # 注意这里的 'text' 字段可能需要根据实际API响应结构调整
            print(choices[0].get('text', ''))
        else:
            print("没有获取到任何选择结果")
    except json.JSONDecodeError:
        print("无法解析JSON响应")
else:
    print(f"请求失败,状态码:{response.status_code}, 响应内容:{response.text}")

OpenAI Chat Completions API(chat 对话)

curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
  "model": "qwen2_5_1_5",
  "messages": [
    {"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
    {"role": "user", "content": "你好,你是?"}
  ],
  "temperature": 0.7,
  "top_p": 0.8,
  "repetition_penalty": 1.05,
  "max_tokens": 2048
}'

{
  "id": "chatcmpl-189df22813aa4b358e795596f0e2c420",
  "object": "chat.completion",
  "created": 1740238345,
  "model": "qwen2_5_1_5",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "reasoning_content": null,
        "content": "你好!我是阿里云开发的一款语言模型,我叫通义千问。",
        "tool_calls": [
          
        ]
      },
      "logprobs": null,
      "finish_reason": "stop",
      "stop_reason": null
    }
  ],
  "usage": {
    "prompt_tokens": 33,
    "total_tokens": 51,
    "completion_tokens": 18,
    "prompt_tokens_details": null
  },
  "prompt_logprobs": null
}

 python 代码 openai 接口

from openai import OpenAI

# 设置 OpenAI 的 API key 和 API base 来使用 vLLM 的 API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

chat_response = client.chat.completions.create(
    model="qwen2_5_1_5",
    messages=[
        {"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
        {"role": "user", "content": "你是谁?"},
    ],
    temperature=0.7,
    top_p=0.8,
    max_tokens=512,
    extra_body={
        "repetition_penalty": 1.05,
    },
)

# 直接访问响应对象的属性,而不是尝试调用 .json()
print("Chat response:", chat_response.choices[0].message.content)

python vllm 接口

import requests
import json

# 设置API的基本URL和端点
base_url = "http://localhost:8000/v1"
endpoint = "/chat/completions"

url = base_url + endpoint

# 设置请求头部
headers = {
    "Content-Type": "application/json",
}

# 准备请求数据
data = {
    "model": "qwen2_5_1_5",
    "messages": [
        {"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
        {"role": "user", "content": "你是谁?"},
    ],
    "temperature": 0.7,
    "top_p": 0.8,
    "max_tokens": 512,
    "repetition_penalty": 1.05,  # 直接包含在请求体中
}

# 发送POST请求
response = requests.post(url, headers=headers, data=json.dumps(data))

# 检查响应状态并打印结果
if response.status_code == 200:
    print("Chat response:", response.json()['choices'][0]['message'])
else:
    print(f"请求失败,状态码:{response.status_code}, 响应内容:{response.text}")

vllm 进程查看,kill

ps aux | grep vllm

kill -9 6947

服务杀死了,但显存没释放

llama3

跟 qwen2.5 一样

llama 这个模型回答会有重复现象,可以用 extra_body={"repetition_penalty": 1.2}, 取值 1.2 效果较好。1.3,1.4,1.5 会生成乱码,冗余,1.1 还是有重复

deepseek-蒸馏版 qwen

vllm 启动需要加个参数 --enforce-eager

始终使用急切模式PyTorch,如果为False,则在混合模式下使用急切模式和CUDA图以获得最大性能和灵活性

vllm serve /root/autodl-tmp/deepseek-ai/DeepSeek-R1-Distill-Qwen-1___5B/ \
--trust-remote-code \
--served-model-name qwen2_5_1_5 \
--gpu-memory-utilization 0.7 \
--tensor-parallel-size 1 \
--max-model-len 32768 \
--dtype=half \
--enforce-eager

参数说明:VLLM参数解释-中文表格形式-CSDN博客

参考:

https://github.com/datawhalechina/self-llm/blob/master/models/Qwen2.5/03-Qwen2.5-7B-Instruct%20vLLM%20%E9%83%A8%E7%BD%B2%E8%B0%83%E7%94%A8.md

使用vLLM部署Qwen2.5-VL-7B-Instruct模型的详细指南_vllm qwen2.5-vl-CSDN博客

vLLM - Qwen

### Qwen2.5 Web Resources and Information #### Overview of Qwen2.5 Qwen2.5 is a powerful language model series developed by Alibaba Cloud, known for its superior capabilities in text generation, conversation handling, and question answering[^1]. This makes it an essential tool for developers looking to integrate advanced natural language processing features into their applications. #### Official Documentation and Repositories For comprehensive guides on downloading, deploying, and training the Qwen2.5 model, refer to official documentation provided by Alibaba Cloud. The detailed instructions cover environment setup, deployment methods using tools like vLLM for OpenAI-style API services, and front-end interaction via platforms such as Open WebUI. Additionally, community contributions can be found at repositories listed under resources curated by enthusiasts, which include links to various implementations and tutorials related to large language models (LLMs), including Qwen2.5[^2]. #### Community Contributions and Discussions Engage with communities where discussions about MM-LLM commonly used LLMs take place; these forums often contain valuable insights from users who have experience working specifically with Qwen among other popular models like Flan-T5, ChatGLM, UL2, Chinchilla, OPT, PaLM, LLaMA, LLaMA-2, and Vicuna[^3]. ```python import requests def get_qwen_resources(): url = "https://github.com/WangRongsheng/awesome-LLM-resourses" response = requests.get(url) if response.status_code == 200: print("Successfully accessed resource page.") else: print(f"Failed to access resource page with status code {response.status_code}") get_qwen_resources() ``` --related questions-- 1. What are some key differences between Qwen2.5 and earlier versions? 2. How does one set up an environment suitable for running Qwen2.5 locally? 3. Can you provide examples of successful projects that utilized Qwen2.5? 4. Are there any specific hardware requirements recommended for optimal performance when using Qwen2.5? 5. Where can I find more detailed technical specifications regarding Qwen2.5's architecture?
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值