vLLM - 基本使用


一、安装

conda create -n e39 python=3.9

conda activate e39

pip install vllm

二、服务

1、vllm serve

vllm serve Qwen/Qwen2.5-1.5B-Instruct

用了 20G


2、Python 服务

from vllm import LLM, SamplingParams  

# llm = LLM(model="facebook/opt-125m")
llm = LLM(model="Qwen/Qwen2.5-1.5B-Instruct")

prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)  
outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

三、访问

curl - 查看所有模型

curl http://10.0.1.23:8000/v1/models

{
	"object": "list",
	"data": [{
		"id": "Qwen/Qwen2.5-1.5B-Instruct",
		"object": "model",
		"created": 1741263998,
		"owned_by": "vllm",
		"root": "Qwen/Qwen2.5-1.5B-Instruct",
		"parent": null,
		"max_model_len": 32768,
		"permission": [{
			"id": "modelperm-79dc42186c2f46d085c3c98615f71e47",
			"object": "model_permission",
			"created": 1741263998,
			"allow_create_engine": false,
			"allow_sampling": true,
			"allow_logprobs": true,
			"allow_search_indices": false,
			"allow_view": true,
			"allow_fine_tuning": false,
			"organization": "*",
			"group": null,
			"is_blocking": false
		}]
	}]
}

curl - completions

curl http://10.0.1.23:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "Qwen/Qwen2.5-1.5B-Instruct",
        "prompt": "San Francisco is a",
        "max_tokens": 7,
        "temperature": 0
    }'

{
	"id": "cmpl-3fba4fc307e04e3f8d656049437be215",
	"object": "text_completion",
	"created": 1741264121,
	"model": "Qwen/Qwen2.5-1.5B-Instruct",
	"choices": [{
		"index": 0,
		"text": " city in the state of California,",
		"logprobs": null,
		"finish_reason": "length",
		"stop_reason": null,
		"prompt_logprobs": null
	}],
	"usage": {
		"prompt_tokens": 4,
		"total_tokens": 11,
		"completion_tokens": 7,
		"prompt_tokens_details": null
	}
}

curl - chat/completions

curl http://10.0.1.23:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "Qwen/Qwen2.5-1.5B-Instruct",
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Who won the world series in 2020?"}
        ]
    }'

{
	"id": "chatcmpl-eaa9d73518ae4900ac3ca03445b94856",
	"object": "chat.completion",
	"created": 1741264227,
	"model": "Qwen/Qwen2.5-1.5B-Instruct",
	"choices": [{
		"index": 0,
		"message": {
			"role": "assistant",
			"reasoning_content": null,
			"content": "The World Series in 2020 was played between the New York Yankees and the Boston Red Sox. The Yankees won in seven games, defeating the Red Sox across the regular and季后赛 (playoffs) seasons.",
			"tool_calls": []
		},
		"logprobs": null,
		"finish_reason": "stop",
		"stop_reason": null
	}],
	"usage": {
		"prompt_tokens": 31,
		"total_tokens": 76,
		"completion_tokens": 45,
		"prompt_tokens_details": null
	},
	"prompt_logprobs": null
}

Python - completions

from openai import OpenAI

# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://10.0.1.23:8000/v1"
client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)
completion = client.completions.create(model="Qwen/Qwen2.5-1.5B-Instruct",
                                      prompt="San Francisco is a")
print("Completion result:", completion)

Completion result: Completion(
  id='cmpl-accdc1011da548a08ac8f297146e7791', 
  choices=[
    CompletionChoice(
      finish_reason='length', index=0, logprobs=None, 
      text=' great place with a lot of rich history. But in recent years, the people', 
      stop_reason=None, prompt_logprobs=None)
  ], 
  created=1741264345, model='Qwen/Qwen2.5-1.5B-Instruct', 
  object='text_completion', system_fingerprint=None, 
  usage=CompletionUsage(
    completion_tokens=16, prompt_tokens=4, 
    total_tokens=20, completion_tokens_details=None, 
    prompt_tokens_details=None))

Python - chat.completions

from openai import OpenAI
# Set OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://10.0.1.23:8000/v1"

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

chat_response = client.chat.completions.create(
    model="Qwen/Qwen2.5-1.5B-Instruct",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Tell me a joke."},
    ]
)

print("Chat response:", chat_response)

Chat response: ChatCompletion(
  id='chatcmpl-ae641c01fc54450b9a02bba0260c445b', 
  choices=[Choice(
    finish_reason='stop', index=0, logprobs=None, 
    message=ChatCompletionMessage(
      content='Why could the statue of liberty sleep 8 hours a day?\n\nBecause she had a full moon each month.', 
      refusal=None, role='assistant', audio=None, function_call=None, tool_calls=[], reasoning_content=None), stop_reason=None)], 
  created=1741264394, model='Qwen/Qwen2.5-1.5B-Instruct', 
  object='chat.completion', service_tier=None, system_fingerprint=None, 
  usage=CompletionUsage(
    completion_tokens=23, prompt_tokens=24, total_tokens=47, 
    completion_tokens_details=None, prompt_tokens_details=None), prompt_logprobs=None)

2025-03-06(四)

### vLLM DeepSeek-R1-14B 模型介绍 DeepSeek-R1 是由 DeepSeek 开发的大规模预训练语言模型系列之一,旨在通过先进的训练技术和优化策略提供强大的自然语言处理能力。该系列中的 14B 参数版本(即 DeepSeek-R1-14B)继承了这些特性并进一步提升了性能和效率[^2]。 此模型经过精心设计,在多个方面进行了改进: - **架构创新**:采用了更高效的 Transformer 架构变体; - **数据增强**:利用大规模高质量语料库进行预训练; - **蒸馏技术**:应用知识蒸馏方法来提高小型化后的模型效果; 对于希望部署高性能 NLP 应用程序的研究人员来说,这是一个理想的选择。 ### 使用教程 为了方便用户快速上手,以下是关于如何使用 `vllm` 工具启动 DeepSeek-R1-14B 的基本指南: #### 安装依赖项 确保已经安装 Python 及其环境管理工具如 conda 或 virtualenv 后,可以按照官方文档说明设置必要的软件包。 #### 配置服务端口 根据实际需求调整命令参数以适应不同的硬件条件和服务场景。例如,下面这条指令展示了怎样配置一个单 GPU 上运行的服务实例,并设置了最大输入长度和其他选项[^3]: ```bash vllm serve DeepSeek-R1-Distill-Qwen-14B \ --tensor-parallel-size 1 \ --max-model-len 16384 \ --enforce-eager \ --dtype half \ --gpu_memory_utilization 0.95 ``` 这将启动一个 HTTP API 接口,默认监听于本地主机的特定端口号上,允许外部应用程序发送请求并与之交互。 ### 下载方式 访问魔搭社区提供的链接可以直接获取到不同大小版本的 DeepSeek-R1 模型文件。具体而言,针对想要下载 DeepSeek-R1-14B 用户,则需前往指定页面寻找对应条目完成操作。 请注意遵循平台规定以及版权条款来进行合法合规的数据传输活动。 ### 性能评测 评估结果显示,即使是在资源受限的情况下,经由 DeepSeek 技术团队微调过的较小型号也能取得令人满意的成果。特别是在一些标准测试集上的表现证明了这一点——它们不仅能够保持较高的准确性,而且还能显著减少计算成本和时间消耗。 此外,由于支持多种精度模式(比如半精度浮点数),因此可以根据实际情况灵活选择最适合的方式执行推理任务,从而实现最佳性价比。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值