【大模型推理】vLLM推理框架基本使用及注意事项

最新推荐文章于 2025-05-15 22:39:15 发布

Mr.zwX

最新推荐文章于 2025-05-15 22:39:15 发布

阅读量726

点赞数 5

分类专栏：【NLP】自然语言处理【深度学习/神经网络】Deep Learning 文章标签：大模型 DeepSeek

本文链接：https://blog.csdn.net/qq_16763983/article/details/146152772

版权

【深度学习/神经网络】Deep Learning 同时被 2 个专栏收录

91 篇文章

订阅专栏

【NLP】自然语言处理

15 篇文章

订阅专栏

基础流程

vLLM安装

pip install vllm

导入vLLM

from vllm import LLM, SamplingParams

LLM是加载和推理大模型的包，SamplingParams是生成时采样参数。

定义要处理的prompt

prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

加载大模型

llm = LLM(model="facebook/opt-125m")

推理及输出结果

outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

完整代码

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "6,7"

from vllm import LLM, SamplingParams

llm = LLM(
    model='/home/llm_ckpts/DeepSeek-R1-Distill-Qwen-7B',  # model name or checkpoint path
    max_model_len=32768,  # max number of tokens the model can handle
    gpu_memory_utilization=0.95,  # use 95% of GPU memory
    )

prompts = [
    # "Hello, my name is",
    # "The president of the United States is",
    # "The capital of France is",
    # "The future of AI is",
    # "I want to compute the result of 2 * (1 + 4), please think step by step and give me the answer.",
    "Solve the following math problem efficiently and clearly.  The last line of your response should be of the following format: 'Therefore, the final answer is: $\\boxed{{ANSWER}}$. I hope it is correct' (without quotes) where ANSWER is just the final number or expression that solves the problem. Think step by step before answering. What is the result of 2 * (1 + 4)?"
]
sampling_params = SamplingParams(
    temperature=0.6, 
    top_p=0.95,  
    max_tokens=1024,  # max number of tokens to generate
    seed=42,  # random seed
    )

outputs = llm.generate(prompts, sampling_params)

# print the outputs
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r},\nGenerated text: {generated_text!r}\n")

在这里插入图片描述
答案和格式完全正确。

注意事项

随机种子
需要在SamplingParams中固定seed，才能保证对同一个输入的生成的保持一致。

最大生成token数
需要在SamplingParams调大max_tokens，才能保证复杂一点的推理任务能被完整地分析。

关于prompt的问题
对于7B这种规模不足够大的模型，回答的质量和prompt的问法关系很大。比如我仅仅把上面的“What is the result of 2 * (1 + 4)?”改成“Compute the result of 2 * (1 + 4).”，生成的内容就变得混乱了不少，而且也无法得出正确答案。

在这里插入图片描述
在给一个例子，如果问“The capital of France is”，那么得到的结果是：

在这里插入图片描述
结果正确，但问题是后面多了大量的冗余内容，原因应该是没用对生成答案的格式加显式的限制。
如果我把prompt写得非常完整，包括答案的格式限制，“The last line of your response should be of the following format: 'Therefore, the final answer is: $\\boxed{{ANSWER}}$. I hope it is correct' (without quotes) where ANSWER is just the final answer that solves the problem. What is the capital of France?”，那么得到的结果将会完美很多：