vLLM官方中文教程：使用vLLM的两种方式(离线推理和vllm server)

最新推荐文章于 2025-03-28 19:28:43 发布

OpenAppAI

最新推荐文章于 2025-03-28 19:28:43 发布

阅读量6.4k

点赞数 29

文章标签： vllm 大模型推理推理框架

本文链接：https://blog.csdn.net/my_name_is_learn/article/details/145885639

版权

英文原文链接：https://docs.vllm.ai/en/stable/serving/env_vars.html

在这里插入图片描述

离线推理

您可以在自己的代码中根据提示列表运行 vLLM。

离线 API 基于 LLM 类。要初始化 vLLM 引擎，请创建一个新的 LLM 实例并指定要运行的模型。

例如，以下代码从 HuggingFace 下载 facebook/opt-125m 模型，并使用默认配置在 vLLM 中运行。

llm = LLM(model="facebook/opt-125m")

初始化 LLM 实例后，可以使用各种 API 执行模型推理。可用的 API 取决于正在运行的模型类型：

生成模型会输出 logprobs，通过对 logprobs 进行采样，获得最终输出文本。
池化模型直接输出其隐藏状态。

有关每个 API 的详细信息，请参阅上述页面。和参见接口文档： API Reference

配置选项

本节列出了运行 vLLM 引擎的最常用选项。如需完整列表，请参阅 “引擎参数 ”页面。

模型解析

vLLM 通过检查模型资源库 config.json 中的architectures字段，找到已注册到 vLLM 的相应实现，从而加载与 HuggingFace 兼容的模型。不过，我们的模型解析可能会因为以下原因而失败：

模型资源库的 config.json 缺少architectures字段。
非官方资源库使用 vLLM 未记录的其他名称来引用模型。
多个模型使用了相同的架构名称，这就造成了应加载哪个模型的模糊性。

要解决这个问题，可以通过向 hf_overrides 选项传递 config.json overrides 来明确指定模型架构。例如：

model = LLM(
    model="cerebras/Cerebras-GPT-1.3B",
    hf_overrides={"architectures": ["GPT2LMHeadModel"]},  # GPT-2
)

我们的支持模型列表显示了 vLLM 可识别的模型架构。

减少内存使用量

大型模型可能会导致机器内存不足（OOM）。以下是一些有助于缓解这一问题的选项。

张量并行

张量并行（tensor_parallel_size 选项）可用于在多个 GPU 上分割模型。

下面的代码将模型分割到 2 个 GPU 上。

llm = LLM(model="ibm-granite/granite-3.1-8b-instruct",
          tensor_parallel_size=2)

为确保 vLLM 能正确初始化 CUDA，应避免在初始化 vLLM 之前调用相关函数（如 torch.cuda.set_device()）。否则，你可能会遇到类似 RuntimeError： Cannot re-initialize CUDA in forked subcess.

要控制使用哪些设备，请设置 CUDA_VISIBLE_DEVICES 环境变量。

量化

量化模型占用内存较少，但精度较低。

静态量化模型可从 HF Hub 下载（Neural Magic 提供了一些常用模型）并直接使用，无需额外配置。

量化选项还支持动态量化quantization，详情请点击此处。

上下文长度和批量大小

通过限制模型的上下文长度（max_model_len 选项）和最大批量大小（max_num_seqs 选项），可以进一步减少内存使用量。

llm = LLM(model="adept/fuyu-8b",
          max_model_len=2048,
          max_num_seqs=2)

上下文长度（max_model_len）：
这个参数定义了模型能够处理的单次输入的最大长度。对于语言模型来说，它通常指的是可以一次性处理的最大文本长度（例如，单词或字符的数量）。上下文长度决定了模型在生成下一个词或者理解当前输入时能考虑的前文信息量。较长的上下文长度允许模型捕捉到更长距离的依赖关系，但同时也会增加计算复杂度和所需的内存。
最大批量大小（max_num_seqs）：
批量大小是指在更新模型权重之前，一次处理的数据样本数量。在自然语言处理任务中，这通常意味着有多少个序列（如句子或文档）会被一起输入到模型中进行处理。增大批量大小可以在一定程度上提高训练效率和模型稳定性，因为这样可以提供更多的信息用于每次权重更新。然而，较大的批量大小也会消耗更多的内存资源。

性能优化和调整

您可以通过微调各种选项来提高 vLLM 的性能。详情请参考本指南。

兼容 OpenAI 的服务

vLLM 提供了一个 HTTP 服务器，可实现 OpenAI 的 Completions API、Chat API 等功能！

您可以通过 vllm serve 命令或 Docker 启动服务器：

vllm serve NousResearch/Meta-Llama-3-8B-Instruct --dtype auto --api-key token-abc123

要调用服务器，可以使用 OpenAI 官方 Python 客户端或任何其他 HTTP 客户端。

from openai import OpenAI
client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="token-abc123",
)

completion = client.chat.completions.create(
  model="NousResearch/Meta-Llama-3-8B-Instruct",
  messages=[
    {"role": "user", "content": "Hello!"}
  ]
)

print(completion.choices[0].message)

支持的API接口

我们目前支持以下 OpenAI API：

Completions API (/v1/completions)

只能用于 text generation models 模型 (--task generate).
注意: suffix 参数是不支持的

Chat Completions API (/v1/chat/completions)

只能用于 text generation models 模型 (--task generate)，使用一个 chat template.
注意: parallel_tool_calls 和 user 参数是不支持的

Embeddings API (/v1/embeddings)

只能用于 embedding models (--task embed).

Transcriptions API (/v1/audio/transcriptions)

仅适用于自动语音识别 (ASR) 模型 (OpenAI Whisper) (--task generate).

聊天模板

为了让语言模型支持聊天协议，vLLM 要求模型在其tokenizer配置中包含一个聊天模板。聊天模板是一个 Jinja2 模板，用于指定输入中的角色、信息和其他chat-specific标记的编码方式。

NousResearch/Meta-Llama-3-8B-Instruct 的聊天模板示例可在此处找到

有些模型即使进行了指令/聊天微调，也不会提供聊天模板。对于这些模型，您可以在 --chat-template 参数中手动指定它们的聊天模板，并提供聊天模板的文件路径或字符串形式的模板。如果没有聊天模板，服务器将无法处理聊天，所有聊天请求都会出错。

vllm serve <model> --chat-template ./path-to-chat-template.jinja

vLLM 社区为常用模型提供了一套聊天模板。您可以在 examples 目录下找到它们。

在这里插入图片描述

随着多模态聊天应用程序接口的加入，OpenAI 规范现在可以接受新格式的聊天信息，这种格式同时指定了一个type和一个text字段。下面提供了一个示例：

completion = client.chat.completions.create(
  model="NousResearch/Meta-Llama-3-8B-Instruct",
  messages=[
    {"role": "user", "content": [{"type": "text", "text": "Classify this sentiment: vLLM is wonderful!"}]}
  ]
)

大多数 LLM 聊天模板都希望content字段是字符串，但也有一些较新的模式（如 meta-llama/Llama-Guard-3-1B）希望内容按照请求中的 OpenAI 模式格式化。

vLLM 会尽力支持自动检测，并将其记录为类似 “检测到聊天模板内容格式为… ”的字符串，并在内部将传入请求转换为与检测到的格式相匹配的格式，这些格式可以是以下格式之一：

“string”: A string.

Example: “Hello world”

“openai”: 词典列表，类似于 OpenAI 模式。

Example: [{“type”: “text”, “text”: “Hello world!”}]

如果结果与你预期的不同，你可以设置 CLI 参数 --chat-template-content-format 来覆盖要使用的格式。

额外的参数

vLLM 支持一组不属于 OpenAI API 的参数。要使用这些参数，可以在 OpenAI 客户端中将它们作为额外参数传递。如果您直接使用 HTTP 调用，也可以将它们直接合并到 JSON 有效负载中。

completion = client.chat.completions.create(
  model="NousResearch/Meta-Llama-3-8B-Instruct",
  messages=[
    {"role": "user", "content": "Classify this sentiment: vLLM is wonderful!"}
  ],
  extra_body={
    "guided_choice": ["positive", "negative"]
  }
)

额外的请求头

目前只支持 X-Request-Id HTTP 请求头。可以使用 --enable-request-id-headers 启用它。

请注意，启用请求头会对高 QPS 速率下的性能产生重大影响。因此，我们建议在路由器层（例如通过 Istio）而不是在 vLLM 层实施 HTTP 标头。有关详细信息，请参见本 PR。

completion = client.chat.completions.create(
  model="NousResearch/Meta-Llama-3-8B-Instruct",
  messages=[
    {"role": "user", "content": "Classify this sentiment: vLLM is wonderful!"}
  ],
  extra_headers={
    "x-request-id": "sentiment-classification-00001",
  }
)
print(completion._request_id)

completion = client.completions.create(
  model="NousResearch/Meta-Llama-3-8B-Instruct",
  prompt="A robot may not injure a human being",
  extra_headers={
    "x-request-id": "completion-test",
  }
)
print(completion._request_id)

CLI参数

vllm serve 命令用于启动与 OpenAI 兼容的服务器。

usage: vllm serve [-h] [--host HOST] [--port PORT]
                  [--uvicorn-log-level {debug,info,warning,error,critical,trace}]
                  [--allow-credentials] [--allowed-origins ALLOWED_ORIGINS]
                  [--allowed-methods ALLOWED_METHODS]
                  [--allowed-headers ALLOWED_HEADERS] [--api-key API_KEY]
                  [--lora-modules LORA_MODULES [LORA_MODULES ...]]
                  [--prompt-adapters PROMPT_ADAPTERS [PROMPT_ADAPTERS ...]]
                  [--chat-template CHAT_TEMPLATE]
                  [--chat-template-content-format {auto,string,openai}]
                  [--response-role RESPONSE_ROLE] [--ssl-keyfile SSL_KEYFILE]
                  [--ssl-certfile SSL_CERTFILE] [--ssl-ca-certs SSL_CA_CERTS]
                  [--ssl-cert-reqs SSL_CERT_REQS] [--root-path ROOT_PATH]
                  [--middleware MIDDLEWARE] [--return-tokens-as-token-ids]
                  [--disable-frontend-multiprocessing]
                  [--enable-request-id-headers] [--enable-auto-tool-choice]
                  [--enable-reasoning] [--reasoning-parser {deepseek_r1}]
                  [--tool-call-parser {granite-20b-fc,granite,hermes,internlm,jamba,llama3_json,mistral,pythonic} or name registered in --tool-parser-plugin]
                  [--tool-parser-plugin TOOL_PARSER_PLUGIN] [--model MODEL]
                  [--task {auto,generate,embedding,embed,classify,score,reward,transcription}]
                  [--tokenizer TOKENIZER] [--skip-tokenizer-init]
                  [--revision REVISION] [--code-revision CODE_REVISION]
                  [--tokenizer-revision TOKENIZER_REVISION]
                  [--tokenizer-mode {auto,slow,mistral,custom}]
                  [--trust-remote-code]
                  [--allowed-local-media-path ALLOWED_LOCAL_MEDIA_PATH]
                  [--download-dir DOWNLOAD_DIR]
                  [--load-format {auto,pt,safetensors,npcache,dummy,tensorizer,sharded_state,gguf,bitsandbytes,mistral,runai_streamer}]
                  [--config-format {auto,hf,mistral}]
                  [--dtype {auto,half,float16,bfloat16,float,float32}]
                  [--kv-cache-dtype {auto,fp8,fp8_e5m2,fp8_e4m3}]
                  [--max-model-len MAX_MODEL_LEN]
                  [--guided-decoding-backend {outlines,lm-format-enforcer,xgrammar}]
                  [--logits-processor-pattern LOGITS_PROCESSOR_PATTERN]
                  [--model-impl {auto,vllm,transformers}]
                  [--distributed-executor-backend {ray,mp,uni,external_launcher}]
                  [--pipeline-parallel-size PIPELINE_PARALLEL_SIZE]
                  [--tensor-parallel-size TENSOR_PARALLEL_SIZE]
                  [--max-parallel-loading-workers MAX_PARALLEL_LOADING_WORKERS]
                  [--ray-workers-use-nsight] [--block-size {8,16,32,64,128}]
                  [--enable-prefix-caching | --no-enable-prefix-caching]
                  [--disable-sliding-window] [--use-v2-block-manager]
                  [--num-lookahead-slots NUM_LOOKAHEAD_SLOTS] [--seed SEED]
                  [--swap-space SWAP_SPACE] [--cpu-offload-gb CPU_OFFLOAD_GB]
                  [--gpu-memory-utilization GPU_MEMORY_UTILIZATION]
                  [--num-gpu-blocks-override NUM_GPU_BLOCKS_OVERRIDE]
                  [--max-num-batched-tokens MAX_NUM_BATCHED_TOKENS]
                  [--max-num-partial-prefills MAX_NUM_PARTIAL_PREFILLS]
                  [--max-long-partial-prefills MAX_LONG_PARTIAL_PREFILLS]
                  [--long-prefill-token-threshold LONG_PREFILL_TOKEN_THRESHOLD]
                  [--max-num-seqs MAX_NUM_SEQS] [--max-logprobs MAX_LOGPROBS]
                  [--disable-log-stats]
                  [--quantization {aqlm,awq,deepspeedfp,tpu_int8,fp8,ptpc_fp8,fbgemm_fp8,modelopt,marlin,gguf,gptq_marlin_24,gptq_marlin,awq_marlin,gptq,compressed-tensors,bitsandbytes,qqq,hqq,experts_int8,neuron_quant,ipex,quark,moe_wna16,None}]
                  [--rope-scaling ROPE_SCALING] [--rope-theta ROPE_THETA]
                  [--hf-overrides HF_OVERRIDES] [--enforce-eager]
                  [--max-seq-len-to-capture MAX_SEQ_LEN_TO_CAPTURE]
                  [--disable-custom-all-reduce]
                  [--tokenizer-pool-size TOKENIZER_POOL_SIZE]
                  [--tokenizer-pool-type TOKENIZER_POOL_TYPE]
                  [--tokenizer-pool-extra-config TOKENIZER_POOL_EXTRA_CONFIG]
                  [--limit-mm-per-prompt LIMIT_MM_PER_PROMPT]
                  [--mm-processor-kwargs MM_PROCESSOR_KWARGS]
                  [--disable-mm-preprocessor-cache] [--enable-lora]
                  [--enable-lora-bias] [--max-loras MAX_LORAS]
                  [--max-lora-rank MAX_LORA_RANK]
                  [--lora-extra-vocab-size LORA_EXTRA_VOCAB_SIZE]
                  [--lora-dtype {auto,float16,bfloat16}]
                  [--long-lora-scaling-factors LONG_LORA_SCALING_FACTORS]
                  [--max-cpu-loras MAX_CPU_LORAS] [--fully-sharded-loras]
                  [--enable-prompt-adapter]
                  [--max-prompt-adapters MAX_PROMPT_ADAPTERS]
                  [--max-prompt-adapter-token MAX_PROMPT_ADAPTER_TOKEN]
                  [--device {auto,cuda,neuron,cpu,openvino,tpu,xpu,hpu}]
                  [--num-scheduler-steps NUM_SCHEDULER_STEPS]
                  [--multi-step-stream-outputs [MULTI_STEP_STREAM_OUTPUTS]]
                  [--scheduler-delay-factor SCHEDULER_DELAY_FACTOR]
                  [--enable-chunked-prefill [ENABLE_CHUNKED_PREFILL]]
                  [--speculative-model SPECULATIVE_MODEL]
                  [--speculative-model-quantization {aqlm,awq,deepspeedfp,tpu_int8,fp8,ptpc_fp8,fbgemm_fp8,modelopt,marlin,gguf,gptq_marlin_24,gptq_marlin,awq_marlin,gptq,compressed-tensors,bitsandbytes,qqq,hqq,experts_int8,neuron_quant,ipex,quark,moe_wna16,None}]
                  [--num-speculative-tokens NUM_SPECULATIVE_TOKENS]
                  [--speculative-disable-mqa-scorer]
                  [--speculative-draft-tensor-parallel-size SPECULATIVE_DRAFT_TENSOR_PARALLEL_SIZE]
                  [--speculative-max-model-len SPECULATIVE_MAX_MODEL_LEN]
                  [--speculative-disable-by-batch-size SPECULATIVE_DISABLE_BY_BATCH_SIZE]
                  [--ngram-prompt-lookup-max NGRAM_PROMPT_LOOKUP_MAX]
                  [--ngram-prompt-lookup-min NGRAM_PROMPT_LOOKUP_MIN]
                  [--spec-decoding-acceptance-method {rejection_sampler,typical_acceptance_sampler}]
                  [--typical-acceptance-sampler-posterior-threshold TYPICAL_ACCEPTANCE_SAMPLER_POSTERIOR_THRESHOLD]
                  [--typical-acceptance-sampler-posterior-alpha TYPICAL_ACCEPTANCE_SAMPLER_POSTERIOR_ALPHA]
                  [--disable-logprobs-during-spec-decoding [DISABLE_LOGPROBS_DURING_SPEC_DECODING]]
                  [--model-loader-extra-config MODEL_LOADER_EXTRA_CONFIG]
                  [--ignore-patterns IGNORE_PATTERNS]
                  [--preemption-mode PREEMPTION_MODE]
                  [--served-model-name SERVED_MODEL_NAME [SERVED_MODEL_NAME ...]]
                  [--qlora-adapter-name-or-path QLORA_ADAPTER_NAME_OR_PATH]
                  [--otlp-traces-endpoint OTLP_TRACES_ENDPOINT]
                  [--collect-detailed-traces COLLECT_DETAILED_TRACES]
                  [--disable-async-output-proc]
                  [--scheduling-policy {fcfs,priority}]
                  [--scheduler-cls SCHEDULER_CLS]
                  [--override-neuron-config OVERRIDE_NEURON_CONFIG]
                  [--override-pooler-config OVERRIDE_POOLER_CONFIG]
                  [--compilation-config COMPILATION_CONFIG]
                  [--kv-transfer-config KV_TRANSFER_CONFIG]
                  [--worker-cls WORKER_CLS]
                  [--generation-config GENERATION_CONFIG]
                  [--override-generation-config OVERRIDE_GENERATION_CONFIG]
                  [--enable-sleep-mode] [--calculate-kv-scales]
                  [--additional-config ADDITIONAL_CONFIG]
                  [--disable-log-requests] [--max-log-len MAX_LOG_LEN]
                  [--disable-fastapi-docs] [--enable-prompt-tokens-details]

API解析

Completions API

我们的 Completions API 与 OpenAI 的 Completions API 兼容；您可以使用 OpenAI 官方 Python 客户端与之交互。

# SPDX-License-Identifier: Apache-2.0

from openai import OpenAI

# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"

client = OpenAI(
    # defaults to os.environ.get("OPENAI_API_KEY")
    api_key=openai_api_key,
    base_url=openai_api_base,
)

models = client.models.list()
model = models.data[0].id

# Completion API
stream = False
completion = client.completions.create(
    model=model,
    prompt="A robot may not injure a human being",
    echo=False,
    n=2,
    stream=stream,
    logprobs=3)

print("Completion results:")
if stream:
    for c in completion:
        print(c)
else:
    print(completion)

Extra parameters

支持以下采样参数：

    use_beam_search: bool = False
    top_k: Optional[int] = None
    min_p: Optional[float] = None
    repetition_penalty: Optional[float] = None
    length_penalty: float = 1.0
    stop_token_ids: Optional[List[int]] = Field(default_factory=list)
    include_stop_str_in_output: bool = False
    ignore_eos: bool = False
    min_tokens: int = 0
    skip_special_tokens: bool = True
    spaces_between_special_tokens: bool = True
    truncate_prompt_tokens: Optional[Annotated[int, Field(ge=1)]] = None
    allowed_token_ids: Optional[List[int]] = None
    prompt_logprobs: Optional[int] = None

支持以下额外参数：

    add_special_tokens: bool = Field(
        default=True,
        description=(
            "If true (the default), special tokens (e.g. BOS) will be added to "
            "the prompt."),
    )
    response_format: Optional[ResponseFormat] = Field(
        default=None,
        description=
        ("Similar to chat completion, this parameter specifies the format of "
         "output. Only {'type': 'json_object'}, {'type': 'json_schema'} or "
         "{'type': 'text' } is supported."),
    )
    guided_json: Optional[Union[str, dict, BaseModel]] = Field(
        default=None,
        description="If specified, the output will follow the JSON schema.",
    )
    guided_regex: Optional[str] = Field(
        default=None,
        description=(
            "If specified, the output will follow the regex pattern."),
    )
    guided_choice: Optional[List[str]] = Field(
        default=None,
        description=(
            "If specified, the output will be exactly one of the choices."),
    )
    guided_grammar: Optional[str] = Field(
        default=None,
        description=(
            "If specified, the output will follow the context free grammar."),
    )
    guided_decoding_backend: Optional[str] = Field(
        default=None,
        description=(
            "If specified, will override the default guided decoding backend "
            "of the server for this specific request. If set, must be one of "
            "'outlines' / 'lm-format-enforcer'"))
    guided_whitespace_pattern: Optional[str] = Field(
        default=None,
        description=(
            "If specified, will override the default whitespace pattern "
            "for guided json decoding."))
    priority: int = Field(
        default=0,
        description=(
            "The priority of the request (lower means earlier handling; "
            "default: 0). Any priority other than 0 will raise an error "
            "if the served model does not use priority scheduling."))
    logits_processors: Optional[LogitsProcessors] = Field(
        default=None,
        description=(
            "A list of either qualified names of logits processors, or "
            "constructor objects, to apply when sampling. A constructor is "
            "a JSON object with a required 'qualname' field specifying the "
            "qualified name of the processor class/factory, and optional "
            "'args' and 'kwargs' fields containing positional and keyword "
            "arguments. For example: {'qualname': "
            "'my_module.MyLogitsProcessor', 'args': [1, 2], 'kwargs': "
            "{'param': 'value'}}."))

Chat API

我们的Chat API 与 OpenAI’s Chat Completions API兼容；您可以使用官方 OpenAI Python 客户端与之交互。

我们支持视觉相关参数和音频相关参数；更多信息，请参阅多模态输入指南。
注意：不支持 image_url.detail 参数。

代码示例：examples/online_serving/openai_chat_completion_client.py，代码如下：

# SPDX-License-Identifier: Apache-2.0

from openai import OpenAI

# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"

client = OpenAI(
    # defaults to os.environ.get("OPENAI_API_KEY")
    api_key=openai_api_key,
    base_url=openai_api_base,
)

models = client.models.list()
model = models.data[0].id

chat_completion = client.chat.completions.create(
    messages=[{
        "role": "system",
        "content": "You are a helpful assistant."
    }, {
        "role": "user",
        "content": "Who won the world series in 2020?"
    }, {
        "role":
        "assistant",
        "content":
        "The Los Angeles Dodgers won the World Series in 2020."
    }, {
        "role": "user",
        "content": "Where was it played?"
    }],
    model=model,
)

print("Chat completion results:")
print(chat_completion)

代码示例：传递图片

from openai import OpenAI
client = OpenAI()

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What's in this image?"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg",
                    },
                },
            ],
        }
    ],
    max_tokens=300,
)

print(response.choices[0])

额外参数

支持以下采样参数：

在这里插入图片描述

    best_of: Optional[int] = None
    use_beam_search: bool = False
    top_k: Optional[int] = None
    min_p: Optional[float] = None
    repetition_penalty: Optional[float] = None
    length_penalty: float = 1.0
    stop_token_ids: Optional[List[int]] = Field(default_factory=list)
    include_stop_str_in_output: bool = False
    ignore_eos: bool = False
    min_tokens: int = 0
    skip_special_tokens: bool = True
    spaces_between_special_tokens: bool = True
    truncate_prompt_tokens: Optional[Annotated[int, Field(ge=1)]] = None
    prompt_logprobs: Optional[int] = None

支持以下额外参数：

    echo: bool = Field(
        default=False,
        description=(
            "If true, the new message will be prepended with the last message "
            "if they belong to the same role."),
    )
    add_generation_prompt: bool = Field(
        default=True,
        description=
        ("If true, the generation prompt will be added to the chat template. "
         "This is a parameter used by chat template in tokenizer config of the "
         "model."),
    )
    continue_final_message: bool = Field(
        default=False,
        description=
        ("If this is set, the chat will be formatted so that the final "
         "message in the chat is open-ended, without any EOS tokens. The "
         "model will continue this message rather than starting a new one. "
         "This allows you to \"prefill\" part of the model's response for it. "
         "Cannot be used at the same time as `add_generation_prompt`."),
    )
    add_special_tokens: bool = Field(
        default=False,
        description=(
            "If true, special tokens (e.g. BOS) will be added to the prompt "
            "on top of what is added by the chat template. "
            "For most models, the chat template takes care of adding the "
            "special tokens so this should be set to false (as is the "
            "default)."),
    )
    documents: Optional[List[Dict[str, str]]] = Field(
        default=None,
        description=
        ("A list of dicts representing documents that will be accessible to "
         "the model if it is performing RAG (retrieval-augmented generation)."
         " If the template does not support RAG, this argument will have no "
         "effect. We recommend that each document should be a dict containing "
         "\"title\" and \"text\" keys."),
    )
    chat_template: Optional[str] = Field(
        default=None,
        description=(
            "A Jinja template to use for this conversion. "
            "As of transformers v4.44, default chat template is no longer "
            "allowed, so you must provide a chat template if the tokenizer "
            "does not define one."),
    )
    chat_template_kwargs: Optional[Dict[str, Any]] = Field(
        default=None,
        description=("Additional kwargs to pass to the template renderer. "
                     "Will be accessible by the chat template."),
    )
    mm_processor_kwargs: Optional[Dict[str, Any]] = Field(
        default=None,
        description=("Additional kwargs to pass to the HF processor."),
    )
    guided_json: Optional[Union[str, dict, BaseModel]] = Field(
        default=None,
        description=("If specified, the output will follow the JSON schema."),
    )
    guided_regex: Optional[str] = Field(
        default=None,
        description=(
            "If specified, the output will follow the regex pattern."),
    )
    guided_choice: Optional[List[str]] = Field(
        default=None,
        description=(
            "If specified, the output will be exactly one of the choices."),
    )
    guided_grammar: Optional[str] = Field(
        default=None,
        description=(
            "If specified, the output will follow the context free grammar."),
    )
    guided_decoding_backend: Optional[str] = Field(
        default=None,
        description=(
            "If specified, will override the default guided decoding backend "
            "of the server for this specific request. If set, must be either "
            "'outlines' / 'lm-format-enforcer'"))
    guided_whitespace_pattern: Optional[str] = Field(
        default=None,
        description=(
            "If specified, will override the default whitespace pattern "
            "for guided json decoding."))
    priority: int = Field(
        default=0,
        description=(
            "The priority of the request (lower means earlier handling; "
            "default: 0). Any priority other than 0 will raise an error "
            "if the served model does not use priority scheduling."))
    request_id: str = Field(
        default_factory=lambda: f"{random_uuid()}",
        description=(
            "The request_id related to this request. If the caller does "
            "not set it, a random_uuid will be generated. This id is used "
            "through out the inference process and return in response."))
    logits_processors: Optional[LogitsProcessors] = Field(
        default=None,
        description=(
            "A list of either qualified names of logits processors, or "
            "constructor objects, to apply when sampling. A constructor is "
            "a JSON object with a required 'qualname' field specifying the "
            "qualified name of the processor class/factory, and optional "
            "'args' and 'kwargs' fields containing positional and keyword "
            "arguments. For example: {'qualname': "
            "'my_module.MyLogitsProcessor', 'args': [1, 2], 'kwargs': "
            "{'param': 'value'}}."))

分布式推理和服务

如何决定分布式推理策略？

在详细介绍分布式推理和服务之前，我们首先要明确何时使用分布式推理，以及有哪些可用的策略。通常的做法是：

单 GPU（不用分布式推理）：如果您的模型适合单 GPU，您可能不需要使用分布式推理。只需使用单 GPU 运行推理即可。
单节点多 GPU（张量并行推理）：如果您的模型太大，单个 GPU 无法容纳，但可以容纳在拥有多个 GPU 的单个节点中，您就可以使用张量并行。张量并行规模是指要使用的 GPU 数量。例如，如果单个节点有 4 个 GPU，则可以将张量并行大小设为 4。
多节点多 GPU（张量并行加流水线并行推理）：如果您的模型太大，单个节点无法容纳，您可以同时使用张量并行和流水线并行。张量并行规模是指您希望在每个节点上使用的 GPU 数量，而流水线并行规模是指您希望使用的节点数量。例如，如果 2 个节点中有 16 个 GPU（每个节点 8 个 GPU），则可以将张量并行大小设为 8，管道并行大小设为 2。

简而言之，您应该增加 GPU 的数量和节点的数量，直到有足够的 GPU 内存来容纳模型。张量并行大小应该是每个节点中 GPU 的数量，而流水线并行大小应该是节点的数量。

添加足够多的 GPU 和节点以容纳模型后，可以先运行 vLLM，它会打印一些日志，如 # GPU blocks: 790 将这个数字乘以 16（区块大小），就能大致得出当前配置下可提供的最大tokens数。如果这个数字不能令人满意，例如，您想要更高的吞吐量，您可以进一步增加 GPU 或节点的数量，直到块的数量足够为止。

注意：有一种边缘情况：如果模型适合在有多个 GPU 的单个节点上运行，但 GPU 的数量无法平均分配模型大小，则可以使用管道并行，它可以将模型按层分割，并支持不均匀分割。在这种情况下，张量并行大小应为 1，管道并行大小应为 GPU 数量。

在单个节点上运行 vLLM

vLLM 支持分布式张量并行和流水线并行推理和服务。目前，我们支持 Megatron-LM 的张量并行算法。我们使用 Ray 或 python 多进程来管理本机的分布式运行状态。在单节点上部署时可以使用Multiprocessing，多节点推理目前需要使用 Ray。

Ray用于管理和协调跨多个计算节点（即多节点）的分布式运行状态，特别是对于实现vLLM模型的分布式张量并行和流水线并行推理和服务。通过使用Ray，开发者能够更方便地实现高效的分布式训练和推理，尤其是在需要跨越多个物理或虚拟机的场景下，提供必要的工具来处理任务调度、资源分配以及数据交换等复杂问题。

如果不在 Ray 放置组中运行，且同一节点上有足够的 GPU 来满足配置的 tensor_parallel_size 要求，则默认使用Multiprocessing，否则将使用 Ray。

可以通过 LLM 类 distributed_executor_backend 参数或 --distributed-executor-backend API server 参数设置该默认值。如果使用多进程，则将其设置为 mp；如果使用 Ray，则将其设置为 ray。多进程情况下不需要安装 Ray。

要使用 LLM 类运行多 GPU 推理，请将 tensor_parallel_size 参数设置为要使用的 GPU 数量。例如，在 4 个 GPU 上运行推理：

from vllm import LLM
llm = LLM("facebook/opt-13b", tensor_parallel_size=4)
output = llm.generate("San Franciso is a")

要运行多 GPU 服务，请在启动服务器时输入 --tensor-parallel-size 参数。例如，在 4 个 GPU 上运行 API 服务器：

 vllm serve facebook/opt-13b \
     --tensor-parallel-size 4

您还可以额外指定 --pipeline-parallel-size 来启用流水线并行。例如，在 8 个 GPU 上运行具有流水线并行性和张量并行性的 API 服务：

 vllm serve gpt2 \
     --tensor-parallel-size 4 \
     --pipeline-parallel-size 2

在多个节点上运行 vLLM

如果单个节点的 GPU 不足以容纳模型，可以使用多个节点运行模型。重要的是要确保所有节点上的执行环境相同，包括模型路径和 Python 环境。推荐的方法是使用 docker 镜像来确保相同的环境，并通过将主机映射到相同的 docker 配置来隐藏主机的异构性。

第一步是启动容器并将它们组织成一个群集。我们提供了辅助脚本 examples/online_serving/run_cluster.sh，用于启动集群。如下：

#!/bin/bash

# Check for minimum number of required arguments
if [ $# -lt 4 ]; then
    echo "Usage: $0 docker_image head_node_address --head|--worker path_to_hf_home [additional_args...]"
    exit 1
fi

# Assign the first three arguments and shift them away
DOCKER_IMAGE="$1"
HEAD_NODE_ADDRESS="$2"
NODE_TYPE="$3"  # Should be --head or --worker
PATH_TO_HF_HOME="$4"
shift 4

# Additional arguments are passed directly to the Docker command
ADDITIONAL_ARGS=("$@")

# Validate node type
if [ "${NODE_TYPE}" != "--head" ] && [ "${NODE_TYPE}" != "--worker" ]; then
    echo "Error: Node type must be --head or --worker"
    exit 1
fi

# Define a function to cleanup on EXIT signal
cleanup() {
    docker stop node
    docker rm node
}
trap cleanup EXIT

# Command setup for head or worker node
RAY_START_CMD="ray start --block"
if [ "${NODE_TYPE}" == "--head" ]; then
    RAY_START_CMD+=" --head --port=6379"
else
    RAY_START_CMD+=" --address=${HEAD_NODE_ADDRESS}:6379"
fi

# Run the docker command with the user specified parameters and additional arguments
docker run \
    --entrypoint /bin/bash \
    --network host \
    --name node \
    --shm-size 10.24g \
    --gpus all \
    -v "${PATH_TO_HF_HOME}:/root/.cache/huggingface" \
    "${ADDITIONAL_ARGS[@]}" \
    "${DOCKER_IMAGE}" -c "${RAY_START_CMD}"

请注意，该脚本启动 docker 时不具备管理权限，而管理权限是运行剖析和跟踪工具时访问 GPU 性能计数器所必需的。为此，脚本可以通过在 docker run 命令中使用 --cap-add 选项，将 CAP_SYS_ADMIN 添加到 docker 容器中。

选择一个节点作为头部节点，然后运行以下命令

bash run_cluster.sh \
                vllm/vllm-openai \
                ip_of_head_node \
                --head \
                /path/to/the/huggingface/home/in/this/node \
                -e VLLM_HOST_IP=ip_of_this_node

在其余工作节点上运行以下命令：

bash run_cluster.sh \
                vllm/vllm-openai \
                ip_of_head_node \
                --worker \
                /path/to/the/huggingface/home/in/this/node \
                -e VLLM_HOST_IP=ip_of_this_node

然后，你就会得到一个由容器组成的 “射线 ”集群。需要注意的是，运行这些命令的 shell 必须保持激活状态，以维持集群。任何 shell 的断开都会终止集群。此外，请注意参数 ip_of_head_node 应该是头部节点的 IP 地址，所有 Worker 节点都可以访问头部节点。

每个工作节点的 IP 地址应在 VLLM_HOST_IP 环境变量中指定，并且每个工作节点的 IP 地址都应不同。请检查群集的网络配置，确保各节点能通过指定的 IP 地址相互通信。

警告：由于这是一个由容器组成的 ray 集群，因此以下所有命令都应在容器中执行，否则就会在未连接到 ray 集群的主机上执行命令。要进入容器，可以使用 docker exec -it node /bin/bash。

然后，在任意节点上使用 docker exec -it node /bin/bash 进入容器，执行 ray status 查看 Ray 集群的状态。你应该能看到正确的节点和 GPU 数量。

之后，在任何节点上，使用 docker exec -it node /bin/bash 再次进入容器。在容器中，你可以像往常一样使用 vLLM，就像把所有 GPU 都放在一个节点上一样。

通常的做法是将张量并行大小设置为每个节点的 GPU 数量，将管道并行大小设置为节点数量。例如，如果 2 个节点中有 16 个 GPU（每个节点 8 个 GPU），则可以将张量并行大小设为 8，管道并行大小设为 2：

 vllm serve /path/to/the/model/in/the/container \
     --tensor-parallel-size 8 \
     --pipeline-parallel-size 2

您也可以在不使用管道并行的情况下使用张量并行，只需将张量并行的大小设置为集群中 GPU 的数量即可。例如，如果 2 个节点中有 16 个 GPU（每个节点 8 个 GPU），则可以将张量并行大小设置为 16：

vllm serve /path/to/the/model/in/the/container \
     --tensor-parallel-size 16

要使张量并行的性能提高，就必须确保节点间的通信效率，例如使用 Infiniband 等高速网卡。要正确设置集群以使用 Infiniband，可在 run_cluster.sh 脚本中附加 --privileged -e NCCL_IB_HCA=mlx5 等参数。有关如何设置Infiniband标记的详细信息，请联系系统管理员。

确认 Infiniband 是否正常工作的方法之一是在设置 NCCL_DEBUG=TRACE 环境变量的情况下运行 vLLM，例如 NCCL_DEBUG=TRACE vllm serve … 并检查日志中的 NCCL 版本和使用的网络。如果在日志中发现 [send] via NET/Socket，这意味着 NCCL 使用的是原始 TCP Socket，对于跨节点张量并行来说效率不高。如果在日志中发现 [send] via NET/IB/GDRDMA，则表示 NCCL 使用的是带 GPU-Direct RDMA 的 Infiniband，效率很高。

请注意，启动 Ray 集群后，最好也检查一下节点间的 GPU-GPU 通信。这可能不是一个简单的设置。更多信息请参阅 “正确性检查脚本”。如果需要为通信配置设置一些环境变量，可以将它们添加到 run_cluster.sh 脚本中，例如 -e NCCL_SOCKET_IFNAME=eth0

请注意，在 shell 中设置环境变量（如 NCCL_SOCKET_IFNAME=eth0 vllm serve …）只对同一节点的进程有效，对其他节点的进程无效。建议在创建群集时设置环境变量。有关详细信息，请参见问题 #6803。

请确保已将模型下载到所有节点（路径相同），或者模型已下载到所有节点都能访问的分布式文件系统中。

使用 huggingface repo id 引用模型时，应在 run_cluster.sh 脚本中添加 huggingface 标记，例如 -e HF_TOKEN=。建议的方法是先下载模型，然后使用路径引用模型。

如果您一直收到错误信息 Error： No available node types can fulfill resource request but you have enough GPUs in the cluster（没有可用的节点类型可以满足资源请求，但集群中有足够的 GPU），这很可能是您的节点有多个 IP 地址，而 vLLM 无法找到正确的 IP 地址，尤其是在使用多节点推理时。请确保 vLLM 和 ray 使用相同的 IP 地址。
您可以在 run_cluster.sh 脚本（每个节点都不同！）中将 VLLM_HOST_IP 环境变量设置为正确的 IP 地址，并检查 ray status 以查看 Ray 使用的 IP 地址。更多信息，请参见问题 github#7815。