vllm 参数介绍

58 篇文章 0 订阅

一、vllm一个重要参数enable-prefix-caching 特殊场景会在kvcache阶段提高5倍性能。有的平台是use_cache这个参数。

 详细建视频:https://www.toutiao.com/video/7355331984845734435/?channel=&source=search_tab

二、vllm一个重要参数enable-prefix-caching 特殊场景会在perfill提高30%性能

https://www.toutiao.com/video/7358370119023919651/?from_scene=all&log_from=21379bf6743c9_1713448066292

Engine Arguments

Below, you can find an explanation of every engine argument for vLLM:

--model <model_name_or_path>

Name or path of the huggingface model to use.

--tokenizer <tokenizer_name_or_path>

Name or path of the huggingface tokenizer to use.

--revision <revision>

The specific model version to use. It can be a branch name, a tag name, or a commit id. If unspecified, will use the default version.

--tokenizer-revision <revision>

The specific tokenizer version to use. It can be a branch name, a tag name, or a commit id. If unspecified, will use the default version.

--tokenizer-mode {auto,slow}

The tokenizer mode.

  • “auto” will use the fast tokenizer if available.

  • “slow” will always use the slow tokenizer.

--trust-remote-code

Trust remote code from huggingface.

--download-dir <directory>

Directory to download and load the weights, default to the default cache dir of huggingface.

--load-format {auto,pt,safetensors,npcache,dummy}

The format of the model weights to load.

  • “auto” will try to load the weights in the safetensors format and fall back to the pytorch bin format if safetensors format is not available.

  • “pt” will load the weights in the pytorch bin format.

  • “safetensors” will load the weights in the safetensors format.

  • “npcache” will load the weights in pytorch format and store a numpy cache to speed up the loading.

  • “dummy” will initialize the weights with random values, mainly for profiling.

--dtype {auto,half,float16,bfloat16,float,float32}

Data type for model weights and activations.

  • “auto” will use FP16 precision for FP32 and FP16 models, and BF16 precision for BF16 models.

  • “half” for FP16. Recommended for AWQ quantization.

  • “float16” is the same as “half”.

  • “bfloat16” for a balance between precision and range.

  • “float” is shorthand for FP32 precision.

  • “float32” for FP32 precision.

--max-model-len <length>

Model context length. If unspecified, will be automatically derived from the model config.

--worker-use-ray

Use Ray for distributed serving, will be automatically set when using more than 1 GPU.

--pipeline-parallel-size (-pp) <size>

Number of pipeline stages.

--tensor-parallel-size (-tp) <size>

Number of tensor parallel replicas.

--max-parallel-loading-workers <workers>

Load model sequentially in multiple batches, to avoid RAM OOM when using tensor parallel and large models.

--block-size {8,16,32}

Token block size for contiguous chunks of tokens.

--enable-prefix-caching

Enables automatic prefix caching

--seed <seed>

Random seed for operations.

--swap-space <size>

CPU swap space size (GiB) per GPU.

--gpu-memory-utilization <fraction>

The fraction of GPU memory to be used for the model executor, which can range from 0 to 1. For example, a value of 0.5 would imply 50% GPU memory utilization. If unspecified, will use the default value of 0.9.

--max-num-batched-tokens <tokens>

Maximum number of batched tokens per iteration.

--max-num-seqs <sequences>

Maximum number of sequences per iteration.

--max-paddings <paddings>

Maximum number of paddings in a batch.

--disable-log-stats

Disable logging statistics.

--quantization (-q) {awq,squeezellm,None}

Method used to quantize the weights.

Async Engine Arguments

Below are the additional arguments related to the asynchronous engine:

--engine-use-ray

Use Ray to start the LLM engine in a separate process as the server process.

--disable-log-requests

Disable logging requests.

--max-log-len

Max number of prompt characters or prompt ID numbers being printed in log. Defaults to unlimited.

  • 7
    点赞
  • 16
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值