vllm 参数介绍

javastart

已于 2024-04-18 22:54:56 修改

阅读量5.1k

点赞数 8

分类专栏： aigc 大模型文章标签：人工智能 transformer llama

于 2024-04-12 19:01:53 首次发布

本文链接：https://blog.csdn.net/javastart/article/details/137691490

版权

大模型同时被 2 个专栏收录

234 篇文章

订阅专栏

aigc

86 篇文章

订阅专栏

一、vllm一个重要参数enable-prefix-caching 特殊场景会在kvcache阶段提高5倍性能。有的平台是use_cache这个参数。

详细建视频：https://www.toutiao.com/video/7355331984845734435/?channel=&source=search_tab

二、vllm一个重要参数enable-prefix-caching 特殊场景会在perfill提高30%性能

https://www.toutiao.com/video/7358370119023919651/?from_scene=all&log_from=21379bf6743c9_1713448066292

Engine Arguments

Below, you can find an explanation of every engine argument for vLLM:

--model <model_name_or_path>

Name or path of the huggingface model to use.

--tokenizer <tokenizer_name_or_path>

Name or path of the huggingface tokenizer to use.

--revision <revision>

The specific model version to use. It can be a branch name, a tag name, or a commit id. If unspecified, will use the default version.

--tokenizer-revision <revision>

The specific tokenizer version to use. It can be a branch name, a tag name, or a commit id. If unspecified, will use the default version.

--tokenizer-mode {auto,slow}

The tokenizer mode.

“auto” will use the fast tokenizer if available.
“slow” will always use the slow tokenizer.

--trust-remote-code

Trust remote code from huggingface.

--download-dir <directory>

Directory to download and load the weights, default to the default cache dir of huggingface.

--load-format {auto,pt,safetensors,npcache,dummy}

The format of the model weights to load.

“auto” will try to load the weights in the safetensors format and fall back to the pytorch bin format if safetensors format is not available.
“pt” will load the weights in the pytorch bin format.
“safetensors” will load the weights in the safetensors format.
“npcache” will load the weights in pytorch format and store a numpy cache to speed up the loading.
“dummy” will initialize the weights with random values, mainly for profiling.

--dtype {auto,half,float16,bfloat16,float,float32}

Data type for model weights and activations.

“auto” will use FP16 precision for FP32 and FP16 models, and BF16 precision for BF16 models.
“half” for FP16. Recommended for AWQ quantization.
“float16” is the same as “half”.
“bfloat16” for a balance between precision and range.
“float” is shorthand for FP32 precision.
“float32” for FP32 precision.

--max-model-len <length>

Model context length. If unspecified, will be automatically derived from the model config.

--worker-use-ray

Use Ray for distributed serving, will be automatically set when using more than 1 GPU.

--pipeline-parallel-size (-pp) <size>

Number of pipeline stages.

--tensor-parallel-size (-tp) <size>

Number of tensor parallel replicas.

--max-parallel-loading-workers <workers>

Load model sequentially in multiple batches, to avoid RAM OOM when using tensor parallel and large models.

--block-size {8,16,32}

Token block size for contiguous chunks of tokens.

--enable-prefix-caching

Enables automatic prefix caching

--seed <seed>

Random seed for operations.

--swap-space <size>

CPU swap space size (GiB) per GPU.

--gpu-memory-utilization <fraction>

The fraction of GPU memory to be used for the model executor, which can range from 0 to 1. For example, a value of 0.5 would imply 50% GPU memory utilization. If unspecified, will use the default value of 0.9.

--max-num-batched-tokens <tokens>

Maximum number of batched tokens per iteration.

--max-num-seqs <sequences>

Maximum number of sequences per iteration.

--max-paddings <paddings>

Maximum number of paddings in a batch.

--disable-log-stats

Disable logging statistics.

--quantization (-q) {awq,squeezellm,None}

Method used to quantize the weights.