一、vllm一个重要参数enable-prefix-caching 特殊场景会在kvcache阶段提高5倍性能。有的平台是use_cache这个参数。
详细建视频:https://www.toutiao.com/video/7355331984845734435/?channel=&source=search_tab
二、vllm一个重要参数enable-prefix-caching 特殊场景会在perfill提高30%性能
Engine Arguments
Below, you can find an explanation of every engine argument for vLLM:
--model <model_name_or_path>
Name or path of the huggingface model to use.
--tokenizer <tokenizer_name_or_path>
Name or path of the huggingface tokenizer to use.
--revision <revision>
The specific model version to use. It can be a branch name, a tag name, or a commit id. If unspecified, will use the default version.
--tokenizer-revision <revision>
The specific tokenizer version to use. It can be a branch name, a tag name, or a commit id. If unspecified, will use the default version.
--tokenizer-mode {auto,slow}
The tokenizer mode.
-
“auto” will use the fast tokenizer if available.
-
“slow” will always use the slow tokenizer.
--trust-remote-code
Trust remote code from huggingface.
--download-dir <directory>
Directory to download and load the weights, default to the default cache dir of huggingface.
--load-format {auto,pt,safetensors,npcache,dummy}
The format of the model weights to load.
-
“auto” will try to load the weights in the safetensors format and fall back to the pytorch bin format if safetensors format is not available.
-
“pt” will load the weights in the pytorch bin format.
-
“safetensors” will load the weights in the safetensors format.
-
“npcache” will load the weights in pytorch format and store a numpy cache to speed up the loading.
-
“dummy” will initialize the weights with random values, mainly for profiling.
--dtype {auto,half,float16,bfloat16,float,float32}
Data type for model weights and activations.
-
“auto” will use FP16 precision for FP32 and FP16 models, and BF16 precision for BF16 models.
-
“half” for FP16. Recommended for AWQ quantization.
-
“float16” is the same as “half”.
-
“bfloat16” for a balance between precision and range.
-
“float” is shorthand for FP32 precision.
-
“float32” for FP32 precision.
--max-model-len <length>
Model context length. If unspecified, will be automatically derived from the model config.
--worker-use-ray
Use Ray for distributed serving, will be automatically set when using more than 1 GPU.
--pipeline-parallel-size (-pp) <size>
Number of pipeline stages.
--tensor-parallel-size (-tp) <size>
Number of tensor parallel replicas.
--max-parallel-loading-workers <workers>
Load model sequentially in multiple batches, to avoid RAM OOM when using tensor parallel and large models.
--block-size {8,16,32}
Token block size for contiguous chunks of tokens.
--enable-prefix-caching
Enables automatic prefix caching
--seed <seed>
Random seed for operations.
--swap-space <size>
CPU swap space size (GiB) per GPU.
--gpu-memory-utilization <fraction>
The fraction of GPU memory to be used for the model executor, which can range from 0 to 1. For example, a value of 0.5 would imply 50% GPU memory utilization. If unspecified, will use the default value of 0.9.
--max-num-batched-tokens <tokens>
Maximum number of batched tokens per iteration.
--max-num-seqs <sequences>
Maximum number of sequences per iteration.
--max-paddings <paddings>
Maximum number of paddings in a batch.
--disable-log-stats
Disable logging statistics.
--quantization (-q) {awq,squeezellm,None}
Method used to quantize the weights.
Async Engine Arguments
Below are the additional arguments related to the asynchronous engine:
--engine-use-ray
Use Ray to start the LLM engine in a separate process as the server process.
--disable-log-requests
Disable logging requests.
--max-log-len
Max number of prompt characters or prompt ID numbers being printed in log. Defaults to unlimited.