vLLM使用教程【V5.0.4】
vLLM
vLLM is a fast and easy-to-use library for LLM inference and serving.
vLLM是一个快速易用的LLM推理和服务库。
关于vLLM环境的安装可以看我的另一篇文章vLLM环境安装与运行实例【V5.0.4】
初次尝试只使用了部分参数,等用到其他参数后再更新······
vLLM相关参数说明
基于Meta-Llama-3.1-8B-Instruct进行推理测试
1. LLM——载入模型
代码中的参数说明:
class LLM:
"""An LLM for generating texts from given prompts and sampling parameters.
This class includes a tokenizer, a language model (possibly distributed
across multiple GPUs), and GPU memory space allocated for intermediate
states (aka KV cache). Given a batch of prompts and sampling parameters,
this class generates texts from the model, using an intelligent batching
mechanism and efficient memory management.
Args:
model: The name or path of a HuggingFace Transformers model.
tokenizer: The name or path of a HuggingFace Transformers tokenizer.
tokenizer_mode: The tokenizer mode. "auto" will use the fast tokenizer
if available, and "slow" will always use the slow tokenizer.
skip_tokenizer_init: If true, skip initialization of tokenizer and
detokenizer. Expect valid prompt_token_ids and None for prompt
from the input.
trust_remote_code: Trust remote code (e.g., from HuggingFace) when
downloading the model and tokenizer.
tensor_parallel_size: The number of GPUs to use for distributed
execution with tensor parallelism.
dtype: The data type for the model weights and activations. Currently,
we support `float32`, `float16`, and `bfloat16`. If `auto`, we use
the `torch_dtype` attribute specified in the model config file.
However, if the `torch_dtype` in the config is `float32`, we will
use `float16` instead.
quantization: The method used to quantize the model weights. Currently,
we support "awq", "gptq", "squeezellm", and "fp8" (experimental).
If None, we first check the `quantization_config` attribute in the
model config file. If that is None, we assume the model weights are
not quantized and use `dtype` to determine the data type of
the weights.
revision: The specific model version to use. It can be a branch name,
a tag name, or a commit id.
tokenizer_revision: The specific tokenizer version to use. It can be a
branch name, a tag name, or a commit id.
seed: The seed to initialize the random number generator for sampling.
gpu_memory_utilization: The ratio (between 0 and 1) of GPU memory to
reserve for the model weights, activations, and KV cache. Higher
values will increase the KV cache size and thus improve the model's
throughput. However, if the value is too high, it may cause out-of-
memory (OOM) errors.
swap_space: The size (GiB) of CPU memory per GPU to use as swap space.
This can be used for temporarily storing the states of the requests
when their `best_of` sampling parameters are larger than 1. If all
requests will have `best_of=1`, you can safely set this to 0.
Otherwise, too small values may cause out-of-memory (OOM) errors.
cpu_offload_gb: The size (GiB) of CPU memory to use for offloading
the model weights. This virtually increases the GPU memory space
you can use to hold the model weights, at the cost of CPU-GPU data
transfer for every forward pass.
enforce_eager: Whether to enforce eager execution. If True, we will
disable CUDA graph and always execute the model in eager mode.
If False, we will use CUDA graph and eager execution in hybrid.
max_context_len_to_capture: Maximum context len covered by CUDA graphs.
When a sequence has context length larger than this, we fall back
to eager mode (DEPRECATED. Use `max_seq_len_to_capture` instead).
max_seq_len_to_capture: Maximum sequence len covered by CUDA graphs.
When a sequence has context length larger than this, we fall back
to eager mode.
disable_custom_all_reduce: See ParallelConfig
**kwargs: Arguments for :class:`~vllm.EngineArgs`. (See
:ref:`engine_args`)
Note:
This class is intended to be used for offline inference. For online
serving, use the :class:`~vllm.AsyncLLMEngine` class instead.
"""
(1)默认方式
指定model为模型文件夹路径
test.py文件:
from vllm import LLM
llm = LLM(model="Meta-Llama-3.1-8B-Instruct")
print("model load success!")
运行方式为:
CUDA_VISIBLE_DEVICES=0 python3 test.py
CUDA_VISIBLE_DEVICES为指定卡号
报错信息:
(2)增加max_model_len参数
根据错误提示,KV cache的长度只支持43200,而Meta-Llama-3.1-8B-Instruct模型的最大长度是131072,需要通过max_model_len参数限制模型最大长度。
from vllm import LLM
llm = LLM(
model="Meta-Llama-3.1-8B-Instruct",
max_model_len = 43200,
)
print("model load success!")
运行结果:
需要显存22G
2. SamplingParams
代码中的参数说明:
class SamplingParams:
"""Sampling parameters for text generation.
Overall, we follow the sampling parameters from the OpenAI text completion
API (https://platform.openai.com/docs/api-reference/completions/create).
In addition, we support beam search, which is not supported by OpenAI.
Args:
n: Number of output sequences to return for the given prompt.
best_of: Number of output sequences that are generated from the prompt.
From these `best_of` sequences, the top `n` sequences are returned.
`best_of` must be greater than or equal to `n`. This is treated as
the beam width when `use_beam_search` is True. By default, `best_of`
is set to `n`.
presence_penalty: Float that penalizes new tokens based on whether they
appear in the generated text so far. Values > 0 encourage the model
to use new tokens, while values < 0 encourage the model to repeat
tokens.
frequency_penalty: Float that penalizes new tokens based on their
frequency in the generated text so far. Values > 0 encourage the
model to use new tokens, while values < 0 encourage the model to
repeat tokens.
repetition_penalty: Float that penalizes new tokens based on whether
they appear in the prompt and the generated text so far. Values > 1
encourage the model to use new tokens, while values < 1 encourage
the model to repeat tokens.
temperature: Float that controls the randomness of the sampling. Lower
values make the model more deterministic, while higher values make
the model more random. Zero means greedy sampling.
top_p: Float that controls the cumulative probability of the top tokens
to consider. Must be in (0, 1]. Set to 1 to consider all tokens.
top_k: Integer that controls the number of top tokens to consider. Set
to -1 to consider all tokens.
min_p: Float that represents the minimum probability for a token to be
considered, relative to the probability of the most likely token.
Must be in [0, 1]. Set to 0 to disable this.
seed: Random seed to use for the generation.
use_beam_search: Whether to use beam search instead of sampling.
length_penalty: Float that penalizes sequences based on their length.
Used in beam search.
early_stopping: Controls the stopping condition for beam search. It
accepts the following values: `True`, where the generation stops as
soon as there are `best_of` complete candidates; `False`, where an
heuristic is applied and the generation stops when is it very
unlikely to find better candidates; `"never"`, where the beam search
procedure only stops when there cannot be better candidates
(canonical beam search algorithm).
stop: List of strings that stop the generation when they are generated.
The returned output will not contain the stop strings.
stop_token_ids: List of tokens that stop the generation when they are
generated. The returned output will contain the stop tokens unless
the stop tokens are special tokens.
include_stop_str_in_output: Whether to include the stop strings in
output text. Defaults to False.
ignore_eos: Whether to ignore the EOS token and continue generating
tokens after the EOS token is generated.
max_tokens: Maximum number of tokens to generate per output sequence.
min_tokens: Minimum number of tokens to generate per output sequence
before EOS or stop_token_ids can be generated
logprobs: Number of log probabilities to return per output token.
When set to None, no probability is returned. If set to a non-None
value, the result includes the log probabilities of the specified
number of most likely tokens, as well as the chosen tokens.
Note that the implementation follows the OpenAI API: The API will
always return the log probability of the sampled token, so there
may be up to `logprobs+1` elements in the response.
prompt_logprobs: Number of log probabilities to return per prompt token.
detokenize: Whether to detokenize the output. Defaults to True.
skip_special_tokens: Whether to skip special tokens in the output.
spaces_between_special_tokens: Whether to add spaces between special
tokens in the output. Defaults to True.
logits_processors: List of functions that modify logits based on
previously generated tokens, and optionally prompt tokens as
a first argument.
truncate_prompt_tokens: If set to an integer k, will use only the last k
tokens from the prompt (i.e., left truncation). Defaults to None
(i.e., no truncation).
"""
(1)使用方式
为了验证结果一致性和LLaMA-Factory推理设置一样
temperature=0.6,
top_p=0.9
from vllm import SamplingParams
sampling_params = SamplingParams(temperature=0.6, top_p=0.9, max_tokens=100, stop=['<|eom_id|>', '<|eot_id|>', '<|end_of_text|>'])
注意:
- 需要设置max_tokens参数,默认为16,需要适当增大。
- 需要设置参数stop=[‘<|eom_id|>’, ‘<|eot_id|>’, ‘<|end_of_text|>’],Llama3.1的结束token,防止重复生成。
3. generate
generate返回的是RequestOutput组成的list
完整测试
注意:
1. vLLM目前不支持batch推断,即便输入多句prompt,依然按顺序推断。
vLLM目前支持batch推断,虽然tqdm按句显示进度,但是速度明显变快是支持batch的。
prompts = ["你是谁?", "你好"]
outputs = llm.generate(prompts, sampling_params, use_tqdm=False)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}\n, Generated text: {generated_text!r}")
注意:
默认generate过程使用tqdm显示进度,可以使用use_tqdm=False关闭
默认情况如下:
4. RequestOutput——输出结构
4.1 RequestOutput
class RequestOutput:
"""The output data of a completion request to the LLM.
Args:
request_id: The unique ID of the request.
prompt: The prompt string of the request.
prompt_token_ids: The token IDs of the prompt.
prompt_logprobs: The log probabilities to return per prompt token.
outputs: The output sequences of the request.
finished: Whether the whole request is finished.
metrics: Metrics associated with the request.
lora_request: The LoRA request that was used to generate the output.
"""
需要用的属性:
序号 | 参数名 | 说明 |
---|---|---|
1 | prompt | 输入的文本 |
2 | prompt_token_ids | 输入的token,可以用来计算输入token数量 |
3 | outputs | 输出,CompletionOutput类组成的list |
4 | metrics | RequestMetrics,相关的时间记录 |
4.2 CompletionOutput
class CompletionOutput:
"""The output data of one completion output of a request.
Args:
index: The index of the output in the request.
text: The generated output text.
token_ids: The token IDs of the generated output text.
cumulative_logprob: The cumulative log probability of the generated
output text.
logprobs: The log probabilities of the top probability words at each
position if the logprobs are requested.
finish_reason: The reason why the sequence is finished.
stop_reason: The stop string or token id that caused the completion
to stop, None if the completion finished for some other reason
including encountering the EOS token.
lora_request: The LoRA request that was used to generate the output.
"""
需要用的属性:
序号 | 参数名 | 说明 |
---|---|---|
1 | text | 输出的文本 |
2 | token_ids | 输出的token,可以用来计算输入token数量 |
4.3 RequestMetrics
class RequestMetrics:
"""Metrics associated with a request.
Attributes:
arrival_time: The time when the request arrived.
first_scheduled_time: The time when the request was first scheduled.
first_token_time: The time when the first token was generated.
time_in_queue: The time the request spent in the queue.
finished_time: The time when the request was finished.
"""
需要用的属性:
序号 | 参数名 | 说明 |
---|---|---|
1 | arrival_time | 开始的时间 |
2 | finished_time | 结束的时间 |
可以作为速度的参考时间
vLLM使用demo
from vllm import LLM, SamplingParams
import argparse
import time
import logging
from tqdm import tqdm
class Generate:
def __init__(self, model_path, temperature=0.6, top_p=0.9, max_tokens=43200, debug=False):
self.logger = logging.getLogger("vLLM")
if debug:
self.logger.setLevel(logging.DEBUG)
else:
self.logger.setLevel(logging.INFO)
stream_handler = logging.StreamHandler()
formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
stream_handler.setFormatter(formatter)
self.logger.addHandler(stream_handler)
start_time = time.time()
self.llm = LLM(
model=model_path,
max_model_len = 43200
)
end_time = time.time()
self.logger.info("load {} model use {:.2f}s".format(model_path, end_time - start_time))
self.sampling_params = SamplingParams(temperature=temperature, top_p=top_p, max_tokens=max_tokens)
self.all_num = 0
self.all_input_tokens_num = 0
self.all_output_tokens_num = 0
def generate_sentences(self, prompts):
start_time = time.time()
outputs = self.llm.generate(prompts, self.sampling_params, use_tqdm=False)
end_time = time.time()
self.all_time = self.all_time + (end_time - start_time)
result_list = []
for output in outputs:
self.all_num += 1
self.all_input_tokens_num += len(output.prompt_token_ids)
self.all_output_tokens_num += len(output.outputs[0].token_ids)
result_list.append(output.outputs[0].text)
return result_list
def generate_all_data(self, input_list, batch):
self.all_num = 0
self.all_input_tokens_num = 0
self.all_output_tokens_num = 0
self.all_time = 0
output_list = []
for i in tqdm(range(0, len(input_list), batch)):
output = self.generate_sentences(input_list[i : i + batch])
output_list.extend(output)
self.logger.info("-" * 20)
self.logger.info("process {} sentences use {:.2f}s".format(self.all_num, self.all_time))
self.logger.info("everage input token num: {:.2f}, everage output token num: {:.2f}".format(self.all_input_tokens_num / self.all_num, self.all_output_tokens_num / self.all_num))
self.logger.info(
"speed: all:{:.2f} token/s, input:{:.2f} token/s, output:{:.2f} token/s".format(
(self.all_input_tokens_num + self.all_input_tokens_num) / self.all_time,
self.all_input_tokens_num / self.all_time,
self.all_output_tokens_num / self.all_time
)
)
self.logger.info("-" * 20 + "\n\n")
return output_list
def generate_file(self, input_file_path, output_file_path, batch):
input_list = []
with open(input_file_path, "r", encoding="utf8") as fin:
for line in fin.readlines():
#去掉读取文件中最后的\n
if line[-1] == "\n":
data = line[:-1]
else:
data = line
data = data.replace("\\n", "\n") #替换为真实转义字符\n
input_list.append(data)
output_list = generate.generate_all_data(input_list, batch)
with open(output_file_path, "w", encoding="utf8") as fout:
for line in output_list:
fout.write(line + "\n")
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="vLLM Llama Demo")
parser.add_argument('-m', '--model', type=str, required=True, help='model path')
parser.add_argument('-i', '--input', type=str, required=True, help='input file path')
parser.add_argument('-o', '--output', type=str, required=True, help='output file path')
parser.add_argument('-b', '--batch', type=int, default=1)
parser.add_argument('--temperature', type=float, default=0.6)
parser.add_argument('--top_p', type=float, default=0.9)
parser.add_argument('--max_token', type=int, default=43200)
parser.add_argument('--debug', action="store_true", help="whether to show debug infomation")
args = parser.parse_args()
generate = Generate(args.model, args.temperature, args.top_p, args.max_token, args.debug)
generate.generate_file(args.input, args.output, args.batch)
运行指令:
CUDA_VISIBLE_DEVICES=0 python3 inference.py -m ./Meta-Llama-3.1-8B-Instruct/ -i test.txt -o output.txt -b 4 --max_token 100
运行效果: