vLLM使用教程【V5.0.4】

莽夫搞战术

已于 2024-12-15 01:48:41 修改

阅读量2.1k

点赞数 19

分类专栏：使用教程文章标签：人工智能 llama 自然语言处理语言模型

于 2024-11-21 16:12:52 首次发布

本文链接：https://blog.csdn.net/yd778473278/article/details/143903552

版权

使用教程专栏收录该内容

3 篇文章

订阅专栏

vLLM使用教程【V5.0.4】

vLLM
vLLM相关参数说明
vLLM使用demo

vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving.
vLLM是一个快速易用的LLM推理和服务库。
关于vLLM环境的安装可以看我的另一篇文章vLLM环境安装与运行实例【V5.0.4】

初次尝试只使用了部分参数，等用到其他参数后再更新······

vLLM相关参数说明

基于Meta-Llama-3.1-8B-Instruct进行推理测试

1. LLM——载入模型

代码中的参数说明：

class LLM:
    """An LLM for generating texts from given prompts and sampling parameters.

    This class includes a tokenizer, a language model (possibly distributed
    across multiple GPUs), and GPU memory space allocated for intermediate
    states (aka KV cache). Given a batch of prompts and sampling parameters,
    this class generates texts from the model, using an intelligent batching
    mechanism and efficient memory management.

    Args:
        model: The name or path of a HuggingFace Transformers model.
        tokenizer: The name or path of a HuggingFace Transformers tokenizer.
        tokenizer_mode: The tokenizer mode. "auto" will use the fast tokenizer
            if available, and "slow" will always use the slow tokenizer.
        skip_tokenizer_init: If true, skip initialization of tokenizer and
            detokenizer. Expect valid prompt_token_ids and None for prompt
            from the input.
        trust_remote_code: Trust remote code (e.g., from HuggingFace) when
            downloading the model and tokenizer.
        tensor_parallel_size: The number of GPUs to use for distributed
            execution with tensor parallelism.
        dtype: The data type for the model weights and activations. Currently,
            we support `float32`, `float16`, and `bfloat16`. If `auto`, we use
            the `torch_dtype` attribute specified in the model config file.
            However, if the `torch_dtype` in the config is `float32`, we will
            use `float16` instead.
        quantization: The method used to quantize the model weights. Currently,
            we support "awq", "gptq", "squeezellm", and "fp8" (experimental).
            If None, we first check the `quantization_config` attribute in the
            model config file. If that is None, we assume the model weights are
            not quantized and use `dtype` to determine the data type of
            the weights.
        revision: The specific model version to use. It can be a branch name,
            a tag name, or a commit id.
        tokenizer_revision: The specific tokenizer version to use. It can be a
            branch name, a tag name, or a commit id.
        seed: The seed to initialize the random number generator for sampling.
        gpu_memory_utilization: The ratio (between 0 and 1) of GPU memory to
            reserve for the model weights, activations, and KV cache. Higher
            values will increase the KV cache size and thus improve the model's
            throughput. However, if the value is too high, it may cause out-of-
            memory (OOM) errors.
        swap_space: The size (GiB) of CPU memory per GPU to use as swap space.
            This can be used for temporarily storing the states of the requests
            when their `best_of` sampling parameters are larger than 1. If all
            requests will have `best_of=1`, you can safely set this to 0.
            Otherwise, too small values may cause out-of-memory (OOM) errors.
        cpu_offload_gb: The size (GiB) of CPU memory to use for offloading
            the model weights. This virtually increases the GPU memory space
            you can use to hold the model weights, at the cost of CPU-GPU data
            transfer for every forward pass.
        enforce_eager: Whether to enforce eager execution. If True, we will
            disable CUDA graph and always execute the model in eager mode.
            If False, we will use CUDA graph and eager execution in hybrid.
        max_context_len_to_capture: Maximum context len covered by CUDA graphs.
            When a sequence has context length larger than this, we fall back
            to eager mode (DEPRECATED. Use `max_seq_len_to_capture` instead).
        max_seq_len_to_capture: Maximum sequence len covered by CUDA graphs.
            When a sequence has context length larger than this, we fall back
            to eager mode.
        disable_custom_all_reduce: See ParallelConfig
        **kwargs: Arguments for :class:`~vllm.EngineArgs`. (See
            :ref:`engine_args`)
    
    Note:
        This class is intended to be used for offline inference. For online
        serving, use the :class:`~vllm.AsyncLLMEngine` class instead.
    """

（1）默认方式
指定model为模型文件夹路径
test.py文件：

from vllm import LLM
llm = LLM(model="Meta-Llama-3.1-8B-Instruct")
print("model load success!")

运行方式为：

CUDA_VISIBLE_DEVICES=0 python3 test.py

CUDA_VISIBLE_DEVICES为指定卡号

报错信息：
默认方式报错
（2）增加max_model_len参数
根据错误提示，KV cache的长度只支持43200，而Meta-Llama-3.1-8B-Instruct模型的最大长度是131072，需要通过max_model_len参数限制模型最大长度。

from vllm import LLM
llm = LLM(
    model="Meta-Llama-3.1-8B-Instruct",
    max_model_len = 43200,
)
print("model load success!")

运行结果：
运行成功
需要显存22G

2. SamplingParams

代码中的参数说明：

class SamplingParams:
    """Sampling parameters for text generation.

    Overall, we follow the sampling parameters from the OpenAI text completion
    API (https://platform.openai.com/docs/api-reference/completions/create).
    In addition, we support beam search, which is not supported by OpenAI.

    Args:
        n: Number of output sequences to return for the given prompt.
        best_of: Number of output sequences that are generated from the prompt.
            From these `best_of` sequences, the top `n` sequences are returned.
            `best_of` must be greater than or equal to `n`. This is treated as
            the beam width when `use_beam_search` is True. By default, `best_of`
            is set to `n`.
        presence_penalty: Float that penalizes new tokens based on whether they
            appear in the generated text so far. Values > 0 encourage the model
            to use new tokens, while values < 0 encourage the model to repeat
            tokens.
        frequency_penalty: Float that penalizes new tokens based on their
            frequency in the generated text so far. Values > 0 encourage the
            model to use new tokens, while values < 0 encourage the model to
            repeat tokens.
        repetition_penalty: Float that penalizes new tokens based on whether
            they appear in the prompt and the generated text so far. Values > 1
            encourage the model to use new tokens, while values < 1 encourage
            the model to repeat tokens.
        temperature: Float that controls the randomness of the sampling. Lower
            values make the model more deterministic, while higher values make
            the model more random. Zero means greedy sampling.
        top_p: Float that controls the cumulative probability of the top tokens
            to consider. Must be in (0, 1]. Set to 1 to consider all tokens.
        top_k: Integer that controls the number of top tokens to consider. Set
            to -1 to consider all tokens.
        min_p: Float that represents the minimum probability for a token to be
            considered, relative to the probability of the most likely token.
            Must be in [0, 1]. Set to 0 to disable this.
        seed: Random seed to use for the generation.
        use_beam_search: Whether to use beam search instead of sampling.
        length_penalty: Float that penalizes sequences based on their length.
            Used in beam search.
        early_stopping: Controls the stopping condition for beam search. It
            accepts the following values: `True`, where the generation stops as
            soon as there are `best_of` complete candidates; `False`, where an
            heuristic is applied and the generation stops when is it very
            unlikely to find better candidates; `"never"`, where the beam search
            procedure only stops when there cannot be better candidates
            (canonical beam search algorithm).
        stop: List of strings that stop the generation when they are generated.
            The returned output will not contain the stop strings.
        stop_token_ids: List of tokens that stop the generation when they are
            generated. The returned output will contain the stop tokens unless
            the stop tokens are special tokens.
        include_stop_str_in_output: Whether to include the stop strings in
            output text. Defaults to False.
        ignore_eos: Whether to ignore the EOS token and continue generating
            tokens after the EOS token is generated.
        max_tokens: Maximum number of tokens to generate per output sequence.
        min_tokens: Minimum number of tokens to generate per output sequence
            before EOS or stop_token_ids can be generated
        logprobs: Number of log probabilities to return per output token.
            When set to None, no probability is returned. If set to a non-None
            value, the result includes the log probabilities of the specified
            number of most likely tokens, as well as the chosen tokens.
            Note that the implementation follows the OpenAI API: The API will
            always return the log probability of the sampled token, so there
            may be up to `logprobs+1` elements in the response.
        prompt_logprobs: Number of log probabilities to return per prompt token.
        detokenize: Whether to detokenize the output. Defaults to True.
        skip_special_tokens: Whether to skip special tokens in the output.
        spaces_between_special_tokens: Whether to add spaces between special
            tokens in the output.  Defaults to True.
        logits_processors: List of functions that modify logits based on
            previously generated tokens, and optionally prompt tokens as
            a first argument.
        truncate_prompt_tokens: If set to an integer k, will use only the last k
            tokens from the prompt (i.e., left truncation). Defaults to None
            (i.e., no truncation).
    """

（1）使用方式
为了验证结果一致性和LLaMA-Factory推理设置一样
temperature=0.6,
top_p=0.9

from vllm import SamplingParams
sampling_params = SamplingParams(temperature=0.6, top_p=0.9, max_tokens=100, stop=['<|eom_id|>', '<|eot_id|>', '<|end_of_text|>'])

注意：

需要设置max_tokens参数，默认为16，需要适当增大。
需要设置参数stop=[‘<|eom_id|>’, ‘<|eot_id|>’, ‘<|end_of_text|>’]，Llama3.1的结束token，防止重复生成。

3. generate

generate返回的是RequestOutput组成的list
完整测试
注意：

~~1. vLLM目前不支持batch推断，即便输入多句prompt，依然按顺序推断。~~
vLLM目前支持batch推断，虽然tqdm按句显示进度，但是速度明显变快是支持batch的。

prompts = ["你是谁？", "你好"]
outputs = llm.generate(prompts, sampling_params, use_tqdm=False)

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}\n, Generated text: {generated_text!r}")

注意：
默认generate过程使用tqdm显示进度，可以使用use_tqdm=False关闭
默认情况如下：

4. RequestOutput——输出结构

4.1 RequestOutput

class RequestOutput:
    """The output data of a completion request to the LLM.

    Args:
        request_id: The unique ID of the request.
        prompt: The prompt string of the request.
        prompt_token_ids: The token IDs of the prompt.
        prompt_logprobs: The log probabilities to return per prompt token.
        outputs: The output sequences of the request.
        finished: Whether the whole request is finished.
        metrics: Metrics associated with the request.
        lora_request: The LoRA request that was used to generate the output.
    """

需要用的属性：

序号	参数名	说明
1	prompt	输入的文本
2	prompt_token_ids	输入的token，可以用来计算输入token数量
3	outputs	输出，CompletionOutput类组成的list
4	metrics	RequestMetrics，相关的时间记录

4.2 CompletionOutput

class CompletionOutput:
    """The output data of one completion output of a request.

    Args:
        index: The index of the output in the request.
        text: The generated output text.
        token_ids: The token IDs of the generated output text.
        cumulative_logprob: The cumulative log probability of the generated
            output text.
        logprobs: The log probabilities of the top probability words at each
            position if the logprobs are requested.
        finish_reason: The reason why the sequence is finished.
        stop_reason: The stop string or token id that caused the completion
            to stop, None if the completion finished for some other reason
            including encountering the EOS token.
        lora_request: The LoRA request that was used to generate the output.
    """

需要用的属性：

序号	参数名	说明
1	text	输出的文本
2	token_ids	输出的token，可以用来计算输入token数量

4.3 RequestMetrics

class RequestMetrics:
    """Metrics associated with a request.

    Attributes:
        arrival_time: The time when the request arrived.
        first_scheduled_time: The time when the request was first scheduled.
        first_token_time: The time when the first token was generated.
        time_in_queue: The time the request spent in the queue.
        finished_time: The time when the request was finished.
    """

需要用的属性：

序号	参数名	说明
1	arrival_time	开始的时间
2	finished_time	结束的时间

可以作为速度的参考时间

vLLM使用demo

from vllm import LLM, SamplingParams
import argparse
import time
import logging
from tqdm import tqdm

class Generate:
    def __init__(self, model_path, temperature=0.6, top_p=0.9, max_tokens=43200, debug=False):
        self.logger = logging.getLogger("vLLM")
        if debug:
            self.logger.setLevel(logging.DEBUG)
        else:
            self.logger.setLevel(logging.INFO)
        stream_handler = logging.StreamHandler()
        formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
        stream_handler.setFormatter(formatter)
        self.logger.addHandler(stream_handler)

        start_time = time.time()
        self.llm = LLM(
            model=model_path, 
            max_model_len = 43200
        )
        end_time = time.time()
        self.logger.info("load {} model use {:.2f}s".format(model_path, end_time - start_time))

        self.sampling_params = SamplingParams(temperature=temperature, top_p=top_p, max_tokens=max_tokens)

        self.all_num = 0
        self.all_input_tokens_num = 0
        self.all_output_tokens_num = 0

    def generate_sentences(self, prompts):
        start_time = time.time()
        outputs = self.llm.generate(prompts, self.sampling_params, use_tqdm=False)
        end_time = time.time()
        self.all_time = self.all_time + (end_time - start_time)
        result_list = []
        for output in outputs:
            self.all_num += 1
            self.all_input_tokens_num += len(output.prompt_token_ids)
            self.all_output_tokens_num += len(output.outputs[0].token_ids)
            result_list.append(output.outputs[0].text)
        return result_list

    def generate_all_data(self, input_list, batch):
        self.all_num = 0
        self.all_input_tokens_num = 0
        self.all_output_tokens_num = 0
        self.all_time = 0
        output_list = []
        for i in tqdm(range(0, len(input_list), batch)):
            output = self.generate_sentences(input_list[i : i + batch])
            output_list.extend(output)
        
        self.logger.info("-" * 20)
        self.logger.info("process {} sentences use {:.2f}s".format(self.all_num, self.all_time))
        self.logger.info("everage input token num: {:.2f}, everage output token num: {:.2f}".format(self.all_input_tokens_num / self.all_num, self.all_output_tokens_num / self.all_num))
        self.logger.info(
            "speed: all:{:.2f} token/s, input:{:.2f} token/s, output:{:.2f} token/s".format(
                (self.all_input_tokens_num + self.all_input_tokens_num) / self.all_time, 
                self.all_input_tokens_num / self.all_time, 
                self.all_output_tokens_num / self.all_time
            )
        )
        self.logger.info("-" * 20 + "\n\n")

        return output_list
    
    def generate_file(self, input_file_path, output_file_path, batch):
        input_list = []
        with open(input_file_path, "r", encoding="utf8") as fin:
            for line in fin.readlines():
                #去掉读取文件中最后的\n
                if line[-1] == "\n":
                    data = line[:-1]
                else:
                    data = line
                data = data.replace("\\n", "\n") #替换为真实转义字符\n
                input_list.append(data)

        output_list = generate.generate_all_data(input_list, batch)

        with open(output_file_path, "w", encoding="utf8") as fout:
            for line in output_list:
                fout.write(line + "\n")


if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="vLLM Llama Demo")

    parser.add_argument('-m', '--model', type=str, required=True, help='model path')
    parser.add_argument('-i', '--input', type=str, required=True, help='input file path')
    parser.add_argument('-o', '--output', type=str, required=True, help='output file path')
    parser.add_argument('-b', '--batch', type=int, default=1)
    parser.add_argument('--temperature', type=float, default=0.6)
    parser.add_argument('--top_p', type=float, default=0.9)
    parser.add_argument('--max_token', type=int, default=43200)
    parser.add_argument('--debug', action="store_true", help="whether to show debug infomation")

    args = parser.parse_args()

    generate = Generate(args.model, args.temperature, args.top_p, args.max_token, args.debug)

    generate.generate_file(args.input, args.output, args.batch)

运行指令：

 CUDA_VISIBLE_DEVICES=0 python3 inference.py -m ./Meta-Llama-3.1-8B-Instruct/ -i test.txt -o output.txt -b 4 --max_token 100

运行效果：
运行结果