vLLM使用教程【V5.0.4】

vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving.
vLLM是一个快速易用的LLM推理和服务库。
关于vLLM环境的安装可以看我的另一篇文章vLLM环境安装与运行实例【V5.0.4】

初次尝试只使用了部分参数,等用到其他参数后再更新······

vLLM相关参数说明

基于Meta-Llama-3.1-8B-Instruct进行推理测试

1. LLM——载入模型

代码中的参数说明:

class LLM:
    """An LLM for generating texts from given prompts and sampling parameters.

    This class includes a tokenizer, a language model (possibly distributed
    across multiple GPUs), and GPU memory space allocated for intermediate
    states (aka KV cache). Given a batch of prompts and sampling parameters,
    this class generates texts from the model, using an intelligent batching
    mechanism and efficient memory management.

    Args:
        model: The name or path of a HuggingFace Transformers model.
        tokenizer: The name or path of a HuggingFace Transformers tokenizer.
        tokenizer_mode: The tokenizer mode. "auto" will use the fast tokenizer
            if available, and "slow" will always use the slow tokenizer.
        skip_tokenizer_init: If true, skip initialization of tokenizer and
            detokenizer. Expect valid prompt_token_ids and None for prompt
            from the input.
        trust_remote_code: Trust remote code (e.g., from HuggingFace) when
            downloading the model and tokenizer.
        tensor_parallel_size: The number of GPUs to use for distributed
            execution with tensor parallelism.
        dtype: The data type for the model weights and activations. Currently,
            we support `float32`, `float16`, and `bfloat16`. If `auto`, we use
            the `torch_dtype` attribute specified in the model config file.
            However, if the `torch_dtype` in the config is `float32`, we will
            use `float16` instead.
        quantization: The method used to quantize the model weights. Currently,
            we support "awq", "gptq", "squeezellm", and "fp8" (experimental).
            If None, we first check the `quantization_config` attribute in the
            model config file. If that is None, we assume the model weights are
            not quantized and use `dtype` to determine the data type of
            the weights.
        revision: The specific model version to use. It can be a branch name,
            a tag name, or a commit id.
        tokenizer_revision: The specific tokenizer version to use. It can be a
            branch name, a tag name, or a commit id.
        seed: The seed to initialize the random number generator for sampling.
        gpu_memory_utilization: The ratio (between 0 and 1) of GPU memory to
            reserve for the model weights, activations, and KV cache. Higher
            values will increase the KV cache size and thus improve the model's
            throughput. However, if the value is too high, it may cause out-of-
            memory (OOM) errors.
        swap_space: The size (GiB) of CPU memory per GPU to use as swap space.
            This can be used for temporarily storing the states of the requests
            when their `best_of` sampling parameters are larger than 1. If all
            requests will have `best_of=1`, you can safely set this to 0.
            Otherwise, too small values may cause out-of-memory (OOM) errors.
        cpu_offload_gb: The size (GiB) of CPU memory to use for offloading
            the model weights. This virtually increases the GPU memory space
            you can use to hold the model weights, at the cost of CPU-GPU data
            transfer for every forward pass.
        enforce_eager: Whether to enforce eager execution. If True, we will
            disable CUDA graph and always execute the model in eager mode.
            If False, we will use CUDA graph and eager execution in hybrid.
        max_context_len_to_capture: Maximum context len covered by CUDA graphs.
            When a sequence has context length larger than this, we fall back
            to eager mode (DEPRECATED. Use `max_seq_len_to_capture` instead).
        max_seq_len_to_capture: Maximum sequence len covered by CUDA graphs.
            When a sequence has context length larger than this, we fall back
            to eager mode.
        disable_custom_all_reduce: See ParallelConfig
        **kwargs: Arguments for :class:`~vllm.EngineArgs`. (See
            :ref:`engine_args`)
    
    Note:
        This class is intended to be used for offline inference. For online
        serving, use the :class:`~vllm.AsyncLLMEngine` class instead.
    """

(1)默认方式
指定model为模型文件夹路径
test.py文件:

from vllm import LLM
llm = LLM(model="Meta-Llama-3.1-8B-Instruct")
print("model load success!")

运行方式为:

CUDA_VISIBLE_DEVICES=0 python3 test.py

CUDA_VISIBLE_DEVICES为指定卡号

报错信息:
默认方式报错
(2)增加max_model_len参数
根据错误提示,KV cache的长度只支持43200,而Meta-Llama-3.1-8B-Instruct模型的最大长度是131072,需要通过max_model_len参数限制模型最大长度。

from vllm import LLM
llm = LLM(
    model="Meta-Llama-3.1-8B-Instruct",
    max_model_len = 43200,
)
print("model load success!")

运行结果:
运行成功
需要显存22G

2. SamplingParams

代码中的参数说明:

class SamplingParams:
    """Sampling parameters for text generation.

    Overall, we follow the sampling parameters from the OpenAI text completion
    API (https://platform.openai.com/docs/api-reference/completions/create).
    In addition, we support beam search, which is not supported by OpenAI.

    Args:
        n: Number of output sequences to return for the given prompt.
        best_of: Number of output sequences that are generated from the prompt.
            From these `best_of` sequences, the top `n` sequences are returned.
            `best_of` must be greater than or equal to `n`. This is treated as
            the beam width when `use_beam_search` is True. By default, `best_of`
            is set to `n`.
        presence_penalty: Float that penalizes new tokens based on whether they
            appear in the generated text so far. Values > 0 encourage the model
            to use new tokens, while values < 0 encourage the model to repeat
            tokens.
        frequency_penalty: Float that penalizes new tokens based on their
            frequency in the generated text so far. Values > 0 encourage the
            model to use new tokens, while values < 0 encourage the model to
            repeat tokens.
        repetition_penalty: Float that penalizes new tokens based on whether
            they appear in the prompt and the generated text so far. Values > 1
            encourage the model to use new tokens, while values < 1 encourage
            the model to repeat tokens.
        temperature: Float that controls the randomness of the sampling. Lower
            values make the model more deterministic, while higher values make
            the model more random. Zero means greedy sampling.
        top_p: Float that controls the cumulative probability of the top tokens
            to consider. Must be in (0, 1]. Set to 1 to consider all tokens.
        top_k: Integer that controls the number of top tokens to consider. Set
            to -1 to consider all tokens.
        min_p: Float that represents the minimum probability for a token to be
            considered, relative to the probability of the most likely token.
            Must be in [0, 1]. Set to 0 to disable this.
        seed: Random seed to use for the generation.
        use_beam_search: Whether to use beam search instead of sampling.
        length_penalty: Float that penalizes sequences based on their length.
            Used in beam search.
        early_stopping: Controls the stopping condition for beam search. It
            accepts the following values: `True`, where the generation stops as
            soon as there are `best_of` complete candidates; `False`, where an
            heuristic is applied and the generation stops when is it very
            unlikely to find better candidates; `"never"`, where the beam search
            procedure only stops when there cannot be better candidates
            (canonical beam search algorithm).
        stop: List of strings that stop the generation when they are generated.
            The returned output will not contain the stop strings.
        stop_token_ids: List of tokens that stop the generation when they are
            generated. The returned output will contain the stop tokens unless
            the stop tokens are special tokens.
        include_stop_str_in_output: Whether to include the stop strings in
            output text. Defaults to False.
        ignore_eos: Whether to ignore the EOS token and continue generating
            tokens after the EOS token is generated.
        max_tokens: Maximum number of tokens to generate per output sequence.
        min_tokens: Minimum number of tokens to generate per output sequence
            before EOS or stop_token_ids can be generated
        logprobs: Number of log probabilities to return per output token.
            When set to None, no probability is returned. If set to a non-None
            value, the result includes the log probabilities of the specified
            number of most likely tokens, as well as the chosen tokens.
            Note that the implementation follows the OpenAI API: The API will
            always return the log probability of the sampled token, so there
            may be up to `logprobs+1` elements in the response.
        prompt_logprobs: Number of log probabilities to return per prompt token.
        detokenize: Whether to detokenize the output. Defaults to True.
        skip_special_tokens: Whether to skip special tokens in the output.
        spaces_between_special_tokens: Whether to add spaces between special
            tokens in the output.  Defaults to True.
        logits_processors: List of functions that modify logits based on
            previously generated tokens, and optionally prompt tokens as
            a first argument.
        truncate_prompt_tokens: If set to an integer k, will use only the last k
            tokens from the prompt (i.e., left truncation). Defaults to None
            (i.e., no truncation).
    """

(1)使用方式
为了验证结果一致性和LLaMA-Factory推理设置一样
temperature=0.6,
top_p=0.9

from vllm import SamplingParams
sampling_params = SamplingParams(temperature=0.6, top_p=0.9, max_tokens=100, stop=['<|eom_id|>', '<|eot_id|>', '<|end_of_text|>'])

注意:

  1. 需要设置max_tokens参数,默认为16,需要适当增大。
  2. 需要设置参数stop=[‘<|eom_id|>’, ‘<|eot_id|>’, ‘<|end_of_text|>’],Llama3.1的结束token,防止重复生成。

3. generate

generate返回的是RequestOutput组成的list
完整测试
注意:

1. vLLM目前不支持batch推断,即便输入多句prompt,依然按顺序推断。
vLLM目前支持batch推断,虽然tqdm按句显示进度,但是速度明显变快是支持batch的。

prompts = ["你是谁?", "你好"]
outputs = llm.generate(prompts, sampling_params, use_tqdm=False)

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}\n, Generated text: {generated_text!r}")

注意:
默认generate过程使用tqdm显示进度,可以使用use_tqdm=False关闭
默认情况如下:
use_tqdm

4. RequestOutput——输出结构

4.1 RequestOutput

class RequestOutput:
    """The output data of a completion request to the LLM.

    Args:
        request_id: The unique ID of the request.
        prompt: The prompt string of the request.
        prompt_token_ids: The token IDs of the prompt.
        prompt_logprobs: The log probabilities to return per prompt token.
        outputs: The output sequences of the request.
        finished: Whether the whole request is finished.
        metrics: Metrics associated with the request.
        lora_request: The LoRA request that was used to generate the output.
    """

需要用的属性:

序号参数名说明
1prompt输入的文本
2prompt_token_ids输入的token,可以用来计算输入token数量
3outputs输出,CompletionOutput类组成的list
4metricsRequestMetrics,相关的时间记录

4.2 CompletionOutput

class CompletionOutput:
    """The output data of one completion output of a request.

    Args:
        index: The index of the output in the request.
        text: The generated output text.
        token_ids: The token IDs of the generated output text.
        cumulative_logprob: The cumulative log probability of the generated
            output text.
        logprobs: The log probabilities of the top probability words at each
            position if the logprobs are requested.
        finish_reason: The reason why the sequence is finished.
        stop_reason: The stop string or token id that caused the completion
            to stop, None if the completion finished for some other reason
            including encountering the EOS token.
        lora_request: The LoRA request that was used to generate the output.
    """

需要用的属性:

序号参数名说明
1text输出的文本
2token_ids输出的token,可以用来计算输入token数量

4.3 RequestMetrics

class RequestMetrics:
    """Metrics associated with a request.

    Attributes:
        arrival_time: The time when the request arrived.
        first_scheduled_time: The time when the request was first scheduled.
        first_token_time: The time when the first token was generated.
        time_in_queue: The time the request spent in the queue.
        finished_time: The time when the request was finished.
    """

需要用的属性:

序号参数名说明
1arrival_time开始的时间
2finished_time结束的时间

可以作为速度的参考时间

vLLM使用demo

from vllm import LLM, SamplingParams
import argparse
import time
import logging
from tqdm import tqdm

class Generate:
    def __init__(self, model_path, temperature=0.6, top_p=0.9, max_tokens=43200, debug=False):
        self.logger = logging.getLogger("vLLM")
        if debug:
            self.logger.setLevel(logging.DEBUG)
        else:
            self.logger.setLevel(logging.INFO)
        stream_handler = logging.StreamHandler()
        formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
        stream_handler.setFormatter(formatter)
        self.logger.addHandler(stream_handler)

        start_time = time.time()
        self.llm = LLM(
            model=model_path, 
            max_model_len = 43200
        )
        end_time = time.time()
        self.logger.info("load {} model use {:.2f}s".format(model_path, end_time - start_time))

        self.sampling_params = SamplingParams(temperature=temperature, top_p=top_p, max_tokens=max_tokens)

        self.all_num = 0
        self.all_input_tokens_num = 0
        self.all_output_tokens_num = 0

    def generate_sentences(self, prompts):
        start_time = time.time()
        outputs = self.llm.generate(prompts, self.sampling_params, use_tqdm=False)
        end_time = time.time()
        self.all_time = self.all_time + (end_time - start_time)
        result_list = []
        for output in outputs:
            self.all_num += 1
            self.all_input_tokens_num += len(output.prompt_token_ids)
            self.all_output_tokens_num += len(output.outputs[0].token_ids)
            result_list.append(output.outputs[0].text)
        return result_list

    def generate_all_data(self, input_list, batch):
        self.all_num = 0
        self.all_input_tokens_num = 0
        self.all_output_tokens_num = 0
        self.all_time = 0
        output_list = []
        for i in tqdm(range(0, len(input_list), batch)):
            output = self.generate_sentences(input_list[i : i + batch])
            output_list.extend(output)
        
        self.logger.info("-" * 20)
        self.logger.info("process {} sentences use {:.2f}s".format(self.all_num, self.all_time))
        self.logger.info("everage input token num: {:.2f}, everage output token num: {:.2f}".format(self.all_input_tokens_num / self.all_num, self.all_output_tokens_num / self.all_num))
        self.logger.info(
            "speed: all:{:.2f} token/s, input:{:.2f} token/s, output:{:.2f} token/s".format(
                (self.all_input_tokens_num + self.all_input_tokens_num) / self.all_time, 
                self.all_input_tokens_num / self.all_time, 
                self.all_output_tokens_num / self.all_time
            )
        )
        self.logger.info("-" * 20 + "\n\n")

        return output_list
    
    def generate_file(self, input_file_path, output_file_path, batch):
        input_list = []
        with open(input_file_path, "r", encoding="utf8") as fin:
            for line in fin.readlines():
                #去掉读取文件中最后的\n
                if line[-1] == "\n":
                    data = line[:-1]
                else:
                    data = line
                data = data.replace("\\n", "\n") #替换为真实转义字符\n
                input_list.append(data)

        output_list = generate.generate_all_data(input_list, batch)

        with open(output_file_path, "w", encoding="utf8") as fout:
            for line in output_list:
                fout.write(line + "\n")


if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="vLLM Llama Demo")

    parser.add_argument('-m', '--model', type=str, required=True, help='model path')
    parser.add_argument('-i', '--input', type=str, required=True, help='input file path')
    parser.add_argument('-o', '--output', type=str, required=True, help='output file path')
    parser.add_argument('-b', '--batch', type=int, default=1)
    parser.add_argument('--temperature', type=float, default=0.6)
    parser.add_argument('--top_p', type=float, default=0.9)
    parser.add_argument('--max_token', type=int, default=43200)
    parser.add_argument('--debug', action="store_true", help="whether to show debug infomation")

    args = parser.parse_args()

    generate = Generate(args.model, args.temperature, args.top_p, args.max_token, args.debug)

    generate.generate_file(args.input, args.output, args.batch)
    
    

运行指令:

 CUDA_VISIBLE_DEVICES=0 python3 inference.py -m ./Meta-Llama-3.1-8B-Instruct/ -i test.txt -o output.txt -b 4 --max_token 100

运行效果:
运行结果

VLLM是指通过使用自然语言理解和生成技术,为llama2(即少儿启蒙推理游戏)构建一个强大的智能推理引擎。llama2是一个有趣的谜题游戏,旨在帮助儿童培养逻辑思维和推理能力。VLLM的目标是通过语义理解和生成技术,使llama2能够理解和响应玩家的问题和指令。 VLLM使用的技术包括自然语言处理、机器学习和推理算法。它能够理解不同玩家的提问,并根据题目提供合适的答案和指导。VLLM还能够根据玩家的反馈和行为进行学习,提供更准确的推理和解答。 通过使用VLLMllama2游戏变得更加有趣和教育性。孩子们可以与电脑程序进行对话,提出问题,解决谜题,培养他们的思维能力和推理能力。VLLM能够提供有趣且适应儿童认知水平的谜题,并通过给予提示和解释,帮助他们学习解决问题的方法和策略。 VLLM的应用不仅局限于游戏中,它也可以在教育领域发挥重要作用。教育者可以利用VLLM的推理引擎开发更多有趣的教学资源,帮助孩子们在解决问题和推理推测过程中积累知识和技能。此外,VLLM还可以为教育者提供相关反馈和评估,帮助他们更好地了解学生的学习情况和需求。 总之,VLLMllama2带来了智能推理的能力,使得孩子们能够通过游戏和探索培养自己的思维能力和推理能力。同时,VLLM的应用也拓展了教育领域的可能性,为教育者和学生提供更多有趣和有效的学习资源和工具。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

莽夫搞战术

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值