vLLM (4) - LLMEngine上篇

系列文章目录

vLLM (1) - Qwen2推理&部署
vLLM (2) - 架构总览
vLLM (3) - Sequence & SequenceGroup
vLLM (4) - LLMEngine上篇
vLLM (5) - LLMEngine下篇



前言

经过前面两篇的铺垫,终于来到了解析LLMEngine的篇章。如下图所示,LLMEngine主要有两部分构成,右边部分包括WorkerCacheEngineModelRunner等重要的类,它们在LLMEngine的初始化阶段就会用到,工作内容包括模型加载,KV Cache初始化等等,这是本文中重点;左边部分包括SchedulerBlockSpaceManger,用于调度用户请求,并在过程中管理显存和内存,这部分发生在LLMEngine的(generate)生成阶段,将放到后续文章中。
在这里插入图片描述


一、类图

本篇重点讲述LLMEngine的初始化部分。由于代码调用相对复杂,下面我使用类图的方式来表示不同的类之间的关系。同时,在类图中只写上本篇所涉及的相关属性和方法,避免其他属性和方法对本篇阅读造成干扰。建议该类图当结合后续代码一起使用。

# 类图

+-------------------------+
|           LLM           |
+-------------------------+
| + llm_engine: LLMEngine |
+-------------------------+
            |
            |
            v
+-------------------------+
|        LLMEngine        |
+-------------------------+
| + model_executor: GPUExecutor |  # 执行器,名字有点歧义,项目有个子目录也叫model_exectuor
| - _initialize_kv_caches()     |  # 初始化kv_caches
| + scheduler: Scheduler  |        # 调度器
| + output_processor      |        # 输出处理器
+-------------------------+
            |
            |
            v
+-------------------------+
|        GPUExecutor      |        
+-------------------------+
| - _init_executor()      |        # 初始化执行器
| + driver_worker: Worker |        # worker
|                         |
| + determine_num_available_blocks: Tuple[int, int] |   # 确认可用的gpu blocks和cpu blocks
| + initalize_cache()     |        # 初始化缓存,先用全0张量为kv_cache占住内存
+-------------------------+
            |
            |
            v
+-------------------------+
|          Worker         |
+-------------------------+
| + model_runner: ModelRunner |    # 加载和执行模型的部分
| + cache_engine: CacheEngine |    # 初始化和更新kv_cache的部分
| + init_device()         |        # 初始化设备,gpu
| + load_model()          |        # 加载模型
+-------------------------+
            |                              	|
            |                              	|
            v                              	v
+-------------------------+    +-------------------------+
|       ModelRunner       |    |       CacheEngine       |
+-------------------------+    +-------------------------+
| + loader_model()        |    | + gpu_cache             | 
| + profile_run()         |    | - _allocate_kv_cache(): List[torch.Tensor] |
| + capture_model()       |    | + get_cache_block_size(...): int           |
+-------------------------+    +-------------------------+

二、LLM

LLM是一个在给定promptsample paramters时,使用指定的大语言模型生成文本的类;其核心组件为self.llm_engineLLMEngine的实例化对象),LLM的绝大多数工作由它来完成。
使用LLM的示例代码如下所示。1)构建LLM实例化对象,其初始化部分将完成llm_engine: LLMEngine的创建(本文将重点);2)处理请求,使用self.generate()方法,完成了资源调度,高效的应对用户请求,输出文本(后续文章讲述)。

# 完整示例见系列文章的Qwen2推理篇
from vllm import LLM

llm = LLM(model=DEFAULT_CKPT_PATH)      # DEFAULT_CKPT_PATH为模型名称或下载到本地的目录
outputs = llm.generate(text, sampling_params)  # text为输入文本,sampling_params是采样参数

三、LLMEngine

LLMEngine主要包含两个部分:1)model_executor;2)schedulermodel_executor主要负责模型相关的部分,比如设备的选择,模型的加载等等;而scheduler用于资源的调度,这部分在会模型推理阶段频繁使用。
结合代码来看一下LLMEngine在初始化环节都在干什么:

  1. 创建model_executor:根据model_config等一系列配置创建模型执行器;对于一个不太富裕的从业者来说,我们可能在一块单卡上跑vllm,这时候model_executorGPUExectuor,如果你使用的硬件是Neuron或者TPU,对应的model_executor就是NeuronExecutorTPUExecutor;另外,model_config等配置是将输入和默认参数按照功能拆分出的多个配置项,这里不赘述;
  2. 初始化kv_caches:借由self.model_exectutor(下一小节展开),确定可用于kv_caches的内存空间,并创建tensor占用这部分内存;在Qwen2推理&部署中的真实显存占用这一小节中,我们已经观察到了这个动作,并做了详细分析,不清楚的可以去看一下;
  3. 构建scheduler:资源调度一般都出现在模型推理阶段;
  4. 其他:比如创建output_processor等,这部分不是重点。
# vllm/engine/llm_engine.py
class LLMEngine:
    def __init__(self, ...):
        # ...
        
        self.model_executor = executor_class(
            model_config=model_config,
            cache_config=cache_config,
            parallel_config=parallel_config,
            scheduler_config=scheduler_config,
            device_config=device_config,
            lora_config=lora_config,
            vision_language_config=vision_language_config,
            speculative_config=speculative_config,
            load_config=load_config,
        )    # 1) 根据输入配置构建model_executor

        if not self.model_config.embedding_mode:
            self._initialize_kv_caches()   # 2) 初始化kv caches

        # 3) 构建scheduler
        self.scheduler = Scheduler(scheduler_config, cache_config, lora_config)

        # 4) 创建输出处理器,这在最后输出的时候会用到
        # Create sequence output processor, e.g. for beam search or speculative decoding.
        self.output_processor = (
            SequenceGroupOutputProcessor.create_output_processor(
                self.scheduler_config,
                self.detokenizer,
                self.scheduler,
                self.seq_counter,
                self.get_tokenizer_for_seq,
                stop_checker=StopChecker(
                    self.scheduler_config.max_model_len,
                    self.get_tokenizer_for_seq,
                ),
            ))

    def _initialize_kv_caches(self) -> None:
        """Initialize the KV cache in the worker(s).

        The workers will determine the number of blocks in both the GPU cache
        and the swap CPU cache.
        """
        num_gpu_blocks, num_cpu_blocks = (
            self.model_executor.determine_num_available_blocks())

        if self.cache_config.num_gpu_blocks_override is not None:
            num_gpu_blocks_override = self.cache_config.num_gpu_blocks_override
            logger.info(
                "Overriding num_gpu_blocks=%d with "
                "num_gpu_blocks_override=%d", num_gpu_blocks,
                num_gpu_blocks_override)
            num_gpu_blocks = num_gpu_blocks_override

        self.cache_config.num_gpu_blocks = num_gpu_blocks
        self.cache_config.num_cpu_blocks = num_cpu_blocks

        self.model_executor.initialize_cache(num_gpu_blocks, num_cpu_blocks)

四、GPUExectuor

model_executor(比如GPUExecutor)在初始化阶段在干什么呢?GPUExecutor继承自基类ExecutorBase,在self.__init__()中调用了self._init_executor()方法,具体包括如下:

  1. 使用self._create_worker()创建worker:实际上是通过WorkerWrapperBase来创建的worker,不同的配置对应不同类型的worker,默认情况下是Worker,当你使用投机采样speculative decoding的时候,则是SpecDecodeWorker(合理使用投机采样能够提升解码效率);
  2. worker初始化设备:self.driver_worker.init_device()
  3. worker加载模型:self.driver_worker.load_model()
    前面提到,GPUExecutor在被创建之后,还用来完成kv_caches的初始化,如上一节LLMEngine._initialize_kv_caches()方法所示,这其中主要涉及GPUExecutor的两个方法:
  4. self.determine_num_available_blocks():该方法返回了当前可用的gpu_blockscpu_blocks的数量;block的意思是将gpucpu按照指定的大小block_size进行分块,每一块对应一定大小的显存/内存;
  5. initialize_cache():在确定num_gpu_blocksnum_cpu_blocks,也就是确定有多少显存和内存可用于kv_caches之后,就可以占据这部分资源进行缓存初始化;
    这边简单说明了GPUExecutor在前期的一些工作,但这些操作基本依赖于它创建的worker,我们下一小节来看。
# vllm/executor/gpu_executor.py
class GPUExecutor(ExecutorBase):

    def _init_executor(self) -> None:
        """Initialize the worker and load the model.
        """
        assert self.parallel_config.world_size == 1, (
            "GPUExecutor only supports single GPU.")

        self.driver_worker = self._create_worker()   # 创建worker
        self.driver_worker.init_device()   # 初始化设备
        self.driver_worker.load_model()    # 加载模型

    def _create_worker(self,
                       local_rank: int = 0,
                       rank: int = 0,
                       distributed_init_method: Optional[str] = None):

        if self.speculative_config is None:
            worker_module_name = "vllm.worker.worker"
            worker_class_name = "Worker"
        else:
            worker_module_name = "vllm.spec_decode.spec_decode_worker"
            worker_class_name = "create_spec_worker"

        wrapper = WorkerWrapperBase(
            worker_module_name=worker_module_name,
            worker_class_name=worker_class_name,
        )
        wrapper.init_worker(**self._get_worker_kwargs(local_rank, rank,
                                                      distributed_init_method))
        return wrapper.worker

    def determine_num_available_blocks(self) -> Tuple[int, int]:
        """Determine the number of available KV blocks by invoking the
        underlying worker.
        """
        return self.driver_worker.determine_num_available_blocks()

    def initialize_cache(self, num_gpu_blocks: int, num_cpu_blocks) -> None:
        """Initialize the KV cache by invoking the underlying worker.
        """
        # NOTE: This is logged in the executor because there can be >1 worker
        # with other executors. We could log in the engine level, but work
        # remains to abstract away the device for non-GPU configurations.
        logger.info("# GPU blocks: %d, # CPU blocks: %d", num_gpu_blocks,
                    num_cpu_blocks)

        self.driver_worker.initialize_cache(num_gpu_blocks, num_cpu_blocks)

五、Worker

Worker主要承载两部分功能:modelcache,分别对应于成员model_runnercache_engine

  1. self.model_runner:对于使用大模型生成(本次举例)的情形,它是ModelRunner的实例对象;而如果使用了embedding_mode,它就是EmbeddingModelRunner的实例对象;
  2. self.cache_engine:是CacheEngine的实例对象,在self.initial_cache()方法中,主要就是初始化了self.cache_engine,相关内容放在下下小节。
    关于方法self.determine_num_available_blocks(),它返回的是num_gpu_blocksnum_cpu_blocks,两者获取逻辑分别如下:
  • num_gpu_blocks:在清空CUDA缓存之后,执行一次前向传播,profile模型的显存使用情况,然后获取当前CUDA设备的空闲显存和总显存,此时就能就算出峰值显存占用peak_memory;那么可用于kv_caches的显存就是total_gpu_memory * self.cache_config.gpu_memory_utilization - peak_memory,其中的gpu_memory_utilizationgpu使用率,默认0.9;因为缓存以block形式存在,所以除以cache_block_size就能得到num_gpu_blocks,其中cache_block_size是一个block所占用的字节数,这会在CacheEngine中讲到;
  • num_cpu_blocks:模型不会在cpu上进行运算,但是可以在上面缓存,必要时再swapgpu上,这部分内存大小是self.cache_config.swap_space_bytes,默认是4GB
# vllm/worker/worker.py
class Worker(WorkerBase):
    def __init__(self, ...)   # 传入参数是一些配置项,这边略去
        # 无关代码,pass

        ModelRunnerClass = (EmbeddingModelRunner if
                            self.model_config.embedding_mode else ModelRunner)
        self.model_runner = ModelRunnerClass(
            model_config,
            parallel_config,
            scheduler_config,
            device_config,
            cache_config,
            load_config=load_config,
            lora_config=self.lora_config,
            kv_cache_dtype=self.cache_config.cache_dtype,
            is_driver_worker=is_driver_worker,
            vision_language_config=vision_language_config,
        )
        # Uninitialized cache engine. Will be initialized by
        # initialize_cache.
        self.cache_engine: CacheEngine
        # Initialize gpu_cache as embedding models don't initialize kv_caches
        self.gpu_cache: Optional[List[torch.tensor]] = None

    # ------------------- GPUExecutor中被调用来初始化的部分 ------------------- # 
    def init_device(self) -> None:
        if self.device_config.device.type == "cuda":
            os.environ["TORCH_NCCL_AVOID_RECORD_STREAMS"] = "1"

            # This env var set by Ray causes exceptions with graph building.
            os.environ.pop("NCCL_ASYNC_ERROR_HANDLING", None)
            self.device = torch.device(f"cuda:{self.local_rank}")
            torch.cuda.set_device(self.device)

            _check_if_gpu_supports_dtype(self.model_config.dtype)
            torch.cuda.empty_cache()
            self.init_gpu_memory = torch.cuda.mem_get_info()[0]
        else:
            raise RuntimeError(
                f"Not support device type: {self.device_config.device}")
        # Initialize the distributed environment.
        init_worker_distributed_environment(self.parallel_config, self.rank,
                                            self.distributed_init_method,
                                            self.local_rank)
        # Set random seed.
        set_random_seed(self.model_config.seed)

    def load_model(self):
        self.model_runner.load_model()

    # ------------------- model runner相关 ------------------- # 
    @torch.inference_mode()
    def determine_num_available_blocks(self) -> Tuple[int, int]:
        """Profiles the peak memory usage of the model to determine how many
        KV blocks may be allocated without OOMs.
        """
        # Profile the memory usage of the model and get the maximum number of
        # cache blocks that can be allocated with the remaining free memory.
        torch.cuda.empty_cache()

        # Execute a forward pass with dummy inputs to profile the memory usage
        # of the model.
        self.model_runner.profile_run()

        # Calculate the number of blocks that can be allocated with the
        # profiled peak memory.
        torch.cuda.synchronize()
        free_gpu_memory, total_gpu_memory = torch.cuda.mem_get_info()
        # NOTE(woosuk): Here we assume that the other processes using the same
        # GPU did not change their memory usage during the profiling.
        peak_memory = self.init_gpu_memory - free_gpu_memory
        assert peak_memory > 0, (
            "Error in memory profiling. This happens when the GPU memory was "
            "not properly cleaned up before initializing the vLLM instance.")

        cache_block_size = self.get_cache_block_size_bytes()
        num_gpu_blocks = int(
            (total_gpu_memory * self.cache_config.gpu_memory_utilization -
             peak_memory) // cache_block_size)
        num_cpu_blocks = int(self.cache_config.swap_space_bytes //
                             cache_block_size)
        num_gpu_blocks = max(num_gpu_blocks, 0)
        num_cpu_blocks = max(num_cpu_blocks, 0)
        if self.model_runner.lora_manager:
            self.model_runner.remove_all_loras()
        gc.collect()
        torch.cuda.empty_cache()
        return num_gpu_blocks, num_cpu_blocks

    # --------------------- cache相关 ------------------------
    def initialize_cache(self, num_gpu_blocks: int,
                         num_cpu_blocks: int) -> None:
        """Allocate GPU and CPU KV cache with the specified number of blocks.

        This also warms up the model, which may record CUDA graphs.
        """
        raise_if_cache_size_invalid(num_gpu_blocks,
                                    self.cache_config.block_size,
                                    self.model_config.max_model_len)

        self.cache_config.num_gpu_blocks = num_gpu_blocks
        self.cache_config.num_cpu_blocks = num_cpu_blocks

        self._init_cache_engine()
        self._warm_up_model()

    def _init_cache_engine(self):
        assert self.cache_config.num_gpu_blocks is not None
        self.cache_engine = CacheEngine(self.cache_config, self.model_config,
                                        self.parallel_config)
        self.gpu_cache = self.cache_engine.gpu_cache

    def _warm_up_model(self) -> None:
        if not self.model_config.enforce_eager:
            self.model_runner.capture_model(self.gpu_cache)
        # Reset the seed to ensure that the random state is not affected by
        # the model initialization and profiling.
        set_random_seed(self.model_config.seed)

六、ModelRunner

这部分主要讲一下ModelRunner的两个方法:self.profile_run()self.capture_model()

  1. self.profile_run() 是用于跑一跑dummy input然后看一下具体的内存使用情况,最关键的代码是self.execute_model(seqs, kv_caches),也就是说我们需要准备好输入seqs和缓存kv_caches,其中kv_caches用于模型中Attention的计算,初始都为None。详细注释见下方。
class ModelRunner:
    # ...
    
    @torch.inference_mode()
    def profile_run(self) -> None:
        # top-k采样,获取内存使用情况
        sampling_params = SamplingParams(top_p=0.99, top_k=self.vocab_size - 1)
        # 一个batch中最大处理的token数量,典型的32k
        max_num_batched_tokens = self.scheduler_config.max_num_batched_tokens
        # 最大的序列数量,典型的256
        max_num_seqs = self.scheduler_config.max_num_seqs

        # profile的时候要求序列数量为max_num_seqs,token总数等于max_num_batched_tokens
        seqs: List[SequenceGroupMetadata] = []
        model_config = self.model_config

        # lora: pass
        # vlm: pass

        for group_id in range(max_num_seqs):
            # 均分所有token,计算每个seq的长度
            seq_len = (max_num_batched_tokens // max_num_seqs +
                       (group_id < max_num_batched_tokens % max_num_seqs))

            # SequenceData和SequenceGroupMetadata已经在前面文章中讲过,不再赘述
            # 构造dummy input
            seq_data = SequenceData([0] * seq_len)
            dummy_multi_modal_data = None

            seq = SequenceGroupMetadata(
                request_id=str(group_id),
                is_prompt=True,
                seq_data={group_id: seq_data},
                sampling_params=sampling_params,
                block_tables=None,
                lora_request=dummy_lora_requests_per_seq[group_id]
                if dummy_lora_requests_per_seq else None,
                multi_modal_data=dummy_multi_modal_data,
            )
            seqs.append(seq)

        # 构造kv caches,由于尚未开始推理,初始化为None
        num_layers = self.model_config.get_num_layers(self.parallel_config)
        kv_caches = [None] * num_layers

        # 执行模型
        self.execute_model(seqs, kv_caches)
        # cuda同步
        torch.cuda.synchronize()
        return

    @torch.inference_mode()
    def execute_model(
        self,
        seq_group_metadata_list: Optional[List[SequenceGroupMetadata]],
        kv_caches: List[torch.Tensor],
    ) -> Optional[SamplerOutput]:
        # 准备输入张量
        (input_tokens, input_positions, attn_metadata, sampling_metadata,
         lora_requests, lora_mapping, multi_modal_kwargs
         ) = self.prepare_input_tensors(seq_group_metadata_list)

        # lora: pass

        # 仅在decode阶段使用cuda graph(它能提升效率)
        prefill_meta = attn_metadata.prefill_metadata  # 具体是怎样的,暂时不必管它
        decode_meta = attn_metadata.decode_metadata
        if prefill_meta is None and decode_meta.use_cuda_graph:
            graph_batch_size = input_tokens.shape[0]
            model_executable = self.graph_runners[graph_batch_size]
        else:
            model_executable = self.model

        # 模型具体执行,模型在vllm/model_executor/models/中有定义,这边找到qwen2.py文件
        hidden_states = model_executable(
            input_ids=input_tokens,
            positions=input_positions,
            kv_caches=kv_caches,
            attn_metadata=attn_metadata,
            **multi_modal_kwargs,
        )

        # Compute the logits.
        logits = self.model.compute_logits(hidden_states, sampling_metadata)

        # Only perform sampling in the driver worker.
        if not self.is_driver_worker:
            return None

        # Sample the next token. 采样
        output = self.model.sample(
            logits=logits,
            sampling_metadata=sampling_metadata,
        )

        return output
  1. self.capture_model()使用CUDA Graph技术(仅在解码过程中使用),捕获一个模型的执行过程,以便在后续的推理过程中可以重用这个捕获的图,从而提高性能,代码中给出了简单注释;
class ModelRunner:
    # ...

    @torch.inference_mode()
    def capture_model(self, kv_caches: List[torch.Tensor]) -> None:
        """Cuda graph capture a model. ...
        CUDA Graph主要用于解码阶段,因为对于较大的批量大小,
        CUDA Graph的性能提升不明显,并且由于CUDA Graph需要固定大小的张量,
        支持大或可变批量大小需要较高的GPU内存开销
        """
        # 提示信息,可以了解一下
        assert not self.model_config.enforce_eager
        logger.info("Capturing the model for CUDA graphs. This may lead to "
                    "unexpected consequences if the model is not static. To "
                    "run the model in eager mode, set 'enforce_eager=True' or "
                    "use '--enforce-eager' in the CLI.")
        logger.info("CUDA graphs can take additional 1~3 GiB memory per GPU. "
                    "If you are running out of memory, consider decreasing "
                    "`gpu_memory_utilization` or enforcing eager mode. "
                    "You can also reduce the `max_num_seqs` as needed "
                    "to decrease memory usage.")
        start_time = time.perf_counter()

        # dummy inputs
        max_batch_size = max(_BATCH_SIZES_TO_CAPTURE)
        input_tokens = torch.zeros(max_batch_size, dtype=torch.long).cuda()
        input_positions = torch.zeros(max_batch_size, dtype=torch.long).cuda()
        slot_mapping = torch.empty(max_batch_size, dtype=torch.long).cuda()
        slot_mapping.fill_(_PAD_SLOT_ID)
        seq_lens = torch.ones(max_batch_size, dtype=torch.int32).cuda()
        block_tables = torch.from_numpy(self.graph_block_tables).cuda()

        # 创建用于存储输出隐藏状态的缓冲区hidden_states,
        # 该缓冲区将在第一次图捕获后填充。
        hidden_states: Optional[torch.Tensor] = None

        # 需要捕获的若干batch size
        graph_batch_size = _get_graph_batch_size(
            self.scheduler_config.max_num_seqs)
        batch_size_capture_list = [
            bs for bs in _BATCH_SIZES_TO_CAPTURE if bs <= graph_batch_size
        ]

        # 捕获CUDA Graph,graph_capture()是上下文管理器(一些并行策略)
        with graph_capture() as graph_capture_context:
            # NOTE: Capturing the largest batch size first may help reduce the
            # memory usage of CUDA graph.
            for batch_size in reversed(batch_size_capture_list):
                # Create dummy attn_metadata.
                attn_metadata = self.attn_backend.make_metadata(
                    num_prefills=0,
                    num_prefill_tokens=0,
                    num_decode_tokens=batch_size,
                    slot_mapping=slot_mapping[:batch_size],
                    seq_lens=None,
                    seq_lens_tensor=seq_lens[:batch_size],
                    max_query_len=None,
                    max_prefill_seq_len=0,
                    max_decode_seq_len=self.max_seq_len_to_capture,
                    query_start_loc=None,
                    seq_start_loc=None,
                    context_lens_tensor=None,
                    block_tables=block_tables[:batch_size],
                    use_cuda_graph=True,
                )

                if self.lora_config:
                    lora_mapping = LoRAMapping(
                        [0] * batch_size,
                        [0] * batch_size,
                    )
                    self.set_active_loras(set(), lora_mapping)

                # 创建CUDAGraphRunner实例,并使用capture方法捕获模型的执行过程
                graph_runner = CUDAGraphRunner(self.model)
                hidden_states = graph_runner.capture(
                    input_tokens[:batch_size],
                    input_positions[:batch_size],
                    hidden_states[:batch_size]
                    if hidden_states is not None else None,
                    kv_caches,
                    attn_metadata,
                    memory_pool=self.graph_memory_pool,
                    stream=graph_capture_context.stream,
                )
                self.graph_memory_pool = graph_runner.graph.pool()
                # graph_runner存起来
                self.graph_runners[batch_size] = graph_runner

        end_time = time.perf_counter()
        elapsed_time = end_time - start_time
        # This usually takes < 10 seconds.
        logger.info("Graph capturing finished in %.0f secs.", elapsed_time)

七、CacheEngine

  1. self._allocate_kv_cache():该方法就是收集(初始化)kv_cache,用处就是先占住gpucpu资源;
  2. self.get_cache_block_size():这个方法计算了每一个block对应的字节数,也就是上面Worker中的cache_block_size。1)每个block存放block_sizetokenkv_caches;2)单个token对应的k的元素个数为num_heads * head_size * num_layersv也一样;3)通过1)和2)计算出block中包含的元素个数,根据不同的数据类型就能得到这个block占用的字节数了(见注释);
  3. self.swap_in()self.swap_out()self.copy()这几个方法并没有在初始化阶段用到,但这边解释一下:当处理大量用户请求的时候,涉及到资源的分配,比如在此之前,有部分数据时缓存在cpu上的,现在gpu上有剩余可用的显存了,那就应该使用self.swap_in()将数据搬到gpu上进行计算;反之,gpu可用显存都占满了,可能会将原本在gpu上的部分缓存搬至cpu,等待机会搬回gpu,此时使用self.swap_out()
# vllm/worker/cache_engine.py
class CacheEngine:
    """Manages the KV cache.

    This class is responsible for initializing and managing the GPU and CPU KV
    caches. It also provides methods for performing KV cache operations, such
    as swapping and copying.
    """

    def __init__(
        self,
        cache_config: CacheConfig,
        model_config: ModelConfig,
        parallel_config: ParallelConfig,
    ) -> None:
        self.cache_config = cache_config   # 传入的配置
        self.model_config = model_config
        self.parallel_config = parallel_config

        self.head_size = model_config.get_head_size()   # 多头注意力每个头的维度
        self.num_layers = model_config.get_num_layers(parallel_config)  # 每个pp对应的层的个数
        self.num_kv_heads = model_config.get_num_kv_heads(parallel_config) # 每个tp对应的kv heads的个数

        self.block_size = cache_config.block_size
        self.num_gpu_blocks = cache_config.num_gpu_blocks
        self.num_cpu_blocks = cache_config.num_cpu_blocks

        if cache_config.cache_dtype == "auto":
            self.dtype = model_config.dtype
        else:
            self.dtype = STR_DTYPE_TO_TORCH_DTYPE[cache_config.cache_dtype]

        # Get attention backend.
        self.attn_backend = get_attn_backend(
            model_config.get_num_attention_heads(parallel_config),
            self.head_size,
            self.num_kv_heads,
            model_config.get_sliding_window(),
            model_config.dtype,
            cache_config.cache_dtype,
            self.block_size,
        )

        # Initialize the cache.
        self.gpu_cache = self._allocate_kv_cache(self.num_gpu_blocks, "cuda")
        self.cpu_cache = self._allocate_kv_cache(self.num_cpu_blocks, "cpu")

    def _allocate_kv_cache(
        self,
        num_blocks: int,
        device: str,
    ) -> List[torch.Tensor]:
        """Allocates KV cache on the specified device."""
        kv_cache_shape = self.attn_backend.get_kv_cache_shape(
            num_blocks, self.block_size, self.num_kv_heads, self.head_size)  # 计算kv_cache的形状
        pin_memory = is_pin_memory_available() if device == "cpu" else False
        kv_cache: List[torch.Tensor] = []
        for _ in range(self.num_layers):   # 添加每一层的kv缓存
            # null block in CpuGpuBlockAllocator requires at least that
            # block to be zeroed-out.
            # We zero-out everything for simplicity.
            kv_cache.append(
                torch.zeros(kv_cache_shape,
                            dtype=self.dtype,
                            pin_memory=pin_memory,
                            device=device))
        return kv_cache

    def swap_in(self, src_to_dst: torch.Tensor) -> None:
        for i in range(self.num_layers):
            self.attn_backend.swap_blocks(self.cpu_cache[i], self.gpu_cache[i],
                                          src_to_dst)

    def swap_out(self, src_to_dst: torch.Tensor) -> None:
        for i in range(self.num_layers):
            self.attn_backend.swap_blocks(self.gpu_cache[i], self.cpu_cache[i],
                                          src_to_dst)

    def copy(self, src_to_dsts: torch.Tensor) -> None:
        self.attn_backend.copy_blocks(self.gpu_cache, src_to_dsts)

    @staticmethod
    def get_cache_block_size(
        cache_config: CacheConfig,
        model_config: ModelConfig,
        parallel_config: ParallelConfig,
    ) -> int:
        head_size = model_config.get_head_size()
        num_heads = model_config.get_num_kv_heads(parallel_config)
        num_layers = model_config.get_num_layers(parallel_config)

        # block_size默认16, 也就是存16个token的kv_caches
        # 具体计算以字节为单位的大小时,需要考虑kv的大小,也就是num_heads * head_size * num_layers
        key_cache_block = cache_config.block_size * num_heads * head_size
        value_cache_block = key_cache_block
        total = num_layers * (key_cache_block + value_cache_block)
        if cache_config.cache_dtype == "auto":
            dtype = model_config.dtype
        else:
            dtype = STR_DTYPE_TO_TORCH_DTYPE[cache_config.cache_dtype]
        dtype_size = get_dtype_size(dtype)   # bf16对应的dtype_size就是2
        return dtype_size * total


总结

本篇主要介绍了LLMEngine初始化部分的内容,涉及了GPUExecutorWorkerModelRunnerCacheEngine等多个类的方法,有助于理解在使用vllm文本生成之前(初始化阶段)的工作原理。对于LLMEngine的另一个重要组成部分Scheduler,会在后续文章(请求处理阶段)中讲述。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值