vLLM推理服务库

最新推荐文章于 2024-10-10 10:30:20 发布

云帆@

最新推荐文章于 2024-10-10 10:30:20 发布

阅读量582

点赞数 3

分类专栏： AI 文章标签：自然语言处理

本文链接：https://blog.csdn.net/weixin_40777649/article/details/138862748

版权

AI 专栏收录该内容

50 篇文章 3 订阅

订阅专栏

一、定义

定义与目的
为什么Cache只存储K、V，而不存储Q？
为什么vllm和HF推理结果不一致？
vLLM 实现原理
pagedAttention 实现原理
模型部署demo

二、实现

定义与目的
目的：减少显存、提高吞吐量。如何优化KV cache，节省显存，提高推理吞吐量，就成了LLM推理框架需要解决的重点问题。
为什么Cache只存储K、V，而不存储Q？
在Transformer架构的注意力机制中，Q（Query）代表当前输入Token的查询信息，它需要与所有先前和当前的K（Key）进行匹配来决定注意力的分配。因为Q是专门针对当前正在处理的Token的，所以它每次都是独一无二的，即便在没有Cache的情况下，对于序列中的每个新Token，Q都会改变。
相反，K（Key）和V（Value）通常与序列中已经处理过的Token相关，它们可以被缓存起来供后续的Token查询使用。存储K和V使得模型可以重用之前Token的信息，而不必重新计算它们，这样可以大大节省计算资源。如果存储Q值，对于每个新的Token，我们仍然需要重新计算整个Q与所有K的匹配度，这不会减少计算量。因此，只缓存KV值而不缓存Q值，是为了优化计算效率。
为什么vllm和HF推理结果不一致？
qkv映射函数维度有差别，将vllm中的qkv_proj[768,2304]改成三个单独的q_proj[768,768]、k_proj、v_proj，生成结果会一点区别。说明Linear()的参数量对计算精度有影响（可忽略不计）。
vllm 注意力采用pageattention,和huggingface的OPTAttention()实现不一致.
采样方法会影响推理结果。将vllm的惩罚处理改为和HF的惩罚处理一致，推理完全结果相同！！！
最大区别在于采样策略，HF只采用了重复性惩罚+argmax处理，而vllm中对logit根据top-p、top_k、temperatures参数进行了处理。
参考：https://zhuanlan.zhihu.com/p/658780653
vLLM 实现原理
1. 通过 PagedAttention 对注意力 key 和 value 进行内存管理，降低显存，提高推理速度。
pagedAttention 实现原理
1. 内存中保存Key、value 值，允许内存空间不需要连续。不必浪费大量显存。方法：内存中序列的逻辑通过块表映射到物理内存中，保证其逻辑性。
2. 内存中可以共享内存，使prompt 计算可以共享。
6 . 模型部署demo
1. 查看支持的模型
https://docs.vllm.ai/en/latest/models/supported_models.html
离线部署

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "6,7"

from vllm import LLM, SamplingParams

llm = LLM('/data-ai/model/llama2/llama2_hf/Llama-2-13b-chat-hf')


prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

outputs = llm.generate(prompts, sampling_params)

# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
    


Prompt: 'Hello, my name is', Generated text: " Sherry and I'm a stay at home mom of three beautiful children."
Prompt: 'The president of the United States is', Generated text: ' one of the most powerful people in the world, and yet, many people do'
Prompt: 'The capital of France is', Generated text: ' Paris. This is a fact that is well known to most people, but there'
Prompt: 'The future of AI is', Generated text: ' likely to be shaped by a combination of technological advancements and soci'