vllm generate推理与Huggingface generate推理对齐(长样本)

自学AI的鲨鱼儿

已于 2024-07-09 22:18:46 修改

阅读量1k

点赞数 19

分类专栏： # 训练文章标签：个人笔记深度学习 LLM

于 2024-06-24 15:30:23 首次发布

本文链接：https://blog.csdn.net/qq_16555103/article/details/139927787

版权

训练专栏收录该内容

10 篇文章 0 订阅

订阅专栏

一、离线推理

1.1、开发环境

1.2、推理对齐

1.2.1、样本生成参数gen_kwargs对齐

1.2.2、文本生成eos token对齐

1.2.3、model 初始化数据类型对齐

1.2.4、vllm generate推理选择适当的batch size

1.5.2、Huggingface generate

二、在线部署

2.1、部署相关的基础知识

2.2、类openai vLLM 服务器部署

2.2.1 类openai vLLM 服务器参数解析

2.2.2 使用py脚本来实现 python -m vllm.entrypoints.openai.api_server 功能进行部署，只是可以同py脚本添加更多自定义的功能，代码如下

2.2.3 vLLM python -m vllm.entrypoints.openai.api_server 原本不支持自定义的序列惩罚，故修改vllm.entrypoints.openai.api_server 源码以支持推理序列抑制惩罚

2.2.4 client 请求

一、离线推理

本文中所有的结论皆是个人通过下列开发环境中评测出来的，仅作为参考

1.1、开发环境

个人数据长度：prompt普遍在9k以上

linux: ubuntu
GPU：单张A6000（48G）
python=3.10
torch==2.3.0
transformers==4.41.2
vllm==0.5.0.post1
vllm-flash-attn==2.5.9
flash-attn==2.5.9.post1

实现目标：
vllm generate函数批量输入prompt（普遍在9k以上）贪婪解码推理结果与 Huggingface generate函数 贪婪解码推理完全一致

1.2、推理对齐

若想vllm generate推理与Huggingface generate推理完全一致主要需要做以下几个方面修改

1.2.1、样本生成参数gen_kwargs对齐

首先要确保vllm生成参数与Huggingface生成参数对齐，但两者某些参数命名不完全相同，需要找出各个重要参数的映射关系

VLLM(generate 函数)	Huggingface(generate 函数)	说明
max_tokens	max_new_tokens	允许生成新token最大数值
top_p	top_p
top_k（默认为50）	top_k（默认为-1，全词库）
temperature	temperature	温度系数，Huggingface 不允许temperature=0，而vllm中temperature=0 表示贪婪解码
repetition_penalty	repetition_penalty	重复惩罚
/	do_sample	huggingface 中do_sample=True 表示做概率采样，do_sample=False 表示做贪婪解码，而vllm中没有此参数，使用temperature=0 就可以贪婪解码

根据上表，我们首先需要将两个框架的生成参数对齐，为了保证实验的稳定性，两者均采用贪婪解码，需要注意的是，即使使用贪婪解码，top-p的值也会影响生成的结果，故两者均设置参数如下

# 双方启用贪婪解码, 确保推理的一致性
gen_kwargs_vllm = {
    "max_tokens": 1150,
    "top_p": 0.9,
    "top_k": 50,
    "temperature": 0.0,
    "repetition_penalty": 1.0,
}
gen_kwargs_hug = {
    "max_new_tokens": 1150,
    "top_p": 0.9,
    "top_k": 50,
    "temperature": 0.35,
    "repetition_penalty": 1.0,
    "do_sample": False
}

1.2.2、文本生成eos token对齐

Huggingface generate 函数设置文本结束符号，这里面的llama模型选中的llama3-instruct，具体需要根据自己的模型进行更换

# 组织生成文本的超参
if model_type == 'llama':
    # 设置文本生成的结束 token ID
    gen_kwargs_hug['eos_token_id'] = [self.tokenizer.eos_token_id, self.tokenizer.convert_tokens_to_ids("<|end_of_text|>")]
    gen_kwargs_hug['pad_token_id'] = self.tokenizer.eos_token_id
elif model_type == 'qwen2':
    # 设置文本生成的结束 token ID
    gen_kwargs_hug['eos_token_id'] = [151645,151643]
    gen_kwargs_hug['pad_token_id'] = 151643
else:
    raise ValueError(f"Only support 'llama or qwen2' now, but got '{model_type}'")

vllm generate 函数设置文本结束符号，这里面的llama模型选中的llama3-instruct，具体需要根据自己的模型进行更换

# 组织生成文本的超参
if self.model_type == 'llama':
    # 设置文本生成的结束 token ID,  # gen_kwargs_vllm['pad_token_id'] = self.tokenizer.eos_token_id
    gen_kwargs_vllm['stop_token_ids'] = [self.tokenizer.eos_token_id, self.tokenizer.convert_tokens_to_ids("<|end_of_text|>")]
elif self.model_type == 'qwen2':
    # 设置文本生成的结束 token ID, # gen_kwargs_vllm['pad_token_id'] = 151643
    gen_kwargs_vllm['stop_token_ids'] = [151645,151643]
else:
    raise ValueError(f"Only support llama and qwen2, model_type {self.model_type} is not supported")

1.2.3、model 初始化数据类型对齐

Huggingface from_pretrained 模型的数据类型也要与VLLM 模型初始化时数据类型保持一致

Huggingface 加载模型

# Huggingface
model_ = AutoModelForCausalLM.from_pretrained(
            model_name_or_path,
            trust_remote_code=True,
            low_cpu_mem_usage=True,     
            torch_dtype = torch.float16,   # float16 加载, 后面的vllm 也要使用 float16 加载
).eval()

VLLM 加载模型

一些其他的参数（待确定）：

max_num_batched_tokens=4096, # 控制批处理中的最大token数

max_num_seqs=256, # 控制批处理中的最大序列数

class VLLMInference():
    def __init__(self,
                 model_name_or_path:str,
                 model_type:str,
                 # dtype 模型加载的数据类型, 'auto' 表示自动, torch.float32 表示 fp32, torch.float16 表示 fp16
                 # 与 Huggingface from_pretrained dtype 尽量保持一致, 为了Huggingface generate结果对齐
                 dtype: str,  
                 seed: int=0, # VLLM 默认为0, 默认即可
                 trust_remote_code: bool = True, 
                 tensor_parallel_size: int = 1,   # GPU 张量并行的卡数
                 gpu_memory_utilization: float = 0.9, # GPU 内存占用率
                 max_seq_len_to_capture: int = 9800, # 提升效率的cuda, 可以选择样本中中位数适中的文本长度
                **kwargs
                ):
        
        self.SYSTEM_PROMPT = SYSTEM_PROMPT
        self.model_name_or_path = model_name_or_path
        self.model_name_suffix = os.path.split(model_name_or_path)[-1]
        assert model_type in ['llama', 'qwen2'], f"model_type must be in ['llama', 'qwen2'], but got {model_type}"
        self.model_type = model_type
        
        tokenizer = AutoTokenizer.from_pretrained(model_name_or_path,
                                                    trust_remote_code=True,
                                                    use_fast=True
                                                    # use_fast=False if model.config.model_type == 'llama' else True
                                                )
        # VLLM 模型初始化
        # vllm class 详情见: https://docs.vllm.ai/en/latest/dev/offline_inference/llm.html
        self.model = LLM(model=model_name_or_path,
                         trust_remote_code=trust_remote_code,
                         tensor_parallel_size=tensor_parallel_size,
                         dtype=dtype,  # 与 Huggingface from_pretrained dtype 尽量保持一致, 为了Huggingface generate结果对齐
                         gpu_memory_utilization=gpu_memory_utilization,
                         max_seq_len_to_capture=max_seq_len_to_capture,
                         **kwargs
                        )
        self.model.set_tokenizer(tokenizer)
        self.tokenizer = self.model.get_tokenizer()

1.2.4、vllm generate推理选择适当的batch size

经个人实验多次证明，虽然vllm generate函数可以一次性输入所有的数据，但如果一次性输入的数量过多，也会导致vllm推理的结果与Huggingface 推理结果存在差异
个人数据集中的prompt多数在9k左右，属于较长的样本，在48G的A6000多次实验batch size <= 15，可以贪婪解码得到与Huggingface generate 贪婪解码完全一致的结果