vllm generate推理与Huggingface generate推理对齐(长样本)

目录

一、离线推理

1.1、开发环境

1.2、推理对齐

1.2.1、样本生成参数gen_kwargs对齐

1.2.2、文本生成eos token对齐

1.2.3、model 初始化数据类型对齐

1.2.4、vllm generate推理选择适当的batch size

1.3、推理代码

1.4、推理结果

1.4.1、结果对齐

1.4.2、推理效率

1.5、推理序列抑制惩罚

1.5.1、vLLM generate

1.5.2、Huggingface generate

二、在线部署

2.1、部署相关的基础知识

2.2、类openai vLLM 服务器部署

2.2.1 类openai vLLM 服务器参数解析​​​​​​​

2.2.2 使用py脚本来实现 python -m vllm.entrypoints.openai.api_server 功能进行部署,只是可以同py脚本添加更多自定义的功能,代码如下

2.2.3 vLLM python -m vllm.entrypoints.openai.api_server 原本不支持自定义的序列惩罚,故修改vllm.entrypoints.openai.api_server 源码以支持推理序列抑制惩罚

2.2.4 client 请求


一、离线推理

本文中所有的结论皆是个人通过下列开发环境中评测出来的,仅作为参考

1.1、开发环境

个人数据长度:prompt普遍在9k以上

linux: ubuntu
GPU:单张A6000(48G)
python=3.10
torch==2.3.0
transformers==4.41.2
vllm==0.5.0.post1
vllm-flash-attn==2.5.9
flash-attn==2.5.9.post1

实现目标:
vllm generate函数批量输入prompt(普遍在9k以上)贪婪解码推理结果与 Huggingface generate函数 贪婪解码推理完全一致

1.2、推理对齐

若想vllm generate推理与Huggingface generate推理完全一致主要需要做以下几个方面修改

1.2.1、样本生成参数gen_kwargs对齐
  • 首先要确保vllm生成参数与Huggingface生成参数对齐,但两者某些参数命名不完全相同,需要找出各个重要参数的映射关系
VLLM(generate 函数)Huggingface(generate 函数)说明

max_tokens

max_new_tokens

允许生成新token最大数值

top_p

top_p

top_k(默认为50)

top_k(默认为-1,全词库)

temperature

temperature

温度系数,Huggingface 不允许temperature=0,而vllm中temperature=0 表示贪婪解码

repetition_penalty

repetition_penalty

重复惩罚
/

do_sample

huggingface 中do_sample=True 表示做概率采样,do_sample=False 表示做贪婪解码,而vllm中没有此参数,使用temperature=0 就可以贪婪解码
  • 根据上表,我们首先需要将两个框架的生成参数对齐,为了保证实验的稳定性,两者均采用贪婪解码需要注意的是,即使使用贪婪解码,top-p的值也会影响生成的结果,故两者均设置参数如下
# 双方启用贪婪解码, 确保推理的一致性
gen_kwargs_vllm = {
    "max_tokens": 1150,
    "top_p": 0.9,
    "top_k": 50,
    "temperature": 0.0,
    "repetition_penalty": 1.0,
}
gen_kwargs_hug = {
    "max_new_tokens": 1150,
    "top_p": 0.9,
    "top_k": 50,
    "temperature": 0.35,
    "repetition_penalty": 1.0,
    "do_sample": False
}
1.2.2、文本生成eos token对齐
  • Huggingface generate 函数设置文本结束符号,这里面的llama模型选中的llama3-instruct,具体需要根据自己的模型进行更换
# 组织生成文本的超参
if model_type == 'llama':
    # 设置文本生成的结束 token ID
    gen_kwargs_hug['eos_token_id'] = [self.tokenizer.eos_token_id, self.tokenizer.convert_tokens_to_ids("<|end_of_text|>")]
    gen_kwargs_hug['pad_token_id'] = self.tokenizer.eos_token_id
elif model_type == 'qwen2':
    # 设置文本生成的结束 token ID
    gen_kwargs_hug['eos_token_id'] = [151645,151643]
    gen_kwargs_hug['pad_token_id'] = 151643
else:
    raise ValueError(f"Only support 'llama or qwen2' now, but got '{model_type}'")
  • vllm generate 函数设置文本结束符号,这里面的llama模型选中的llama3-instruct,具体需要根据自己的模型进行更换
# 组织生成文本的超参
if self.model_type == 'llama':
    # 设置文本生成的结束 token ID,  # gen_kwargs_vllm['pad_token_id'] = self.tokenizer.eos_token_id
    gen_kwargs_vllm['stop_token_ids'] = [self.tokenizer.eos_token_id, self.tokenizer.convert_tokens_to_ids("<|end_of_text|>")]
elif self.model_type == 'qwen2':
    # 设置文本生成的结束 token ID, # gen_kwargs_vllm['pad_token_id'] = 151643
    gen_kwargs_vllm['stop_token_ids'] = [151645,151643]
else:
    raise ValueError(f"Only support llama and qwen2, model_type {self.model_type} is not supported")
1.2.3、model 初始化数据类型对齐

Huggingface from_pretrained 模型的数据类型也要与VLLM 模型初始化时数据类型保持一致

  •  Huggingface 加载模型
# Huggingface
model_ = AutoModelForCausalLM.from_pretrained(
            model_name_or_path,
            trust_remote_code=True,
            low_cpu_mem_usage=True,     
            torch_dtype = torch.float16,   # float16 加载, 后面的vllm 也要使用 float16 加载
).eval()
  • VLLM 加载模型

一些其他的参数(待确定):

max_num_batched_tokens=4096,  # 控制批处理中的最大token数     

max_num_seqs=256,  # 控制批处理中的最大序列数

class VLLMInference():
    def __init__(self,
                 model_name_or_path:str,
                 model_type:str,
                 # dtype 模型加载的数据类型, 'auto' 表示自动, torch.float32 表示 fp32, torch.float16 表示 fp16
                 # 与 Huggingface from_pretrained dtype 尽量保持一致, 为了Huggingface generate结果对齐
                 dtype: str,  
                 seed: int=0, # VLLM 默认为0, 默认即可
                 trust_remote_code: bool = True, 
                 tensor_parallel_size: int = 1,   # GPU 张量并行的卡数
                 gpu_memory_utilization: float = 0.9, # GPU 内存占用率
                 max_seq_len_to_capture: int = 9800, # 提升效率的cuda, 可以选择样本中中位数适中的文本长度
                **kwargs
                ):
        
        self.SYSTEM_PROMPT = SYSTEM_PROMPT
        self.model_name_or_path = model_name_or_path
        self.model_name_suffix = os.path.split(model_name_or_path)[-1]
        assert model_type in ['llama', 'qwen2'], f"model_type must be in ['llama', 'qwen2'], but got {model_type}"
        self.model_type = model_type
        
        tokenizer = AutoTokenizer.from_pretrained(model_name_or_path,
                                                    trust_remote_code=True,
                                                    use_fast=True
                                                    # use_fast=False if model.config.model_type == 'llama' else True
                                                )
        # VLLM 模型初始化
        # vllm class 详情见: https://docs.vllm.ai/en/latest/dev/offline_inference/llm.html
        self.model = LLM(model=model_name_or_path,
                         trust_remote_code=trust_remote_code,
                         tensor_parallel_size=tensor_parallel_size,
                         dtype=dtype,  # 与 Huggingface from_pretrained dtype 尽量保持一致, 为了Huggingface generate结果对齐
                         gpu_memory_utilization=gpu_memory_utilization,
                         max_seq_len_to_capture=max_seq_len_to_capture,
                         **kwargs
                        )
        self.model.set_tokenizer(tokenizer)
        self.tokenizer = self.model.get_tokenizer()
1.2.4、vllm generate推理选择适当的batch size
  1. 经个人实验多次证明,虽然vllm generate函数可以一次性输入所有的数据,但如果一次性输入的数量过多,也会导致vllm推理的结果与Huggingface 推理结果存在差异
  2. 个人数据集中的prompt多数在9k左右,属于较长的样本,在48G的A6000多次实验batch size <= 15,可以贪婪解码得到与Huggingface generate 贪婪解码完全一致的结果

1.3、推理代码

  • VLLM 代码,需要注意的是,如果vllm使用微调PI扩展长度后的model推理,vllm 会自动根据model config.rope_scaling 的缩放值调整vllm 的max_seq_len 这个参数,所以不用担心vllm截断超过原始模型输入的长度,比如本人数据prompt长度均在9k,已经超出了llama3 8k的输入长度,但通过PI扩展长度微调后,config.rope_scaling=2,而vllm加载model后max_seq_len=16384 而不是 8192
class VLLMInference():
    def __init__(self,
                 model_name_or_path:str,
                 model_type:str,
                 # dtype 模型加载的数据类型, 'auto' 表示自动, torch.float32 表示 fp32, torch.float16 表示 fp16
                 # 与 Huggingface from_pretrained dtype 尽量保持一致, 为了Huggingface generate结果对齐
                 dtype: str,  
                 seed: int=0, # VLLM 默认为0, 默认即可
                 trust_remote_code: bool = True, 
                 tensor_parallel_size: int = 1,   # GPU 张量并行的卡数
                 gpu_memory_utilization: float = 0.9, # GPU 内存占用率
                 max_seq_len_to_capture: int = 9800, # 提升效率的cuda, 可以选择样本中中位数适中的文本长度
                **kwargs
                ):
        
        self.SYSTEM_PROMPT = SYSTEM_PROMPT
        self.model_name_or_path = model_name_or_path
        self.model_name_suffix = os.path.split(model_name_or_path)[-1]
        assert model_type in ['llama', 'qwen2'], f"model_type must be in ['llama', 'qwen2'], but got {model_type}"
        self.model_type = model_type
    
        # vllm class 详情见: https://docs.vllm.ai/en/latest/dev/offline_inference/llm.html
        # 需要注意的是,如果vllm使用微调PI扩展长度后的model推理,vllm 会自动根据model config.rope_scaling 的缩放值调整vllm 的max_seq_len 这个参数,
        # 所以不用担心vllm截断超过原始模型输入的长度,比如本人数据prompt长度均在9k,已经超出了llama3 8k的输入长度,
        # 但通过PI扩展长度微调后,config.rope_scaling=2,而vllm加载model后max_seq_len=16384 而不是 8192
        self.model = LLM(model=model_name_or_path,
                         trust_remote_code=trust_remote_code,
                         tensor_parallel_size=tensor_parallel_size,
                         dtype=dtype,
                         gpu_memory_utilization=gpu_memory_utilization,
                         max_seq_len_to_capture=max_seq_len_to_capture,
                         **kwargs
                        )
        
        # tokenizer = AutoTokenizer.from_pretrained(model_name_or_path,
        #                                             trust_remote_code=True,
        #                                             padding_side="left",   # 推理侧需要左pad
        #                                             use_fast=True
        #                                             # # llama不支持fast
        #                                             # use_fast=False if model.config.model_type == 'llama' else True
        #                                         )
        # self.model.set_tokenizer(tokenizer)
        self.tokenizer = self.model.get_tokenizer()
        logger.info(f"vllm tokenizer: \n{self.tokenizer}")
        

    def _generate(self, 
                 text:Union[str,List[str]],
                 gen_kwargs:dict,
                 batch_size:int=10,  # vllm 处理的 batch size, 如果不传入合适的batch size 会导致与Huggingface generate 不能对齐 (可能是传入大量的数据内存优化导致的)
                ):
        """
        解释一下 vLLM 在处理大量输入数据时的行为:
        输入处理:
            当你将所有数据作为一个列表传入 model.generate() 方法时,vLLM 确实会接收所有这些输入。但是,它不会简单地将整个输入列表的大小作为单一的 batch size 来处理。
        动态批处理:
            vLLM 使用动态批处理(dynamic batching)技术。这意味着它不会一次性处理所有输入,而是根据当前的计算资源和模型容量,动态地决定如何最efficiently地处理这些输入。
        内部批处理机制:
            vLLM 会内部管理一个请求队列。
            它会根据当前可用的 GPU 内存和计算资源,动态地决定每次处理多少输入。
            这个动态决定的批大小通常小于你提供的全部输入数量,特别是当输入数量很大时。
        最大批大小限制:
            虽然你可能输入了成千上万的提示,但 vLLM 有内部的最大批大小限制。
            这个限制是为了确保efficient的内存使用和计算效率。
            默认情况下,这个限制通常远小于你可能提供的全部数据量。
        连续处理:
            vLLM 会连续地处理输入队列,每次取出一定数量的输入进行处理。
            这个过程会持续进行,直到所有输入都被处理完毕。
        性能优化:
            这种方法允许 vLLM 在处理大量数据时保持高效率。
            它可以充分利用 GPU 资源,同时避免因为一次性加载过多数据而导致的内存问题。
        用户角度:
            从用户的角度来看,当你调用 model.generate(all_prompts, sampling_params) 时,vLLM 会处理所有输入,但内部是分批进行的。
            你会得到所有输入的输出,就好像它们是一次性处理的一样。
        控制批大小:
            如果你确实需要更精细地控制批处理大小,可以考虑使用 AsyncLLMEngine 或者在服务器模式下设置 max_batch_size 参数。
            但在大多数情况下,让 vLLM 自动管理批处理会得到更好的性能。
        """
        gen_kwargs = copy.deepcopy(gen_kwargs)
        # 组织生成文本的超参
        if self.model_type == 'llama':
            # 设置文本生成的结束 token ID,  # gen_kwargs['pad_token_id'] = self.tokenizer.eos_token_id
            gen_kwargs['stop_token_ids'] = [self.tokenizer.eos_token_id, self.tokenizer.convert_tokens_to_ids("<|end_of_text|>")]
        elif self.model_type == 'qwen2':
            # 设置文本生成的结束 token ID, # gen_kwargs['pad_token_id'] = 151643
            gen_kwargs['stop_token_ids'] = [151645,151643]
        else:
            raise ValueError(f"Only support llama and qwen2, model_type {self.model_type} is not supported")
        
        # SamplingParams 参数详情见: https://docs.vllm.ai/en/latest/dev/sampling_params.html#vllm.SamplingParams
        sampling_params = SamplingParams(**gen_kwargs)
        logger.warning(f"Now vllm running sampling_params \n{sampling_params}")

        # 组织文本
        if isinstance(text, str):
            text = [text]
        text = [i.strip() for i in text]

        text = [[
                {"role": "system", "content": self.SYSTEM_PROMPT},
                {"role": "user", "content": i}
            ] for i in text]
        
        text = [self.tokenizer.apply_chat_template(i, 
                                                    add_generation_prompt=True, 
                                                    tokenize=False   # False 表示返回 str,不转化为id
                                                ) for i in text]
        # 这里选择转化为id传给prompt_token_ids, 因为vllm默认的tokenizer encode 可能会加入前缀特殊token
        text = [self.tokenizer.encode(i, add_special_tokens=False) for i in text]
        logger.info(f"vllm one input:\n{self.tokenizer.decode(text[0])}")
        batch_num = math.ceil(len(text)/batch_size)
        logger.info(f"VLLM batch size: {batch_size}, length of datas num: {len(text)}, batch_num: {batch_num}")

        # 每次处理一个batch size
        outputs = []
        for b_idx,i in enumerate(range(0,len(text),batch_size)):
            s = i
            e = i + batch_size
            if e >= len(text):
                e = len(text)
            batch_ids = text[s: e]
            if (b_idx + 1) % 10 == 0:
                logger.info(f"batch id/batch num: {b_idx+1}/{batch_num}")
            batch_outputs = self.model.generate(prompt_token_ids = batch_ids, 
                                                sampling_params=sampling_params) 
            outputs = outputs + batch_outputs
        return outputs
  • Huggingface face 代码(其中 ModelUtils.load_model 是自定义的类,就是普通的float16加载模型)
class HuggingFaceInference():
    def __init__(self, 
            model_name_or_path:str, # base model 模型的路径或名称
            model_max_length:int=16384, # 模型允许的最大长度, 默认为 16384(即16k), 原模型长度不够的使用 PI 插值
            adapter_name_or_path:str=None,     # lora 适配器模型的路径或名称, None 表示不使用
            use_cache:bool=True,    # 推理时是否使用模型的缓存
            load_in_4bit:bool=False, # 是否使用 bnb 4bit进行推理,能够节省很多显存,但效果可能会有一定的下降
        ):
        self.model_name_or_path = model_name_or_path
        self.adapter_name_or_path = adapter_name_or_path
        self.model_max_length = model_max_length
        # 定义提示模版
        self.SYSTEM_PROMPT = SYSTEM_PROMPT
        logger.info(f"model_name_or_path: {model_name_or_path}")
        logger.info(f"adapter_name_or_path: {adapter_name_or_path}")
        logger.info(f"SYSTEM_PROMPT: {self.SYSTEM_PROMPT}")

        # 定义base model 和 lora 适配器模型的后缀名称
        self.model_name_suffix = os.path.split(model_name_or_path)[-1]
        if self.adapter_name_or_path is not None:
            self.adapter_name_suffix = os.path.split(adapter_name_or_path)[-1]
        else:
            self.adapter_name_suffix = 'no_adapter'
        
        logger.info(f"Loading model: {model_name_or_path} and adapter: {adapter_name_or_path}")
        self.config = AutoConfig.from_pretrained(model_name_or_path, trust_remote_code=True)
        self.config.use_cache = use_cache  
        self.model_type = self.config.model_type
        # NOTE 从预训练模型配置中获取原始上下文长度, 
        # 如果设置的上下文窗口大小超过了原始长度,则需要计算 RoPE 缩放因子
        if self.config.model_type == 'llama':
            # 使用PI插值, 是否扩展RoPE的position
            orig_ctx_len = getattr(self.config, "max_position_embeddings", None)
            if orig_ctx_len and model_max_length > orig_ctx_len:
                scaling_factor = float(math.ceil(model_max_length / orig_ctx_len))
                self.config.rope_scaling = {"type": "linear", "factor": scaling_factor}
                logger.success(f'Use PI/NTK change {self.config.model_type} model_max_length from {orig_ctx_len} to {model_max_length}')
            else:
                logger.warning(f'Dont not PI/NTK, because {self.config.model_type} model_max_length {model_max_length} <= max_position_embeddings {orig_ctx_len}')
        elif self.config.model_type == 'qwen2':
            orig_ctx_len = getattr(self.config, "max_position_embeddings", None)
            logger.warning(f"{self.config.model_type} 'max_position_embeddings' is {orig_ctx_len}, {self.config.model_type} not support PI/NTK now. Default use '--model_max_length {model_max_length}'")
        else:
            raise ValueError(f"Only support 'llama or qwen2' now, but got '{self.config.model_type}'")

        # 加载模型
        self.model = ModelUtils.load_model(
                                    model_name_or_path,
                                    config=self.config,
                                    load_in_4bit=load_in_4bit,  # 是否 4bit 量化加载
                                    adapter_name_or_path=adapter_name_or_path,  # adapter_name_or_path 为 None 表示不使用lora适配器
                                    device_map='auto',
                                    ).eval()   # 开始 eval 模式

        self.tokenizer = AutoTokenizer.from_pretrained(model_name_or_path,
                                                    trust_remote_code=True,
                                                    padding_side="left",   # 推理侧需要左pad
                                                    use_fast=True
                                                    # # llama不支持fast
                                                    # use_fast=False if model.config.model_type == 'llama' else True
                                                )
        logger.info(f"config:\n{self.config}")
        logger.info(f"model structure:\n{self.model}")
        logger.info(f"tokenizer:\n{self.tokenizer}")
        
    def _generate(self, 
                    texts:Union[List[str], str],   # 推理文本
                    model_type:str, # llama or qwen2
                    gen_kwargs:dict = {}, # 生成文本的超参
                ):
        """
        目前只能逐行推理一个文本, 暂时不支持并发批量推理
        """
        if isinstance(texts, str):
            texts = [texts]
        texts = [text.strip() for text in texts]

        # 生成超参配置
        if gen_kwargs == {}:
            # gen_kwargs = {
            #     'max_new_tokens': 900, # 生成文本的最大新 token 数
            #     'top_p': 0.9,  # 使用 top-p 采样策略,只考虑概率最高的一部分 token
            #     'temperature': 0.35,  # 控制生成文本的随机性,温度越高,随机性越大
            #     'repetition_penalty': 1.0,  # 重复惩罚系数,防止生成过于重复的文本
            #     'do_sample': True  # 是否进行采样,即根据概率分布随机生成 token
            # }
            logger.warning(f"gen_kwargs is empty, use default gen_kwargs")
            gen_kwargs = {
                'max_new_tokens': 900, 
                'top_p': 0.9,
                # "top_k": 20,
                'temperature': 0.35,  
                'repetition_penalty': 1.0,  
                'do_sample': True  
            }
        # TODO++: 深拷贝 gen_kwargs,防止影响到gen_kwargs_grid的原始参数
        gen_kwargs = copy.deepcopy(gen_kwargs)

        # 组织生成文本的超参
        if model_type == 'llama':
            # 设置文本生成的结束 token ID
            gen_kwargs['eos_token_id'] = [self.tokenizer.eos_token_id, self.tokenizer.convert_tokens_to_ids("<|end_of_text|>")]
            gen_kwargs['pad_token_id'] = self.tokenizer.eos_token_id
        elif model_type == 'qwen2':
            # 设置文本生成的结束 token ID
            gen_kwargs['eos_token_id'] = [151645,151643]
            gen_kwargs['pad_token_id'] = 151643
        else:
            raise ValueError(f"Only support 'llama or qwen2' now, but got '{model_type}'")
        logger.info(f"generate gen_kwargs: \n{gen_kwargs}")
        logger.info(f"generate model_type: {model_type}")

        outputs = []  # 保存推理输出
        # 模型推理(逐行推理)
        for idx,text in enumerate(tqdm.tqdm(texts)):
            messages = [
                {"role": "system", "content": self.SYSTEM_PROMPT},
                {"role": "user", "content": text}
            ]
            text_prompt = self.tokenizer.apply_chat_template(messages, 
                                                             add_generation_prompt=True, 
                                                             tokenize=False   # False 表示返回 str,不转化为id
                                                        )
            # 转化为id
            text_input = self.tokenizer(text_prompt,
                               add_special_tokens=False,
                               return_tensors='pt').to(self.model.device)
            if idx <= 0:
                # 打印一条提示文本
                logger.info(f"text_input: \n{text_input}")  
                logger.info(f"text_promt:\n{self.tokenizer.decode(text_input['input_ids'][0],skip_special_tokens=False)}")
            # 生成最终的 .generate() 函数输入
            input_id = text_input['input_ids']
            gen_kwargs["input_ids"] = input_id  # input_ids 放进去
            gen_kwargs["attention_mask"] = text_input['attention_mask']  # attention_mask 放进去

            with torch.no_grad():
                s = time.time()
                # 推理
                output_id = self.model.generate(**gen_kwargs)
                e = time.time()
                output = self.tokenizer.decode(output_id[0][len(input_id[0]): ], skip_special_tokens=True)
                print(f"use_time: {e-s}s\n{output}")
                logger.success(f"model: {self.model_name_suffix} adapter: {self.adapter_name_suffix} success inference line_num: {idx+1} paper, generate gen_kwargs: \n{gen_kwargs}")
                # 逐行写入缓存数据, 仅用于debug
                with open(f"./{self.model_name_suffix}_{self.adapter_name_suffix}_response.tmp", 'a', encoding='utf-8') as jsonl_file:
                    jsonl_file.write(str(output) + '\n\n\n')
                outputs.append(output)
        return outputs

1.4、推理结果

1.4.1、结果对齐

下面是两者推理结果的比较,确实已经严格一一对齐了

  • Huggingface 推理结果
[How to evaluate the idea of the paper]
<Innovative approach to improving GANs> The paper proposes a novel approach to improving GANs by incorporating image quality assessment techniques into the energy function of the discriminator, which is a unique and innovative idea in the field of generative adversarial networks.
<Use of multi-component energy function> The use of a multi-component energy function based on image quality assessment techniques, such as l1 score, gradient magnitude similarity score, and chrominance score, is a novel and interesting approach to enhancing the performance of GANs.

[Compared to previous similar works, what are the essential differences, such as any fundamental differences, improvements, innovations]
<Novelty in energy function> The paper introduces a novel energy function for the discriminator, which is a departure from the traditional mean squared error (MSE) used in previous GANs. This new energy function is based on image quality assessment techniques, providing a fundamental difference from previous works.
<Experimental evaluation> The paper provides a comprehensive experimental evaluation of the proposed approach, including both quantitative and qualitative assessments, which sets it apart from previous works in the field.

[How to evaluate the experimental results in the paper]
<Thorough experimental evaluation> The paper presents a thorough experimental evaluation of the proposed approach, including both quantitative and qualitative assessments, which provides a comprehensive understanding of the performance of the method.
<Need for additional experiments> While the experimental evaluation is thorough, there are suggestions for additional experiments, such as testing the approach on other datasets and conducting more detailed analyses of the results, which could further strengthen the evaluation.

[Potential reasons for acceptance]
<Innovative approach> The paper presents an innovative approach to improving GANs by incorporating image quality assessment techniques into the energy function of the discriminator, which contributes to the advancement of the field.
<Thorough experimental evaluation> The paper provides a comprehensive experimental evaluation of the proposed approach, demonstrating the thoroughness and rigor of the research.

[Potential reasons for rejection]
<Clarity and presentation> The paper lacks clarity in presenting the proposed approach and experimental results, which may hinder the understanding and evaluation of the method.
<Need for additional experiments> There are suggestions for additional experiments, such as testing the approach on other datasets and conducting more detailed analyses of the results, which could strengthen the paper.

[Other suggestions for further improving the quality of the paper]
<Clarity and presentation> Improving the clarity and presentation of the proposed approach and experimental results could enhance the overall quality of the paper and facilitate better understanding and evaluation of the method.
<Additional experiments> Conducting additional experiments, such as testing the approach on other datasets and conducting more detailed analyses of the results, could further strengthen the paper and provide a more comprehensive evaluation of the proposed method.

[Other important review comments]
<No related terms> -
  • vLLM推理结果
[How to evaluate the idea of the paper]
<Innovative approach to improving GANs> The paper proposes a novel approach to improving GANs by incorporating image quality assessment techniques into the energy function of the discriminator, which is a unique and innovative idea in the field of generative adversarial networks.
<Use of multi-component energy function> The use of a multi-component energy function based on image quality assessment techniques, such as l1 score, gradient magnitude similarity score, and chrominance score, is a novel and interesting approach to enhancing the performance of GANs.

[Compared to previous similar works, what are the essential differences, such as any fundamental differences, improvements, innovations]
<Novelty in energy function> The paper introduces a novel energy function for the discriminator, which is a departure from the traditional mean squared error (MSE) used in previous GANs. This new energy function is based on image quality assessment techniques, providing a fundamental difference from previous works.
<Experimental evaluation> The paper provides a comprehensive experimental evaluation of the proposed approach, including both quantitative and qualitative assessments, which sets it apart from previous works in the field.

[How to evaluate the experimental results in the paper]
<Thorough experimental evaluation> The paper presents a thorough experimental evaluation of the proposed approach, including both quantitative and qualitative assessments, which provides a comprehensive understanding of the performance of the method.
<Need for additional experiments> While the experimental evaluation is thorough, there are suggestions for additional experiments, such as testing the approach on other datasets and conducting more detailed analyses of the results, which could further strengthen the evaluation.

[Potential reasons for acceptance]
<Innovative approach> The paper presents an innovative approach to improving GANs by incorporating image quality assessment techniques into the energy function of the discriminator, which contributes to the advancement of the field.
<Thorough experimental evaluation> The paper provides a comprehensive experimental evaluation of the proposed approach, demonstrating the thoroughness and rigor of the research.

[Potential reasons for rejection]
<Clarity and presentation> The paper lacks clarity in presenting the proposed approach and experimental results, which may hinder the understanding and evaluation of the method.
<Need for additional experiments> There are suggestions for additional experiments, such as testing the approach on other datasets and conducting more detailed analyses of the results, which could strengthen the paper.

[Other suggestions for further improving the quality of the paper]
<Clarity and presentation> Improving the clarity and presentation of the proposed approach and experimental results could enhance the overall quality of the paper and facilitate better understanding and evaluation of the method.
<Additional experiments> Conducting additional experiments, such as testing the approach on other datasets and conducting more detailed analyses of the results, could further strengthen the paper and provide a more comprehensive evaluation of the proposed method.

[Other important review comments]
<No related terms> -
1.4.2、推理效率

单卡A6000 48G 环境下,Huggingface generate 平均推理一个样本需要30s左右,vLLM平均推理一个样本需要3.6s 左右,提升接近8倍,小于vLLM官方提供的倍速。

1.5、推理序列抑制惩罚

有的时候我不希望模型生成某些序列,需要在推理时对某些序列的logits进行惩罚,Huggingface 与vLLM 实现方式如下

"""
适用性:
对于Hugging Face,你可以将这个处理器添加到 GenerationConfig或generate函数 的 logits_processor 列表中。
对于VLLM,如示例所示,你可以将它添加到 SamplingParams 的 logits_processors 列表中。

工作原理:
处理器在每次生成新token时被调用,接收当前的 input_ids 和 scores。
它检查当前序列是否与目标序列(完全或部分)匹配。
如果匹配,它会降低相应token的生成概率。

参数作用:
tokenizer: 用于将目标序列转换为token ID。如果传入的是字符串(分词器名称),则自动加载相应的分词器。
target_sequences: 需要抑制的序列列表。
penalty_factor: 控制抑制的强度。取值范围为0.0到1.0,对于乘法模式,应小于1(如0.5);对于减法模式,应为正数(如5.0)。
use_multiplicative: 决定使用乘法还是减法来应用惩罚。

有效抑制序列生成:
完全匹配:如果检测到完整的目标序列,目前不进行惩罚(可以根据需求修改)。
部分匹配:如果检测到部分匹配,只降低下一个预期token的概率,允许序列在其他方向发展。

灵活性:
可以轻松添加或修改目标序列。
通过调整 penalty_factor 和 use_multiplicative,可以控制抑制的强度和方式。

通过这种方法,你可以有效地降低特定序列的生成概率,而不会完全禁止它们,从而在保持模型灵活性的同时实现一定程度的内容控制。
"""
1.5.1、vLLM generate
import torch
from vllm import LLM, SamplingParams
from typing import Union, List, Tuple, Dict, Any, Callable, Optional, Iterable
from transformers.generation.logits_process import LogitsProcessor, LogitsProcessorList
from transformers import PreTrainedTokenizer,PreTrainedTokenizerBase,AutoTokenizer
from loguru import logger

class VllmPenaltySequenceLogitsProcessor(LogitsProcessor):
    def __init__(self, 
                 tokenizer: Union[PreTrainedTokenizer,str], 
                 target_sequences:List[str] = [], 
                 penalty_factor:Union[float, int]=0.5,  # 0.0 - 1.0 之间, 1.0 表示不惩罚, use_multiplicative 为False时, penalty_factor 输入整数
                 use_multiplicative:bool=True
                ):
        """
        vllm 初始化抑制某些目标序列的处理器。

        参数:
        - tokenizer: 分词器对象,用于将目标序列转换为token ID
        - target_sequences: 需要抑制的目标序列列表
        - penalty_factor: 惩罚因子,1. 用于调整目标序列的生成概率float  2. 用于控制抑制的强度int
        - use_multiplicative: 是否使用乘法方式应用惩罚(True为乘法,False为减法)
        """
        self.tokenizer = tokenizer
        if isinstance(tokenizer, str): # 传入的是分词器名字,如bert-base-uncased,需要加载分词器
            self.tokenizer = AutoTokenizer.from_pretrained(tokenizer, trust_remote_code=True)
        
        self.target_sequences = None
        self.max_seq_length = 0
        if target_sequences:
            # 将目标序列转换为token ID列表
            self.target_sequences = [self.tokenizer.encode(seq, add_special_tokens=False) for seq in target_sequences]
            # 计算最长目标序列的长度,用于确定需要检查的上下文长度
            self.max_seq_length = max(len(seq) for seq in self.target_sequences)

        self.penalty_factor = penalty_factor
        self.use_multiplicative = use_multiplicative
        logger.success(f"'VllmPenaltySequenceLogitsProcessor' init success! Will penalty target sequences: '{target_sequences}', penalty factor: '{penalty_factor}', use_multiplicative: '{use_multiplicative}'.")

    def __call__(self, input_ids, scores):
        """
        处理输入序列和分数。
        
        参数:
        - input_ids: 形状为已生成序列的 [sequence_length] 的1D列表, 最开始没有生成新的token时是空[]
        - scores: 形状为 [vocab_size] 的浮点1D张量
        
        返回:
        - 调整后的scores,形状与输入的scores相同
        """
        if self.target_sequences is None:
            return scores

        # VLLM 要求输入input_ids为list, scores为1D torch.Tensor
        assert isinstance(input_ids, list) and isinstance(scores, torch.Tensor) and scores.ndim == 1, f"'input_ids' and 'scores' must be list and torch.Tensor with 1D shape, but got {type(input_ids)} and {type(scores)} with {scores.ndim}D shape."

        # 获取当前上下文,长度最大不超过 max_seq_length
        context = input_ids[-self.max_seq_length:] if input_ids else []
        context_length = len(context)
        
        # 遍历每个目标序列进行检查
        for target_sequence in self.target_sequences:
            seq_length = len(target_sequence)
            # 序列=1时, 直接惩罚下一个token
            if seq_length == 1:
                self._adjust_scores(scores, target_sequence)
            else:
                # 序列>=2时检查从完全匹配到部分匹配的可能性
                for i in range(1, min(seq_length, context_length) + 1):
                    # 如果上下文的尾部与目标序列的头部匹配
                    if context[-i:] == target_sequence[:i]:
                        # 如果匹配长度等于目标序列长度,说明找到完全匹配
                        if i == seq_length:
                            # 目前没有惩罚,直接跳过
                            pass
                        else:
                            # 如果是部分匹配,只降低下一个预期token的分数
                            next_token = target_sequence[i]
                            self._adjust_scores(scores, [next_token])
                        break  # 匹配成功后,跳出循环处理下一个序列
        return scores

    def _adjust_scores(self, scores, tokens):
        """
        调整指定token的分数。
        
        参数:
        - scores: 形状为 [vocab_size] 的一维张量
        - tokens: 需要调整的token ID列表
        """
        tokens = torch.tensor(tokens, device=scores.device) # scores.device 确保与scores相同的设备
        
        if self.use_multiplicative:
            scores[tokens] = scores[tokens] * self.penalty_factor
        else:
            scores[tokens] = scores[tokens] - self.penalty_factor
1.5.2、Huggingface generate
class HuggingFacePenaltySequenceLogitsProcessor(LogitsProcessor):
    def __init__(self, 
                 tokenizer: Union[PreTrainedTokenizer,str], 
                 target_sequences:List[str] = [], 
                 penalty_factor:Union[float, int]=0.5,  # 0.0 - 1.0 之间, 1.0 表示不惩罚, use_multiplicative 为False时, penalty_factor 输入整数
                 use_multiplicative:bool=True
                ):
        """
        Hugging Face 初始化抑制某些目标序列的处理器。

        参数:
        - tokenizer: 分词器对象,用于将目标序列转换为token ID
        - target_sequences: 需要抑制的目标序列列表
        - penalty_factor: 惩罚因子,1. 用于调整目标序列的生成概率float  2. 用于控制抑制的强度int
        - use_multiplicative: 是否使用乘法方式应用惩罚(True为乘法,False为减法)
        """
        self.tokenizer = tokenizer
        if isinstance(tokenizer, str): # 传入的是分词器名字,如bert-base-uncased,需要加载分词器
            self.tokenizer = AutoTokenizer.from_pretrained(tokenizer, trust_remote_code=True)

        self.target_sequences = None
        self.max_seq_length = 0
        if target_sequences:
            # 将目标序列转换为token ID列表
            self.target_sequences = [self.tokenizer.encode(seq, add_special_tokens=False) for seq in target_sequences]
            # 计算最长目标序列的长度,用于确定需要检查的上下文长度
            self.max_seq_length = max(len(seq) for seq in self.target_sequences)

        self.penalty_factor = penalty_factor
        self.use_multiplicative = use_multiplicative
        logger.success(f"'HuggingFacePenaltySequenceLogitsProcessor' init success! Will penalty target sequences: '{target_sequences}', penalty factor: '{penalty_factor}', use_multiplicative: '{use_multiplicative}'.")

    def __call__(self, input_ids, scores):
        """
        处理输入序列和分数。
        
        参数:
        - input_ids: 形状为已生成序列的 [batch_size, sequence_length] 的2D tensor, 最开始没有生成新的token时该tensor是prompt的token,
                     一般来说, batch_size 为 1, batch_size 跟generate函数的input_ids一致。
        - scores: 形状为 [batch_size, vocab_size] 的浮点2D tensor
        
        返回:
        - 调整后的scores,形状与输入的scores相同
        """
        if self.target_sequences is None:
            return scores

        # Huggingface 生成的 input_ids 和 scores 都是 2D tensor
        assert input_ids.ndim == 2 and scores.ndim == 2, f"'input_ids' and 'scores' must be 2D tensors, but got {input_ids.ndim}D and {scores.ndim}D tensors."
        # 获取批次大小与序列长度
        B,T = input_ids.shape
        input_ids = input_ids.tolist() # 将tensor转换为list, 方便后续判断
        for b_idx in range(B):
            # 获取当前上下文,长度最大不超过 max_seq_length
            context = input_ids[b_idx][-self.max_seq_length:] if input_ids[b_idx] else []
            context_length = len(context)
            
            # 遍历每个目标序列进行检查
            for target_sequence in self.target_sequences:
                seq_length = len(target_sequence)
                # 序列=1时, 直接惩罚下一个token
                if seq_length == 1:
                    self._adjust_scores(scores, target_sequence, batch_idx = b_idx)
                else:
                    # 序列>=2时检查从完全匹配到部分匹配的可能性
                    for i in range(1, min(seq_length, context_length) + 1):
                        # 如果上下文的尾部与目标序列的头部匹配
                        if context[-i:] == target_sequence[:i]:
                            # 如果匹配长度等于目标序列长度,说明找到完全匹配
                            if i == seq_length:
                                # 目前没有惩罚,直接跳过
                                pass
                            else:
                                # 如果是部分匹配,只降低下一个预期token的分数
                                next_token = target_sequence[i]
                                self._adjust_scores(scores, [next_token], batch_idx = b_idx)
                            break  # 匹配成功后,跳出循环处理下一个序列
        return scores

    def _adjust_scores(self, scores, tokens, batch_idx):
        """
        调整指定token的分数。
        
        参数:
        - scores: 形状为 [batch_size, vocab_size] 的二维张量
        - tokens: 需要调整的token ID列表
        - batch_idx: 当前处理的batch索引
        """
        tokens = torch.tensor(tokens, device=scores.device) # scores.device 确保与scores相同的设备
        
        if self.use_multiplicative:
            scores[batch_idx, tokens] = scores[batch_idx, tokens] * self.penalty_factor
        else:
            scores[batch_idx, tokens] = scores[batch_idx, tokens] - self.penalty_factor

二、在线部署

以qwen2为例​​​​​​​​​​​​​​:qwen2官方vLLM线上部署​​​​​​​

2.1、部署相关的基础知识

1. 测试远程服务器 70.49.214.80:41084 是否可用的几种方式
1)telnet 可以用来测试某个特定端口是否可以连接
如果连接成功,会显示类似于以下的内容:
Trying 70.49.214.80...
Connected to 70.49.214.80.
Escape character is '^]'.
如果连接失败,会显示类似于以下的内容:
Trying 70.49.214.80...
telnet: Unable to connect to remote host: Connection refused

2)nc 是一个强大的网络工具,可以用于测试连接。
nc -zv 70.49.214.80 41084
如果连接成功,会显示类似于以下的内容:
Connection to 70.49.214.80 41084 port [tcp/*] succeeded!
如果连接失败,会显示类似于以下的内容:
nc: connect to 70.49.214.80 port 41084 (tcp) failed: Connection refused

3)使用 curl,如果这个端口提供的是 HTTP/HTTPS 服务,可以使用 curl 来测试。
curl http://70.49.214.80:41084
如果服务可用,会返回 HTTP 响应。如果不可用,会返回错误信息。
----------------------------------------------------------------------------

2.2、类openai vLLM 服务器部署

2.2.1 类openai vLLM 服务器参数解析​​​​​​​

vLLM Engine Argumentsicon-default.png?t=N7T8https://docs.vllm.ai/en/latest/models/engine_args.html#engine-args

地址: https://docs.vllm.ai/en/latest/models/engine_args.html#engine-args
usage:
python -m vllm.entrypoints.openai.api_server 
                                             [--host HOST]        #  设置为 0.0.0.0 以允许外部访问。
                                             [--port PORT]        #  端口地址, 如 4667
                                             [--chat-template]      # 用于生成对话模板的 Jinja2 模板, 未传入时使用tokenizer的默认模板。 例如: --chat-template ./examples/template_chatml.jinja
                                             [--api-key VLLM_API_KEY]             # 您可以传入参数 --api-key 或环境变量 VLLM_API_KEY 以使服务器在标头中检查 API 密钥。
                                             
                                             [--model MODEL]        # 要使用的 Hugging Face 模型的名称或路径
                                             [--tokenizer TOKENIZER]     # 要使用的 Hugging Face 分词器的名称或路径。如果未指定,将使用模型名称或路径。
                                             [--skip-tokenizer-init]
                                             [--revision REVISION]
                                             [--code-revision CODE_REVISION]
                                             [--tokenizer-revision TOKENIZER_REVISION]
                                             [--tokenizer-mode {auto,slow}]                 # Default: “auto”
                                             [--trust-remote-code]                          # 信任来自 huggingface 的远程代码。
                                             [--download-dir DOWNLOAD_DIR]                  #  下载和加载权重的目录,默认为 huggingface 的默认缓存目录。
                                             [--load-format {auto,pt,safetensors,npcache,dummy,tensorizer,bitsandbytes}]          # Default: “auto”
                                             [--dtype {auto,half,float16,bfloat16,float,float32}]             # Default: “auto”
                                             [--kv-cache-dtype {auto,fp8,fp8_e5m2,fp8_e4m3}]                  # Default: “auto”
                                             [--quantization-param-path QUANTIZATION_PARAM_PATH]
                                             [--max-model-len MAX_MODEL_LEN]                                  # 模型上下文长度。如果未指定,将自动从模型config中派生。派生时会自动考虑 rope_scaling 缩放值扩展原始模型的最大长度。
                                             [--guided-decoding-backend {outlines,lm-format-enforcer}]
                                             [--distributed-executor-backend {ray,mp}]                        #  pip install ray, 用于分布式服务的后端。当使用多于 1 个 GPU 时,如果安装了 ray 则会自动设置为"ray",否则为"mp"(多处理)。
                                             [--worker-use-ray]
                                             [--pipeline-parallel-size PIPELINE_PARALLEL_SIZE]                #  PP模型并行数量。
                                             [--tensor-parallel-size TENSOR_PARALLEL_SIZE]                    #  TP模型并行数量, 一般用这个
                                             [--max-parallel-loading-workers MAX_PARALLEL_LOADING_WORKERS]    #  分批顺序加载模型,以避免在使用张量并行和大型模型时出现内存溢出。
                                             [--ray-workers-use-nsight]
                                             [--block-size {8,16,32}]                                         # Token连续块的块大小。 默认值16
                                             [--enable-prefix-caching]                                        # 启用自动前缀缓存。
                                             [--disable-sliding-window]
                                             [--use-v2-block-manager]
                                             [--num-lookahead-slots NUM_LOOKAHEAD_SLOTS]
                                             [--seed SEED]                                                     # 操作的随机种子。
                                             [--swap-space SWAP_SPACE]                                         # 每个 GPU 的 CPU 交换空间大小(GiB)。
                                             [--gpu-memory-utilization GPU_MEMORY_UTILIZATION]                 # GPU 内存占用率。
                                             [--num-gpu-blocks-override NUM_GPU_BLOCKS_OVERRIDE]
                                             [--max-num-batched-tokens MAX_NUM_BATCHED_TOKENS]                 # 每次迭代最大批处理 tokens 数量。
                                             [--max-num-seqs MAX_NUM_SEQS]                                     # 每次迭代的最大序列数。 默认值256
                                             [--max-logprobs MAX_LOGPROBS]                                     # 在 SamplingParams 中指定的最大 log probs 返回数量。默认为20
                                             [--disable-log-stats]                                             # 禁用日志统计。
                                             [--quantization {aqlm,awq,deepspeedfp,fp8,marlin,gptq_marlin_24,gptq_marlin,gptq,squeezellm,compressed-tensors,bitsandbytes,None}]    # 用于量化权重的方法。如果为 None,我们首先检查模型配置文件中的量化_配置属性。如果该属性为 None,我们假设模型权重未经过量化,并使用 dtype 确定权重的数据类型。
                                             [--rope-scaling ROPE_SCALING]                                     # RoPE 缩放配置,采用 JSON 格式。例如 {"type":"dynamic","factor":2.0}
                                             [--rope-theta ROPE_THETA]                                         # RoPE theta。与 rope_scaling 一起使用。在某些情况下,更改 RoPE theta 可以提高缩放模型的性能。
                                             [--enforce-eager]                                                 # 始终使用 eager-mode PyTorch。如果为 False,将使用 eager 模式和 CUDA 图形进行混合,以获得最大的性能和灵活性。
                                             [--max-context-len-to-capture MAX_CONTEXT_LEN_TO_CAPTURE]         # 该参数参数已经弃用, 使用下面的 max-seq-len-to-capture 参数替代
                                             [--max-seq-len-to-capture MAX_SEQ_LEN_TO_CAPTURE]                 # CUDA 图形覆盖的最大序列长度。当序列的上下文长度大于此值时,我们将退回到 eager 模式。
                                             [--disable-custom-all-reduce]
                                             [--tokenizer-pool-size TOKENIZER_POOL_SIZE]
                                             [--tokenizer-pool-type TOKENIZER_POOL_TYPE]
                                             [--tokenizer-pool-extra-config TOKENIZER_POOL_EXTRA_CONFIG]
                                             [--enable-lora]                                                   # 如果为 True,则启用对 LoRA 适配器的处理。
                                             [--max-loras MAX_LORAS]                                           # 单个批次中最大的 LoRA 数量。
                                             [--max-lora-rank MAX_LORA_RANK]                                   # Max LoRA rank. 默认为16
                                             [--lora-extra-vocab-size LORA_EXTRA_VOCAB_SIZE]                   # LoRA 适配器中可以存在的额外词汇表的最大大小(添加到基础模型词汇表中)。默认为256
                                             [--lora-dtype {auto,float16,bfloat16,float32}]                    # Default: “auto”
                                             [--long-lora-scaling-factors LONG_LORA_SCALING_FACTORS]           # 指定多个缩放因子(可以与基础模型缩放因子不同 - 请参见例如 Long LoRA),以允许同时使用多个使用这些缩放因子训练的 LoRA 适配器。如果未指定,只允许使用使用基础模型缩放因子训练的适配器。
                                             [--max-cpu-loras MAX_CPU_LORAS]                                   # 在 CPU 内存中存储的最大 LoRA 数量。必须大于或等于 max_num_seqs。默认为 max_num_seqs。
                                             [--fully-sharded-loras]                                           # 默认情况下,只有一半的 LoRA 计算是使用张量并行计算的。启用此功能将使用完全分片的层。对于高序列长度、最大秩或张量并行大小,这可能更快
                                             [--device {auto,cuda,neuron,cpu,tpu,xpu}]                         # Default: “auto”
                                             [--image-input-type {pixel_values,image_features}]
                                             [--image-token-id IMAGE_TOKEN_ID]
                                             [--image-input-shape IMAGE_INPUT_SHAPE]
                                             [--image-feature-size IMAGE_FEATURE_SIZE]
                                             [--image-processor IMAGE_PROCESSOR]
                                             [--image-processor-revision IMAGE_PROCESSOR_REVISION]
                                             [--disable-image-processor]
                                             [--scheduler-delay-factor SCHEDULER_DELAY_FACTOR]
                                             [--enable-chunked-prefill]                                        # 如果设置,前置请求可以基于 max_num_batched_tokens 进行分块。
                                             [--speculative-model SPECULATIVE_MODEL]
                                             [--num-speculative-tokens NUM_SPECULATIVE_TOKENS]
                                             [--speculative-draft-tensor-parallel-size SPECULATIVE_DRAFT_TENSOR_PARALLEL_SIZE]
                                             [--speculative-max-model-len SPECULATIVE_MAX_MODEL_LEN]
                                             [--speculative-disable-by-batch-size SPECULATIVE_DISABLE_BY_BATCH_SIZE]
                                             [--ngram-prompt-lookup-max NGRAM_PROMPT_LOOKUP_MAX]
                                             [--ngram-prompt-lookup-min NGRAM_PROMPT_LOOKUP_MIN]
                                             [--model-loader-extra-config MODEL_LOADER_EXTRA_CONFIG]
                                             [--preemption-mode PREEMPTION_MODE]
                                             [--served-model-name SERVED_MODEL_NAME [SERVED_MODEL_NAME ...]]      # API 中使用的模型名称(s)。如果提供了多个名称,服务器将响应任何提供的名称。响应中 model 字段的模型名称将是该列表中的第一个名称。如果未指定,模型名称将与--model 参数相同。请注意,此名称(s)也将用于 prometheus 指标的 model_name 标签内容,如果提供了多个名称,指标标签将使用第一个。
                                             [--qlora-adapter-name-or-path QLORA_ADAPTER_NAME_OR_PATH]            # QLoRA 适配器的名称或路径。
                                             [--otlp-traces-endpoint OTLP_TRACES_ENDPOINT]
2.2.2 使用py脚本来实现 python -m vllm.entrypoints.openai.api_server 功能进行部署,只是可以同py脚本添加更多自定义的功能,代码如下

服务api部署启动的方式和vllm.entrypoints.openai.api_server一样:

python my_api.py --port 41084 --model Qwen/Qwen2-0.5B-Instruct
#!/usr/bin/env python3
import os,sys
current_file_path = os.path.realpath(__file__) # 获取当前运行文件的路径
parent_directory = os.path.dirname(current_file_path) # 获取当前运行文件的父目录
append_path = os.path.split(parent_directory)[0]
sys.path.append(append_path)
import subprocess  # 用于在子进程中执行命令
from typing import Any, Dict, List,Optional,Union
from loguru import logger
# 重定义终端logger显示颜色
logger.configure(handlers=[
    {
        "sink": sys.stderr,
        "format": "{time:YYYY-MM-DD HH:mm:ss.SSS} |<cyan><lvl>{level:8}</></>| {name} : {module}:{line:4} | <cyan>mymodule</> | - <lvl>{message}</>",
        "colorize": True
    },
])

# 定义vLLM默认参数配置, 包含了服务器地址、端口、模型、并行大小、GPU内存利用率等关键参数
# None 表示使用vLLM默认值or环境变量or命令行参数
# NOTE 这些默认值可以被环境变量或命令行参数覆盖
DEFAULT_ARGS: Dict[str, Any] = {
    "host": "0.0.0.0",   # 设置为 0.0.0.0 以允许外部访问。
    "port": "8000",      # 端口地址
    "model": "Qwen/Qwen2-0.5B-Instruct",  # 模型路径
    "served-model-name": None,  # 服务模型名称
    "api-key": None,     # 设定 API 密钥, 服务端设定了API密钥后, 客户端就必须使用相同的API密钥请求. 如果未设置,客户端可以请求任何API密钥, vLLM不会验证客户端的API密钥
    "tokenizer": None,   # 用于分词的tokenizer, 不给定则使用model的tokenizer
    "chat-template": None,         # 用于生成对话模板的 Jinja2 模板, 未传入时使用tokenizer的默认模板。 例如: --chat-template ./examples/template_chatml.jinja
    "tensor-parallel-size": "1",   #  TP模型并行数量, 一般用这个
    "pipeline-parallel-size": None, #  PP模型并行数量
    "max-parallel-loading-workers": None,  # 分批顺序加载模型,以避免在使用张量并行和大型模型时出现内存溢出。
    "gpu-memory-utilization": "0.9",  # GPU 内存占用率。
    "max-seq-len-to-capture": "12288",  # CUDA 图形覆盖的最大序列长度。当序列的上下文长度大于此值时,我们将退回到 eager 模式。
    "quantization": None,         
    "max-num-batched-tokens": None,  # 每次迭代最大批处理 tokens 数量。
    "max-num-seqs": None,            # 每次迭代的最大序列数。 默认值256
    "max-model-len": None,           # 模型最大长度, 超过此长度将被截断。如果未指定,将自动从模型config中派生。派生时会自动考虑 rope_scaling 缩放值扩展原始模型的最大长度。
    "trust-remote-code": True,       # 信任远程代码
    "dtype": "bfloat16",              # 模型加载的数据类型,支持 "float16" 和 "bfloat16"。
    "device": "auto",                # 模型加载的设备。
    "download-dir": None,
    "enable-lora": None,             # 是否开启LoRA, bool
    "lora_modules": None,            # lora 路径
}

def get_env_value(key: str, default: Any) -> Any:
    """
    技术说明:
    1. 使用os.getenv函数获取环境变量的值
    2. 将键名转换为大写并替换连字符为下划线,以匹配常见的环境变量命名规范
    3. 如果环境变量不存在,返回提供的默认值
    """
    return os.getenv(key.upper().replace("-", "_"), default)

def prepare_args() -> List[str]:
    """
    准备命令行参数,优先使用命令行输入的参数,其次是环境变量,最后是默认值。

    返回:
    List[str]: 准备好的命令行参数列表

    技术说明:
    1. 使用sys.argv[1:]获取命令行参数,跳过脚本名称
    2. 遍历DEFAULT_ARGS字典,检查每个参数是否已在命令行中指定
    3. 如果参数未在命令行中指定,则尝试从环境变量获取值
    4. 如果环境变量也不存在,则使用默认值
    5. 根据参数类型(布尔值或其他)构造适当的命令行参数格式
    """
    args = sys.argv[1:]  # 直接使用 sys.argv[1:] 来跳过脚本名称
    logger.info(f"The parameters passed into the `args`: {args}")

    used_keys = set()
    processed_args = []
    # 处理命令行中指定的参数
    i = 0
    while i < len(args):
        arg = args[i]
        if arg.startswith('--'):
            key = arg[2:].replace('_', '-')  # 移除 '--' 前缀并替换下划线为连字符
            used_keys.add(key)
            if i + 1 < len(args) and not args[i + 1].startswith('--'):
                # 这是一个键值对参数
                processed_args.extend([arg, args[i + 1]])
                i += 2
            else:
                # 这是一个标志参数, 解析标志参数
                processed_args.append(arg)
                i += 1
        else:
            # 处理非预期的参数格式
            logger.warning(f"Unexpected argument format: {arg}")
            i += 1

    # 添加环境变量中的参数(如果命令行中没有指定)
    for key, default_value in DEFAULT_ARGS.items():
        if key not in used_keys:
            value = get_env_value(key, default_value)
            if value is not None:
                if isinstance(value, bool):
                    # 对于布尔值,如果为True,添加标志参数
                    if value:
                        processed_args.append(f"--{key}")
                elif value != "":
                    # 对于非空的非布尔值,添加键值对参数
                    processed_args.extend([f"--{key}", str(value)])

    return processed_args

def format_args(args: List[str]) -> str:
    """
    技术说明:
    1. 遍历参数列表,每两个元素为一组(参数名和值)
    2. 将每组参数格式化为"参数名 参数值"的形式
    3. 使用换行符连接所有格式化后的参数,实现每行一个参数的效果
    """
    formatted_args = []
    i = 0
    while i < len(args):
        if args[i].startswith('--'):
            if i + 1 < len(args) and not args[i + 1].startswith('--'):
                # 这是一个键值对参数
                formatted_args.append(f"{args[i]} {args[i + 1]}")
                i += 2
            else:
                # 这是一个标志参数, 解析标志参数
                formatted_args.append(args[i])
                i += 1
        else:
            # 处理非预期的参数格式
            formatted_args.append(args[i])
            i += 1

    return "\n".join([str(i) for i in formatted_args])

"""
启动vLLM(Vector Language Model)服务器的Python脚本。主要技术和概念:
日志管理:使用loguru库进行高级日志配置,提供了自定义格式和彩色输出。
参数管理:通过字典存储默认参数,并结合环境变量和命令行参数实现灵活的参数配置。
环境变量处理:使用os.getenv函数获取环境变量,实现配置的灵活性。
命令行参数处理:使用sys.argv获取命令行参数,并进行解析和处理。
字符串格式化:使用f-strings和join方法进行高效的字符串格式化。
异常处理:使用try-except块捕获和处理可能出现的异常,提高程序的健壮性。
子进程管理:使用subprocess.run函数启动vLLM服务器作为子进程。
模块化设计:将不同功能分解为独立的函数,如prepare_args、format_args和run_server,提高代码的可读性和可维护性。
这个脚本的主要目的是提供一个灵活且用户友好的方式来启动vLLM服务器,允许用户通过命令行参数或环境变量来自定义服务器的配置。它特别适用于在不同环境中部署和管理vLLM服务,如开发、测试或生产环境。
1. 例如,您可以这样运行:
python your_script.py --model /path/to/your/model --tensor-parallel-size 2
2. 或者设置环境变量后运行:
export MODEL=/path/to/your/model
export TENSOR_PARALLEL_SIZE=2
python your_script.py
这样,这个脚本就可以完全替代直接在命令行运行 python -m vllm.entrypoints.openai.api_server 的功能,同时提供了更多的灵活性和可配置性。
"""
# 替代直接在命令行运行 python -m vllm.entrypoints.openai.api_server 的功能,同时提供了更多的灵活性和可配置性。
def run_server() -> None:
    """
    运行vLLM服务器的主函数

    技术说明:
    1. 使用prepare_args函数获取准备好的命令行参数
    2. 使用format_args函数格式化参数,并通过logger输出启动信息
    3. 构造完整的命令,包括Python解释器路径和vLLM入口点
    4. 使用subprocess.run执行vLLM服务器命令
    5. 捕获可能的异常并通过logger输出错误信息
    """
    try:
        # 准备参数
        vllm_args = prepare_args() # 解析后是列表形式

        # 打印格式化后的启动信息
        logger.success(f"Starting vLLM server with args: \n{format_args(vllm_args)}")
    
        # 使用subprocess运行vLLM服务器
        cmd = [sys.executable, "-m", "server_api.api_server"] + vllm_args
        subprocess.run(cmd, check=True)

    except subprocess.CalledProcessError as e:
        logger.error(f"Error starting vLLM server: {e}")
        sys.exit(1)
    except Exception as e:
        logger.error(f"Unexpected error: {e}")
        sys.exit(1)

if __name__ == "__main__":
    run_server()
2.2.3 vLLM python -m vllm.entrypoints.openai.api_server 原本不支持自定义的序列惩罚,故修改vllm.entrypoints.openai.api_server 源码以支持推理序列抑制惩罚
# 来源: import vllm.entrypoints.openai.api_server
import asyncio
import importlib
import inspect
import re
from contextlib import asynccontextmanager
from http import HTTPStatus
from typing import Optional, Set

import fastapi
import uvicorn
from fastapi import Request
from fastapi.exceptions import RequestValidationError
from fastapi.middleware.cors import CORSMiddleware
from fastapi.responses import JSONResponse, Response, StreamingResponse
from prometheus_client import make_asgi_app
from starlette.routing import Mount

import vllm.envs as envs
from vllm.engine.arg_utils import AsyncEngineArgs
from vllm.engine.async_llm_engine import AsyncLLMEngine
from vllm.entrypoints.openai.cli_args import make_arg_parser
from vllm.entrypoints.openai.protocol import (ChatCompletionRequest as _ChatCompletionRequest,
                                              ChatCompletionResponse,
                                              CompletionRequest as _CompletionRequest,
                                              EmbeddingRequest, ErrorResponse)
from vllm.entrypoints.openai.serving_chat import OpenAIServingChat
from vllm.entrypoints.openai.serving_completion import OpenAIServingCompletion
from vllm.entrypoints.openai.serving_embedding import OpenAIServingEmbedding
from vllm.logger import init_logger
from vllm.usage.usage_lib import UsageContext
from vllm.version import __version__ as VLLM_VERSION


# ======================================== 修改 vLLM 源码 start =================================================
from vllm.sampling_params import SamplingParams
import torch
from typing import List, ClassVar, Union
import os,sys
current_file_path = os.path.realpath(__file__) # 获取当前运行文件的路径
parent_directory = os.path.dirname(current_file_path) # 获取当前运行文件的父目录
append_path = os.path.split(parent_directory)[0]
sys.path.append(append_path)

# TODO++ 修改vllm源码, 类的继承, requests 接受新的参数, 实现序列抑制
class ChatCompletionRequest(_ChatCompletionRequest):
    # TODO++ 新增序列惩罚参数
    target_sequences_ids: Optional[List[List[Union[int, str]]]] = None
    sequences_penalty_factor: Optional[float] = None

    # 继承魔改父类的方法
    def to_sampling_params(self) -> SamplingParams:
        # We now allow logprobs being true without top_logrobs.

        logits_processors = None
        if self.logit_bias:

            def logit_bias_logits_processor(
                    token_ids: List[int],
                    logits: torch.Tensor) -> torch.Tensor:
                assert self.logit_bias is not None
                for token_id, bias in self.logit_bias.items():
                    # Clamp the bias between -100 and 100 per OpenAI API spec
                    bias = min(100, max(-100, bias))
                    logits[int(token_id)] += bias
                return logits

            logits_processors = [logit_bias_logits_processor]

        # TODO ++ 新增的序列惩罚processor
        if self.target_sequences_ids is not None and self.sequences_penalty_factor is not None:
            def sequences_penalty_logits_processor(
                                                    token_ids: List[int],
                                                    logits: torch.Tensor) -> torch.Tensor:
                
                assert isinstance(token_ids, list) and isinstance(logits, torch.Tensor) and logits.ndim == 1, f"'token_ids' and 'logits' must be list and torch.Tensor with 1D shape, but got {type(token_ids)} and {type(logits)} with {logits.ndim}D shape."
                max_seq_length = max([len(i) for i in self.target_sequences_ids])
                # 获取当前上下文,长度最大不超过 max_seq_length
                context = token_ids[-max_seq_length:] if token_ids else []
                context_length = len(context)
                
                # 遍历每个目标序列进行检查
                for target_sequence in self.target_sequences_ids:
                    seq_length = len(target_sequence)
                    
                    # 序列=1时, 直接惩罚下一个token
                    if seq_length == 1:
                        next_token = torch.tensor(target_sequence, device=logits.device) # logits.device 确保与logits相同的设备
                        logits[next_token] = logits[next_token] * self.sequences_penalty_factor
                    else:
                        # 序列>=2时检查从完全匹配到部分匹配的可能性
                        for i in range(1, min(seq_length, context_length) + 1):
                            # 如果上下文的尾部与目标序列的头部匹配
                            if context[-i:] == target_sequence[:i]:
                                # 如果匹配长度等于目标序列长度,说明找到完全匹配
                                if i == seq_length:
                                    # 目前没有惩罚,直接跳过
                                    pass
                                else:
                                    # 如果是部分匹配,只降低下一个预期token的分数
                                    next_token = target_sequence[i]
                                    next_token = torch.tensor([next_token], device=logits.device) # logits.device 确保与logits相同的设备
                                    logits[next_token] = logits[next_token] * self.sequences_penalty_factor
                                break  # 匹配成功后,跳出循环处理下一个序列
                return logits

            if logits_processors is None:
                logits_processors = [sequences_penalty_logits_processor]
            else:
                logits_processors.append(sequences_penalty_logits_processor)
    
        return SamplingParams(
            n=self.n,
            presence_penalty=self.presence_penalty,
            frequency_penalty=self.frequency_penalty,
            repetition_penalty=self.repetition_penalty,
            temperature=self.temperature,
            top_p=self.top_p,
            min_p=self.min_p,
            seed=self.seed,
            stop=self.stop,
            stop_token_ids=self.stop_token_ids,
            max_tokens=self.max_tokens,
            min_tokens=self.min_tokens,
            logprobs=self.top_logprobs if self.logprobs else None,
            prompt_logprobs=self.top_logprobs if self.echo else None,
            best_of=self.best_of,
            top_k=self.top_k,
            ignore_eos=self.ignore_eos,
            use_beam_search=self.use_beam_search,
            early_stopping=self.early_stopping,
            skip_special_tokens=self.skip_special_tokens,
            spaces_between_special_tokens=self.spaces_between_special_tokens,
            include_stop_str_in_output=self.include_stop_str_in_output,
            length_penalty=self.length_penalty,
            logits_processors=logits_processors,
        )

class CompletionRequest(_CompletionRequest):
    # TODO++ 新增序列惩罚参数
    target_sequences_ids: Optional[List[List[Union[int, str]]]] = None
    sequences_penalty_factor: Optional[float] = None

    # 继承魔改父类的方法
    def to_sampling_params(self):
        echo_without_generation = self.echo and self.max_tokens == 0

        logits_processors = None
        if self.logit_bias:

            def logit_bias_logits_processor(
                    token_ids: List[int],
                    logits: torch.Tensor) -> torch.Tensor:
                assert self.logit_bias is not None
                for token_id, bias in self.logit_bias.items():
                    # Clamp the bias between -100 and 100 per OpenAI API spec
                    bias = min(100, max(-100, bias))
                    logits[int(token_id)] += bias
                return logits

            logits_processors = [logit_bias_logits_processor]

        # TODO ++ 新增的序列惩罚processor
        if self.target_sequences_ids is not None and self.sequences_penalty_factor is not None:
            def sequences_penalty_logits_processor(
                                                    token_ids: List[int],
                                                    logits: torch.Tensor) -> torch.Tensor:
                
                assert isinstance(token_ids, list) and isinstance(logits, torch.Tensor) and logits.ndim == 1, f"'token_ids' and 'logits' must be list and torch.Tensor with 1D shape, but got {type(token_ids)} and {type(logits)} with {logits.ndim}D shape."
                max_seq_length = max([len(i) for i in self.target_sequences_ids])
                # 获取当前上下文,长度最大不超过 max_seq_length
                context = token_ids[-max_seq_length:] if token_ids else []
                context_length = len(context)
                
                # 遍历每个目标序列进行检查
                for target_sequence in self.target_sequences_ids:
                    seq_length = len(target_sequence)
                    
                    # 序列=1时, 直接惩罚下一个token
                    if seq_length == 1:
                        next_token = torch.tensor(target_sequence, device=logits.device) # logits.device 确保与logits相同的设备
                        logits[next_token] = logits[next_token] * self.sequences_penalty_factor
                    else:
                        # 序列>=2时检查从完全匹配到部分匹配的可能性
                        for i in range(1, min(seq_length, context_length) + 1):
                            # 如果上下文的尾部与目标序列的头部匹配
                            if context[-i:] == target_sequence[:i]:
                                # 如果匹配长度等于目标序列长度,说明找到完全匹配
                                if i == seq_length:
                                    # 目前没有惩罚,直接跳过
                                    pass
                                else:
                                    # 如果是部分匹配,只降低下一个预期token的分数
                                    next_token = target_sequence[i]
                                    next_token = torch.tensor([next_token], device=logits.device) # logits.device 确保与logits相同的设备
                                    logits[next_token] = logits[next_token] * self.sequences_penalty_factor
                                break  # 匹配成功后,跳出循环处理下一个序列
                return logits

            if logits_processors is None:
                logits_processors = [sequences_penalty_logits_processor]
            else:
                logits_processors.append(sequences_penalty_logits_processor)

        return SamplingParams(
            n=self.n,
            best_of=self.best_of,
            presence_penalty=self.presence_penalty,
            frequency_penalty=self.frequency_penalty,
            repetition_penalty=self.repetition_penalty,
            temperature=self.temperature,
            top_p=self.top_p,
            top_k=self.top_k,
            min_p=self.min_p,
            seed=self.seed,
            stop=self.stop,
            stop_token_ids=self.stop_token_ids,
            ignore_eos=self.ignore_eos,
            max_tokens=self.max_tokens if not echo_without_generation else 1,
            min_tokens=self.min_tokens,
            logprobs=self.logprobs,
            use_beam_search=self.use_beam_search,
            early_stopping=self.early_stopping,
            prompt_logprobs=self.logprobs if self.echo else None,
            skip_special_tokens=self.skip_special_tokens,
            spaces_between_special_tokens=(self.spaces_between_special_tokens),
            include_stop_str_in_output=self.include_stop_str_in_output,
            length_penalty=self.length_penalty,
            logits_processors=logits_processors,
            truncate_prompt_tokens=self.truncate_prompt_tokens,
        )
# ======================================== 修改 vLLM 源码 end =================================================


TIMEOUT_KEEP_ALIVE = 5  # seconds

openai_serving_chat: OpenAIServingChat
openai_serving_completion: OpenAIServingCompletion
openai_serving_embedding: OpenAIServingEmbedding

logger = init_logger('vllm.entrypoints.openai.api_server')

_running_tasks: Set[asyncio.Task] = set()


@asynccontextmanager
async def lifespan(app: fastapi.FastAPI):

    async def _force_log():
        while True:
            await asyncio.sleep(10)
            await engine.do_log_stats()

    if not engine_args.disable_log_stats:
        task = asyncio.create_task(_force_log())
        _running_tasks.add(task)
        task.add_done_callback(_running_tasks.remove)

    yield


app = fastapi.FastAPI(lifespan=lifespan)


def parse_args():
    parser = make_arg_parser()
    return parser.parse_args()


# Add prometheus asgi middleware to route /metrics requests
route = Mount("/metrics", make_asgi_app())
# Workaround for 307 Redirect for /metrics
route.path_regex = re.compile('^/metrics(?P<path>.*)$')
app.routes.append(route)


@app.exception_handler(RequestValidationError)
async def validation_exception_handler(_, exc):
    err = openai_serving_chat.create_error_response(message=str(exc))
    return JSONResponse(err.model_dump(), status_code=HTTPStatus.BAD_REQUEST)


@app.get("/health")
async def health() -> Response:
    """Health check."""
    await openai_serving_chat.engine.check_health()
    return Response(status_code=200)


@app.get("/v1/models")
async def show_available_models():
    models = await openai_serving_chat.show_available_models()
    return JSONResponse(content=models.model_dump())


@app.get("/version")
async def show_version():
    ver = {"version": VLLM_VERSION}
    return JSONResponse(content=ver)


@app.post("/v1/chat/completions")
async def create_chat_completion(request: ChatCompletionRequest,
                                 raw_request: Request):
    generator = await openai_serving_chat.create_chat_completion(
        request, raw_request)
    if isinstance(generator, ErrorResponse):
        return JSONResponse(content=generator.model_dump(),
                            status_code=generator.code)
    if request.stream:
        return StreamingResponse(content=generator,
                                 media_type="text/event-stream")
    else:
        assert isinstance(generator, ChatCompletionResponse)
        return JSONResponse(content=generator.model_dump())


@app.post("/v1/completions")
async def create_completion(request: CompletionRequest, raw_request: Request):
    generator = await openai_serving_completion.create_completion(
        request, raw_request)
    if isinstance(generator, ErrorResponse):
        return JSONResponse(content=generator.model_dump(),
                            status_code=generator.code)
    if request.stream:
        return StreamingResponse(content=generator,
                                 media_type="text/event-stream")
    else:
        return JSONResponse(content=generator.model_dump())


@app.post("/v1/embeddings")
async def create_embedding(request: EmbeddingRequest, raw_request: Request):
    generator = await openai_serving_embedding.create_embedding(
        request, raw_request)
    if isinstance(generator, ErrorResponse):
        return JSONResponse(content=generator.model_dump(),
                            status_code=generator.code)
    else:
        return JSONResponse(content=generator.model_dump())


if __name__ == "__main__":
    args = parse_args()

    app.add_middleware(
        CORSMiddleware,
        allow_origins=args.allowed_origins,
        allow_credentials=args.allow_credentials,
        allow_methods=args.allowed_methods,
        allow_headers=args.allowed_headers,
    )

    if token := envs.VLLM_API_KEY or args.api_key:

        @app.middleware("http")
        async def authentication(request: Request, call_next):
            root_path = "" if args.root_path is None else args.root_path
            if request.method == "OPTIONS":
                return await call_next(request)
            if not request.url.path.startswith(f"{root_path}/v1"):
                return await call_next(request)
            if request.headers.get("Authorization") != "Bearer " + token:
                return JSONResponse(content={"error": "Unauthorized"},
                                    status_code=401)
            return await call_next(request)

    for middleware in args.middleware:
        module_path, object_name = middleware.rsplit(".", 1)
        imported = getattr(importlib.import_module(module_path), object_name)
        if inspect.isclass(imported):
            app.add_middleware(imported)
        elif inspect.iscoroutinefunction(imported):
            app.middleware("http")(imported)
        else:
            raise ValueError(f"Invalid middleware {middleware}. "
                             f"Must be a function or a class.")

    logger.info("vLLM API server version %s", VLLM_VERSION)
    logger.info("args: %s", args)

    if args.served_model_name is not None:
        served_model_names = args.served_model_name
    else:
        served_model_names = [args.model]

    engine_args = AsyncEngineArgs.from_cli_args(args)

    # Enforce pixel values as image input type for vision language models
    # when serving with API server
    if engine_args.image_input_type is not None and \
        engine_args.image_input_type.upper() != "PIXEL_VALUES":
        raise ValueError(
            f"Invalid image_input_type: {engine_args.image_input_type}. "
            "Only --image-input-type 'pixel_values' is supported for serving "
            "vision language models with the vLLM API server.")

    engine = AsyncLLMEngine.from_engine_args(
        engine_args, usage_context=UsageContext.OPENAI_API_SERVER)

    event_loop: Optional[asyncio.AbstractEventLoop]
    try:
        event_loop = asyncio.get_running_loop()
    except RuntimeError:
        event_loop = None

    if event_loop is not None and event_loop.is_running():
        # If the current is instanced by Ray Serve,
        # there is already a running event loop
        model_config = event_loop.run_until_complete(engine.get_model_config())
    else:
        # When using single vLLM without engine_use_ray
        model_config = asyncio.run(engine.get_model_config())

    openai_serving_chat = OpenAIServingChat(engine, model_config,
                                            served_model_names,
                                            args.response_role,
                                            args.lora_modules,
                                            args.chat_template)
    openai_serving_completion = OpenAIServingCompletion(
        engine, model_config, served_model_names, args.lora_modules)
    openai_serving_embedding = OpenAIServingEmbedding(engine, model_config,
                                                      served_model_names)
    app.root_path = args.root_path
    uvicorn.run(app,
                host=args.host,
                port=args.port,
                log_level=args.uvicorn_log_level,
                timeout_keep_alive=TIMEOUT_KEEP_ALIVE,
                ssl_keyfile=args.ssl_keyfile,
                ssl_certfile=args.ssl_certfile,
                ssl_ca_certs=args.ssl_ca_certs,
                ssl_cert_reqs=args.ssl_cert_reqs)
2.2.4 client 请求
  • 方式一:通用方法,requests
  • 已验证:只要api请求参数与离线推理Huggingface generate或vLLM generate参数、加载类型、eos_token_ids等保持一致,在线与离线推理结果可以完全对齐
import os,sys
current_file_path = os.path.realpath(__file__) # 获取当前运行文件的路径
parent_directory = os.path.dirname(current_file_path) # 获取当前运行文件的父目录
append_path = os.path.split(parent_directory)[0]
sys.path.append(append_path)
from typing import Optional, Union, List, Dict
from loguru import logger
# 重定义终端logger显示颜色
logger.configure(handlers=[
    {
        "sink": sys.stderr,
        "format": "{time:YYYY-MM-DD HH:mm:ss.SSS} |<cyan><lvl>{level:8}</></>| {name} : {module}:{line:4} | <cyan>mymodule</> | - <lvl>{message}</>",
        "colorize": True
    },
])

import datasets
import requests

class GptApi():
    def __init__(self):
        self.api_key = "sk-XXXXXXXXXXXXXXXXXXXX"
        self.headers = {"Authorization": 'Bearer ' + self.api_key,}   
        self.url = "https://api.openai.com/v1/chat/completions"
        self.SYSTEM_PROMPT = """You are a helpful assistant""".strip()
 

    def get_response(self, 
                     question:str,
                     model:str="llama3-8b",
                     temperature:float=0.35,       # temperature 为0, 表示贪婪采样
                     top_p:float = 0.9,
                     top_k:int = 50,
                     repetition_penalty:float = 1.0,
                     max_new_tokens:int = 1150,
                     eos_token_id:List[int] = None,
                     verbose:bool = True,   # 是否打印调试信息
                     url:str=None,
                     system_prompt:str=None,
                     api_key:str=None,
                     **kwargs
                    ):
        assert isinstance(question, str), "question必须是str类型"
        # 更新self值
        if url is not None:
            self.url = url
        if system_prompt is not None:
            self.SYSTEM_PROMPT = system_prompt
        if api_key is not None:
            self.api_key = api_key
            self.headers = {"Authorization": 'Bearer ' + self.api_key,} 

        # post 参数
        params = {
            "messages": [
                    {"role": "system", "content": self.SYSTEM_PROMPT},
                    {"role": "user", "content": question}
                    ],
            "model": model,      # 如果需要切换模型,则在这里修改
            "temperature": temperature,    # temperature 为0, 表示贪婪采样
            "top_p": top_p,
            "top_k": top_k,
            "repetition_penalty": repetition_penalty,
            "max_tokens": max_new_tokens,
            "stop_token_ids": eos_token_id,
            # "response_format":{ "type": "json_object" }  # 输出强制为 json格式
        }

        # 加入用户自定义参数
        for k,v in kwargs.items():
            if v is not None:
                params[k] = v

        params_json = {}
        # 去除 v == None 的参数, 使用模型的默认参数
        for k,v in params.items():
            if v is not None:
                params_json[k] = v

        if verbose:
            logger.info(f"request params: \n{params_json}")

        response = requests.post(
            self.url,  # 三方站 url
            headers=self.headers,
            json=params_json,
            stream=False
        )
        return response.json()


api = GptApi()

model = "llama3-8b"
if "llama3" in model:
    model_name_or_path = "meta-llama/Meta-Llama-3-8B-Instruct"
    eos_token_id = [128001, 128009]


else:
    raise ValueError(f"model type not surposed")


# 构建序列惩罚参数 
from component.logitsprocessor import VllmPenaltySequenceLogitsProcessor
vllm_penalty_kwargs = {
    "tokenizer": model_name_or_path,
    'target_sequences': ["<No related terms>"],   # 被惩罚的序列列表  ["<No related terms>"]
    'penalty_factor': 0.95,         # 惩罚因子, 1.0 为不惩罚, None 没有惩罚系数
}
proccessor = VllmPenaltySequenceLogitsProcessor(**vllm_penalty_kwargs)


# 请求(带有序列惩罚参数)
a = api.get_response(question="介绍一下杭州!",
                        model=model,
                        temperature=0.0,
                        eos_token_id = eos_token_id,
                        # 序列惩罚参数
                        target_sequences_ids = proccessor.target_sequences,
                        sequences_penalty_factor = proccessor.penalty_factor,
                        verbose=True,
    )
print(a)
  • 方式二:openai 请求
from openai import OpenAI
# Set OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "test"
openai_api_base = "http://Ip:端口号/v1"     # openai的接口 不用加 /chat/completions,内部会自动加

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

chat_response = client.chat.completions.create(
    model="Qwen/Qwen2-0.5B-Instruct",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Tell me something about large language models."},
    ]
)
print("Chat response:\n", chat_response)
print("Chat response content:\n", chat_response.choices[0].message.content)

​​​​​​​

  • 方式三:curl

(待更新 ...)

(待更新 ...)

  • 19
    点赞
  • 30
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值