【大模型】部署lora微调后的LLM（以百川为例）

本文链接：https://blog.csdn.net/mengmengz07/article/details/137558500

接上篇，我们可以使用lora的方法微调LLM。

和微调相对的另一个问题是，我们要如何使用大模型。如果直接使用transformer加载大模型，那只能单线程使用，因为推理过程没有经过优化。如果想要多人同时使用大模型，那么需要使用专门的大模型推理加速框架，比如vllm，tgi。

vllm，tgi加载大模型的方法可以查看官网。这些框架成功加载LLM后，还有两个大坑等着我们：对话模板，加载微调模型。

关于对话模板。首先，大模型是补全模型，也就是说，我们输入一句话，模型接着这句话往下续写，补全后文。这和对话是很不一样的。我们需要一个特殊的prompt，使得模型可以意识到，它续写的时候需要按照对话这种情境来续写。不同的模型有不同的prompt，真正的问题是需要包裹在prompt里面的。transformer加载开源LLM的时候，一般会提供chat接口，背后已经将问题转化成了prompt包裹的格式。当我们需要使用其他加速框架部署大模型时，需要自己完成这一步：要么在传入问题时直接传prompt格式的问题，或者框架有模板接口可以按照自定义模板在后台自动化完成这种格式转化。（一般是写jinja文件）tgi在对话模板上的支持很不好。vllm则可以在启动时显式指定chat-template。因此推荐使用vllm。

关于加载微调模型。vllm0.4.0版开始可以支持，但是之前不能支持（即使将lora微调后的参数合并回原始模型了，也不能加载）。tgi则可以直接加载参数合并之后的模型。但是因为模板的问题，tgi还是不适用于没有给出明文prompt的LLM。

附录：将微调参数合并回原始模型的代码

import argparse

import torch
from peft import PeftModel, PeftConfig
from transformers import (
    AutoModel,
    AutoTokenizer,
    BloomForCausalLM,
    BloomTokenizerFast,
    AutoModelForCausalLM,
    LlamaTokenizer,
    LlamaForCausalLM,
    AutoModelForSequenceClassification,
)

MODEL_CLASSES = {
    "bloom": (BloomForCausalLM, BloomTokenizerFast),
    "chatglm": (AutoModel, AutoTokenizer),
    "llama": (LlamaForCausalLM, LlamaTokenizer),
    "baichuan": (AutoModelForCausalLM, AutoTokenizer),
    "auto": (AutoModelForCausalLM, AutoTokenizer),
}


def main():
    parser = argparse.ArgumentParser()
    parser.add_argument('--model_type', default="baichuan", type=str, required=False)
    parser.add_argument('--tokenizer_path', default=None, type=str,
                        help="Please specify tokenization path.")

    parser.add_argument('--output_dir', default='./merged', type=str)
    args = parser.parse_args()


    base_model_path = "../../Baichuan_Inc/Baichuan_Inc/Baichuan"
    lora_model_path = "baichuan_fineturn/risk_combine_output"
    output_dir = args.output_dir
    peft_config = PeftConfig.from_pretrained(lora_model_path)
    model_class, tokenizer_class = MODEL_CLASSES[args.model_type]

    # 模型加载
    if peft_config.task_type == "SEQ_CLS":
        if args.model_type == "chatglm":
            raise ValueError("chatglm does not support sequence classification")
        base_model = AutoModelForSequenceClassification.from_pretrained(
            base_model_path,
            num_labels=1,
            load_in_8bit=False,
            torch_dtype=torch.float32,
            trust_remote_code=True,
            device_map="auto",
        )
    else:
        base_model = model_class.from_pretrained(
            base_model_path,
            load_in_8bit=False,
            torch_dtype=torch.float16,
            trust_remote_code=True,
            device_map="auto",
        )
    
    # 分词器加载
    if args.tokenizer_path:
        tokenizer = tokenizer_class.from_pretrained(args.tokenizer_path, trust_remote_code=True)
    else:
        tokenizer = tokenizer_class.from_pretrained(base_model_path, trust_remote_code=True)

    # 修改词表大小
    # if args.resize_emb:
    #     base_model_token_size = base_model.get_input_embeddings().weight.size(0)
    #     if base_model_token_size != len(tokenizer):
    #         base_model.resize_token_embeddings(len(tokenizer))

    # 初始化Peft新模型
    new_model = PeftModel.from_pretrained(
        base_model,
        lora_model_path,
        device_map="auto",
        torch_dtype=torch.float16,
    )
    new_model.eval()
    new_base_model = new_model.merge_and_unload()

    tokenizer.save_pretrained(output_dir)
    new_base_model.save_pretrained(output_dir, safe_serialization=False, max_shard_size='10GB')

if __name__ == '__main__':
    main()

vllm在openshift上部署deployment的demo

（这里直接部署的量化版通义千问，参数合并后的百川用同样的方式部署即可。需要特别注意的是参数gpu-memory-utilization，它要求一开始就被分配这么多的GPU，如果不能分配，会报错崩溃。如果分配了这么多GPU，仍然不能满足模型的需要，也会报错。所以，要确保分配的比例是当前空余GPU空余资源可满足的，然后保证足够加载模型）

kind: Deployment
apiVersion: apps/v1
metadata:
  name: base-llm-server-vllm
  namespace: YOUR_NAMESPACE
spec:
  replicas: 1
  selector:
    matchLabels:
      app: base-llm-server-vllm
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: base-llm-server-vllm
    spec:
      nodeSelector:
        kubernetes.io/hostname: YOUR_COMPUTE_NODE
        kubernetes.io/role: gpu
      restartPolicy: Always
      schedulerName: default-scheduler
      terminationGracePeriodSeconds: 30
      securityContext:
        runAsUser: 0
      containers:
        - resources:
            limits:
              cpu: '6'
              memory: 64Gi
              nvidia.com/gpu: '1'
            requests:
              cpu: 200m
              memory: 16Gi
              nvidia.com/gpu: '1'
          terminationMessagePath: /dev/termination-log
          name: vllm-latest-container
          command:
            - python3
            - '-m'
            - vllm.entrypoints.openai.api_server
            - '--model=/data/Qwen-AWQ'
            - '--gpu-memory-utilization=0.3'
            - '--trust-remote-code'
            - '--disable-log-requests'
            - '--chat-template=/data/Qwen-AWQ/qwen1_5_14B_template.jinja'
            - '--dtype=float16'
            - '--quantization=awq'
          env:
            - name: HUGGING_FACE_HUB_TOKEN
              value: <secret>
            - name: NVIDIA_VISIBLE_DEVICES
              value: all
            - name: CUDA_VISIBLE_DEVICES
              value: '1'
          ports:
            - containerPort: 8000
              protocol: TCP
          imagePullPolicy: IfNotPresent
          volumeMounts:
            - name: data-volume
              mountPath: /data
              subPath: models
          terminationMessagePolicy: File
          image: VLLM-OPENAI-IMAGE:v0.4.0
      volumes:
        - name: data-volume
          persistentVolumeClaim:
            claimName: text-gen-pvc-local
      dnsPolicy: ClusterFirst
      tolerations:
        - key: nvidia.com/gpu
          effect: NoSchedule
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 25%
      maxSurge: 25%
  revisionHistoryLimit: 10
  progressDeadlineSeconds: 600

自定义的对话模板（jinja）

{% for message in messages %}
{% if loop.first and messages[0]['role'] != 'system' %}
{{ '<|im_start|>system\nYou are a helpful assistant<|im_end|>\n' }}
{% endif %}
{{'<|im_start|>' + message['role'] + '\n' + message['content']}}
{% if (loop.last and add_generation_prompt) or not loop.last %}
{{ '<|im_end|>' + '\n'}}
{% endif %}
{% endfor %}
{% if add_generation_prompt and messages[-1]['role'] != 'assistant' %}
{{ '<|im_start|>assistant\n' }}
{% endif %}

通过restful调用大模型

POST http://base-llm-server-vllm.com/v1/chat/completions 

{
"model": "/data/Qwen-AWQ",
"stream": true, #流式响应，不需要可以关闭
"messages": [
        {"role": "user", "content": "你是谁？"}
    ]
}