最近在做一个RAG的项目,尝试多种模型以后,发现chatglm3-6b-32k在中文领域明显优于其它模型,基于transformer在测试环境验证后需要生产环境部署,这就需要用到英伟达的triton server。
我们的生产服务器有8块 Tesla T4显卡,如果部署非量化版模型,每一个显卡16G可以部署一个实例(单个实例占用显存12G左右),如果是4bit量化版一个显卡可以部署至少2个实例。
1.拉取triton镜像:
docker pull instill/tritonserver:23.12-py3
2.创建容器(有两种方式,直接启动triton或者守护模式启动然后进去容器启动triton):
直接启动:
docker run -it --name chatglmtest --gpus all --shm-size=1g --ulimit memlock=-1 -p 8000:8000 -p 8001:8001 -p 8002:8002 --net=host -v /home/server/model_repository:/models --ulimit stack=67108864 nvcr.io/nvidia/tritonserver:23.12-py3 tritonserver --model-repository=/models
守护模式:
docker run -itd --name chatglmtest --gpus all --shm-size=1g --ulimit memlock=-1 -p 8000:8000 -p 8001:8001 -p 8002:8002 --net=host -v /home/server/model_repository:/models --ulimit stack=67108864 nvcr.io/nvidia/tritonserver:23.12-py3
3.进入容器,pip安装模型依赖,torch的cuda版本根据主机的cuda版本确定
docker exec -it chatglmtest bash
#cuda版本跟主机的cuda版本有关
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
pip install sentence_transformers transformers tiktoken accelerate packaging ninja transformers_stream_generator einops optimum bitsandbytes
4.配置模型,模型存放在刚才创建docker容器时映射的目录中/home/server/model_repository
/home/server/model_repository目录结构如下,我只放了一个模型,__pycache__和work目录不用管它,这两个目录是运行triton以后自动生成的。
目录1是模型版本,目录1下面放着huggingface下载下来的模型和model.py(运行脚本)文件。
和目录1平级的需要一个配置文件config.pbtxt,说明输入输出的协议和实例对应GPU的配置
下面开始写配置文件config.pbtxt和model.py。
config.pbtxt
name: "chatglm3-6b-32k" // 模型名,与模型的文件夹名字相同
backend: "python" // 模型所使用的后端引擎
max_batch_size: 0
input [ // 输入定义
{
name: "prompt" //名称
data_type: TYPE_STRING //类型
dims: [ -1 ] //数据维度,-1 表示可变维度
},
{
name: "history"
data_type: TYPE_STRING
dims: [ -1 ]
},
{
name: "temperature"
data_type: TYPE_STRING
dims: [ -1 ]
},
{
name: "max_token"
data_type: TYPE_STRING
dims: [ -1 ]
},
{
name: "history_len"
data_type: TYPE_STRING
dims: [ -1 ]
}
]
output [ //输出定义
{
name: "response"
data_type: TYPE_STRING
dims: [ -1 ]
},
{
name: "history"
data_type: TYPE_STRING
dims: [ -1 ]
}
]
//实例配置,我使用了3个显卡,每个显卡配置了一个实例
instance_group [
{
count: 1
kind: KIND_GPU
gpus: [ 0 ]
},
{
count: 1
kind: KIND_GPU
gpus: [ 1 ]
},
{
count: 1
kind: KIND_GPU
gpus: [ 2 ]
}
]
model.py
import os
# 设置显存空闲block最大分割阈值
os.environ['PYTORCH_CUDA_ALLOC_CONF'] = 'max_split_size_mb:32'
# 设置work目录
os.environ['TRANSFORMERS_CACHE'] = os.path.dirname(os.path.abspath(__file__))+"/work/"
os.environ['HF_MODULES_CACHE'] = os.path.dirname(os.path.abspath(__file__))+"/work/"
import json
# triton_python_backend_utils is available in every Triton Python model. You
# need to use this module to create inference requests and responses. It also
# contains some utility functions for extracting information from model_config
# and converting Triton input/output types to numpy types.
import triton_python_backend_utils as pb_utils
import sys
import gc
import time
import logging
import torch
from transformers import AutoTokenizer, AutoModel, AutoModelForCausalLM
import numpy as np
gc.collect()
torch.cuda.empty_cache()
logging.basicConfig(format='%(asctime)s - %(filename)s[line:%(lineno)d] - %(levelname)s: %(message)s',
level=logging.INFO)
class TritonPythonModel:
"""Your Python model must use the same class name. Every Python model
that is created must have "TritonPythonModel" as the class name.
"""
def initialize(self, args):
"""`initialize` is called only once when the model is being loaded.
Implementing `initialize` function is optional. This function allows
the model to intialize any state associated with this model.
Parameters
----------
args : dict
Both keys and values are strings. The dictionary keys and values are:
* model_config: A JSON string containing the model configuration
* model_instance_kind: A string containing model instance kind
* model_instance_device_id: A string containing model instance device ID
* model_repository: Model repository path
* model_version: Model version
* model_name: Model name
"""
# You must parse model_config. JSON string is not parsed here
self.model_config = json.loads(args['model_config'])
output_response_config = pb_utils.get_output_config_by_name(self.model_config, "response")
output_history_config = pb_utils.get_output_config_by_name(self.model_config, "history")
# Convert Triton types to numpy types
self.output_response_dtype = pb_utils.triton_string_to_numpy(output_response_config['data_type'])
self.output_history_dtype = pb_utils.triton_string_to_numpy(output_history_config['data_type'])
ChatGLM_path = os.path.dirname(os.path.abspath(__file__))+"/chatglm3-6b-32k"
self.tokenizer = AutoTokenizer.from_pretrained(ChatGLM_path, trust_remote_code=True)
#下面to('cuda:'+args['model_instance_device_id'])这里一定要注意,这里是把实例部署到对应的显卡上,如果不写会分散到所有显卡上或者集中到一个显卡上,都会造成问题
model = AutoModelForCausalLM.from_pretrained(ChatGLM_path,
torch_dtype=torch.float16, trust_remote_code=True).half().to('cuda:'+args['model_instance_device_id'])
self.model = model.eval()
logging.info("model init success")
def execute(self, requests):
"""`execute` MUST be implemented in every Python model. `execute`
function receives a list of pb_utils.InferenceRequest as the only
argument. This function is called when an inference request is made
for this model. Depending on the batching configuration (e.g. Dynamic
Batching) used, `requests` may contain multiple requests. Every
Python model, must create one pb_utils.InferenceResponse for every
pb_utils.InferenceRequest in `requests`. If there is an error, you can
set the error argument when creating a pb_utils.InferenceResponse
Parameters
----------
requests : list
A list of pb_utils.InferenceRequest
Returns
-------
list
A list of pb_utils.InferenceResponse. The length of this list must
be the same as `requests`
"""
output_response_dtype = self.output_response_dtype
output_history_dtype = self.output_history_dtype
# output_dtype = self.output_dtype
responses = []
# Every Python backend must iterate over everyone of the requests
# and create a pb_utils.InferenceResponse for each of them.
for request in requests:
prompt = pb_utils.get_input_tensor_by_name(request, "prompt").as_numpy()[0]
prompt = prompt.decode('utf-8')
history_origin = pb_utils.get_input_tensor_by_name(request, "history").as_numpy()
if len(history_origin) > 0:
history = np.array([item.decode('utf-8') for item in history_origin]).reshape((-1,2)).tolist()
else:
history = []
temperature = pb_utils.get_input_tensor_by_name(request, "temperature").as_numpy()[0]
temperature = float(temperature.decode('utf-8'))
max_token = pb_utils.get_input_tensor_by_name(request, "max_token").as_numpy()[0]
max_token = int(max_token.decode('utf-8'))
history_len = pb_utils.get_input_tensor_by_name(request, "history_len").as_numpy()[0]
history_len = int(history_len.decode('utf-8'))
# 日志输出传入信息
in_log_info = {
"in_prompt":prompt,
"in_history":history,
"in_temperature":temperature,
"in_max_token":max_token,
"in_history_len":history_len
}
logging.info(in_log_info)
response,history = self.model.chat(self.tokenizer,
prompt,
history=history[-history_len:] if history_len > 0 else [],
max_length=max_token,
temperature=temperature)
# 日志输出处理后的信息
out_log_info = {
"out_response":response,
"out_history":history
}
logging.info(out_log_info)
response = np.array(response)
history = np.array(history)
response_output_tensor = pb_utils.Tensor("response",response.astype(self.output_response_dtype))
history_output_tensor = pb_utils.Tensor("history",history.astype(self.output_history_dtype))
final_inference_response = pb_utils.InferenceResponse(output_tensors=[response_output_tensor,history_output_tensor])
responses.append(final_inference_response)
# Create InferenceResponse. You can set an error here in case
# there was a problem with handling this inference request.
# Below is an example of how you can set errors in inference
# response:
#
# pb_utils.InferenceResponse(
# output_tensors=..., TritonError("An error occured"))
# You should return a list of pb_utils.InferenceResponse. Length
# of this list must match the length of `requests` list.
return responses
def finalize(self):
"""`finalize` is called only once when the model is being unloaded.
Implementing `finalize` function is OPTIONAL. This function allows
the model to perform any necessary clean ups before exit.
"""
print('Cleaning up...')
5:启动triton server
#守护模式(-itd创建的容器),进入容器运行
tritonserver --model-repository=/models
#非守护模式(-it创建的容器),在宿主机运行
docker start chatglmtest
6:验证
curl -X POST localhost:8000/v2/models/chatglm3-6b-32k/generate \
-d '{"prompt": "你好,请问你叫什么?", "history":[], "temperature":"0.3","max_token":"100","history_len":"0"}'
响应:
{"history":["{'role': 'user', 'content': '你好,请问你叫什么?'}","{'role': 'assistant', 'metadata': '', 'content': '你好!我是一个名为 ChatGLM3-6B 的人工智能助手,是基于清华大学 KEG 实验室和智谱 AI 公司于 2023 年共同训练的语言模型开发的。我的任务是针对用户的问题和要求提供适当的答复和支持。'}"],"model_name":"chatglm3-6b-32k","model_version":"1","response":"你好!我是一个名为 ChatGLM3-6B 的人工智能助手,是基于清华大学 KEG 实验室和智谱 AI 公司于 2023 年共同训练的语言模型开发的。我的任务是针对用户的问题和要求提供适当的答复和支持。"}