TensorRT Triton Inference Server and tritonclient on python_backend

FakeOccupational

已于 2023-06-19 17:36:29 修改

阅读量1.1k

点赞数

CC 4.0 BY-SA版权

分类专栏：深度学习文章标签： python 开发语言

于 2023-03-31 14:14:23 首次发布

本文链接：https://blog.csdn.net/ResumeProject/article/details/129813831

深度学习专栏收录该内容

175 篇文章

订阅专栏

server

The Triton backend for Python. The goal of Python backend is to let you serve models written in Python by Triton Inference Server without having to write any C++ code.

Model Config File

每个Python Triton模型都必须提供一个描述文件config.pbtxt用于模型配置。必须将模型文件的backend 字段设置为python ,同时不应设置platform字段。文件结构如下
单模块模型

models
└── add_sub
    ├── 1
    │   └── model.py
    └── config.pbtxt

多模块模型

├─decoder
│  │  config.pbtxt
│  │
│  └─1
│          decoder.plan
│
├─encoder
│  │  config.pbtxt
│  │
│  └─1
│          encoder.plan
│
└─main_name
    │  config.pbtxt
    │
    └─1
          model.py
          用于model.py的其他文件

config.pbtxt的配置

需要配置各个模块的config.pbtxt及model.py的config.pbtxt，子模块的可以参考添加链接描述和添加链接描述，model.py的config.pbtxt的size可设置为-1，名称设置请阅读下面的部分。

model.py

https://github.com/triton-inference-server/python_backend/blob/main/examples/add_sub/model.py

关键字	解释	–
`class TritonPythonModel:`	你的 Python 模型必须使用相同的类名。每个 Python 模型创建的内容必须具有“TritonPythonModel”作为类名
`def initialize(self, args):`	'初始化’仅在加载模型时调用一次。实现“初始化”功能是可选的。此功能允许用于初始化与此模型关联的任何状态的模型。
`def execute(self, requests):`	‘execute’ 每当发出推理请求时都会调用函数。“执行”函数接收pb_utils列表。推理请求作为唯一论点。发出推理请求时调用此函数对于此模型。根据配料配置（例如动态批处理）使用，“请求”可能包含多个请求。每Python 模型，必须创建一个pb_utils。推理响应每个pb_utils。推理请求在“请求”中。如果出现错误，您可以在创建pb_utils时设置错误参数。推理响应参数

initialize

在函数中，您被赋予一个变量args ，是一个 Python 字典。键和此字典的值都是字符串。字典中的键及其说明：

key	描述
model_config	包含模型配置的 JSON 字符串
model_instance_kind	包含模型实例类型的字符串	self.device = “cpu” if args[“model_instance_kind”] == “CPU” else “cuda”
model_instance_device_id	包含模型实例设备 ID 的字符串
model_repository	模型存储库路径
model_version	模型版本
model_name	型号名称

execute

这是您希望实现模型的最通用方式，并且要求函数为每个请求只返回一个响应。这意味着在此模式下，函数必须返回长度与相同的对象列表。工作此模式下的流为：

execute函数接收一批pb_utils。推理请求作为长度 N 数组。
对pb_utils执行推理。推理请求并附加相应的pb_utils。推理响应响应列表。
返回响应列表。

返回的响应列表的长度必须为 N。列表中的每个元素都应该是相应元素的响应元素。每个元素必须包含一个响应（响应可以是输出张量或错误）;元素不能为“无”。

获取用户输入值get_input_tensor_by_name

客户端使用tritonclient.http.InferInput的 set_data_from_numpy设置namefrommainconfig的值（name form main_name-> config.pbtxt）
后端使用pb_utils.get_input_tensor_by_name(request, "namefrommainconfig").as_numpy().tolist()取值

// https://github.com/triton-inference-server/python_backend/blob/main/src/resources/triton_python_backend_utils.py
def get_input_tensor_by_name(inference_request, name):
    """Find an input Tensor in the inference_request that has the given
    name
    Parameters
    ----------
    inference_request : InferenceRequest
        InferenceRequest object
    name : str
        name of the input Tensor object
    Returns
    -------
    Tensor
        The input Tensor with the specified name, or None if no
        input Tensor with this name exists
    """

in_0 = pb_utils.Tensor("INPUT__0", input_ids.numpy().astype(self.input0_dtype)) 名称INPUT__0在model_name='facenet’的文件夹配置文件中获取

pb_utils.InferenceRequest

注意以下例子中的 inputs=[pb_utils.Tensor(self.Facenet_inputs[0], face_img.astype(np.float32))]
tritonclient.utils.InferenceServerException: Failed to process the request(s) for model instance ‘XXXXX’, message: TypeError: init(): incompatible constructor arguments. The following argument types are supported:
1. c_python_backend_utils.InferenceRequest(request_id: str = ‘’, correlation_id: int = 0, inputs: List[triton::backend::python::PbTensor], requested_output_names: List[str], model_name: str, model_version: int = -1, flags: int = 0)

//https://www.cnblogs.com/zzk0/p/15535828.html
inference_request = pb_utils.InferenceRequest(
    model_name='facenet',
    requested_output_names=[self.Facenet_outputs[0]],
    inputs=[pb_utils.Tensor(self.Facenet_inputs[0], face_img.astype(np.float32))]
)
inference_response = inference_request.exec()
pre = utils.pb_tensor_to_numpy(pb_utils.get_output_tensor_by_name(inference_response, self.Facenet_outputs[0]))


#output = pb_utils.get_output_tensor_by_name(inference_response, "your_requested_output_names_from_config")
#out: torch.Tensor = torch.from_dlpack(output.to_dlpack())

必须采用pytorch的to_dlpack将GPU的内容放到共享内存中，再用from_dlpack把共享内存的内容转为pytorch的tensor。

output = pb_utils.get_output_tensor_by_name(inference_response, "your_requested_output_names_from_config")
out: torch.Tensor = torch.from_dlpack(output.to_dlpack())

torch.from_dlpack:https://pytorch.org/docs/stable/dlpack.html
pb_utils.Tensor.from_dlpack triton的变量转为pytorch的tensor有2种方法：

input_ids = from_dlpack(in_0.to_dlpack())

input_ids = torch.from_numpy(in_0.as_numpy()) ,采用to_dlpack和from_dlpack 具有更低的消耗。

获取模型输出get_output_tensor_by_name

//pb_utils.get_output_tensor_by_name(inference_response, self.Facenet_outputs[0])
//https://github.com/triton-inference-server/python_backend/blob/main/src/resources/triton_python_backend_utils.py
def get_output_tensor_by_name(inference_response, name):
    """Find an output Tensor in the inference_response that has the given
    name
    Parameters
    ----------
    inference_response : InferenceResponse
        InferenceResponse object
    name : str
        name of the output Tensor object
    Returns
    -------
    Tensor
        The output Tensor with the specified name, or None if no
        output Tensor with this name exists
    """
    output_tensors = inference_response.output_tensors()
    for output_tensor in output_tensors:
        if output_tensor.name() == name:
            return output_tensor

    return None

错误输出

inference_response = inference_request.exec()
if inference_response.has_error():
	print(inference_response.error().message())

client

pip install tritonclient[all] https://github.com/triton-inference-server/client
frpc参考 https://programtalk.com/python-more-examples/tritonclient.grpc.InferInput/
http参考 https://www.cnblogs.com/zzk0/p/15535828.html

import numpy as np
import tritonclient.http as httpclient


if __name__ == '__main__':
    triton_client = httpclient.InferenceServerClient(url='127.0.0.1:8000')

    inputs = []
    inputs.append(httpclient.InferInput('INPUT0', [4], "FP32"))
    inputs.append(httpclient.InferInput('INPUT1', [4], "FP32"))
    input_data0 = np.random.randn(4).astype(np.float32)
    input_data1 = np.random.randn(4).astype(np.float32)
    inputs[0].set_data_from_numpy(input_data0, binary_data=False)
    inputs[1].set_data_from_numpy(input_data1, binary_data=False)
    outputs = []
    outputs.append(httpclient.InferRequestedOutput('OUTPUT0', binary_data=False))
    outputs.append(httpclient.InferRequestedOutput('OUTPUT1', binary_data=False))

    results = triton_client.infer('example_python', inputs=inputs, outputs=outputs)
    output_data0 = results.as_numpy('OUTPUT0')
    output_data1 = results.as_numpy('OUTPUT1')

    print(input_data0)
    print(input_data1)
    print(output_data0)
    print(output_data1)

http参考 https://github.com/kamalkraj/stable-diffusion-tritonserver/blob/master/Inference.ipynb 主要包括设置输入输出占位符和进行推理

import numpy as np
import tritonclient.http
# model
model_name = "stable_diffusion"
url = "0.0.0.0:8000"
model_version = "1"
batch_size = 1
# model input params
prompt = "A small cabin on top of a snowy mountain in the style of Disney, artstation"
samples = 1 # no.of images to generate
steps = 45
guidance_scale = 7.5
seed = 1024
triton_client = tritonclient.http.InferenceServerClient(url=url, verbose=False)
assert triton_client.is_model_ready(
    model_name=model_name, model_version=model_version
), f"model {model_name} not yet ready"

model_metadata = triton_client.get_model_metadata(model_name=model_name, model_version=model_version)
model_config = triton_client.get_model_config(model_name=model_name, model_version=model_version)
# Input placeholder
prompt_in = tritonclient.http.InferInput(name="PROMPT", shape=(batch_size,), datatype="BYTES")# 如果为图像shape可以写成shape=(batch_size,64,64,3),
samples_in = tritonclient.http.InferInput("SAMPLES", (batch_size, ), "INT32")
steps_in = tritonclient.http.InferInput("STEPS", (batch_size, ), "INT32")
guidance_scale_in = tritonclient.http.InferInput("GUIDANCE_SCALE", (batch_size, ), "FP32")
seed_in = tritonclient.http.InferInput("SEED", (batch_size, ), "INT64")

images = tritonclient.http.InferRequestedOutput(name="IMAGES", binary_data=False)
%%time
# Setting inputs
prompt_in.set_data_from_numpy(np.asarray([prompt] * batch_size, dtype=object))
samples_in.set_data_from_numpy(np.asarray([samples], dtype=np.int32))
steps_in.set_data_from_numpy(np.asarray([steps], dtype=np.int32))
guidance_scale_in.set_data_from_numpy(np.asarray([guidance_scale], dtype=np.float32))
seed_in.set_data_from_numpy(np.asarray([seed], dtype=np.int64))

response = triton_client.infer(
    model_name=model_name, model_version=model_version, 
    inputs=[prompt_in,samples_in,steps_in,guidance_scale_in,seed_in], 
    outputs=[images]
)
CPU times: user 92.3 ms, sys: 39.5 ms, total: 132 ms
Wall time: 6.31 s
images = response.as_numpy("IMAGES")
from PIL import Image
if images.ndim == 3:
    images = images[None, ...]
images = (images * 255).round().astype("uint8")
pil_images = [Image.fromarray(image) for image in images]
def image_grid(imgs, rows, cols):
    assert len(imgs) == rows*cols

    w, h = imgs[0].size
    grid = Image.new('RGB', size=(cols*w, rows*h))
    grid_w, grid_h = grid.size
    
    for i, img in enumerate(imgs):
        grid.paste(img, box=(i%cols*w, i//cols*h))
    return grid
rows = 1 # change according to no.of samples 
cols = 1 # change according to no.of samples
# rows * cols == no.of samples
image_grid(pil_images, rows, cols)

tritonclient.http.InferInput的name与server端的config一致

error :tritonclient.utils.InferenceServerException: got unexpected datatype BYTES from numpy array, expected FP32

根据提示修改类型即可tritonclient.http.InferInput(name=“PROMPT”, shape=(batch_size,), datatype=“BYTES”)

CG

Nvidia官方python backend推理教程
Triton Inference Server总目录
triton-inference-server python_backend
self.encoder= CLIPTokenizer.from_pretrained(" ") https://huggingface.co/CompVis/stable-diffusion-v1-4/tree/main/tokenizer
输入编码

# https://bobbyhadz.com/blog/python-unicodeencodeerror-ascii-codec-cant-encode-character-in-position#:~:text=The%20Python%20%22UnicodeEncodeError%3A%20%27ascii%27%20codec%20can%27t%20encode%20character,the%20error%2C%20specify%20the%20correct%20encoding%2C%20e.g.%20utf-8.
my_str = 'one ф'

# 👇️ encode str to bytes
my_bytes = my_str.encode('utf-8')
print(my_bytes)  # 👉️ b'one \xd1\x84'

# 👇️ decode bytes to str
my_str_again = my_bytes.decode('utf-8')
print(my_str_again)  # 👉️ "one ф"