server
- The Triton backend for Python. The goal of Python backend is to let you serve models written in Python by Triton Inference Server without having to write any C++ code.
Model Config File
- 每个Python Triton模型都必须提供一个描述文件config.pbtxt用于模型配置。必须将模型文件的backend 字段设置为python ,同时不应设置platform字段。文件结构如下
- 单模块模型
models
└── add_sub
├── 1
│ └── model.py
└── config.pbtxt
- 多模块模型
├─decoder
│ │ config.pbtxt
│ │
│ └─1
│ decoder.plan
│
├─encoder
│ │ config.pbtxt
│ │
│ └─1
│ encoder.plan
│
└─main_name
│ config.pbtxt
│
└─1
model.py
用于model.py的其他文件
config.pbtxt的配置
需要配置各个模块的config.pbtxt及model.py的config.pbtxt,子模块的可以参考添加链接描述和添加链接描述,model.py的config.pbtxt的size可设置为-1,名称设置请阅读下面的部分。
model.py
关键字 | 解释 | – |
---|---|---|
class TritonPythonModel: | 你的 Python 模型必须使用相同的类名。每个 Python 模型创建的内容必须具有“TritonPythonModel”作为类名 | |
def initialize(self, args): | '初始化’仅在加载模型时调用一次。实现“初始化”功能是可选的。此功能允许用于初始化与此模型关联的任何状态的模型。 | |
def execute(self, requests): | ‘execute’ 每当发出推理请求时都会调用函数。“执行”函数接收pb_utils列表。推理请求作为唯一论点。发出推理请求时调用此函数对于此模型。根据配料配置(例如动态批处理)使用,“请求”可能包含多个请求。每Python 模型,必须创建一个pb_utils。推理响应每个pb_utils。推理请求在“请求”中。如果出现错误,您可以在创建pb_utils时设置错误参数。推理响应参数 |
initialize
- 在函数中,您被赋予 一个变量args , 是一个 Python 字典。键和此字典的值都是字符串。字典中的键及其说明:
key | 描述 | |
---|---|---|
model_config | 包含模型配置的 JSON 字符串 | |
model_instance_kind | 包含模型实例类型的字符串 | self.device = “cpu” if args[“model_instance_kind”] == “CPU” else “cuda” |
model_instance_device_id | 包含模型实例设备 ID 的字符串 | |
model_repository | 模型存储库路径 | |
model_version | 模型版本 | |
model_name | 型号名称 |
execute
这是您希望实现模型的最通用方式,并且 要求函数为每个请求只返回一个响应。 这意味着在此模式下,函数必须返回长度与 相同的对象列表。工作 此模式下的流为:
-
execute函数接收一批pb_utils。推理请求作为 长度 N 数组。
-
对pb_utils执行推理。推理请求并附加 相应的pb_utils。推理响应响应列表。
-
返回响应列表。
返回的响应列表的长度必须为 N。列表中的每个元素都应该是相应元素的响应 元素。每个元素必须包含一个响应(响应可以是输出 张量或错误);元素不能为“无”。
获取用户输入值get_input_tensor_by_name
- 客户端使用tritonclient.http.InferInput的 set_data_from_numpy设置namefrommainconfig的值(name form main_name-> config.pbtxt)
- 后端使用
pb_utils.get_input_tensor_by_name(request, "namefrommainconfig").as_numpy().tolist()
取值
// https://github.com/triton-inference-server/python_backend/blob/main/src/resources/triton_python_backend_utils.py
def get_input_tensor_by_name(inference_request, name):
"""Find an input Tensor in the inference_request that has the given
name
Parameters
----------
inference_request : InferenceRequest
InferenceRequest object
name : str
name of the input Tensor object
Returns
-------
Tensor
The input Tensor with the specified name, or None if no
input Tensor with this name exists
"""
in_0 = pb_utils.Tensor("INPUT__0", input_ids.numpy().astype(self.input0_dtype))
名称INPUT__0在model_name='facenet’的文件夹配置文件中获取
pb_utils.InferenceRequest
- 注意以下例子中的
inputs=[pb_utils.Tensor(self.Facenet_inputs[0], face_img.astype(np.float32))]
- tritonclient.utils.InferenceServerException: Failed to process the request(s) for model instance ‘XXXXX’, message: TypeError: init(): incompatible constructor arguments. The following argument types are supported:
- c_python_backend_utils.InferenceRequest(request_id: str = ‘’, correlation_id: int = 0, inputs: List[triton::backend::python::PbTensor], requested_output_names: List[str], model_name: str, model_version: int = -1, flags: int = 0)
//https://www.cnblogs.com/zzk0/p/15535828.html
inference_request = pb_utils.InferenceRequest(
model_name='facenet',
requested_output_names=[self.Facenet_outputs[0]],
inputs=[pb_utils.Tensor(self.Facenet_inputs[0], face_img.astype(np.float32))]
)
inference_response = inference_request.exec()
pre = utils.pb_tensor_to_numpy(pb_utils.get_output_tensor_by_name(inference_response, self.Facenet_outputs[0]))
#output = pb_utils.get_output_tensor_by_name(inference_response, "your_requested_output_names_from_config")
#out: torch.Tensor = torch.from_dlpack(output.to_dlpack())
必须采用pytorch的to_dlpack将GPU的内容放到共享内存中,再用from_dlpack把共享内存的内容转为pytorch的tensor。
output = pb_utils.get_output_tensor_by_name(inference_response, "your_requested_output_names_from_config")
out: torch.Tensor = torch.from_dlpack(output.to_dlpack())
-
torch.from_dlpack:https://pytorch.org/docs/stable/dlpack.html
-
pb_utils.Tensor.from_dlpack triton的变量转为pytorch的tensor有2种方法:
input_ids = from_dlpack(in_0.to_dlpack())
input_ids = torch.from_numpy(in_0.as_numpy()) ,采用to_dlpack和from_dlpack 具有更低的消耗。
获取模型输出get_output_tensor_by_name
//pb_utils.get_output_tensor_by_name(inference_response, self.Facenet_outputs[0])
//https://github.com/triton-inference-server/python_backend/blob/main/src/resources/triton_python_backend_utils.py
def get_output_tensor_by_name(inference_response, name):
"""Find an output Tensor in the inference_response that has the given
name
Parameters
----------
inference_response : InferenceResponse
InferenceResponse object
name : str
name of the output Tensor object
Returns
-------
Tensor
The output Tensor with the specified name, or None if no
output Tensor with this name exists
"""
output_tensors = inference_response.output_tensors()
for output_tensor in output_tensors:
if output_tensor.name() == name:
return output_tensor
return None
错误输出
inference_response = inference_request.exec()
if inference_response.has_error():
print(inference_response.error().message())
client
-
pip install tritonclient[all]
https://github.com/triton-inference-server/client -
frpc参考 https://programtalk.com/python-more-examples/tritonclient.grpc.InferInput/
-
http参考 https://www.cnblogs.com/zzk0/p/15535828.html
import numpy as np
import tritonclient.http as httpclient
if __name__ == '__main__':
triton_client = httpclient.InferenceServerClient(url='127.0.0.1:8000')
inputs = []
inputs.append(httpclient.InferInput('INPUT0', [4], "FP32"))
inputs.append(httpclient.InferInput('INPUT1', [4], "FP32"))
input_data0 = np.random.randn(4).astype(np.float32)
input_data1 = np.random.randn(4).astype(np.float32)
inputs[0].set_data_from_numpy(input_data0, binary_data=False)
inputs[1].set_data_from_numpy(input_data1, binary_data=False)
outputs = []
outputs.append(httpclient.InferRequestedOutput('OUTPUT0', binary_data=False))
outputs.append(httpclient.InferRequestedOutput('OUTPUT1', binary_data=False))
results = triton_client.infer('example_python', inputs=inputs, outputs=outputs)
output_data0 = results.as_numpy('OUTPUT0')
output_data1 = results.as_numpy('OUTPUT1')
print(input_data0)
print(input_data1)
print(output_data0)
print(output_data1)
- http参考 https://github.com/kamalkraj/stable-diffusion-tritonserver/blob/master/Inference.ipynb 主要包括设置输入输出占位符和进行推理
import numpy as np
import tritonclient.http
# model
model_name = "stable_diffusion"
url = "0.0.0.0:8000"
model_version = "1"
batch_size = 1
# model input params
prompt = "A small cabin on top of a snowy mountain in the style of Disney, artstation"
samples = 1 # no.of images to generate
steps = 45
guidance_scale = 7.5
seed = 1024
triton_client = tritonclient.http.InferenceServerClient(url=url, verbose=False)
assert triton_client.is_model_ready(
model_name=model_name, model_version=model_version
), f"model {model_name} not yet ready"
model_metadata = triton_client.get_model_metadata(model_name=model_name, model_version=model_version)
model_config = triton_client.get_model_config(model_name=model_name, model_version=model_version)
# Input placeholder
prompt_in = tritonclient.http.InferInput(name="PROMPT", shape=(batch_size,), datatype="BYTES")# 如果为图像shape可以写成shape=(batch_size,64,64,3),
samples_in = tritonclient.http.InferInput("SAMPLES", (batch_size, ), "INT32")
steps_in = tritonclient.http.InferInput("STEPS", (batch_size, ), "INT32")
guidance_scale_in = tritonclient.http.InferInput("GUIDANCE_SCALE", (batch_size, ), "FP32")
seed_in = tritonclient.http.InferInput("SEED", (batch_size, ), "INT64")
images = tritonclient.http.InferRequestedOutput(name="IMAGES", binary_data=False)
%%time
# Setting inputs
prompt_in.set_data_from_numpy(np.asarray([prompt] * batch_size, dtype=object))
samples_in.set_data_from_numpy(np.asarray([samples], dtype=np.int32))
steps_in.set_data_from_numpy(np.asarray([steps], dtype=np.int32))
guidance_scale_in.set_data_from_numpy(np.asarray([guidance_scale], dtype=np.float32))
seed_in.set_data_from_numpy(np.asarray([seed], dtype=np.int64))
response = triton_client.infer(
model_name=model_name, model_version=model_version,
inputs=[prompt_in,samples_in,steps_in,guidance_scale_in,seed_in],
outputs=[images]
)
CPU times: user 92.3 ms, sys: 39.5 ms, total: 132 ms
Wall time: 6.31 s
images = response.as_numpy("IMAGES")
from PIL import Image
if images.ndim == 3:
images = images[None, ...]
images = (images * 255).round().astype("uint8")
pil_images = [Image.fromarray(image) for image in images]
def image_grid(imgs, rows, cols):
assert len(imgs) == rows*cols
w, h = imgs[0].size
grid = Image.new('RGB', size=(cols*w, rows*h))
grid_w, grid_h = grid.size
for i, img in enumerate(imgs):
grid.paste(img, box=(i%cols*w, i//cols*h))
return grid
rows = 1 # change according to no.of samples
cols = 1 # change according to no.of samples
# rows * cols == no.of samples
image_grid(pil_images, rows, cols)
- tritonclient.http.InferInput的name与server端的config一致
error :tritonclient.utils.InferenceServerException: got unexpected datatype BYTES from numpy array, expected FP32
- 根据提示修改类型即可tritonclient.http.InferInput(name=“PROMPT”, shape=(batch_size,), datatype=“BYTES”)
CG
-
self.encoder= CLIPTokenizer.from_pretrained(" ")
https://huggingface.co/CompVis/stable-diffusion-v1-4/tree/main/tokenizer -
输入编码
# https://bobbyhadz.com/blog/python-unicodeencodeerror-ascii-codec-cant-encode-character-in-position#:~:text=The%20Python%20%22UnicodeEncodeError%3A%20%27ascii%27%20codec%20can%27t%20encode%20character,the%20error%2C%20specify%20the%20correct%20encoding%2C%20e.g.%20utf-8.
my_str = 'one ф'
# 👇️ encode str to bytes
my_bytes = my_str.encode('utf-8')
print(my_bytes) # 👉️ b'one \xd1\x84'
# 👇️ decode bytes to str
my_str_again = my_bytes.decode('utf-8')
print(my_str_again) # 👉️ "one ф"
- 注:关于config文件类型,如果不知道可以先乱写一个,然后根据错误提示进行修改即可