Triton部署mmdeploy导出的TensorRT模型失败篇

记录一下历程,最终没有部署成功,应该是Ubantu系统版本的问题。现在没有时间搞了,先记录一下,后续用到再填坑。

Triton demo

git clone -b r22.06 https://github.com/triton-inference-server/server.git

cd server/docs/examples

./fetch_models.sh

# 构建并启动容器1的服务
docker run --gpus=1 --rm --net=host -v /home/xbsj/gaoying/triton/triton_demo/server/docs/examples/model_repository:/models nvcr.io/nvidia/tritonserver:22.06-py3 tritonserver --model-repository=/models

# 进入容器2,准备发送请求
docker run -it --rm --net=host nvcr.io/nvidia/tritonserver:22.06-py3-sdk

# 在容器2中发送请求
/workspace/install/bin/image_client -m densenet_onnx -c 3 -s INCEPTION /workspace/images/mug.jpg

Triton安装及启动服务(docker)

triton容器与cuda,tensorrt对应: Release Notes :: NVIDIA Deep Learning Triton Inference Server Documentation

更详细的在这: Frameworks Support Matrix :: NVIDIA Deep Learning Frameworks Documentation

Container VersionTriton Inference ServerUbuntuCUDA ToolkitTensorRT
21.072.12.020.04NVIDIA CUDA 11.4.0TensorRT 8.0.1.6
21.06.12.11.0NVIDIA CUDA 11.3.1TensorRT 7.2.3.4
21.06
21.052.10.0
21.042.9.0
21.032.8.0NVIDIA CUDA 11.2.1TensorRT 7.2.2.3
21.022.7.0NVIDIA CUDA 11.2.0TensorRT 7.2.2.3+cuda11.1.0.024
20.122.6.0NVIDIA CUDA 11.1.1TensorRT 7.2.2
20.112.5.018.04NVIDIA CUDA 11.1.0TensorRT 7.2.1
20.102.4.0
20.092.3.0NVIDIA CUDA 11.0.3TensorRT 7.1.3
20.082.2.0
20.071.15.0
2.1.0
NVIDIA CUDA 11.0.194
20.061.14.0
2.0.0
NVIDIA CUDA 11.0.167TensorRT 7.1.2
20.03.11.13.0NVIDIA CUDA 10.2.89TensorRT 7.0.0
20.031.12.0
20.02
20.01
1.11.0
1.10.0
19.12
19.11
1.9.0
1.8.0
TensorRT 6.0.1
19.101.7.0NVIDIA CUDA 10.1.243
19.091.6.0
19.081.5.0TensorRT 5.1.5
1️⃣ Triton安装

拉取docker镜像,20.11是版本号, 可以去这里挑选:Triton Inference Server (Formerly TensorRT inference Server) | NVIDIA NGC

新建一个Dockerfile.triton文件,内容如下

FROM nvcr.io/nvidia/tritonserver:20.11-py3

RUN 

保存并推出,运行下面命令安装triton 的 docker。先创建Dockerfile.triton文件再安装的好处是,可以把镜像命名为triton:2104,方便查看。并且如果想对triton docker镜像添加一些操作的话,可以在Dockerfile.triton文件中继续添加。

nvidia-docker build -f Dockerfile.triton -t triton:2011 . 
2️⃣ 模型配置文件编写

新建一个本地目录,用于映射到docker容器

映射目录配置

.
└── model_rep                # 宿主机要映射的根目录
    ├── demo1                # 模型1
    │   ├── 1                # 模型版本号
    │   │   └── model.pt    # 模型
    │   ├── 2                # 模型版本号
    │   │   └── model.pt    # 模型
    │   └── config.pbtxt
    └── demo2                # 模型2
        ├── 1
        │   └── model.pt
        └── config.pbtxt

模型配置文件编写

下面是一个用Netron软件打开的onnx格式的模型。我们可以看到输入,输出的名称,以及类型。我们根据这个修改配置文件中的input和output。下面是faster_rcnn_r50_trt的onnx模型文件,以及faster_rcnn_r50_trt的配置文件。

下面是对应上边模型的config.pbtxt配置文件

name: "faster_rcnn_r50_trt"               # 模型名,也是目录名
platform: "tensorrt_plan"    # 模型对应的平台,参考文章下面给出的表格
max_batch_size : 8              # 一次送入模型的最大batch_size。
input [
  {
    name: "input"
    data_type: TYPE_FP32
    dims: [ 3,-1,-1 ]            # 第一个维度默认是batch size,不用咱们配置。因此我们从第二个维度开始配置。
                                # 如果是可变维度,我们就用 -1
  }
]
output [
  {
    name: "dets"
    data_type: TYPE_FP32
    dims: [-1,-1]
  },
  {
    name: "labels"
    data_type: TYPE_INT32
    dims: [ -1 ]
  }
]

default_model_filename: "end2end.engine"

框架与platform对应表格:

框架名platform
TensorRTtensorrt_plan
TensorFlow SavedModeltensorflow_savedmodel
TensorFlow GraphDeftensorflow_graphdef
ONNXonnxruntime_onnx
Torchpytorch_libtorch

输入输出data_type对应表格:

Model ConfigTensorRTTensorFlowONNX RuntimePyTorchAPINumPy
TYPE_BOOLkBOOLDT_BOOLBOOLkBoolBOOLbool
TYPE_UINT8DT_UINT8UINT8kByteUINT8uint8
TYPE_UINT16DT_UINT16UINT16UINT16uint16
TYPE_UINT32DT_UINT32UINT32UINT32uint32
TYPE_UINT64DT_UINT64UINT64UINT64uint64
TYPE_INT8kINT8DT_INT8INT8kCharINT8int8
TYPE_INT16DT_INT16INT16kShortINT16int16
TYPE_INT32kINT32DT_INT32INT32kIntINT32int32
TYPE_INT64DT_INT64INT64kLongINT64int64
TYPE_FP16kHALFDT_HALFFLOAT16FP16float16
TYPE_FP32kFLOATDT_FLOATFLOATkFloatFP32float32
TYPE_FP64DT_DOUBLEDOUBLEkDoubleFP64float64
TYPE_STRINGDT_STRINGSTRINGBYTESdtype(object)
3️⃣ 启动服务
🔸 启动并执行服务:

–gpus all代表启用gpus

/home/xbsj/gaoying/triton/model_rep/:/models 本地目录映射到docker目录

8000为http端口,8001为grpc端口

nvcr.io/nvidia/tritonserver:21.11-py3,版本号记得改成自己的。

docker run --gpus all -p8000:8000 -p8001:8001 -p8002:8002 -v /home/xbsj/gaoying/triton/model_rep:/model_rep -v /home/xbsj/gaoying/triton/plugin_rep:/plugin_rep --env LD_PRELOAD=/plugin_rep/libmmdeploy_tensorrt_ops.so triton:2201 tritonserver --model-repository=/model_rep
🔸 进入docker,启动服务
docker run --gpus=all --network=host --shm-size=2g -v /home/xbsj/gaoying/triton/model_rep/:/models  -it nvcr.io/nvidia/tritonserver:21.04-py3  # 进入 docker
./bin/tritonserver --model-repository=/models  # 启动 triton
docker run --gpus=all --network=host -v /home/xbsj/gaoying/triton/model_rep:/opt/ml/model  -it triton:2104  # 进入 docker
./bin/tritonserver --model-repository=/models  # 启动 triton

客户端测试接口

1️⃣ 命令行接口测试

测试命令是否准备好,宿主机命令行运行

curl -v localhost:8000/v2/health/ready

成功结果:

  • Trying 127.0.0.1…

  • TCP_NODELAY set

  • Connected to localhost (127.0.0.1) port 8000 (#0)

    GET /v2/health/ready HTTP/1.1
    Host: localhost:8000
    User-Agent: curl/7.58.0
    Accept: /

    < HTTP/1.1 200 OK
    < Content-Length: 0
    < Content-Type: text/plain
    <

  • Connection #0 to host localhost left intact

2️⃣ triton client 接口测试
🔸 grpc

faster rcnn r50 十个迭代用时: 1.0688064098358154

import os
import time
import numpy as np
import tritonclient.grpc as grpcclient
from PIL import Image


def client_init(url="localhost:8001",
                ssl=False, private_key=None, root_certificates=None, certificate_chain=None,
                verbose=False):
    triton_client = grpcclient.InferenceServerClient(
        url=url,
        verbose=verbose,
        ssl=ssl,
        root_certificates=root_certificates,
        private_key=private_key,
        certificate_chain=certificate_chain)
    return triton_client


def infer_faster_rcnn_r50_trt_grpc(triton_client, model_name, input='input', dets='dets', labels='labels',
                                   compression_algorithm=None):
    inputs = []
    outputs = []

    # 添加输入的数据
    inputs.append(grpcclient.InferInput(input, [1, 3, 427, 640], "FP32"))

    # 给输入的数据赋值
    root_dir = os.getcwd()
    img_path = os.path.join(root_dir, 'demo.jpg')  # 自己把一张图片命名为demo.jpg放到目录下
    img = np.array(Image.open(img_path))
    img = img.astype(np.float32)
    img = img.transpose((2, 0, 1))
    img = np.expand_dims(img, axis=0)  # (1, 3, 427, 640)
    inputs[0].set_data_from_numpy(img)

    # 添加输出的数据
    outputs.append(grpcclient.InferRequestedOutput(dets))
    outputs.append(grpcclient.InferRequestedOutput(labels))

    results = triton_client.infer(
        model_name=model_name,
        inputs=inputs,
        outputs=outputs,
        compression_algorithm=compression_algorithm
        # client_timeout=0.1
    )
    # print('=' * 50)
    print(results)
    # print('=' * 50)
    # # 转化为numpy格式
    # print(results.as_numpy(output0))
    # print('=' * 50)
    # print(results.as_numpy(output1))
    # print('=' * 50)


if __name__ == '__main__':
    client = client_init()

    st = time.time()
    for i in range(10):
        infer_faster_rcnn_r50_trt_grpc(triton_client=client, model_name='faster_rcnn_r50_trt')
    print("grpc faster rcnn r50 十个迭代用时: {}".format(time.time() - st))
🔸 http

http faster rcnn r50 十个迭代用时:1.1643376350402832

import os
import time

import gevent.ssl
import numpy as np
import tritonclient.http as httpclient
from PIL import Image


def client_init(url="localhost:8000",
                ssl=False, key_file=None, cert_file=None, ca_certs=None, insecure=False,
                verbose=False):
    if ssl:
        ssl_options = {}
        if key_file is not None:
            ssl_options['keyfile'] = key_file
        if cert_file is not None:
            ssl_options['certfile'] = cert_file
        if ca_certs is not None:
            ssl_options['ca_certs'] = ca_certs
        ssl_context_factory = None
        if insecure:
            ssl_context_factory = gevent.ssl._create_unverified_context
        triton_client = httpclient.InferenceServerClient(
            url=url,
            verbose=verbose,
            ssl=True,
            ssl_options=ssl_options,
            insecure=insecure,
            ssl_context_factory=ssl_context_factory)
    else:
        triton_client = httpclient.InferenceServerClient(
            url=url, verbose=verbose)
    return triton_client


def infer_faster_rcnn_r50_trt_http(triton_client, model_name='faster_rcnn_r50_trt',
                              input='input', output0='dets', output1='labels',
                              request_compression_algorithm=None,
                              response_compression_algorithm=None):
    inputs = []
    outputs = []

    # 添加输入的数据
    inputs.append(httpclient.InferInput(input, [1, 3, 427, 640], "FP32"))

    # 给输入的数据赋值
    root_dir = os.getcwd()
    img_path = os.path.join(root_dir, 'demo.jpg')  # 自己把一张图片命名为demo.jpg放到目录下
    img = np.array(Image.open(img_path))
    img = img.astype(np.float32)
    img = img.transpose((2, 0, 1))
    img = np.expand_dims(img, axis=0)  # (1, 3, 427, 640)
    inputs[0].set_data_from_numpy(img)

    # OUTPUT0、OUTPUT1为配置文件中的输出节点名称
    outputs.append(httpclient.InferRequestedOutput(output0, binary_data=False))
    outputs.append(httpclient.InferRequestedOutput(output1, binary_data=False))

    results = triton_client.infer(
        model_name=model_name,
        inputs=inputs,
        outputs=outputs,
        request_compression_algorithm=request_compression_algorithm,
        response_compression_algorithm=response_compression_algorithm)
    # print('=' * 50)
    print(results)
    # print('=' * 50)
    # # 转化为numpy格式
    # print(results.as_numpy(output0))
    # print('=' * 50)
    # print(results.as_numpy(output1))
    # print('=' * 50)


if __name__ == '__main__':
    triton_client = client_init()
    st=time.time()
    for i in range(10):
        infer_faster_rcnn_r50_trt_http(triton_client)
    print("http faster rcnn r50 十个迭代用时:{}".format(time.time()-st))
3️⃣ requests 接口测试

requests faster rcnn r50 十个迭代用时: 3.843385934829712

import os
import time

import numpy as np
from PIL import Image
import requests


def infer_demo_torch_http():
    url = 'http://localhost:8000/v2/models/demo_torch/versions/1/infer'
    data = {
        "inputs": [{
            "name": "input__0",
            "shape": [2, 3],
            "datatype": "INT64",
            "data": [[1, 2, 3], [4, 5, 6]]
        }],
        "outputs": [{"name": "output__0"}, {"name": "output__1"}]
    }
    headers = {'Content-Type': 'application/json'}
    res = requests.post(url, json=data, headers=headers).json()
    print(res)


def infer_demo_onnx_http():
    url = 'http://localhost:8000/v2/models/demo_onnx/versions/1/infer'
    data = {
        "inputs": [{
            "name": "INPUT0",
            "shape": [8, 2],
            "datatype": "FP32",
            "data": [[0.1] * 2 for _ in range(8)]
        }, {
            "name": "INPUT1",
            "shape": [8, 2],
            "datatype": "INT32",
            "data": [[1] * 2 for _ in range(8)]
        }],
        "outputs": [{"name": "OUTPUT0"}, {"name": "OUTPUT1"}]
    }
    headers = {'Content-Type': 'application/json'}
    res = requests.post(url, json=data, headers=headers).json()
    print(res)


def infer_faster_rcnn_r50_onnx_http():
    root_dir = os.getcwd()
    img_path = os.path.join(root_dir, 'demo.jpg')
    img = np.array(Image.open(img_path))
    img = img.astype(np.float32)
    img = img.transpose((2, 0, 1))
    img = np.expand_dims(img, axis=0)  # (1, 3, 427, 640)
    # img = np.repeat(img, repeats=2, axis=0)  # (2, 3, 427, 640)
    img = img.tolist()
    url = 'http://localhost:8000/v2/models/faster_rcnn_r50_onnx/versions/1/infer'

    data = {
        "inputs": [{
            "name": "input",
            "shape": [1, 3, 427, 640],
            "datatype": "FP32",
            "data": img
        }, ],
        "outputs": [{"name": "dets"}, {"name": "labels"}]
    }
    headers = {'Content-Type': 'application/json'}
    res = requests.post(url, json=data, headers=headers).json()
    print(res)


def infer_faster_rcnn_r50_trt_http():
    root_dir = os.getcwd()
    img_path = os.path.join(root_dir, 'demo.jpg')
    img = np.array(Image.open(img_path))
    img = img.astype(np.float32)
    img = img.transpose((2, 0, 1))
    img = np.expand_dims(img, axis=0)  # (1, 3, 427, 640)
    img = img.tolist()
    url = 'http://localhost:8000/v2/models/faster_rcnn_r50_trt/versions/1/infer'

    data = {
        "inputs": [{
            "name": "input",
            "shape": [1, 3, 427, 640],
            "datatype": "FP32",
            "data": img
        }, ],
        "outputs": [{"name": "dets"}, {"name": "labels"}]
    }
    headers = {'Content-Type': 'application/json'}
    res = requests.post(url, json=data, headers=headers).json()
    print(res)


if __name__ == "__main__":
    print('=' * 50)
    print('| Infer demo_torch')
    print('_' * 20)
    infer_demo_torch_http()
    print('=' * 50)
    print('| Infer demo_onnx')
    print('_' * 20)
    infer_demo_onnx_http()
    print('=' * 50)
    print('| Infer faster_rcnn_r50_onnx')
    print('_' * 20)
    infer_faster_rcnn_r50_onnx_http()

    print('=' * 50)
    print('| Infer faster_rcnn_r50_trt')
    print('_' * 20)
    st = time.time()
    for _ in range(10):
        infer_faster_rcnn_r50_trt_http()
    print("requests faster rcnn r50 十个迭代用时: {}".format(time.time() - st))
    print('=' * 50)

triton压测

首先构建好我们的输入数据,input.json。

{
        "inputs": [{
            "name": "input__0",
            "shape": [2, 3],
            "datatype": "INT64",
            "data": [[1, 2, 3], [4, 5, 6]]
        }],
        "outputs": [{"name": "output__0"}, {"name": "output__1"}]
}

安装一下用到的包

sudo apt install apache2-utils

压测命令

ab -k -c 5 -n 500 -p input.json http://localhost:8000/v2/models/demo/versions/1/infer 

命令的意思是5个进程反复调用接口共500次,输入数据为input.json,模型是demo模型,版本1。

triton报错合集:

⚠️ INVALID_ARGUMENT: getPluginCreator could not find plugin TRTBatchedNMS version 1

用mmdeploy docker转换出来的tensorrt模型,在triton docker中没法用,报以下错误:(triton的报错信息,刚开始我也不会看,那么一大堆,找不到关键是哪里报错。教大家一下,E开头的就是报错的)

E0630 01:31:22.566631 1 logging.cc:43] INVALID_ARGUMENT: getPluginCreator could not find plugin TRTBatchedNMS version 1
E0630 01:31:22.566657 1 logging.cc:43] safeDeserializationUtils.cpp (322) - Serialization Error in load: 0 (Cannot deserialize plugin since corresponding IPluginCreator not found in Plugin Registry)
E0630 01:31:22.566739 1 logging.cc:43] INVALID_STATE: std::exception
E0630 01:31:22.572629 1 logging.cc:43] INVALID_CONFIG: Deserialize the cuda engine failed.
E0630 01:31:22.587565 1 model_repository_manager.cc:1215] failed to load ‘faster_rcnn_r50_tensorrt’ version 1: Internal: unable to create TensorRT engine

🔸 方法一(推荐)

参考:yolo模型部署——tensorRT模型加速+triton服务器模型部署

直接运行下面命令(根据自己的自行修改)

docker run --gpus all -p8000:8000 -p8001:8001 -p8002:8002 -v /home/xbsj/gaoying/triton/model_rep:/model_rep -v /home/xbsj/gaoying/triton/plugin_rep:/plugin_rep --env LD_PRELOAD=/plugin_rep/libmmdeploy_tensorrt_ops.so triton:2104 tritonserver --model-repository=/model_rep
🔸 方法二

解决方法来源: end2end.engine to Triton · Issue #465 · open-mmlab/mmdeploy (github.com)

具体方法:(我试了,没成功。。。是我操作不对)

1️⃣ 将 /root/workspace/mmdeploy/build/lib/libmmdeploy_tensorrt_ops.so 从 mmdeploy docker 镜像复制到 triton docker 镜像中的 /opt/tritonserver/lib/

docker run --gpus=all --network=host -v /home/xbsj/gaoying/triton/model_rep:/opt/ml/model  -it triton:2104     # 宿主机命令行运行,进入triton docker容器,但不启动服务

docker ps    # 宿主机命令行运行,查看triton docker容器的id

docker cp /data/imagetd/xbsj/gaoying//mmdeploy_out/libmmdeploy_tensorrt_ops.so 7725e367f0f0:/opt/tritonserver/lib/libmmdeploy_tensorrt_ops.so      # 传输文件,宿主机->triton容器

2️⃣ 将 LD_PRELOAD=libmmdeploy_tensorrt_ops.so 附加到 /bin/serve/ 的末尾,tritonserver服务之前。

vim /bin/serve

添加上下面命令,105行

LD_PRELOAD=libmmdeploy_tensorrt_ops.so

启动服务

./bin/tritonserver --model-store=/models
⚠️ ImportError: cannot import name ‘ORTWrapper’ from ‘mmdeploy.backend.onnxruntime’ (/data/imagetd/xbsj/gaoying/mmdeploy/mmdeploy/backend/onnxruntime/init.py)

解决方法来源:Bug using ORTwrapper · Issue #37 · open-mmlab/mmdeploy (github.com)

🔸 方法

mmdeploy/codebase/mmdet/core/post_processing/bbox_nms.py::select_nms_index 中,将return batched_dets, batched_labele 更改为 return batched_dets[:, 0:-1, :], batched_labels[:, 0:-1] 可能会修复 bug .

然后运行命令

python setup.py install

后边再进行模型转换

⚠️ Some tactics do not have sufficient workspace memory to run. Increasing workspace size may increase performance, please check verbose output.

解决方法参考:Bug using ORTwrapper · Issue #37 · open-mmlab/mmdeploy (github.com)

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

gy-7

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值