TensorRT学习与paddle-TRT实践(一)

TensorRT是NVIDIA的深度学习推理加速库,支持Caffe和ONNX模型,提供了丰富的算子支持。其工作流程包括构建网络定义、生成CUDA引擎和执行推理。通过层间融合、数据精度校准等优化手段提升模型性能。在案例实践中,演示了如何将Caffe模型转换为ONNX,再转为Paddle模型,并利用Paddle-TRT进行加速。
摘要由CSDN通过智能技术生成

github 源码地址:https://github.com/NVIDIA/TensorRThttps://github.com/NVIDIA/TensorRT

官方手册地址:

NVIDIA Deep Learning TensorRT Documentationhttps://docs.nvidia.com/deeplearning/tensorrt/index.html

一、算子支持

截止2021年10月19日,TensorRT 已更新至8.2.0版本,算子支持情况(在源码parsers文件夹中查看)如下:

Caffe:

    {"Convolution", parseConvolution},
    {"Pooling", parsePooling},
    {"InnerProduct", parseInnerProduct},
    {"ReLU", parseReLU},
    {"Softmax", parseSoftMax},
    {"SoftmaxWithLoss", parseSoftMax},
    {"LRN", parseLRN},
    {"Power", parsePower},
    {"Eltwise", parseEltwise},
    {"Concat", parseConcat},
    {"Deconvolution", parseDeconvolution},
    {"Sigmoid", parseSigmoid},
    {"TanH", parseTanH},
    {"BatchNorm", parseBatchNormalization},
    {"Scale", parseScale},
    {"Crop", parseCrop},
    {"Reduction", parseReduction},
    {"Reshape", parseReshape},
    {"Permute", parsePermute},
    {"ELU", parseELU},
    {"BNLL", parseBNLL},
    {"Clip", parseClip},
    {"AbsVal", parseAbsVal},
    {"PReLU", parsePReLU}

ONNX:

| Operator                  | Supported  | Supported Types | Restrictions                                                                                                           |
|---------------------------|------------|-----------------|------------------------------------------------------------------------------------------------------------------------|
| Abs                       | Y          | FP32, FP16, INT32 |
| Acos                      | Y          | FP32, FP16 |
| Acosh                     | Y          | FP32, FP16 |
| Add                       | Y          | FP32, FP16, INT32 |
| And                       | Y          | BOOL | 
| ArgMax                    | Y          | FP32, FP16 |
| ArgMin                    | Y          | FP32, FP16 |
| Asin                      | Y          | FP32, FP16 |
| Asinh                     | Y          | FP32, FP16 |
| Atan                      | Y          | FP32, FP16 |
| Atanh                     | Y          | FP32, FP16 |
| AveragePool               | Y          | FP32, FP16, INT8, INT32 | 2D or 3D Pooling only                                                                                                                    |
| BatchNormalization        | Y          | FP32, FP16 |
| BitShift                  | N          |
| Cast                      | Y          | FP32, FP16, INT32, INT8, BOOL |                                                                                                       |
| Ceil                      | Y          | FP32, FP16 |
| Celu                      | Y          | FP32, FP16 |
| Clip                      | Y          | FP32, FP16, INT8 | `min` and `max` clip values must be initializers                                                                                         |
| Compress                  | N          |
| Concat                    | Y          | FP32, FP16, INT32, INT8, BOOL |
| ConcatFromSequence        | N          |
| Constant                  | Y          | FP32, FP16, INT32, INT8, BOOL |
| ConstantOfShape           | Y          | FP32 |
| Conv                      | Y          | FP32, FP16, INT8 | 2D or 3D convolutions only                                                                                                               |
| ConvInteger               | N          |
| ConvTranspose             | Y          | FP32, FP16, INT8 | 2D or 3D deconvolutions only\. Weights `W` must be an initializer                                                                        |
| Cos                       | Y          | FP32, FP16 |
| Cosh                      | Y          | FP32, FP16 |
| CumSum                    | Y          | FP32, FP16 | `axis` must be an initializer                                                                                                            |
| DepthToSpace              | Y          | FP32, FP16, INT32 |
| DequantizeLinear          | Y          | INT8 | `x_zero_point` must be zero                                                                                    |
| Det                       | N          |
| Div                       | Y          | FP32, FP16, INT32 |
| Dropout                   | Y          | FP32, FP16 |
| DynamicQuantizeLinear     | N          |
| Einsum                    | Y          | FP32, FP16 | Ellipsis and diagonal operations are not supported.
| Elu                       | Y          | FP32, FP16, INT8 |
| Equal                     | Y          | FP32, FP16, INT32 |
| Erf                       | Y          | FP32, FP16 |
| Exp                       | Y          | FP32, FP16 |
| Expand                    | Y          | FP32, FP16, INT32, BOOL |
| EyeLike                   | Y          | FP32, FP16, INT32, BOOL |
| Flatten                   | Y          | FP32, FP16, INT32, BOOL |
| Floor                     | Y          | FP32, FP16 |
| Gather                    | Y          | FP32, FP16, INT8, INT32 |
| GatherElements            | Y          | FP32, FP16, INT8, INT32 |
| GatherND                  | Y          | FP32, FP16, INT8, INT32 |
| Gemm                      | Y          | FP32, FP16, INT8 |
| GlobalAveragePool         | Y          | FP32, FP16, INT8 |
| GlobalLpPool              | Y          | FP32, FP16, INT8 |
| GlobalMaxPool             | Y          | FP32, FP16, INT8 |
| Greater                   | Y          | FP32, FP16, INT32 |
| GreaterOrEqual            | Y          | FP32, FP16, INT32 |
| GRU                       | Y          | FP32, FP16 | For bidirectional GRUs, activation functions must be the same for both the forward and reverse pass
| HardSigmoid               | Y          | FP32, FP16, INT8 |
| Hardmax                   | N          |
| Identity                  | Y          | FP32, FP16, INT32, INT8, BOOL |
| If                        | Y          | FP32, FP16, INT32, BOOL | Output tensors of the two conditional branches must have broadcastable shapes, and must have different names
| ImageScaler               | Y          | FP32, FP16 |
| InstanceNormalization     | Y          | FP32, FP16 | Scales `scale` and biases `B` must be initializers. Input rank must be >=3 & <=5                                                                                  |
| IsInf                     | N          |
| IsNaN                     | Y          | FP32, FP16, INT32 |
| LeakyRelu                 | Y          | FP32, FP16, INT8 |
| Less                      | Y          | FP32, FP16, INT32 |
| LessOrEqual               | Y          | FP32, FP16, INT32 |
| Log                       | Y          | FP32, FP16 |
| LogSoftmax                | Y          | FP32, FP16 |
| Loop                      | Y          | FP32, FP16, INT32, BOOL |
| LRN                       | Y          | FP32, FP16 |
| LSTM                      | Y          | FP32, FP16 | For bidirectional LSTMs, activation functions must be the same for both the forward and reverse pass
| LpNormalization           | Y          | FP32, FP16 |
| LpPool                    | Y          | FP32, FP16, INT8 |
| MatMul                    | Y          | FP32, FP16 |
| MatMulInteger             | N          |
| Max                       | Y          | FP32, FP16, INT32 |
| MaxPool                   | Y          | FP32, FP16, INT8 |
| MaxRoiPool                | N          |
| MaxUnpool                 | N          |
| Mean                      | Y          | FP32, FP16, INT32 |
| MeanVarianceNormalization | N          |
| Min                       | Y          | FP32, FP16, INT32 |
| Mod                       | N          |
| Mul                       | Y          | FP32, FP16, INT32 |
| Multinomial               | N          |
| Neg                       | Y          | FP32, FP16, INT32 |
| NegativeLogLikelihoodLoss | N          |
| NonMaxSuppression         | Y [EXPERIMENTAL] | FP32, FP16 | Inputs `max_output_boxes_per_class`, `iou_threshold`, and `score_threshold` must be initializers. Output has fixed shape and is padded to [`max_output_boxes_per_class`, 3].
| NonZero                   | N          |
| Not                       | Y          | BOOL |
| OneHot                    | N          |
| Or                        | Y          | BOOL |
| Pad                       | Y          | FP32, FP16, INT8, INT32 |
| ParametricSoftplus        | Y          | FP32, FP16, INT8 |
| Pow                       | Y          | FP32, FP16 |
| PRelu                     | Y          | FP32, FP16, INT8 |
| QLinearConv               | N          |
| QLinearMatMul             | N          |
| QuantizeLinear            | Y          | FP32, FP16 | `y_zero_point` must be 0                                                                   |
| RandomNormal              | N          |
| RandomNormalLike          | N          |
| RandomUniform             | Y          | FP32, FP16 |
| RandomUniformLike         | Y          | FP32, FP16 |
| Range                     | Y          | FP32, FP16, INT32 | Floating point inputs are only supported if `start`, `limit`, and `delta` inputs are initializers                                                 |
| Reciprocal                | N          |
| ReduceL1                  | Y          | FP32, FP16 |
| ReduceL2                  | Y          | FP32, FP16 |
| ReduceLogSum              | Y          | FP32, FP16 |
| ReduceLogSumExp           | Y          | FP32, FP16 |
| ReduceMax                 | Y          | FP32, FP16 |
| ReduceMean                | Y          | FP32, FP16 |
| ReduceMin                 | Y          | FP32, FP16 |
| ReduceProd                | Y          | FP32, FP16 |
| ReduceSum                 | Y          | FP32, FP16 |
| ReduceSumSquare           | Y          | FP32, FP16 |
| Relu                      | Y          | FP32, FP16, INT8 |
| Reshape                   | Y          | FP32, FP16, INT32, INT8, BOOL |
| Resize                    | Y          | FP32, FP16 | Supported resize transformation modes: `half_pixel`, `pytorch_half_pixel`, `tf_half_pixel_for_nn`, `asymmetric`, and `align_corners`.<br />Supported resize modes: `nearest`, `linear`.<br />Supported nearest modes: `floor`, `ceil`, `round_prefer_floor`, `round_prefer_ceil`   |
| ReverseSequence           | Y          | FP32, FP16 | Dynamic input shapes are unsupported
| RNN                       | Y          | FP32, FP16 | For bidirectional RNNs, activation functions must be the same for both the forward and reverse pass
| RoiAlign                  | N          |
| Round                     | Y          | FP32, FP16, INT8 |
| ScaledTanh                | Y          | FP32, FP16, INT8 |
| Scan                      | Y          | FP32, FP16 |
| Scatter                   | Y          | FP32, FP16, INT8, INT32 |
| ScatterElements           | Y          | FP32, FP16, INT8, INT32 |
| ScatterND                 | Y          | FP32, FP16, INT8, INT32 |
| Selu                      | Y          | FP32, FP16, INT8|
| SequenceAt                | N          |
| SequenceConstruct         | N          |
| SequenceEmpty             | N          |
| SequenceErase             | N          |
| SequenceInsert            | N          |
| SequenceLength            | N          |
| Shape                     | Y          | FP32, FP16, INT32, INT8, BOOL |
| Shrink                    | N          |
| Sigmoid                   | Y          | FP32, FP16, INT8 |
| Sign                      | Y          | FP32, FP16, INT8, INT32 |
| Sin                       | Y          | FP32, FP16 |
| Sinh                      | Y          | FP32, FP16 |
| Size                      | Y          | FP32, FP16, INT32, INT8, BOOL |
| Slice                     | Y          | FP32, FP16, INT32, INT8, BOOL | `axes` must be an initializer                                                                                                            |
| Softmax                   | Y          | FP32, FP16 |
| SoftmaxCrossEntropyLoss   | N          |
| Softplus                  | Y          | FP32, FP16, INT8 |
| Softsign                  | Y          | FP32, FP16, INT8 |
| SpaceToDepth              | Y          | FP32, FP16, INT32 |
| Split                     | Y          | FP32, FP16, INT32, BOOL | `split` must be an initializer                                                                                                           |
| SplitToSequence           | N          |
| Sqrt                      | Y          | FP32, FP16 |
| Squeeze                   | Y          | FP32, FP16, INT32, INT8, BOOL | `axes` must be an initializer                                                                                                            |
| StringNormalizer          | N          |
| Sub                       | Y          | FP32, FP16, INT32 |
| Sum                       | Y          | FP32, FP16, INT32 |
| Tan                       | Y          | FP32, FP16 |
| Tanh                      | Y          | FP32, FP16, INT8 |
| TfIdfVectorizer           | N          |
| ThresholdedRelu           | Y          | FP32, FP16, INT8 |
| Tile                      | Y          | FP32, FP16, INT32, BOOL |
| TopK                      | Y          | FP32, FP16 |
| Transpose                 | Y          | FP32, FP16, INT32, INT8, BOOL |
| Unique                    | N          |
| Unsqueeze                 | Y          | FP32, FP16, INT32, INT8, BOOL | `axes` must be a constant tensor                                                                                                         |
| Upsample                  | Y          | FP32, FP16 |
| Where                     | Y          | FP32, FP16, INT32, BOOL |
| Xor                       | N          |

注:ONNX有第三方开源社区维护,支持的算子要比Caffe全很多。

其他支持信息可参考官方手册Support Matrix页:

Support Matrix :: NVIDIA Deep Learning TensorRT Documentation

二、使用流程

使用TensorRT对神经网络进行加速的一般流程:

The general TensorRT workflow consists of 3 steps:

  1. Populate a tensorrt.INetworkDefinition either with a parser or by using the TensorRT Network API (see tensorrt.INetworkDefinition for more details). The tensorrt.Builder can be used to generate an empty tensorrt.INetworkDefinition .

  2. Use the tensorrt.Builder to build a tensorrt.ICudaEngine using the populated tensorrt.INetworkDefinition .

  3. Create a tensorrt.IExecutionContext from the tensorrt.ICudaEngine and use it to perform optimized inference.

也可以分为两个阶段,build和runtime

1.build:Import and optimize trained models to generate inference engines

build阶段主要完成模型转换(从caffe或TensorFlow到TensorRT),在模型转换时会完成前述优化过程中的层间融合,精度校准。这一步的输出是一个针对特定GPU平台和网络模型的优化过的TensorRT模型,这个TensorRT模型可以序列化存储到磁盘或内存中。存储到磁盘中的文件称之为 plan file。

在这里插入图片描述

2.runtime (deploy):Generate runtime inference engine for inference

runtime或者说是deploy阶段主要完成推理过程,将上面一个步骤中的plan文件首先反序列化,并创建一个 runtime engine,然后就可以输入数据(比如测试集或数据集之外的图片),然后输出分类向量结果或检测结果。

在这里插入图片描述

单独说一下TensorRT的优化机制:

引用文章:TensorRT-优化-原理 - 吴建明wujianming - 博客园

TensorRT优化方法主要有以下几种方式,最主要的是前面两种。

  • 层间融合或张量融合(Layer & Tensor Fusion)

如下图左侧是GoogLeNetInception模块的计算图。这个结构中有很多层,在部署模型推理时,这每一层的运算操作都是由GPU完成的,但实际上是GPU通过启动不同的CUDA(Compute unified device architecture)核心来完成计算的,CUDA核心计算张量的速度是很快的,但是往往大量的时间是浪费在CUDA核心的启动和对每一层输入/输出张量的读写操作上面,这造成了内存带宽的瓶颈和GPU资源的浪费。TensorRT通过对层间的横向或纵向合并(合并后的结构称为CBR,意指 convolution, bias, and ReLU layers are fused to form a single layer),使得层的数量大大减少。横向合并可以把卷积、偏置和激活层合并成一个CBR结构,只占用一个CUDA核心。纵向合并可以把结构相同,但是权值不同的层合并成一个更宽的层,也只占用一个CUDA核心。合并之后的计算图(图4右侧)的层次更少了,占用的CUDA核心数也少了,因此整个模型结构会更小,更快,更高效。

  •  数据精度校准(Weight &Activation Precision Calibration)

大部分深度学习框架在训练神经网络时网络中的张量(Tensor)都是32位浮点数的精度(Full 32-bit precision,FP32),一旦网络训练完成,在部署推理的过程中由于不需要反向传播,完全可以适当降低数据精度,比如降为FP16或INT8的精度。更低的数据精度将会使得内存占用和延迟更低,模型体积更小。

如下表为不同精度的动态范围:

Precision

Dynamic Range

FP32

−3.4×1038 +3.4×1038−3.4×1038 +3.4×1038

FP16

−65504 +65504−65504 +65504

INT8

−128 +127−128 +127

INT8只有256个不同的数值,使用INT8来表示 FP32精度的数值,肯定会丢失信息,造成性能下降。不过TensorRT会提供完全自动化的校准(Calibration )过程,会以最好的匹配性能将FP32精度的数据降低为INT8精度,最小化性能损失。校准方式有4种:

IInt8Calibrator — NVIDIA TensorRT Standard Python API Documentation 8.2.0 documentationhttps://docs.nvidia.com/deeplearning/tensorrt/api/python_api/infer/Int8/Calibrator.html

IInt8Calibrator

tensorrt.CalibrationAlgoType

Version of calibration algorithm to use.

Members:

LEGACY_CALIBRATION

ENTROPY_CALIBRATION

ENTROPY_CALIBRATION_2

MINMAX_CALIBRATION

  • Kernel Auto-Tuning

网络模型在推理计算时,是调用GPU的CUDA核进行计算的。TensorRT可以针对不同的算法,不同的网络模型,不同的GPU平台,进行 CUDA核的调整,以保证当前模型在特定平台上以最优性能计算。

TensorRT will pick the implementation from a library of kernels that delivers the best performance for the target GPU, input data size, filter size, tensor layout, batch size and other parameters.

  • Dynamic Tensor Memory

在每个tensor的使用期间,TensorRT会为其指定显存,避免显存重复申请,减少内存占用和提高重复使用效率。

  • Multi-Stream Execution

Scalable design to process multiple input streams in parallel,这个应该就是GPU底层的优化了。

 三、案例实践

本文将选用paddle-TRT的案例来尝试对paddle模型加速

Paddle LiteWeb site created using create-react-apphttps://paddleinference.paddlepaddle.org.cn/optimize/paddle_trt.html

实践一、移植openpose到paddlepaddle框架并进行TRT加速

paddlepaddle对caffe的兼容性也是没有对onnx好,所以第一步先把openpose模型从caffe格式转成onnx,再用x2paddle工具转成paddle模型。

caffe转onnx推荐该文章中提供的工具:

【caffe】Caffe模型转换为ONNX模型(新版)_花丸大老师的博客-CSDN博客_caffe转onnx在了解了caffe模型的结构和ONNX的结构后,我用python写了一个caffe转onnx的小工具,现只测试了resnet50、alexnet、yolov3的caffe模型和onnx模型推理结果,存在误差,但是在可接受范围内。本工具在转换模型的时候是不需要配置caffe的,只需要安装好protobuf即可。在进行推理测试的时候才需要配置好pycaffe。https://blog.csdn.net/u013597931/article/details/85236288模型转完后,先在gpu上测试是否可以正常运行,测试 代码如下:

import numpy as np
import paddle.inference as paddle_infer
import time

def create_predictor():
    config = paddle_infer.Config("/data/models/paddle/pose/inference_model/model.pdmodel", "/data/models/paddle/pose/inference_model/model.pdiparams")
    config.enable_use_gpu(1000, 0)
    config.switch_ir_optim()
    config.enable_memory_optim()

    predictor = paddle_infer.create_predictor(config)
    return predictor

def run(predictor, img):
    # 准备输入
    input_names = predictor.get_input_names()
    for i,  name in enumerate(input_names):
        input_tensor = predictor.get_input_handle(name)
        input_tensor.reshape(img[i].shape)
        input_tensor.copy_from_cpu(img[i].copy())
    # 预测
    predictor.run()
    results = []
    # 获取输出
    output_names = predictor.get_output_names()
    for i, name in enumerate(output_names):
        output_tensor = predictor.get_output_handle(name)
        output_data = output_tensor.copy_to_cpu()
        results.append(output_data)
    return results

if __name__ == '__main__':
    pred = create_predictor()
    img = np.ones((1, 3, 480, 640)).astype(np.float32)
    while(1):
        start = time.time()
        result = run(pred, [img])
        end = time.time()
        print("run time: ", end-start)

编写openpose后处理代码并测试。

后处理采用Openpose官方C++代码,使用pybind11对其进行Python封装,封装代码如下:

  • 0
    点赞
  • 6
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值