TensorRT学习与paddle-TRT实践（一）

最新推荐文章于 2023-05-08 20:12:04 发布

传奇小迷弟

最新推荐文章于 2023-05-08 20:12:04 发布

阅读量2.4k

点赞数

分类专栏： TensorRT 文章标签：深度学习

本文链接：https://blog.csdn.net/feedinglife/article/details/120848119

版权

TensorRT 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

TensorRT是NVIDIA的深度学习推理加速库，支持Caffe和ONNX模型，提供了丰富的算子支持。其工作流程包括构建网络定义、生成CUDA引擎和执行推理。通过层间融合、数据精度校准等优化手段提升模型性能。在案例实践中，演示了如何将Caffe模型转换为ONNX，再转为Paddle模型，并利用Paddle-TRT进行加速。

摘要由CSDN通过智能技术生成

github 源码地址：https://github.com/NVIDIA/TensorRThttps://github.com/NVIDIA/TensorRT

官方手册地址：

NVIDIA Deep Learning TensorRT Documentationhttps://docs.nvidia.com/deeplearning/tensorrt/index.html

一、算子支持

截止2021年10月19日，TensorRT 已更新至8.2.0版本，算子支持情况（在源码parsers文件夹中查看）如下：

Caffe：

    {"Convolution", parseConvolution},
    {"Pooling", parsePooling},
    {"InnerProduct", parseInnerProduct},
    {"ReLU", parseReLU},
    {"Softmax", parseSoftMax},
    {"SoftmaxWithLoss", parseSoftMax},
    {"LRN", parseLRN},
    {"Power", parsePower},
    {"Eltwise", parseEltwise},
    {"Concat", parseConcat},
    {"Deconvolution", parseDeconvolution},
    {"Sigmoid", parseSigmoid},
    {"TanH", parseTanH},
    {"BatchNorm", parseBatchNormalization},
    {"Scale", parseScale},
    {"Crop", parseCrop},
    {"Reduction", parseReduction},
    {"Reshape", parseReshape},
    {"Permute", parsePermute},
    {"ELU", parseELU},
    {"BNLL", parseBNLL},
    {"Clip", parseClip},
    {"AbsVal", parseAbsVal},
    {"PReLU", parsePReLU}

ONNX：

| Operator                  | Supported  | Supported Types | Restrictions                                                                                                           |
|---------------------------|------------|-----------------|------------------------------------------------------------------------------------------------------------------------|
| Abs                       | Y          | FP32, FP16, INT32 |
| Acos                      | Y          | FP32, FP16 |
| Acosh                     | Y          | FP32, FP16 |
| Add                       | Y          | FP32, FP16, INT32 |
| And                       | Y          | BOOL | 
| ArgMax                    | Y          | FP32, FP16 |
| ArgMin                    | Y          | FP32, FP16 |
| Asin                      | Y          | FP32, FP16 |
| Asinh                     | Y          | FP32, FP16 |
| Atan                      | Y          | FP32, FP16 |
| Atanh                     | Y          | FP32, FP16 |
| AveragePool               | Y          | FP32, FP16, INT8, INT32 | 2D or 3D Pooling only                                                                                                                    |
| BatchNormalization        | Y          | FP32, FP16 |
| BitShift                  | N          |
| Cast                      | Y          | FP32, FP16, INT32, INT8, BOOL |                                                                                                       |
| Ceil                      | Y          | FP32, FP16 |
| Celu                      | Y          | FP32, FP16 |
| Clip                      | Y          | FP32, FP16, INT8 | `min` and `max` clip values must be initializers                                                                                         |
| Compress                  | N          |
| Concat                    | Y          | FP32, FP16, INT32, INT8, BOOL |
| ConcatFromSequence        | N          |
| Constant                  | Y          | FP32, FP16, INT32, INT8, BOOL |
| ConstantOfShape           | Y          | FP32 |
| Conv                      | Y          | FP32, FP16, INT8 | 2D or 3D convolutions only                                                                                                               |
| ConvInteger               | N          |
| ConvTranspose             | Y          | FP32, FP16, INT8 | 2D or 3D deconvolutions only\. Weights `W` must be an initializer                                                                        |
| Cos                       | Y          | FP32, FP16 |
| Cosh                      | Y          | FP32, FP16 |
| CumSum                    | Y          | FP32, FP16 | `axis` must be an initializer                                                                                                            |
| DepthToSpace              | Y          | FP32, FP16, INT32 |
| DequantizeLinear          | Y          | INT8 | `x_zero_point` must be zero                                                                                    |
| Det                       | N          |
| Div                       | Y          | FP32, FP16, INT32 |
| Dropout                   | Y          | FP32, FP16 |
| DynamicQuantizeLinear     | N          |
| Einsum                    | Y          | FP32, FP16 | Ellipsis and diagonal operations are not supported.
| Elu                       | Y          | FP32, FP16, INT8 |
| Equal                     | Y          | FP32, FP16, INT32 |
| Erf                       | Y          | FP32, FP16 |
| Exp                       | Y          | FP32, FP16 |
| Expand                    | Y          | FP32, FP16, INT32, BOOL |
| EyeLike                   | Y          | FP32, FP16, INT32, BOOL |
| Flatten                   | Y          | FP32, FP16, INT32, BOOL |
| Floor                     | Y          | FP32, FP16 |
| Gather                    | Y          | FP32, FP16, INT8, INT32 |
| GatherElements            | Y          | FP32, FP16, INT8, INT32 |
| GatherND                  | Y          | FP32, FP16, INT8, INT32 |
| Gemm                      | Y          | FP32, FP16, INT8 |
| GlobalAveragePool         | Y          | FP32, FP16, INT8 |
| GlobalLpPool              | Y          | FP32, FP16, INT8 |
| GlobalMaxPool             | Y          | FP32, FP16, INT8 |
| Greater                   | Y          | FP32, FP16, INT32 |
| GreaterOrEqual            | Y          | FP32, FP16, INT32 |
| GRU                       | Y          | FP32, FP16 | For bidirectional GRUs, activation functions must be the same for both the forward and reverse pass
| HardSigmoid               | Y          | FP32, FP16, INT8 |
| Hardmax                   | N          |
| Identity                  | Y          | FP32, FP16, INT32, INT8, BOOL |
| If                        | Y          | FP32, FP16, INT32, BOOL | Output tensors of the two conditional branches must have broadcastable shapes, and must have different names
| ImageScaler               | Y          | FP32, FP16 |
| InstanceNormalization     | Y          | FP32, FP16 | Scales `scale` and biases `B` must be initializers. Input rank must be >=3 & <=5                                                                                  |
| IsInf                     | N          |
| IsNaN                     | Y          | FP32, FP16, INT32 |
| LeakyRelu                 | Y          | FP32, FP16, INT8 |
| Less                      | Y          | FP32, FP16, INT32 |
| LessOrEqual               | Y          | FP32, FP16, INT32 |
| Log                       | Y          | FP32, FP16 |
| LogSoftmax                | Y          | FP32, FP16 |
| Loop                      | Y          | FP32, FP16, INT32, BOOL |
| LRN                       | Y          | FP32, FP16 |
| LSTM                      | Y          | FP32, FP16 | For bidirectional LSTMs, activation functions must be the same for both the forward and reverse pass
| LpNormalization           | Y          | FP32, FP16 |
| LpPool                    | Y          | FP32, FP16, INT8 |
| MatMul                    | Y          | FP32, FP16 |
| MatMulInteger             | N          |
| Max                       | Y          | FP32, FP16, INT32 |
| MaxPool                   | Y          | FP32, FP16, INT8 |
| MaxRoiPool                | N          |
| MaxUnpool                 | N          |
| Mean                      | Y          | FP32, FP16, INT32 |
| MeanVarianceNormalization | N          |
| Min                       | Y          | FP32, FP16, INT32 |
| Mod                       | N          |
| Mul                       | Y          | FP32, FP16, INT32 |
| Multinomial               | N          |
| Neg                       | Y          | FP32, FP16, INT32 |
| NegativeLogLikelihoodLoss | N          |
| NonMaxSuppression         | Y [EXPERIMENTAL] | FP32, FP16 | Inputs `max_output_boxes_per_class`, `iou_threshold`, and `score_threshold` must be initializers. Output has fixed shape and is padded to [`max_output_boxes_per_class`, 3].
| NonZero                   | N          |
| Not                       | Y          | BOOL |
| OneHot                    | N          |
| Or                        | Y          | BOOL |
| Pad                       | Y          | FP32, FP16, INT8, INT32 |
| ParametricSoftplus        | Y          | FP32, FP16, INT8 |
| Pow                       | Y          | FP32, FP16 |
| PRelu                     | Y          | FP32, FP16, INT8 |
| QLinearConv               | N          |
| QLinearMatMul             | N          |
| QuantizeLinear            | Y          | FP32, FP16 | `y_zero_point` must be 0                                                                   |
| RandomNormal              | N          |
| RandomNormalLike          | N          |
| RandomUniform             | Y          | FP32, FP16 |
| RandomUniformLike         | Y          | FP32, FP16 |
| Range                     | Y          | FP32, FP16, INT32 | Floating point inputs are only supported if `start`, `limit`, and `delta` inputs are initializers                                                 |
| Reciprocal                | N          |
| ReduceL1                  | Y          | FP32, FP16 |
| ReduceL2                  | Y          | FP32, FP16 |
| ReduceLogSum              | Y          | FP32, FP16 |
| ReduceLogSumExp           | Y          | FP32, FP16 |
| ReduceMax                 | Y          | FP32, FP16 |
| ReduceMean                | Y          | FP32, FP16 |
| ReduceMin                 | Y          | FP32, FP16 |
| ReduceProd                | Y          | FP32, FP16 |
| ReduceSum                 | Y          | FP32, FP16 |
| ReduceSumSquare           | Y          | FP32, FP16 |
| Relu                      | Y          | FP32, FP16, INT8 |
| Reshape                   | Y          | FP32, FP16, INT32, INT8, BOOL |
| Resize                    | Y          | FP32, FP16 | Supported resize transformation modes: `half_pixel`, `pytorch_half_pixel`, `tf_half_pixel_for_nn`, `asymmetric`, and `align_corners`.<br />Supported resize modes: `nearest`, `linear`.<br />Supported nearest modes: `floor`, `ceil`, `round_prefer_floor`, `round_prefer_ceil`   |
| ReverseSequence           | Y          | FP32, FP16 | Dynamic input shapes are unsupported
| RNN                       | Y          | FP32, FP16 | For bidirectional RNNs, activation functions must be the same for both the forward and reverse pass
| RoiAlign                  | N          |
| Round                     | Y          | FP32, FP16, INT8 |
| ScaledTanh                | Y          | FP32, FP16, INT8 |
| Scan                      | Y          | FP32, FP16 |
| Scatter                   | Y          | FP32, FP16, INT8, INT32 |
| ScatterElements           | Y          | FP32, FP16, INT8, INT32 |
| ScatterND                 | Y          | FP32, FP16, INT8, INT32 |
| Selu                      | Y          | FP32, FP16, INT8|
| SequenceAt                | N          |
| SequenceConstruct         | N          |
| SequenceEmpty             | N          |
| SequenceErase             | N          |
| SequenceInsert            | N          |
| SequenceLength            | N          |
| Shape                     | Y          | FP32, FP16, INT32, INT8, BOOL |
| Shrink                    | N          |
| Sigmoid                   | Y          | FP32, FP16, INT8 |
| Sign                      | Y          | FP32, FP16, INT8, INT32 |
| Sin                       | Y          | FP32, FP16 |
| Sinh                      | Y          | FP32, FP16 |
| Size                      | Y          | FP32, FP16, INT32, INT8, BOOL |
| Slice                     | Y          | FP32, FP16, INT32, INT8, BOOL | `axes` must be an initializer                                                                                                            |
| Softmax                   | Y          | FP32, FP16 |
| SoftmaxCrossEntropyLoss   | N          |
| Softplus                  | Y          | FP32, FP16, INT8 |
| Softsign                  | Y          | FP32, FP16, INT8 |
| SpaceToDepth              | Y          | FP32, FP16, INT32 |
| Split                     | Y          | FP32, FP16, INT32, BOOL | `split` must be an initializer                                                                                                           |
| SplitToSequence           | N          |
| Sqrt                      | Y          | FP32, FP16 |
| Squeeze                   | Y          | FP32, FP16, INT32, INT8, BOOL | `axes` must be an initializer                                                                                                            |
| StringNormalizer          | N          |
| Sub                       | Y          | FP32, FP16, INT32 |
| Sum                       | Y          | FP32, FP16, INT32 |
| Tan                       | Y          | FP32, FP16 |
| Tanh                      | Y          | FP32, FP16, INT8 |
| TfIdfVectorizer           | N          |
| ThresholdedRelu           | Y          | FP32, FP16, INT8 |
| Tile                      | Y          | FP32, FP16, INT32, BOOL |
| TopK                      | Y          | FP32, FP16 |
| Transpose                 | Y          | FP32, FP16, INT32, INT8, BOOL |
| Unique                    | N          |
| Unsqueeze                 | Y          | FP32, FP16, INT32, INT8, BOOL | `axes` must be a constant tensor                                                                                                         |
| Upsample                  | Y          | FP32, FP16 |
| Where                     | Y          | FP32, FP16, INT32, BOOL |
| Xor                       | N          |

注：ONNX有第三方开源社区维护，支持的算子要比Caffe全很多。

其他支持信息可参考官方手册Support Matrix页：

Support Matrix :: NVIDIA Deep Learning TensorRT Documentation

二、使用流程

使用TensorRT对神经网络进行加速的一般流程：

The general TensorRT workflow consists of 3 steps:

Populate a tensorrt.INetworkDefinition either with a parser or by using the TensorRT Network API (see tensorrt.INetworkDefinition for more details). The tensorrt.Builder can be used to generate an empty tensorrt.INetworkDefinition .

Use the tensorrt.Builder to build a tensorrt.ICudaEngine using the populated tensorrt.INetworkDefinition .

Create a tensorrt.IExecutionContext from the tensorrt.ICudaEngine and use it to perform optimized inference.

也可以分为两个阶段，build和runtime

1.build：Import and optimize trained models to generate inference engines

build阶段主要完成模型转换（从caffe或TensorFlow到TensorRT），在模型转换时会完成前述优化过程中的层间融合，精度校准。这一步的输出是一个针对特定GPU平台和网络模型的优化过的TensorRT模型，这个TensorRT模型可以序列化存储到磁盘或内存中。存储到磁盘中的文件称之为 plan file。

在这里插入图片描述

2.runtime (deploy)：Generate runtime inference engine for inference

runtime或者说是deploy阶段主要完成推理过程，将上面一个步骤中的plan文件首先反序列化，并创建一个 runtime engine，然后就可以输入数据（比如测试集或数据集之外的图片），然后输出分类向量结果或检测结果。

在这里插入图片描述

单独说一下TensorRT的优化机制：

引用文章：TensorRT-优化-原理 - 吴建明wujianming - 博客园

TensorRT优化方法主要有以下几种方式，最主要的是前面两种。

层间融合或张量融合（Layer & Tensor Fusion）

如下图左侧是GoogLeNetInception模块的计算图。这个结构中有很多层，在部署模型推理时，这每一层的运算操作都是由GPU完成的，但实际上是GPU通过启动不同的CUDA（Compute unified device architecture）核心来完成计算的，CUDA核心计算张量的速度是很快的，但是往往大量的时间是浪费在CUDA核心的启动和对每一层输入/输出张量的读写操作上面，这造成了内存带宽的瓶颈和GPU资源的浪费。TensorRT通过对层间的横向或纵向合并（合并后的结构称为CBR，意指 convolution, bias, and ReLU layers are fused to form a single layer），使得层的数量大大减少。横向合并可以把卷积、偏置和激活层合并成一个CBR结构，只占用一个CUDA核心。纵向合并可以把结构相同，但是权值不同的层合并成一个更宽的层，也只占用一个CUDA核心。合并之后的计算图（图4右侧）的层次更少了，占用的CUDA核心数也少了，因此整个模型结构会更小，更快，更高效。

数据精度校准（Weight &Activation Precision Calibration）

大部分深度学习框架在训练神经网络时网络中的张量（Tensor）都是32位浮点数的精度（Full 32-bit precision，FP32），一旦网络训练完成，在部署推理的过程中由于不需要反向传播，完全可以适当降低数据精度，比如降为FP16或INT8的精度。更低的数据精度将会使得内存占用和延迟更低，模型体积更小。

如下表为不同精度的动态范围：

Precision	Dynamic Range
FP32	−3.4×1038 +3.4×1038−3.4×1038 +3.4×1038
FP16	−65504 +65504−65504 +65504
INT8	−128 +127−128 +127

INT8只有256个不同的数值，使用INT8来表示 FP32精度的数值，肯定会丢失信息，造成性能下降。不过TensorRT会提供完全自动化的校准（Calibration ）过程，会以最好的匹配性能将FP32精度的数据降低为INT8精度，最小化性能损失。校准方式有4种：

IInt8Calibrator — NVIDIA TensorRT Standard Python API Documentation 8.2.0 documentationhttps://docs.nvidia.com/deeplearning/tensorrt/api/python_api/infer/Int8/Calibrator.html

IInt8Calibrator¶

tensorrt.CalibrationAlgoType¶

Version of calibration algorithm to use.

Members:

LEGACY_CALIBRATION

ENTROPY_CALIBRATION

ENTROPY_CALIBRATION_2

MINMAX_CALIBRATION

Kernel Auto-Tuning

网络模型在推理计算时，是调用GPU的CUDA核进行计算的。TensorRT可以针对不同的算法，不同的网络模型，不同的GPU平台，进行 CUDA核的调整，以保证当前模型在特定平台上以最优性能计算。

TensorRT will pick the implementation from a library of kernels that delivers the best performance for the target GPU, input data size, filter size, tensor layout, batch size and other parameters.

Dynamic Tensor Memory

在每个tensor的使用期间，TensorRT会为其指定显存，避免显存重复申请，减少内存占用和提高重复使用效率。

Multi-Stream Execution

Scalable design to process multiple input streams in parallel，这个应该就是GPU底层的优化了。

三、案例实践

本文将选用paddle-TRT的案例来尝试对paddle模型加速

Paddle LiteWeb site created using create-react-apphttps://paddleinference.paddlepaddle.org.cn/optimize/paddle_trt.html

实践一、移植openpose到paddlepaddle框架并进行TRT加速

paddlepaddle对caffe的兼容性也是没有对onnx好，所以第一步先把openpose模型从caffe格式转成onnx，再用x2paddle工具转成paddle模型。

caffe转onnx推荐该文章中提供的工具：

【caffe】Caffe模型转换为ONNX模型(新版)_花丸大老师的博客-CSDN博客_caffe转onnx在了解了caffe模型的结构和ONNX的结构后，我用python写了一个caffe转onnx的小工具，现只测试了resnet50、alexnet、yolov3的caffe模型和onnx模型推理结果，存在误差，但是在可接受范围内。本工具在转换模型的时候是不需要配置caffe的，只需要安装好protobuf即可。在进行推理测试的时候才需要配置好pycaffe。https://blog.csdn.net/u013597931/article/details/85236288模型转完后，先在gpu上测试是否可以正常运行，测试代码如下：

import numpy as np
import paddle.inference as paddle_infer
import time

def create_predictor():
    config = paddle_infer.Config("/data/models/paddle/pose/inference_model/model.pdmodel", "/data/models/paddle/pose/inference_model/model.pdiparams")
    config.enable_use_gpu(1000, 0)
    config.switch_ir_optim()
    config.enable_memory_optim()

    predictor = paddle_infer.create_predictor(config)
    return predictor

def run(predictor, img):
    # 准备输入
    input_names = predictor.get_input_names()
    for i,  name in enumerate(input_names):
        input_tensor = predictor.get_input_handle(name)
        input_tensor.reshape(img[i].shape)
        input_tensor.copy_from_cpu(img[i].copy())
    # 预测
    predictor.run()
    results = []
    # 获取输出
    output_names = predictor.get_output_names()
    for i, name in enumerate(output_names):
        output_tensor = predictor.get_output_handle(name)
        output_data = output_tensor.copy_to_cpu()
        results.append(output_data)
    return results

if __name__ == '__main__':
    pred = create_predictor()
    img = np.ones((1, 3, 480, 640)).astype(np.float32)
    while(1):
        start = time.time()
        result = run(pred, [img])
        end = time.time()
        print("run time: ", end-start)

编写openpose后处理代码并测试。

后处理采用Openpose官方C++代码，使用pybind11对其进行Python封装，封装代码如下：

传奇小迷弟

关注

0
点赞
踩
6

收藏

觉得还不错? 一键收藏
0
评论
TensorRT学习与paddle-TRT实践（一）

github 源码地址：https://github.com/NVIDIA/TensorRThttps://github.com/NVIDIA/TensorRT官方手册地址：NVIDIA Deep Learning TensorRT Documentationhttps://docs.nvidia.com/deeplearning/tensorrt/index.html一、算子支持截止2021年10月19日，TensorRT 已更新至8.2.0版本，算子支持情况（在源码parsers文件夹.
复制链接

扫一扫