Project: Inference Framework based TensorRT

最新推荐文章于 2022-08-30 01:15:41 发布

deepindeed

最新推荐文章于 2022-08-30 01:15:41 发布

阅读量425

点赞数

分类专栏：【性能优化】【计算机视觉】【深度学习框架】文章标签： TensorRT

本文为博主原创文章，未经博主允许不得转载。

本文链接：https://blog.csdn.net/cwlseu/article/details/103696616

版权

引言

视觉算法经过几年高速发展，大量的算法被提出。为了能真正将算法在实际应用场景中更好地应用，高性能的 inference框架层出不穷。从手机端上的ncnn到tf-lite，NVIDIA在cudnn之后，推出专用于神经网络推理的TensorRT. 经过几轮迭代，支持的操作逐渐丰富，补充的插件已经基本满足落地的需求。笔者觉得，尤其是tensorrt 5.0之后，无论是接口还是使用samples都变得非常方便集成。

版本选型与基本概念

FP16 INT8

The easiest way to benefit from mixed precision in your application is to take advantage of the support for FP16 and INT8 computation in NVIDIA GPU libraries. Key libraries from the NVIDIA SDK now support a variety of precisions for both computation and storage.

Table shows the current support for FP16 and INT8 in key CUDA libraries as well as in PTX assembly and CUDA C/C++ intrinsics.

Feature	FP16x2	INT8/16 DP4A/DP2A
PTX instructions	CUDA 7.5	CUDA 8
CUDA C/C++ intrinsics	CUDA 7.5	CUDA 8
cuBLAS GEMM	CUDA 7.5	CUDA 8
cuFFT	CUDA 7.5	I/O via cuFFT callbacks
cuDNN	5.1	6
TensorRT	v1	v2 Tech Preview

PTX

PTX(parallel-thread-execution，并行线程执行) 预编译后GPU代码的一种形式，开发者可以通过编译选项 “-keep”选择输出PTX代码，当然开发人员也可以直接编写PTX级代码。另外，PTX是独立于GPU架构的，因此可以重用相同的代码适用于不同的GPU架构。具体可参考CUDA-PDF之《PTX ISA reference document》

建议我们的CUDA 版本为CUDA 8.0以上, 显卡至少为GeForce 1060, 如果想支持Int8/DP4A等feature，还是需要RTX 1080或者P40。

TensorRT特性助力高性能算法

优化原理

网络模型的裁剪与重构

The above figures explain the vertical fusion optimization that TRT does. The Convolution (C), Bias(B) and Activation(R, ReLU in this case) are all collapsed into one single node (implementation wise this would mean a single CUDA kernel launch for C, B and R).

There is also a horizontal fusion where if multiple nodes with same operation are feeding to multiple nodes then it is converted to one single node feeding multiple nodes. The three 1x1 CBRs are fused to one and their output is directed to appropriate nodes. Other optimizations Apart from the graph optimizations, TRT, through experiments and based on parameters like batch size, convolution kernel(filter) sizes, chooses efficient algorithms and kernels(CUDA kernels) for operations in network.

低精度计算的支持

FP16 & Int8指令的支持
DP4A(Dot Product of 4 8-bits Accumulated to a 32-bit)

TensorRT 进行优化的方式是 DP4A (Dot Product of 4 8-bits Accumulated to a 32-bit)，如下图：

这是PASCAL 系列GPU的硬件指令，INT8卷积就是使用这种方式进行的卷积计算。更多关于DP4A的信息可以参考<

最低0.47元/天解锁文章

deepindeed

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Project: Inference Framework based TensorRT

引言视觉算法经过几年高速发展，大量的算法被提出。为了能真正将算法在实际应用场景中更好地应用，高性能的 inference框架层出不穷。从手机端上的ncnn到tf-lite，NVIDIA在cudnn之后，推出专用于神经网络推理的TensorRT. 经过几轮迭代，支持的操作逐渐丰富，补充的插件已经基本满足落地的需求。笔者觉得，尤其是tensorrt 5.0之后，无论是接口还是使用samples都变...
复制链接

扫一扫