引言
视觉算法经过几年高速发展,大量的算法被提出。为了能真正将算法在实际应用场景中更好地应用,高性能的 inference框架层出不穷。从手机端上的ncnn到tf-lite,NVIDIA在cudnn之后,推出专用于神经网络推理的TensorRT. 经过几轮迭代,支持的操作逐渐丰富,补充的插件已经基本满足落地的需求。笔者觉得,尤其是tensorrt 5.0之后,无论是接口还是使用samples都变得非常方便集成。
版本选型与基本概念
FP16 INT8
The easiest way to benefit from mixed precision in your application is to take advantage of the support for FP16 and INT8 computation in NVIDIA GPU libraries. Key libraries from the NVIDIA SDK now support a variety of precisions for both computation and storage.
Table shows the current support for FP16 and INT8 in key CUDA libraries as well as in PTX assembly and CUDA C/C++ intrinsics.
Feature | FP16x2 | INT8/16 DP4A/DP2A |
---|---|---|
PTX instructions | CUDA 7.5 | CUDA 8 |
CUDA C/C++ intrinsics | CUDA 7.5 | CUDA 8 |
cuBLAS GEMM | CUDA 7.5 | CUDA 8 |
cuFFT | CUDA 7.5 | I/O via cuFFT callbacks |
cuDNN | 5.1 | 6 |
TensorRT | v1 | v2 Tech Preview |
PTX
PTX(parallel-thread-execution,并行线程执行) 预编译后GPU代码的一种形式,开发者可以通过编译选项 “-keep”选择输出PTX代码,当然开发人员也可以直接编写PTX级代码。另外,PTX是独立于GPU架构的,因此可以重用相同的代码适用于不同的GPU架构。 具体可参考CUDA-PDF之《PTX ISA reference document》
建议我们的CUDA 版本为CUDA 8.0以上, 显卡至少为GeForce 1060
, 如果想支持Int8/DP4A等feature,还是需要RTX 1080
或者P40
。
TensorRT特性助力高性能算法
优化原理
网络模型的裁剪与重构
The above figures explain the vertical fusion optimization that TRT does. The Convolution (C), Bias(B) and Activation(R, ReLU in this case) are all collapsed into one single node (implementation wise this would mean a single CUDA kernel launch for C, B and R).
There is also a horizontal fusion where if multiple nodes with same operation are feeding to multiple nodes then it is converted to one single node feeding multiple nodes. The three 1x1 CBRs are fused to one and their output is directed to appropriate nodes. Other optimizations Apart from the graph optimizations, TRT, through experiments and based on parameters like batch size, convolution kernel(filter) sizes, chooses efficient algorithms and kernels(CUDA kernels) for operations in network.
低精度计算的支持
- FP16 & Int8指令的支持
- DP4A(Dot Product of 4 8-bits Accumulated to a 32-bit)
TensorRT 进行优化的方式是 DP4A (Dot Product of 4 8-bits Accumulated to a 32-bit),如下图: