Project: Inference Framework based TensorRT

引言

视觉算法经过几年高速发展,大量的算法被提出。为了能真正将算法在实际应用场景中更好地应用,高性能的 inference框架层出不穷。从手机端上的ncnn到tf-lite,NVIDIA在cudnn之后,推出专用于神经网络推理的TensorRT. 经过几轮迭代,支持的操作逐渐丰富,补充的插件已经基本满足落地的需求。笔者觉得,尤其是tensorrt 5.0之后,无论是接口还是使用samples都变得非常方便集成。

版本选型与基本概念

FP16 INT8

The easiest way to benefit from mixed precision in your application is to take advantage of the support for FP16 and INT8 computation in NVIDIA GPU libraries. Key libraries from the NVIDIA SDK now support a variety of precisions for both computation and storage.

Table shows the current support for FP16 and INT8 in key CUDA libraries as well as in PTX assembly and CUDA C/C++ intrinsics.

Feature FP16x2 INT8/16 DP4A/DP2A
PTX instructions CUDA 7.5 CUDA 8
CUDA C/C++ intrinsics CUDA 7.5 CUDA 8
cuBLAS GEMM CUDA 7.5 CUDA 8
cuFFT CUDA 7.5 I/O via cuFFT callbacks
cuDNN 5.1 6
TensorRT v1 v2 Tech Preview

PTX

PTX(parallel-thread-execution,并行线程执行) 预编译后GPU代码的一种形式,开发者可以通过编译选项 “-keep”选择输出PTX代码,当然开发人员也可以直接编写PTX级代码。另外,PTX是独立于GPU架构的,因此可以重用相同的代码适用于不同的GPU架构。 具体可参考CUDA-PDF之《PTX ISA reference document》

建议我们的CUDA 版本为CUDA 8.0以上, 显卡至少为GeForce 1060, 如果想支持Int8/DP4A等feature,还是需要RTX 1080或者P40

TensorRT特性助力高性能算法

优化原理

网络模型的裁剪与重构

The above figures explain the vertical fusion optimization that TRT does. The Convolution (C), Bias(B) and Activation(R, ReLU in this case) are all collapsed into one single node (implementation wise this would mean a single CUDA kernel launch for C, B and R).

There is also a horizontal fusion where if multiple nodes with same operation are feeding to multiple nodes then it is converted to one single node feeding multiple nodes. The three 1x1 CBRs are fused to one and their output is directed to appropriate nodes. Other optimizations Apart from the graph optimizations, TRT, through experiments and based on parameters like batch size, convolution kernel(filter) sizes, chooses efficient algorithms and kernels(CUDA kernels) for operations in network.

低精度计算的支持

  • FP16 & Int8指令的支持
  • DP4A(Dot Product of 4 8-bits Accumulated to a 32-bit)

TensorRT 进行优化的方式是 DP4A (Dot Product of 4 8-bits Accumulated to a 32-bit),如下图:

这是PASCAL 系列GPU的硬件指令,INT8卷积就是使用这种方式进行的卷积计算。更多关于DP4A的信息可以参考<

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值