TensorRT概述

最新推荐文章于 2024-05-07 16:31:11 发布

张博208

最新推荐文章于 2024-05-07 16:31:11 发布

阅读量192

点赞数

分类专栏： deep learning compliation 文章标签：人工智能

本文链接：https://blog.csdn.net/bbbeoy/article/details/127650046

版权

deep learning compliation 专栏收录该内容

4 篇文章 0 订阅

订阅专栏

深度学习模型研发的生命周期包括五步：目标确认、任务建模、数据采集与标注、模型训练、模型部署。作为炼丹师，接触得最多的是前面四步，但是模型部署也是非常重要的一环，它是模型落地前的临门一脚。

模型部署阶段对模型推理有五个要求：

Throughput
- The volume of output within a given period. Often measured in inferences/second or samples/second, per-server throughput is critical to cost-effective scaling in data centers.
Efficiency
- Amount of throughput delivered per unit-power, often expressed as performance/watt. Efficiency is another key factor to cost-effective data center scaling, since servers, server racks, and entire data centers must operate within fixed power budgets.
Latency
- Time to execute an inference, usually measured in milliseconds. Low latency is critical to delivering rapidly growing, real-time inference-based services.
Accuracy
- A trained neural network's ability to deliver the correct answer. For image classification based usages, the critical metric is expressed as a top-5 or top-1 percentage.
Memory Usage
- The host and device memory that need to be reserved to do inference on a network depend on the algorithms used. This constrains what networks and what combinations of networks can run on a given inference platform. This is particularly important for systems where multiple networks are needed and memory resources are limited - such as cascading multi-class detection networks used in intelligent video analytics and multi-camera, multi-network autonomous driving systems.

如果我们部署的硬件是英伟达的产品，以上五点都可以通过TensorRT进行优化，TensorRT有如下五个功能：

Quantization
- Most deep learning frameworks train neural networks in full 32-bit precision (FP32). Once the model is fully trained, inference computations can use half precision FP16 or even INT8 tensor operations, since gradient backpropagation is not required for inference. Using lower precision results in smaller model size, lower memory utilization and latency, and higher throughput.
Kernel Auto Tuning
- During the optimization phase TensorRT also chooses from hundreds of specialized kernels, many of them hand-tuned and optimized for a range of parameters and target platforms. As an example, there are several different algorithms to do convolutions. TensorRT will pick the implementation from a library of kernels that delivers the best performance for thetarget GPU, input data size, filter size, tensor layout, batch size and other parameters.This ensures that the deployed model is performance tuned for the specific deployment platform as well as for the specific neural network being deployed.
Elimination of Redundant Layers and Operations
- layers whose outputs are not used and operations which are equivalent to no-op.

Figure1. TensorRT’s vertical and horizontal layer fusion and layer elimination optimizations simplify the GoogLeNet Inception module graph, reducing computation and memory overhead.

Layer & Tensor Fusion
- TensorRT parses the network computational graph and looks for opportunities to perform graph optimizations. These graph optimizations do not change the underlying computation in the graph: instead, they look to restructure the graph to perform the operations much faster and more efficiently.
- When a deep learning framework executes this graph during inference, it makes multiple function calls for each layer. Since each operation is performed on the GPU, this translates to multiple CUDA kernel launches. The kernel computation is often very fast relative to the kernel launch overhead and the cost of reading and writing the tensor data for each layer. This results in a memory bandwidth bottleneck and underutilization of available GPU resources.
- TensorRT addresses this by vertically fusing kernels to perform the sequential operations together. This layer fusion reduces kernel launches and avoids writing into and reading from memory between layers. In network on the left of Figure 1, the convolution, bias and ReLU layers of various sizes can be combined into a single kernel called CBR as the right side of Figure 1 shows. A simple analogy is making three separate trips to the supermarket to buy three items versus buying all three in a single trip.
- TensorRT also recognizes layers that share the same input data and filter size, but have different weights. Instead of three separate kernels, TensorRT fuses them horizontally into a single wider kernel as shown for the 1×1 CBR layer in the right side of Figure 1.
- TensorRT can also eliminate the concatenation layers in Figure 1 ("concat") by preallocating output buffers and writing into them in a strided fashion.
- Overall the result is a smaller, faster and more efficient graph with fewer layers and kernel launches, therefore reducing inference latency.
Dynamic Tensor Memory
- TensorRT also reduces memory footprint and improves memory reuse by designating memory for each tensor only for the duration of its usage, avoiding memory allocation overhead for fast and efficient execution.

TensorRT使用过程分为两步：

Build
- The build phase needs to be run on the target deployment GPU platform. For example, if your application is going to run on a Jetson TX2, the build needs to be performed on a Jetson TX2, and likewise if your inference services will run in the cloud on AWS P3 instances with Tesla V100 GPUs, then the build phase needs to run on a system with a Tesla V100. This step is only performed once, so typical applications build one or many engines once, and then serialize them for later use.
- We use TensorRT to parse a trained model and perform optimizations for specified parameters such as batch size, precision, and workspace memory for the target deployment GPU. The output of this step is an optimized inference execution engine which we serialize a file on disk called a plan file.
- A plan file includes serialized data that the runtime engine uses to execute the network. It's called a plan file because it includes not only the weights, but also the schedule for the kernels to execute the network. It also includes information about the network that the application can query in order to determine how to bind input and output buffers.
Deploy
- This is the deployment step. We load and deserialize a saved plan file to create a TensorRT engine object, and use it to run inference on new data on the target deployment platform.