NVIDIA边缘计算产品项目落地,tensorrt介绍以及应用

基于tensorrt的模型开发:

深度学习模型研发的生命周期包括五步:目标确认、任务建模与pipeline搭建、数据采集与标注、模型训练、模型部署。模型部署是非常重要的一环,它是模型落地前的临门一脚;95%的公司训练出模型,却停在了部署阶段。

模型部署阶段对模型推理有五个要求:

Throughput
The volume of output within a given period. Often measured in inferences/second or samples/second, per-server throughput is critical to cost-effective scaling in data centers.
Efficiency
Amount of throughput delivered per unit-power, often expressed as performance/watt. Efficiency is another key factor to cost-effective data center scaling, since servers, server racks, and entire data centers must operate within fixed power budgets.
Latency
Time to execute an inference, usually measured in milliseconds. Low latency is critical to delivering rapidly growing, real-time inference-based services.
Accuracy
A trained neural network’s ability to deliver the correct answer. For image classification based usages, the critical metric is expressed as a top-5 or top-1 percentage.
Memory Usage
The host and device memory that need to be reserved to do inference on a network depend on the algorithms used. This constrains what networks and what combinations of networks can run on a given inference platform. This is particularly important for systems where multiple networks are needed and memory resources are limited - such as cascading multi-class detection networks used in intelligent video analytics and multi-camera, multi-network autonomous driving systems.

如果我们部署的硬件是英伟达的产品,以上五点都可以通过TensorRT进行优化,TensorRT有如下五个功能:

Quantization
Most deep learning frameworks train neural networks in full 32-bit precision (FP32). Once the model is fully trained, inference computations can use half precision FP16 or even INT8 tensor operations, since gradient backpropagation is not required for inference. Using lower precision results in smaller model size, lower memory utilization and latency, and higher throughput.
Kernel Auto Tuning
During the optimization phase TensorRT also chooses from hundreds of specialized kernels, many of them hand-tuned and optimized for a range of parameters and target platforms. As an example, there are several different algorithms to do convolutions. TensorRT will pick the implementation from a library of kernels that delivers the best performance for thetarget GPU, input data size, filter size, tensor layout, batch size and other parameters.This ensures that the deployed model is performance tuned for the specific deployment platform as well as for the specific neural network being deployed.
Elimination of Redundant Layers and Operations
layers whose outputs are not used and operations which are equivalent to no-op.

Figure1. TensorRT’s vertical and horizontal layer fusion and layer elimination optimizations simplify the GoogLeNet Inception module graph, reducing computation and memory overhead.
Layer & Tensor Fusion
TensorRT parses the network computational graph and looks for opportunities to perform graph optimizations. These graph optimizations do not change the underlying computation in the graph: instead, they look to restructure the graph to perform the operations much faster and more efficiently.
When a deep learning framework executes this graph during inference, it makes multiple function calls for each layer. Since each operation is performed on the GPU, this translates to multiple CUDA kernel launches. The kernel computation is often very fast relative to the kernel launch overhead and the cost of reading and writing the tensor data for each layer. This results in a memory bandwidth bottleneck and underutilization of available GPU resources.
TensorRT addresses this by vertically fusing kernels to perform the sequential operations together. This layer fusion reduces kernel launches and avoids writing into and reading from memory between layers. In network on the left of Figure 1, the convolution, bias and ReLU layers of various sizes can be combined into a single kernel called CBR as the right side of Figure 1 shows. A simple analogy is making three separate trips to the supermarket to buy three items versus buying all three in a single trip.
TensorRT also recognizes layers that share the same input data and filter size, but have different weights. Instead of three separate kernels, TensorRT fuses them horizontally into a single wider kernel as shown for the 1×1 CBR layer in the right side of Figure 1.
TensorRT can also eliminate the concatenation layers in Figure 1 (“concat”) by preallocating output buffers and writing into them in a strided fashion.
Overall the result is a smaller, faster and more efficient graph with fewer layers and kernel launches, therefore reducing inference latency.
Dynamic Tensor Memory
TensorRT also reduces memory footprint and improves memory reuse by designating memory for each tensor only for the duration of its usage, avoiding memory allocation overhead for fast and efficient execution.

TensorRT使用过程分为两步:

Build
The build phase needs to be run on the target deployment GPU platform. For example, if your application is going to run on a Jetson TX2, the build needs to be performed on a Jetson TX2, and likewise if your inference services will run in the cloud on AWS P3 instances with Tesla V100 GPUs, then the build phase needs to run on a system with a Tesla V100. This step is only performed once, so typical applications build one or many engines once, and then serialize them for later use.
We use TensorRT to parse a trained model and perform optimizations for specified parameters such as batch size, precision, and workspace memory for the target deployment GPU. The output of this step is an optimized inference execution engine which we serialize a file on disk called a plan file.
A plan file includes serialized data that the runtime engine uses to execute the network. It’s called a plan file because it includes not only the weights, but also the schedule for the kernels to execute the network. It also includes information about the network that the application can query in order to determine how to bind input and output buffers.
Deploy
This is the deployment step. We load and deserialize a saved plan file to create a TensorRT engine object, and use it to run inference on new data on the target deployment platform.

目标识别及自动驾驶等工业产品应用的感知架构结构:

通过将深度神经网络和其他复杂的处理任务引入到流处理管道中,以实现对视频和其他传感器数据的近实时分析。从这些传感器中提取有意义的见解为提高运营效率和安全性创造了机会。例如,摄像头是当前使用最多的目标识别。在我们的家中,街道上,停车场,大型购物中心,仓库,工厂中都可以找到相机–无处不在。视频分析的潜在用途是巨大的:自动驾驶,访问控制,防止丢失,自动结帐,监视,安全,自动检查(QA),包裹分类(智能物流),交通控制/工程,工业自动化等。
在这里插入图片描述
在这里插入图片描述
通用流分析架构定义了可扩展的视频处理管道,可用于执行推理,对象跟踪和报告,甚至存储到云端。当应用程序分析每个视频帧时,插件会提取信息并将其存储为级联元数据记录的一部分,从而保持记录与源帧的关联。管道末尾的完整元数据集合表示深度学习模型和其他分析插件从框架中提取的完整信息集。应用程序可以使用此信息进行显示,也可以作为消息的一部分在外部进行传输,以进行进一步的分析或长期归档。
1
对元数据使用可扩展的标准结构。基本的元数据结构NvDsBatchMeta始于在所需Gst-nvstreammux插件内部创建的批处理级元数据。辅助元数据结构保存框架,对象,分类器和标签数据。
在这里插入图片描述

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

鼾声鼾语

感谢您的支持鼓励!

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值