TensorRt(1)安装和命令行测试

1、选择TensorRt版本

安装tensorrt前,需要先了解自己的显卡算力、架构等,点击 算力列表链接 对号入座。
在这里插入图片描述
这里仅展示RTX和Titan系列,其他系列可在当前网页选择。

1.1、cuda版本

首先需要安装cuda,其版本并不是最新就好,需要选择合适的架构的版本。建议选择合适的版本,基本上可以按照显卡发售的最近的时间选择。

cuda下载需要注册账号,地址链接为 https://developer.nvidia.com/cuda-toolkit-archive
在这里插入图片描述

例如,最新的RTX 40系列,2022年10月发售,使用ada架构,就必须使用cuda11.8版本。RTX20系列,采用Turing架构,最早于2018年12月发售2060,虽然cuda 11.8也是兼容可以使用,但是建议使用合适的10.x版本。在各个版本对应的version online documentation中可以看到算力的兼容性,
在这里插入图片描述
本机显卡为RTX 2060,方便起见,直接最新版本的cuda 11.8版本。

1.2、TensorRt版本

nvidia TensorRt 官网地址为 https://github.com/NVIDIA/TensorRT,提供编译好的包下载两个地址选择:

在这里插入图片描述
在这里插入图片描述注意,选择了cuda11.8,就应该下载TensorRt 8.5版本的

1.3、cudnn版本

TensorRt 8.5 cuda11.8对应的cudnn版本为8.6。由于学习原因,TensorRt 8.4版本中的中的sample较多,就直接选择了TensorRt 8.4版本,也是可以运行的。

下载后可以得到 TensorRT-8.4.3.1.Windows10.x86_64.cuda-11.6.cudnn8.4.zip ,可以看到使用cudnn8.4。 继续安装cudnn,选择cudnn 8.4.x最新就可。

下载链接为 https://developer.nvidia.com/rdp/cudnn-download,历史版本下载地址为
https://developer.nvidia.com/rdp/cudnn-archive
在这里插入图片描述
下载后解压cudnn的文件到cuda目录即可。

1.4、环境配置和基本测试

解压后,需要将安装目录添加到PATH环境变量中,
TensorRt/li
windows下后续某些程序运行可能提示缺少 zlibwapi.dll 库,官方也给出了下载地址,https://docs.nvidia.com/deeplearning/cudnn/install-guide/index.html#install-zlib-windows
在这里插入图片描述
例如,选择x64的版本,将解压对应的lib和dll到环境变量的目录下,如直接放在D:\Librarys\TensorRT-8.4.3.1\lib下即可。

2、测试

前面安装好环境之后,就可以进行测试,简单的就是使用trtexec.exe工具测试。后续可以选择python、c++进行代码开发。

2.1、工具 trtexec.exe 测试

在bin目录下有一个可执行程序 trtexec.exe,能够在不进行编程的情况快捷的利用TensorRt,主要表现在

  • 在网络模型上测试输入为随机或用户指定的数据
  • 将输入模型转换为可序列化的engine模型(优化后的trt模型文件)
  • 生成序列化的时序缓存(主要用于降低build的时间)

具体命令行可以使用 trtexec.exe --help 查看,或者参看 trtexec 文档 。这里直接进行测试data目录下mnist的onnx模型,在bin目录下执行trtexec.exe --onnx=../data/mnist/mnist.onnx,结果为

D:\Librarys\TensorRT-8.4.3.1\bin>trtexec.exe --onnx=../data/mnist/mnist.onnx
&&&& RUNNING TensorRT.trtexec [TensorRT v8403] # trtexec.exe --onnx=../data/mnist/mnist.onnx
[11/06/2022-13:54:02] [I] === Model Options ===
[11/06/2022-13:54:02] [I] Format: ONNX
[11/06/2022-13:54:02] [I] Model: ../data/mnist/mnist.onnx
[11/06/2022-13:54:02] [I] Output:
[11/06/2022-13:54:02] [I] === Build Options ===
[11/06/2022-13:54:02] [I] Max batch: explicit batch
[11/06/2022-13:54:02] [I] Memory Pools: workspace: default, dlaSRAM: default, dlaLocalDRAM: default, dlaGlobalDRAM: default
[11/06/2022-13:54:02] [I] minTiming: 1
[11/06/2022-13:54:02] [I] avgTiming: 8
[11/06/2022-13:54:02] [I] Precision: FP32
[11/06/2022-13:54:02] [I] LayerPrecisions:
[11/06/2022-13:54:02] [I] Calibration:
[11/06/2022-13:54:02] [I] Refit: Disabled
[11/06/2022-13:54:02] [I] Sparsity: Disabled
[11/06/2022-13:54:02] [I] Safe mode: Disabled
[11/06/2022-13:54:02] [I] DirectIO mode: Disabled
[11/06/2022-13:54:02] [I] Restricted mode: Disabled
[11/06/2022-13:54:02] [I] Build only: Disabled
[11/06/2022-13:54:02] [I] Save engine:
[11/06/2022-13:54:02] [I] Load engine:
[11/06/2022-13:54:02] [I] Profiling verbosity: 0
[11/06/2022-13:54:02] [I] Tactic sources: Using default tactic sources
[11/06/2022-13:54:02] [I] timingCacheMode: local
[11/06/2022-13:54:02] [I] timingCacheFile:
[11/06/2022-13:54:02] [I] Input(s)s format: fp32:CHW
[11/06/2022-13:54:02] [I] Output(s)s format: fp32:CHW
[11/06/2022-13:54:02] [I] Input build shapes: model
[11/06/2022-13:54:02] [I] Input calibration shapes: model
[11/06/2022-13:54:02] [I] === System Options ===
[11/06/2022-13:54:02] [I] Device: 0
[11/06/2022-13:54:02] [I] DLACore:
[11/06/2022-13:54:02] [I] Plugins:
[11/06/2022-13:54:02] [I] === Inference Options ===
[11/06/2022-13:54:02] [I] Batch: Explicit
[11/06/2022-13:54:02] [I] Input inference shapes: model
[11/06/2022-13:54:02] [I] Iterations: 10
[11/06/2022-13:54:02] [I] Duration: 3s (+ 200ms warm up)
[11/06/2022-13:54:02] [I] Sleep time: 0ms
[11/06/2022-13:54:02] [I] Idle time: 0ms
[11/06/2022-13:54:02] [I] Streams: 1
[11/06/2022-13:54:02] [I] ExposeDMA: Disabled
[11/06/2022-13:54:02] [I] Data transfers: Enabled
[11/06/2022-13:54:02] [I] Spin-wait: Disabled
[11/06/2022-13:54:02] [I] Multithreading: Disabled
[11/06/2022-13:54:02] [I] CUDA Graph: Disabled
[11/06/2022-13:54:02] [I] Separate profiling: Disabled
[11/06/2022-13:54:02] [I] Time Deserialize: Disabled
[11/06/2022-13:54:02] [I] Time Refit: Disabled
[11/06/2022-13:54:02] [I] Inputs:
[11/06/2022-13:54:02] [I] === Reporting Options ===
[11/06/2022-13:54:02] [I] Verbose: Disabled
[11/06/2022-13:54:02] [I] Averages: 10 inferences
[11/06/2022-13:54:02] [I] Percentile: 99
[11/06/2022-13:54:02] [I] Dump refittable layers:Disabled
[11/06/2022-13:54:02] [I] Dump output: Disabled
[11/06/2022-13:54:02] [I] Profile: Disabled
[11/06/2022-13:54:02] [I] Export timing to JSON file:
[11/06/2022-13:54:02] [I] Export output to JSON file:
[11/06/2022-13:54:02] [I] Export profile to JSON file:
[11/06/2022-13:54:02] [I]
[11/06/2022-13:54:03] [I] === Device Information ===
[11/06/2022-13:54:03] [I] Selected Device: NVIDIA GeForce RTX 2060
[11/06/2022-13:54:03] [I] Compute Capability: 7.5
[11/06/2022-13:54:03] [I] SMs: 30
[11/06/2022-13:54:03] [I] Compute Clock Rate: 1.2 GHz
[11/06/2022-13:54:03] [I] Device Global Memory: 6143 MiB
[11/06/2022-13:54:03] [I] Shared Memory per SM: 64 KiB
[11/06/2022-13:54:03] [I] Memory Bus Width: 192 bits (ECC disabled)
[11/06/2022-13:54:03] [I] Memory Clock Rate: 5.501 GHz
[11/06/2022-13:54:03] [I]
[11/06/2022-13:54:03] [I] TensorRT version: 8.4.3
[11/06/2022-13:54:03] [I] [TRT] [MemUsageChange] Init CUDA: CPU +428, GPU +0, now: CPU 7897, GPU 1147 (MiB)
[11/06/2022-13:54:04] [I] [TRT] [MemUsageChange] Init builder kernel library: CPU +257, GPU +68, now: CPU 8347, GPU 1215 (MiB)
[11/06/2022-13:54:04] [I] Start parsing network model
[11/06/2022-13:54:04] [I] [TRT] ----------------------------------------------------------------
[11/06/2022-13:54:04] [I] [TRT] Input filename:   ../data/mnist/mnist.onnx
[11/06/2022-13:54:04] [I] [TRT] ONNX IR version:  0.0.3
[11/06/2022-13:54:04] [I] [TRT] Opset version:    8
[11/06/2022-13:54:04] [I] [TRT] Producer name:    CNTK
[11/06/2022-13:54:04] [I] [TRT] Producer version: 2.5.1
[11/06/2022-13:54:04] [I] [TRT] Domain:           ai.cntk
[11/06/2022-13:54:04] [I] [TRT] Model version:    1
[11/06/2022-13:54:04] [I] [TRT] Doc string:
[11/06/2022-13:54:04] [I] [TRT] ----------------------------------------------------------------
[11/06/2022-13:54:04] [W] [TRT] onnx2trt_utils.cpp:369: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[11/06/2022-13:54:04] [I] Finish parsing network model
[11/06/2022-13:54:05] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +514, GPU +192, now: CPU 8667, GPU 1407 (MiB)
[11/06/2022-13:54:05] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +132, GPU +52, now: CPU 8799, GPU 1459 (MiB)
[11/06/2022-13:54:05] [I] [TRT] Local timing cache in use. Profiling results in this builder pass will not be stored.
[11/06/2022-13:54:06] [I] [TRT] Detected 1 inputs and 1 output network tensors.
[11/06/2022-13:54:06] [I] [TRT] Total Host Persistent Memory: 7552
[11/06/2022-13:54:06] [I] [TRT] Total Device Persistent Memory: 0
[11/06/2022-13:54:06] [I] [TRT] Total Scratch Memory: 0
[11/06/2022-13:54:06] [I] [TRT] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 0 MiB, GPU 0 MiB
[11/06/2022-13:54:06] [I] [TRT] [BlockAssignment] Algorithm ShiftNTopDown took 0.0236ms to assign 3 blocks to 6 nodes requiring 31748 bytes.
[11/06/2022-13:54:06] [I] [TRT] Total Activation Memory: 31748
[11/06/2022-13:54:06] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in building engine: CPU +0, GPU +0, now: CPU 0, GPU 0 (MiB)
[11/06/2022-13:54:06] [W] [TRT] The getMaxBatchSize() function should not be used with an engine built from a network created with NetworkDefinitionCreationFlag::kEXPLICIT_BATCH flag. This function will always return 1.
[11/06/2022-13:54:06] [W] [TRT] The getMaxBatchSize() function should not be used with an engine built from a network created with NetworkDefinitionCreationFlag::kEXPLICIT_BATCH flag. This function will always return 1.
[11/06/2022-13:54:06] [I] Engine built in 3.05504 sec.
[11/06/2022-13:54:06] [I] [TRT] [MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 9095, GPU 1543 (MiB)
[11/06/2022-13:54:06] [I] [TRT] Loaded engine size: 0 MiB
[11/06/2022-13:54:06] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +0, now: CPU 0, GPU 0 (MiB)
[11/06/2022-13:54:06] [I] Engine deserialized in 0.0020684 sec.
[11/06/2022-13:54:06] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 0 (MiB)
[11/06/2022-13:54:06] [I] Using random values for input Input3
[11/06/2022-13:54:06] [I] Created input binding for Input3 with dimensions 1x1x28x28
[11/06/2022-13:54:06] [I] Using random values for output Plus214_Output_0
[11/06/2022-13:54:06] [I] Created output binding for Plus214_Output_0 with dimensions 1x10
[11/06/2022-13:54:06] [I] Starting inference
[11/06/2022-13:54:09] [I] Warmup completed 2019 queries over 200 ms
[11/06/2022-13:54:09] [I] Timing trace has 32116 queries over 3.00011 s
[11/06/2022-13:54:09] [I]
[11/06/2022-13:54:09] [I] === Trace details ===
[11/06/2022-13:54:09] [I] Trace averages of 10 runs:
[11/06/2022-13:54:09] [I] Average on 10 runs - GPU latency: 0.0410919 ms - Host latency: 0.055571 ms (enqueue 0.054744 ms)
[11/06/2022-13:54:09] [I] Average on 10 runs - GPU latency: 0.0387924 ms - Host latency: 0.0540939 ms (enqueue 0.0451706 ms)
。。。。。
- 中间省略
。。。。。
[11/06/2022-13:54:10] [I] Average on 10 runs - GPU latency: 0.0338623 ms - Host latency: 0.0516113 ms (enqueue 0.0414185 ms)
[11/06/2022-13:54:11] [I] Average on 10 runs - GPU latency: 0.0352783 ms - Host latency: 0.0510498 ms (enqueue 0.0429199 ms)
[11/06/2022-13:54:11] [I]
[11/06/2022-13:54:11] [I] === Performance summary ===
[11/06/2022-13:54:11] [I] Throughput: 10704.9 qps
[11/06/2022-13:54:11] [I] Latency: min = 0.0288086 ms, max = 0.373047 ms, mean = 0.0534397 ms, median = 0.0523682 ms, percentile(99%) = 0.100342 ms
[11/06/2022-13:54:11] [I] Enqueue Time: min = 0.0231323 ms, max = 0.29541 ms, mean = 0.0426776 ms, median = 0.0424805 ms, percentile(99%) = 0.0822754 ms
[11/06/2022-13:54:11] [I] H2D Latency: min = 0.00415039 ms, max = 0.122131 ms, mean = 0.0132315 ms, median = 0.0090332 ms, percentile(99%) = 0.0356445 ms
[11/06/2022-13:54:11] [I] GPU Compute Time: min = 0.017334 ms, max = 0.302734 ms, mean = 0.0350428 ms, median = 0.0380859 ms, percentile(99%) = 0.0758057 ms
[11/06/2022-13:54:11] [I] D2H Latency: min = 0.00195313 ms, max = 0.0830078 ms, mean = 0.00516533 ms, median = 0.00463867 ms, percentile(99%) = 0.0202637 ms
[11/06/2022-13:54:11] [I] Total Host Walltime: 3.00011 s
[11/06/2022-13:54:11] [I] Total GPU Compute Time: 1.12544 s
[11/06/2022-13:54:11] [W] * Throughput may be bound by Enqueue Time rather than GPU Compute and the GPU may be under-utilized.
[11/06/2022-13:54:11] [W]   If not already in use, --useCudaGraph (utilize CUDA graphs where possible) may increase the throughput.
[11/06/2022-13:54:11] [W] * GPU compute time is unstable, with coefficient of variance = 33.1687%.
[11/06/2022-13:54:11] [W]   If not already in use, locking GPU clock frequency or adding --useSpinWait may improve the stability.
[11/06/2022-13:54:11] [I] Explanations of the performance metrics are printed in the verbose logs.
[11/06/2022-13:54:11] [I]
&&&& PASSED TensorRT.trtexec [TensorRT v8403] # trtexec.exe --onnx=../data/mnist/mnist.onnx

最后测试成功PASS,打印了模型信息、构建信息、设备信息、infer测试信息等。后续就可以使用sample中的示例进行学习。

  • 4
    点赞
  • 13
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论
### 回答1: TensorRTNVIDIA推出的深度学习推理引擎,可以在GPU上高效地运行深度学习模型。TensorRT支持Windows平台,可以通过以下步骤安装: 1. 安装CUDAcuDNNTensorRT需要依赖CUDAcuDNN,需要先安装它们。可以从NVIDIA官网下载对应版本CUDAcuDNN,并按照官方文档进行安装。 2. 下载TensorRT:可以从NVIDIA官网下载对应版本TensorRT,下载完成后解压到指定目录。 3. 安装TensorRT Python API:TensorRT提供了Python API,可以通过pip安装。打开命令行窗口,输入以下命令: ``` pip install tensorrt ``` 4. 安装TensorRT UFF Parser:如果需要使用UFF格式的模型,需要安装TensorRT UFF Parser。可以通过pip安装。打开命令行窗口,输入以下命令: ``` pip install uff ``` 安装完成后,就可以在Windows平台上使用TensorRT了。 ### 回答2: TensorRTNVIDIA推出的一个高效的深度神经网络推理引擎,可以大幅提升神经网络在GPU上的运行速度。TensorRT支持多种深度学习框架,如TensorFlow、Caffe和PyTorch等。在本文中,我们将探讨如何在Windows环境中使用Python安装TensorRT。 1. 准备工作 在安装TensorRT之前,需要先安装CUDAcuDNNTensorRT依赖于CUDAcuDNN,并且需要使用与您的GPU型号相对应版本CUDAcuDNN。 首先,下载并安装适合您GPU的CUDA软件包。然后,下载cuDNN库并将其解压缩到CUDA安装目录中。例如,如果您的CUDA安装在C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.1路径下,那么解压cuDNN库后应该将库文件放在C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.1\路径下。 2. 下载TensorRT 在完成CUDAcuDNN安装后,就可以下载TensorRT了。首先,进入NVIDIA官方网站(https://developer.nvidia.com/nvidia-tensorrt-download)下载TensorRT软件包。下载完成后,解压缩到您喜欢的目录中。例如,将TensorRT解压缩到C:\TensorRT路径下。 3. 配置环境变量 接下来,要将TensorRT的路径添加到环境变量中。在Windows环境中,打开“控制面板”->“系统和安全”->“系统”,然后点击“高级系统设置”->“环境变量”按钮。在“系统变量”中,找到“Path”变量并点击“编辑”按钮。在“变量值”框中添加TensorRT的bin和lib路径,例如:C:\TensorRT\bin;C:\TensorRT\lib; 4. 安装Python包 在安装Python之前,需要将Anaconda环境添加到环境变量中。如果您没有安装Anaconda环境,请先下载并安装Anaconda。在Windows环境中,打开“控制面板”->“系统和安全”->“系统”,然后点击“高级系统设置”->“环境变量”按钮。在“用户变量”中,找到“Path”变量并点击“编辑”按钮。在“变量值”框中添加Anaconda的路径,例如:C:\ProgramData\Anaconda3\Scripts;C:\ProgramData\Anaconda3\; 然后,通过pip命令安装TensorRT Python包。在Anaconda命令行窗口中,输入以下命令: pip install tensorrt 5. 测试安装 完成TensorRT Python包的安装后,可以使用Python脚本测试安装是否成功。创建一个新的Python脚本,并将以下代码复制并粘贴: import tensorrt as trt print(trt.__version__) 保存脚本后运行,如果输出正确的TensorRT版本号,则表明安装成功。可以使用TensorRT创建和优化神经网络模型了。 综上所述,TensorRT在Windows环境中的安装步骤如上所述。安装前需要确认CUDAcuDNN已成功安装安装时需要添加环境变量并使用pip工具安装TensorRT Python包。 ### 回答3: TensorRT是一个可用于高性能深度学习推理的软件库,可以在GPU上进行加速。对于Windows系统和Python用户来说,安装TensorRT相对来说比较简单,但也需要一定的操作步骤,下面将详细介绍如何安装TensorRT。 首先,需要在NVIDIA官网上下载TensorRT安装程序,这里提供的是TensorRT 5.1.5版本的下载地址:https://developer.nvidia.com/nvidia-tensorrt-5x-download,选择对应的Windows版本,下载后进行安装。 其次,安装完成后需要配置TensorRT环境变量,将TensorRT的bin目录添加到PATH环境变量中,这样就能够在命令行中使用TensorRT相关命令了。同样需要将TensorRT的include和lib目录添加到对应的环境变量中,以便在调用TensorRT库时能够正确编译。 接着,安装TensorRT的Python包,可以通过pip安装,打开命令行直接输入以下指令: ``` pip install tensorrt ``` 安装完成后,调用TensorRT就可以在Python中使用了。此外,还需要安装对应TensorFlow和Python版本,以及NVIDIACUDAcuDNN软件包,以便与TensorRT一起使用。 最后,验证TensorRT安装是否成功。在Python中导入TensorRT库,进行简单的模型推理测试。如果能够成功进行推理操作,那么就说明TensorRT安装已经成功了。 总之,TensorRT在Windows系统下的安装还是比较简单的,只需要按照上述步骤进行操作即可。当然,安装过程中也有可能会遇到一些问题,比如环境变量没有设置正确等,这时就需要仔细查看错误信息进行调整了。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

aworkholic

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值