tensorrt, batch
1. trtexec编译
trtexec地址
参考官方的说明,进行项目编译
2. 模型转换
- pytorch->onnx的时候,需要在动态尺寸上定义好,例如:
dynamic_axes = {
'input': {0: 'batch_size'}, #
}
torch.onnx.export(model, # model being run
x, # model input (or a tuple for multiple inputs)
'save.onnx',
export_params=True,
opset_version=11, # the ONNX version to export the model to
do_constant_folding=True,
input_names=['input'], # the model's input names
dynamic_axes=dynamic_axes)
这里就是定义batch的数值为动态的。
导出成onnx后,可以看一下输入的shape:batch_size×3×480×640
- trtexec模型转换
直接可以用以下命令进行模型转换
./trtexec --onnx=xxx.onnx --saveEngine=xxx.trt --workspace=1024 --minShapes=inputx:1x3x480x640 --optShapes=inputx:16x3x480x640 --maxShapes=inputx:32x3x480x640 --fp16
说明:
- onnx: 输入的onnx模型
- saveEngine:转换好后保存的tensorrt engine
- workspace:使用的gpu内存,有时候不够,需要手动增大点
- minShapes:动态尺寸时的最小尺寸,格式为NCHW,需要给定输入node的名字,
- optShapes:推理测试的尺寸,trtexec会执行推理测试,该shape就是测试时的输入shape
- maxShapes:动态尺寸时的最大尺寸,这里只有batch是动态的,其他维度都是写死的
- fp16:float16推理
3. batch推理耗时评测
- 推理命令
由上一节的介绍,动态batch推理时,通过改变optShapes的参数就可以实现不同batch的推理。例如:
# batch = 1
./trtexec --onnx=xxx.onnx --saveEngine=xxx.trt --workspace=1024 --minShapes=inputx:1x3x480x640 --optShapes=inputx:1x3x480x640 --maxShapes=inputx:32x3x480x640 --fp16
# batch = 2
./trtexec --onnx=xxx.onnx --saveEngine=xxx.trt --workspace=1024 --minShapes=inputx:1x3x480x640 --optShapes=inputx:2x3x480x640 --maxShapes=inputx:32x3x480x640 --fp16
# batch = 4
./trtexec --onnx=xxx.onnx --saveEngine=xxx.trt --workspace=1024 --minShapes=inputx:1x3x480x640 --optShapes=inputx:4x3x480x640 --maxShapes=inputx:32x3x480x640 --fp16
# batch = 8
./trtexec --onnx=xxx.onnx --saveEngine=xxx.trt --workspace=1024 --minShapes=inputx:1x3x480x640 --optShapes=inputx:8x3x480x640 --maxShapes=inputx:32x3x480x640 --fp16
- 耗时情况
trtexec会打印出很多时间,这里需要对每个时间的含义进行解释,然后大家各取所需,进行评测。总的打印如下:
[09/06/2021-13:50:34] [I] Average on 10 runs - GPU latency: 2.74553 ms - Host latency: 3.74192 ms (end to end 4.93066 ms, enqueue 0.624805 ms) # 跑了10次,GPU latency: GPU计算耗时, Host latency:GPU输入+计算+输出耗时,end to end:GPU端到端的耗时,eventout - eventin,enqueue:CPU异步耗时
[09/06/2021-13:50:34] [I] Host Latency
[09/06/2021-13:50:34] [I] min: 3.65332 ms (end to end 3.67603 ms)
[09/06/2021-13:50:34] [I] max: 5.95093 ms (end to end 6.88892 ms)
[09/06/2021-13:50:34] [I] mean: 3.71375 ms (end to end 5.30082 ms)
[09/06/2021-13:50:34] [I] median: 3.70032 ms (end to end 5.32935 ms)
[09/06/2021-13:50:34] [I] percentile: 4.10571 ms at 99% (end to end 6.11792 ms at 99%)
[09/06/2021-13:50:34] [I] throughput: 356.786 qps
[09/06/2021-13:50:34] [I] walltime: 3.00741 s
[09/06/2021-13:50:34] [I] Enqueue Time
[09/06/2021-13:50:34] [I] min: 0.248474 ms
[09/06/2021-13:50:34] [I] max: 2.12134 ms
[09/06/2021-13:50:34] [I] median: 0.273987 ms
[09/06/2021-13:50:34] [I] GPU Compute
[09/06/2021-13:50:34] [I] min: 2.69702 ms
[09/06/2021-13:50:34] [I] max: 4.99219 ms
[09/06/2021-13:50:34] [I] mean: 2.73299 ms
[09/06/2021-13:50:34] [I] median: 2.71875 ms
[09/06/2021-13:50:34] [I] percentile: 3.10791 ms at 99%
[09/06/2021-13:50:34] [I] total compute time: 2.93249 s
- Host Latency gpu: 输入+计算+输出 三部分的耗时
- Enqueue Time:CPU异步的时间(该时间不具有参考意义,因为GPU的计算可能还没有完成)
- GPU Compute:GPU计算的耗时
综上,去了Enqueue Time时间都是有意义的
这里附上个人的测试。
设备 | 模型 | model | batch | gpu 计算耗时/ms |
---|---|---|---|---|
3080pc | ddrnet23_ocr | fp16 | 1 | 1.70512 |
3080pc | ddrnet23_ocr | fp16 | 2 | 2.70069 |
3080pc | ddrnet23_ocr | fp16 | 4 | 4.78706 |
3080pc | ddrnet23_ocr | fp16 | 8 | 9.03271 |
3080pc | ddrnet23_ocr | fp16 | 16 | 16.1414 |
可以看到tensorrt的batch推理有线性增长的问题