TensorRT实现yolov5推理加速

花花少年

已于 2024-10-31 23:37:04 修改

阅读量7.6k

点赞数 6

分类专栏：深度学习文章标签： python yolov5 tensorRT

于 2021-09-19 22:33:44 首次发布

本文链接：https://blog.csdn.net/m0_37605642/article/details/120241385

版权

深度学习专栏收录该内容

135 篇文章

订阅专栏

一、参考资料

tensorrt_inference
yolov5 PyTorch模型转TensorRT
yolov5剪枝蒸馏压缩

二、重要说明

序列化生成yolov5s.engine耗时，大概6-8分钟。

time ./yolov5 -s yolov5s.wts yolov5s.engine s

输出
real	7m29.211s
user	5m10.066s
sys	0m42.794s

yolov5s.trt与yolov5s.engine是一样的，只是后缀名不同。
c++推理yolov5s和python API推理yolov5s模型，速度相差不大，但是显存占用相差较大。

no tensorRT，tensorRT FP 32，tensorRT FP16，tensorRT INT8性能比较，测试数据集是COCO数据集。

	no tensorRT	tensorRT FP 32	tensorRT FP16	tensorRT INT8
engine	~	38.3MB	21.5MB	10.8MB
FPS	12ms/张，83fps	11ms/张，90fps	7ms/张，142fps	5ms/张，200fps
生成engine耗时	~	31s	7m12s	7m27s
C++ API 显存	~	752MB	544MB	526MB
python API 显存	1133MB	2285MB	2075MB	2057MB
accuracy 精度	~	~	~	无框
mAP	~	~	~	~

tensorRT默认使用的是 USE_FP16，USE_FP32 --> USE_FP16 在CNN里面基本上只是做小数点后几位的截断。只有USE_INT8 才需要校准数据集进行校准量化。

三、准备环境

github源码: tensorrtx

系统环境

Environment
Operating System + Version: Ubuntu + 16.04
TensorRT Version: 7.1.3.4
GPU Type: GeForce GTX1650,4GB
Nvidia Driver Version: 470.63.01
CUDA Version: 10.2.300
CUDNN Version: 7.6.5
Python Version (if applicable): 3.7.3
Anaconda Version：4.10.3
gcc：7.5.0
g++：7.5.0

tensorRT-yolov5.yaml

name: tensorRT-yolov5
channels:
  - <unknown>
  - http://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
  - http://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/r
  - http://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/msys2
dependencies:
  - _libgcc_mutex=0.1=main
  - _openmp_mutex=4.5=1_gnu
  - blas=1.0=mkl
  - bzip2=1.0.8=h7b6447c_0
  - ca-certificates=2021.7.5=h06a4308_1
  - certifi=2021.5.30=py37h06a4308_0
  - cudatoolkit=10.2.89=hfd86e86_1
  - ffmpeg=4.2.2=h20bf706_0
  - freetype=2.10.4=h5ab3b9f_0
  - gmp=6.2.1=h2531618_2
  - gnutls=3.6.15=he1e5248_0
  - jpeg=9b=h024ee3a_2
  - lame=3.100=h7b6447c_0
  - lcms2=2.12=h3be6417_0
  - libedit=3.1.20210714=h7f8727e_0
  - libffi=3.2.1=hf484d3e_1007
  - libgcc-ng=9.3.0=h5101ec6_17
  - libgomp=9.3.0=h5101ec6_17
  - libidn2=2.3.2=h7f8727e_0
  - libopus=1.3.1=h7b6447c_0
  - libpng=1.6.37=hbc83047_0
  - libstdcxx-ng=9.3.0=hd4cf53a_17
  - libtasn1=4.16.0=h27cfd23_0
  - libtiff=4.2.0=h85742a9_0
  - libunistring=0.9.10=h27cfd23_0
  - libuv=1.40.0=h7b6447c_0
  - libvpx=1.7.0=h439df22_0
  - libwebp-base=1.2.0=h27cfd23_0
  - lz4-c=1.9.3=h295c915_1
  - mkl_fft=1.3.0=py37h42c9631_2
  - mkl_random=1.2.2=py37h51133e4_0
  - ncurses=6.2=he6710b0_1
  - nettle=3.7.3=hbbd107a_1
  - ninja=1.10.2=hff7bd54_1
  - numpy-base=1.20.3=py37h74d4b33_0
  - openh264=2.1.0=hd408876_0
  - openjpeg=2.4.0=h3ad879b_0
  - openssl=1.1.1l=h7f8727e_0
  - pip=21.2.2=py37h06a4308_0
  - python=3.7.3=h0371630_0
  - pytorch=1.8.0=py3.7_cuda10.2_cudnn7.6.5_0
  - readline=7.0=h7b6447c_5
  - setuptools=52.0.0=py37h06a4308_0
  - six=1.16.0=pyhd3eb1b0_0
  - sqlite=3.33.0=h62c20be_0
  - tk=8.6.10=hbc83047_0
  - torchvision=0.9.0=py37_cu102
  - typing_extensions=3.10.0.0=pyh06a4308_0
  - wheel=0.37.0=pyhd3eb1b0_0
  - x264=1!157.20191217=h7b6447c_0
  - xz=5.2.5=h7b6447c_0
  - zlib=1.2.11=h7b6447c_3
  - zstd=1.4.9=haebb681_0
  - pip:
    - appdirs==1.4.4
    - charset-normalizer==2.0.4
    - cycler==0.10.0
    - dpcpp-cpp-rt==2021.3.0
    - flatbuffers==2.0
    - graphsurgeon==0.4.5
    - idna==3.2
    - intel-cmplr-lib-rt==2021.3.0
    - intel-cmplr-lic-rt==2021.3.0
    - intel-opencl-rt==2021.3.0
    - intel-openmp==2021.3.0
    - kiwisolver==1.3.1
    - mako==1.1.5
    - markupsafe==2.0.1
    - matplotlib==3.4.3
    - mkl==2021.3.0
    - mkl-fft==1.3.0
    - mkl-service==2.4.0
    - netron==5.1.6
    - numpy==1.21.2
    - olefile==0.46
    - onnx==1.10.1
    - onnx-simplifier==0.3.6
    - onnxoptimizer==0.2.6
    - onnxruntime==1.8.1
    - opencv-python==4.5.3.56
    - pandas==1.3.2
    - pillow==8.3.2
    - protobuf==3.17.3
    - pycuda==2021.1
    - pyparsing==2.4.7
    - python-dateutil==2.8.2
    - pytools==2021.2.8
    - pytz==2021.1
    - pyyaml==5.4.1
    - requests==2.26.0
    - scipy==1.7.1
    - seaborn==0.11.2
    - tbb==2021.3.0
    - tensorrt==7.1.3.4
    - torchsummary==1.5.1
    - tqdm==4.62.2
    - typing-extensions==3.10.0.2
    - uff==0.6.9
    - urllib3==1.26.6
prefix: /PATH/TO/miniconda3/envs/tensorRT-yolov5

requirements-gpu.txt

appdirs==1.4.4
certifi==2021.5.30
charset-normalizer==2.0.4
cycler==0.10.0
dpcpp-cpp-rt==2021.3.0
flatbuffers==2.0
graphsurgeon @ file:///PATH/TO/360Downloads/TensorRT-7.1.3.4/graphsurgeon/graphsurgeon-0.4.5-py2.py3-none-any.whl
idna==3.2
intel-cmplr-lib-rt==2021.3.0
intel-cmplr-lic-rt==2021.3.0
intel-opencl-rt==2021.3.0
intel-openmp==2021.3.0
kiwisolver==1.3.1
Mako==1.1.5
MarkupSafe==2.0.1
matplotlib==3.4.3
mkl==2021.3.0
mkl-fft==1.3.0
mkl-random @ file:///tmp/build/80754af9/mkl_random_1626179032232/work
mkl-service==2.4.0
netron==5.1.6
numpy==1.21.2
olefile==0.46
onnx==1.10.1
onnx-simplifier==0.3.6
onnxoptimizer==0.2.6
onnxruntime==1.8.1
opencv-python==4.5.3.56
pandas==1.3.2
Pillow==8.3.2
protobuf==3.17.3
pycuda==2021.1
pyparsing==2.4.7
python-dateutil==2.8.2
pytools==2021.2.8
pytz==2021.1
PyYAML==5.4.1
requests==2.26.0
scipy==1.7.1
seaborn==0.11.2
six @ file:///tmp/build/80754af9/six_1623709665295/work
tbb==2021.3.0
tensorrt @ file:///PATH/TO/360Downloads/TensorRT-7.1.3.4/python/tensorrt-7.1.3.4-cp37-none-linux_x86_64.whl
torch==1.8.0
torchsummary==1.5.1
torchvision==0.9.0
tqdm==4.62.2
typing-extensions==3.10.0.2
uff @ file:///PATH/TO/360Downloads/TensorRT-7.1.3.4/uff/uff-0.6.9-py2.py3-none-any.whl
urllib3==1.26.6

四、关键步骤

1. 下载模型

下载yolov5预训练模型：下载地址

2. 检查验证模型

import sys
import onnx
filename = yourONNXmodel
model = onnx.load(filename)
onnx.checker.check_model(model)

import onnximport numpy as np
import onnxruntime as rt
import cv2


model_path = '/home/oldpan/code/models/Resnet34_3inputs_448x448_20200609.onnx'

# 验证模型合法性
onnx_model = onnx.load(model_path)
onnx.checker.check_model(onnx_model)

# 读入图像并调整为输入维度
image = cv2.imread("data/images/person.png")
image = cv2.resize(image, (448,448))
image = image.transpose(2,0,1)
image = np.array(image)[np.newaxis, :, :, :].astype(np.float32)

# 设置模型session以及输入信息
sess = rt.InferenceSession(model_path)
input_name1 = sess.get_inputs()[0].name
input_name2 = sess.get_inputs()[1].name
input_name3 = sess.get_inputs()[2].name
output = sess.run(None, {input_name1: image, input_name2: image, input_name3: image})
print(output)

3. 修改 CMakeLists.txt

文件路径：/PATH/TO/tensorrtx/yolov5/CMakeLists.txt

cmake_minimum_required(VERSION 2.6)

project(yolov5)

add_definitions(-std=c++11)
add_definitions(-DAPI_EXPORTS)
option(CUDA_USE_STATIC_CUDA_RUNTIME OFF)
set(CMAKE_CXX_STANDARD 11)
set(CMAKE_BUILD_TYPE Debug)

find_package(CUDA REQUIRED)

if(WIN32)
enable_language(CUDA)
endif(WIN32)

include_directories(${PROJECT_SOURCE_DIR}/include)
# include and link dirs of cuda and tensorrt, you need adapt them if yours are different
# cuda
# 需要修改目录
include_directories(/usr/local/cuda/include)
link_directories(/usr/local/cuda/lib64)
# tensorrt
# 需要修改目录
include_directories(/PATH/TO/360Downloads/TensorRT-7.1.3.4/include/)
link_directories(/PATH/TO/360Downloads/TensorRT-7.1.3.4/lib/)

set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -std=c++11 -Wall -Ofast -Wfatal-errors -D_MWAITXINTRIN_H_INCLUDED")

cuda_add_library(myplugins SHARED ${PROJECT_SOURCE_DIR}/yololayer.cu)
target_link_libraries(myplugins nvinfer cudart)

find_package(OpenCV)
include_directories(${OpenCV_INCLUDE_DIRS})

add_executable(yolov5 ${PROJECT_SOURCE_DIR}/calibrator.cpp ${PROJECT_SOURCE_DIR}/yolov5.cpp)
target_link_libraries(yolov5 nvinfer)
target_link_libraries(yolov5 cudart)
target_link_libraries(yolov5 myplugins)
target_link_libraries(yolov5 ${OpenCV_LIBS})

if(UNIX)
add_definitions(-O2 -pthread)
endif(UNIX)

4. cmake编译

(tensorRT-yolov5) yoyo@yoyo:~/MyDocuments/tensorrtx/yolov5/build$ time cmake ..
CMake Deprecation Warning at CMakeLists.txt:1 (cmake_minimum_required):
  Compatibility with CMake < 2.8.12 will be removed from a future version of
  CMake.

  Update the VERSION argument <min> value or use a ...<max> suffix to tell
  CMake that the project does not need compatibility with older versions.


-- The C compiler identification is GNU 7.5.0
-- The CXX compiler identification is GNU 7.5.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found CUDA: /usr/local/cuda (found version "10.2") 
-- Found OpenCV: /usr/local/opencv3.3.0 (found version "3.3.0") 
-- Configuring done
-- Generating done
-- Build files have been written to: /PATH/TO/tensorrtx/yolov5/build

real	0m0.241s
user	0m0.201s
sys	0m0.042s

5. make编译

(tensorRT-yolov5) yoyo@yoyo:~/MyDocuments/tensorrtx/yolov5/build$ time make -j6
[ 20%] Building NVCC (Device) object 
...
[100%] Linking CXX executable yolov5
[100%] Built target yolov5

real	0m4.723s
user	0m5.887s
sys	0m0.421s

五、tensorRT FP32 推理

1. 设置USE_FP32

修改 /PATH/TO/tensorrtx/yolov5/yolov5.cpp 文件。

#define USE_FP32  // set USE_INT8 or USE_FP16 or USE_FP32

2. cmake 编译

cd /PATH/TO/tensorrtx/yolov5
mkdir build && cd build
cp {ultralytics}/yolov5/yolov5s.wts {tensorrtx}/yolov5/build
cmake ..

3. make 编译

(yolov5-pytorch) yoyo@yoyo:~/MyDocuments/tensorrtx/yolov5/build$ time make -j6
[ 20%] Building NVCC (Device) object 
...
[100%] Linking CXX executable yolov5
[100%] Built target yolov5

real	0m4.702s
user	0m5.841s
sys	0m0.406s

4. wts转engine格式

wts转engine就是序列化engine的过程。

(yolov5-pytorch) yoyo@yoyo:~/MyDocuments/tensorrtx/yolov5/build$ time ./yolov5 -s yolov5s.wts yolov5s.engine s
Loading weights: yolov5s.wts
Building engine, please wait for a while...
Build engine successfully!

real	0m31.284s
user	0m24.642s
sys	0m1.750s

yolov5s.engine，38.3MB

显存占用情况：

Thu Sep  9 14:23:23 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.63.01    Driver Version: 470.63.01    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0  On |                  N/A |
| 27%   40C    P0    24W /  75W |    829MiB /  3903MiB |     24%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1658      G   /usr/lib/xorg/Xorg                206MiB |
|    0   N/A  N/A     13920      C   ./yolov5                          619MiB |
+-----------------------------------------------------------------------------+

5. 模型推理（C++）

推理后图片路径：/PATH/TO/tensorrtx/yolov5/build。

(yolov5-pytorch) yoyo@yoyo:~/MyDocuments/tensorrtx/yolov5/build$ time ./yolov5 -d yolov5s.engine ../samples
375ms
13ms
12ms
13ms
14ms
12ms
12ms
...
10ms
10ms
10ms
11ms

real	0m41.621s
user	0m29.085s
sys	0m3.601s

1000张图，图片分辨率为 640x640，平均11ms/张，即90fps。

显存占用情况：

Thu Sep  9 14:25:15 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.63.01    Driver Version: 470.63.01    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0  On |                  N/A |
| 27%   42C    P0    35W /  75W |    962MiB /  3903MiB |     43%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1658      G   /usr/lib/xorg/Xorg                206MiB |
|    0   N/A  N/A     13988      C   ./yolov5                          752MiB |
+-----------------------------------------------------------------------------+

6. 模型推理（Python）

用 python API推理engine模型。

# install python-tensorrt, pycuda, etc.
# ensure the yolov5s.engine and libmyplugins.so have been built
python yolov5_trt.py

(tensorRT-yolov5) yoyo@yoyo:~/MyDocuments/tensorrtx/yolov5$ time python yolov5_trt.py 
----------- True
bingding: data (3, 640, 640)
bingding: prob (6001, 1, 1)
batch size is 1
warm_up->(640, 640, 3), time->416.93ms
warm_up->(640, 640, 3), time->11.84ms
warm_up->(640, 640, 3), time->13.25ms
warm_up->(640, 640, 3), time->12.98ms
warm_up->(640, 640, 3), time->12.79ms
warm_up->(640, 640, 3), time->12.70ms
warm_up->(640, 640, 3), time->11.82ms
warm_up->(640, 640, 3), time->11.90ms
warm_up->(640, 640, 3), time->13.13ms
warm_up->(640, 640, 3), time->11.89ms
input->['samples/COCO_train2014_000000421903.jpg'], time->10.30ms, saving into output/
input->['samples/COCO_train2014_000000145736.jpg'], time->11.23ms, saving into output/
input->['samples/COCO_train2014_000000482834.jpg'], time->11.26ms, saving into output/
...
input->['samples/COCO_train2014_000000221565.jpg'], time->10.94ms, saving into output/
input->['samples/COCO_train2014_000000366274.jpg'], time->10.30ms, saving into output/
input->['samples/COCO_train2014_000000048824.jpg'], time->10.77ms, saving into output/

real	1m14.491s
user	0m53.540s
sys	0m8.307s

1000张图，图片分辨率为 640x640，平均11ms/张，即90fps。

显存占用情况：

Thu Sep  9 14:35:54 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.63.01    Driver Version: 470.63.01    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0  On |                  N/A |
| 28%   43C    P0    27W /  75W |   2495MiB /  3903MiB |     33%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1658      G   /usr/lib/xorg/Xorg                206MiB |
|    0   N/A  N/A     15510      C   python                           2285MiB |
+-----------------------------------------------------------------------------+

六、tensorRT FP16 推理

1. 设置USE_FP16

修改 /PATH/TO/tensorrtx/yolov5/yolov5.cpp 文件，默认是 FP16。

#define USE_FP16  // set USE_INT8 or USE_FP16 or USE_FP32

2. cmake 编译

cd /PATH/TO/tensorrtx/yolov5
mkdir build
cd build
cp {ultralytics}/yolov5/yolov5s.wts {tensorrtx}/yolov5/build
cmake ..

3. make 编译

(tensorRT-yolov5) yoyo@yoyo:~/MyDocuments/tensorrtx/yolov5/build$ time make -j6
[ 20%] Building NVCC (Device) object 
...
[100%] Linking CXX executable yolov5
[100%] Built target yolov5

real	0m4.723s
user	0m5.887s
sys	0m0.421s

4. wts转engine格式

wts转engine就是序列化engine的过程。

(tensorRT-yolov5) yoyo@yoyo:~/MyDocuments/tensorrtx/yolov5/build$ time ./yolov5 -s yolov5s.wts yolov5s.engine s
Loading weights: yolov5s.wts
Building engine, please wait for a while...
Build engine successfully!

real	7m11.939s
user	4m43.104s
sys	0m39.300s

yolov5s.engine，21.5MB

显存占用情况：

Thu Sep  9 15:20:15 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.63.01    Driver Version: 470.63.01    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0  On |                  N/A |
| 29%   44C    P0    24W /  75W |    843MiB /  3903MiB |     16%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1658      G   /usr/lib/xorg/Xorg                216MiB |
|    0   N/A  N/A     17616      C   ./yolov5                          623MiB |
+-----------------------------------------------------------------------------+

5. 模型推理（C++）

推理后图片路径：/PATH/TO/tensorrtx/yolov5/build。

(tensorRT-yolov5) yoyo@yoyo:~/MyDocuments/tensorrtx/yolov5/build$ time ./yolov5 -d yolov5s.engine ../samples
7ms
8ms
7ms
7ms
7ms
7ms
8ms
8ms
7ms
7ms
7ms
...
7ms

real	0m37.748s
user	0m27.568s
sys	0m2.609s

1000张图，图片分辨率为 640x640，平均7ms/张，即142fps。

显存占用情况：

Wed Sep  8 14:35:53 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.63.01    Driver Version: 470.63.01    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0  On |                  N/A |
| 27%   39C    P0    18W /  75W |    790MiB /  3903MiB |      4%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1469      G   /usr/lib/xorg/Xorg                242MiB |
|    0   N/A  N/A      8440      C   ./yolov5                          544MiB |
+-----------------------------------------------------------------------------+

6. 模型推理（Python）

用 python API推理engine模型。

// install python-tensorrt, pycuda, etc.
// ensure the yolov5s.engine and libmyplugins.so have been built
python yolov5_trt.py

(tensorRT-yolov5) yoyo@yoyo:~/MyDocuments/tensorrtx/yolov5$ time python yolov5_trt.py 
----------- True
bingding: data (3, 640, 640)
bingding: prob (6001, 1, 1)
batch size is 1
warm_up->(640, 640, 3), time->7.03ms
warm_up->(640, 640, 3), time->6.38ms
warm_up->(640, 640, 3), time->6.99ms
warm_up->(640, 640, 3), time->6.42ms
warm_up->(640, 640, 3), time->6.42ms
warm_up->(640, 640, 3), time->6.42ms
warm_up->(640, 640, 3), time->6.99ms
warm_up->(640, 640, 3), time->7.30ms
warm_up->(640, 640, 3), time->6.98ms
warm_up->(640, 640, 3), time->7.28ms
input->['samples/COCO_train2014_000000421903.jpg'], time->7.25ms, saving into output/
input->['samples/COCO_train2014_000000145736.jpg'], time->6.71ms, saving into output/
input->['samples/COCO_train2014_000000482834.jpg'], time->6.70ms, saving into output/
input->['samples/COCO_train2014_000000393241.jpg'], time->6.79ms, saving into output/
...
input->['samples/COCO_train2014_000000221565.jpg'], time->6.70ms, saving into output/
input->['samples/COCO_train2014_000000366274.jpg'], time->6.61ms, saving into output/
input->['samples/COCO_train2014_000000048824.jpg'], time->6.70ms, saving into output/

real	0m51.729s
user	0m44.069s
sys	0m5.622s

1000张图，图片分辨率为 640x640，平均7ms/张，即142fps。

显存占用情况：

Wed Sep  8 15:52:45 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.63.01    Driver Version: 470.63.01    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0  On |                  N/A |
| 27%   39C    P0    22W /  75W |   2321MiB /  3903MiB |     29%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1469      G   /usr/lib/xorg/Xorg                242MiB |
|    0   N/A  N/A     11220      C   python                           2075MiB |
+-----------------------------------------------------------------------------+

七、tensorRT INT8 量化推理

1. 下载校准数据集 coco_calib

百度云盘，提取码：a9wh
GoogleDrive

解压校准数据集到 /PATH/TO/tensorrtx/yolov5/build/coco_calib

创建软链接：

ln -s /PATH/TO/tensorrtx/yolov5/build/coco_calib /PATH/TO/tensorrtx/yolov5/samples

2. 设置 USE_INT8

修改 /PATH/TO/tensorrtx/yolov5/yolov5.cpp 文件。

#define USE_INT8  // set USE_INT8 or USE_FP16 or USE_FP32

3. cmake编译

cd /PATH/TO/tensorrtx/yolov5
mkdir build
cd build
cp {ultralytics}/yolov5/yolov5s.wts {tensorrtx}/yolov5/build
cmake ..

4. make编译

(tensorRT-yolov5) yoyo@yoyo:~/MyDocuments/tensorrtx/yolov5/build$ time make -j6
[ 20%] Building NVCC (Device) object 
...
[100%] Linking CXX executable yolov5
[100%] Built target yolov5

real	0m4.709s
user	0m5.902s
sys	0m0.373s

5. wts转engine格式

wts转engine就是序列化engine的过程。

sudo ./yolov5 -s [.wts] [.engine] [s/m/l/x/s6/m6/l6/x6 or c/c6 gd gw]  // serialize model to plan file

// For example yolov5s
sudo ./yolov5 -s yolov5s.wts yolov5s.engine s

(tensorRT-yolov5) yoyo@yoyo:~/MyDocuments/tensorrtx/yolov5/build$ time ./yolov5 -s yolov5s.wts yolov5s.engine s
Loading weights: yolov5s.wts
Your platform support int8: true
Building engine, please wait for a while...
reading calib cache: int8calib.table
COCO_train2014_000000421903.jpg  0
COCO_train2014_000000145736.jpg  1
...
COCO_train2014_000000048824.jpg  999
reading calib cache: int8calib.table
writing calib cache: int8calib.table size: 13506
Build engine successfully!

real	7m27.392s
user	6m58.768s
sys	0m38.621s

yolov5s.engine，10.8MB

显存占用情况：

Wed Sep  8 15:20:47 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.63.01    Driver Version: 470.63.01    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0  On |                  N/A |
| 33%   46C    P0    18W /  75W |    920MiB /  3903MiB |      2%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1469      G   /usr/lib/xorg/Xorg                242MiB |
|    0   N/A  N/A      9326      C   ./yolov5                          674MiB |
+-----------------------------------------------------------------------------+

6. 模型推理（C++）

推理后图片路径：/PATH/TO/tensorrtx/yolov5/build。

(tensorRT-yolov5) yoyo@yoyo:~/MyDocuments/tensorrtx/yolov5/build$ time ./yolov5 -d yolov5s.engine ../samples
5ms
6ms
5ms
5ms
...
5ms
6ms
5ms

real	0m24.968s
user	0m23.439s
sys	0m1.660s

1000张图，图片分辨率为 640x640，平均5ms/张，即200fps。

显存占用情况：

Wed Sep  8 15:23:56 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.63.01    Driver Version: 470.63.01    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0  On |                  N/A |
| 30%   43C    P0    24W /  75W |    772MiB /  3903MiB |     38%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1469      G   /usr/lib/xorg/Xorg                242MiB |
|    0   N/A  N/A      9573      C   ./yolov5                          526MiB |
+-----------------------------------------------------------------------------+

7. 模型推理（Python）

用 python API 推理INT8量化好的engine模型。

// install python-tensorrt, pycuda, etc.
// ensure the yolov5s.engine and libmyplugins.so have been built
python yolov5_trt.py

(tensorRT-yolov5) yoyo@yoyo:~/MyDocuments/tensorrtx/yolov5$ python yolov5_trt.py 
----------- True
bingding: data (3, 640, 640)
bingding: prob (6001, 1, 1)
batch size is 1
warm_up->(640, 640, 3), time->7.82ms
warm_up->(640, 640, 3), time->4.51ms
warm_up->(640, 640, 3), time->4.55ms
warm_up->(640, 640, 3), time->4.61ms
warm_up->(640, 640, 3), time->5.11ms
warm_up->(640, 640, 3), time->4.81ms
warm_up->(640, 640, 3), time->4.56ms
warm_up->(640, 640, 3), time->4.75ms
warm_up->(640, 640, 3), time->4.52ms
warm_up->(640, 640, 3), time->4.91ms
input->['samples/COCO_train2014_000000421903.jpg'], time->4.57ms, saving into output/
input->['samples/COCO_train2014_000000145736.jpg'], time->5.38ms, saving into output/
input->['samples/COCO_train2014_000000482834.jpg'], time->4.66ms, saving into output/
input->['samples/COCO_train2014_000000393241.jpg'], time->5.01ms, saving into output/output/
input->['samples/COCO_train2014_000000548377.jpg'], time->5.28ms, saving into output/
input->['samples/COCO_train2014_000000329954.jpg'], time->4.93ms, saving into output/
...
input->['samples/COCO_train2014_000000141181.jpg'], time->5.19ms, saving into output/
input->['samples/COCO_train2014_000000221565.jpg'], time->5.15ms, saving into output/
input->['samples/COCO_train2014_000000366274.jpg'], time->4.76ms, saving into output/
input->['samples/COCO_train2014_000000048824.jpg'], time->4.80ms, saving into output/

1000张图，图片分辨率为 640x640，平均5ms/张，即200fps。

显存占用情况：

Wed Sep  8 15:39:14 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.63.01    Driver Version: 470.63.01    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0  On |                  N/A |
| 27%   40C    P0    20W /  75W |   2303MiB /  3903MiB |     23%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1469      G   /usr/lib/xorg/Xorg                242MiB |
|    0   N/A  N/A      9979      C   python                           2057MiB |
+-----------------------------------------------------------------------------+