TensorRT & C++ & TensorFlow & DenseNet & CIFAR10模型部署

最新推荐文章于 2024-06-19 20:48:35 发布

半步鸠

最新推荐文章于 2024-06-19 20:48:35 发布

阅读量2.4k

点赞数 26

分类专栏：随笔记录文章标签： tensorflow c++ 机器学习

本文链接：https://blog.csdn.net/weixin_44285683/article/details/126533082

版权

记录同时被 2 个专栏收录

8 篇文章 0 订阅

订阅专栏

随笔

6 篇文章 0 订阅

订阅专栏

TensorRT & C++ & TensorFlow & DenseNet & CIFAR10模型部署

一些题外话：这篇博客源自于实际的项目经历，项目中我负责对各类模型在Qt系统上的部署，从Libtorch到Pytorch再到TensorFlow的模型部署，都浅浅走了一遍，不透彻但能跑通了。

整体介绍：以TensorFlow训练DenseNet121分类CIFAR10的应用场景为例，讲模型在C++环境下的TensorRT加速部署。

零. 环境配置

名称	版本号
TensorRT	TensorRT-7.2.3.4.Windows10.x86_64.cuda-11.1.cudnn8.1
tensorflow-gpu	2.9.1
C++ Compiler	MSVC/14.29.30133
CUDA	11.1
cuDNN	8.4.1
libtorch	libtorch-1.8.2+cu111
pytorch	torch1.12.0+cu113
tf2onnx	1.11.1
opencv	opencv-3.4.13
keras	2.9.0
h5py	3.9.0
Windows	Windows 10 家庭中文版 19044.1889
OpenCV	3.4.13

模型部署整体的流程如下图所示：

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-lc6navg7-1661434427234)(博客.assets/onnx-workflow.png)]

可以参考链接：使用 TensorFlow、ONNX 和 TensorRT 加速深度学习推理

一、模型训练及保存

参考这篇博客，训练一个基于keras.application中的DenseNet网络的、处理Cifar10的模型，保存为.hdf5格式。

我们在经典的DenNet121网络前加了resize层，使得网络能接收CIFAR10数据集中32x32x3的数据。代码如下：

import matplotlib.pyplot as plt
import tensorflow as tf
import numpy as np
import keras as K
from keras import datasets, layers, models

def preprocess_data(X, Y):
    """pre-processes the data"""
    X_p = X_p = K.applications.densenet.preprocess_input(X)
    """one hot encode target values"""
    Y_p = K.utils.to_categorical(Y, 10)
    return X_p, Y_p

"""load dataset"""
(trainX, trainy), (testX, testy) = K.datasets.cifar10.load_data()
x_train, y_train = preprocess_data(trainX, trainy)
x_test, y_test = preprocess_data(testX, testy)

""" USE DenseNet121"""
OldModel = K.applications.DenseNet121(include_top=False,input_tensor=None,weights='imagenet')
for layer in OldModel.layers[:149]:
    layer.trainable = False
for layer in OldModel.layers[149:]:
    layer.trainable = True

model = K.models.Sequential()

"""a lambda layer that scales up the data to the correct size"""
model.add(K.layers.Lambda(lambda x:K.backend.resize_images(x,height_factor=7,width_factor=7,data_format='channels_last')))

model.add(OldModel)
model.add(K.layers.Flatten())
model.add(K.layers.BatchNormalization())
model.add(K.layers.Dense(256, activation='relu'))
model.add(K.layers.Dropout(0.7))
model.add(K.layers.BatchNormalization())
model.add(K.layers.Dense(128, activation='relu'))
model.add(K.layers.Dropout(0.5))
model.add(K.layers.BatchNormalization())
model.add(K.layers.Dense(64, activation='relu'))
model.add(K.layers.Dropout(0.3))
model.add(K.layers.Dense(10, activation='softmax'))
"""callbacks"""
# cbacks =  K.callbacks.CallbackList()
# cbacks.append(K.callbacks.ModelCheckpoint(filepath='cifar10.h5',monitor='val_accuracy',save_best_only=True))
# cbacks.append(K.callbacks.EarlyStopping(monitor='val_accuracy',patience=2))

model.compile(optimizer='adam',loss='categorical_crossentropy',metrics=['accuracy'])
"""train"""
model.fit(x=x_train,y=y_train,batch_size=128,epochs=5,validation_data=(x_test, y_test))
model.summary()

model.save('cifar10.h5')

事实上，如果使用这个训练得到的cifar10.h5模型来做下面的转换，在转到trt引擎文件的时候会报错:

[07/28/2022-12:54:39] [W] [TRT] onnx2trt_utils.cpp:220: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
ERROR: builtin_op_importers.cpp:2593 In function importResize:
[8] Assertion failed: (mode != "nearest" || nearest_mode == "floor") && "This version of TensorRT only supports floor nearest_mode!"
[07/28/2022-12:54:39] [E] Failed to parse onnx file
[07/28/2022-12:54:39] [E] Parsing model failed
[07/28/2022-12:54:39] [E] Engine creation failed
[07/28/2022-12:54:39] [E] Engine set up failed
&&&& FAILED TensorRT.trtexec

这是因为目前TensorRt的BUG：#974 (comment)，不支持模型中的resize_image操作。不支持的还有NonZero （op is not supported in TRT yet。）

刚才训练代码里使用的keras.backend.resize_images这个方法使用的是 the nearest model + half_pixel + round_prefer_ceil。

一模一样的issue 。

解决方案：Lambda式子改成model.add(K.layers.Lambda(lambda x:tf.image.resize(x,[224,224])))。

OK，使用Keras的Sequential模型，“搭”自己的网络很快，保存也方便。

二、模型冻结

hdf5模型是可以再次被训练的动态图，现将其冻结转换成pb文件，用于前向计算。

import tensorflow as tf
import keras as K
from tensorflow.python.framework.convert_to_constants import convert_variables_to_constants_v2

def convert_h5to_pb():
    model = tf.keras.models.load_model("E:/cifar10.h5",compile=False)
    model.summary()
    full_model = tf.function(lambda Input: model(Input))
    full_model = full_model.get_concrete_function(tf.TensorSpec(model.inputs[0].shape, model.inputs[0].dtype))

    # Get frozen ConcreteFunction
    frozen_func = convert_variables_to_constants_v2(full_model)
    frozen_func.graph.as_graph_def()

    layers = [op.name for op in frozen_func.graph.get_operations()]
    print("-" * 50)
    print("Frozen model layers: ")
    for layer in layers:
        print(layer)

    print("-" * 50)
    print("Frozen model inputs: ")
    print(frozen_func.inputs)
    print("Frozen model outputs: ")
    print(frozen_func.outputs)

    # Save frozen graph from frozen ConcreteFunction to hard drive
    tf.io.write_graph(graph_or_graph_def=frozen_func.graph,
                      logdir="E:/",
                      name="cifar10.pb",
                      as_text=False)
convert_h5to_pb()

#output
--------------------------------------------------
Frozen model inputs: 
[<tf.Tensor 'Input:0' shape=(None, 32, 32, 3) dtype=float32>]
Frozen model outputs: 
[<tf.Tensor 'Identity:0' shape=(None, 10) dtype=float32>]

三、转onnx文件

使用tf2onnx.convert命令将.pb文件转为.onnx文件：

python -m tf2onnx.convert  --input E:/cifar10.pb --inputs Input:0 --outputs Identity:0 --output E:/cifar10.onnx --opset 11

–inputs ：模型输入层的名字 --outputs ：模型输出层的名字
输入输出层的名字在冻结代码里可以输出出来。

生成的onnx文件可以在Netron网站进行可视化，查看网络结构。
此时onnx模型的输入向量维度可以通过netron看到是**float32[unk__1220,224,224,3]**,格式是TF的NHWC.

四、生成优化引擎文件

（trtexec的用法，TensorRT - 自带工具trtexec的参数使用说明，官方介绍文档，测试博客）

trtexec --onnx=cifar10.onnx --saveEngine=cifar10.trt --workspace=4096 --minShapes=Input:0:1x32x32x3 --optShapes=Input:0:1x32x32x3 --maxShapes=Input:0:50x32x32x3 --fp16

onnx: 输入的onnx模型
saveEngine：转换好后保存的tensorrt engine
workspace：使用的gpu内存，有时候不够，需要手动增大点，单位是MB
minShapes：动态尺寸时的最小尺寸，格式为NCHW，需要给定输入node的名字，
optShapes：推理测试的尺寸，trtexec会执行推理测试，该shape就是测试时的输入shape
maxShapes：动态尺寸时的最大尺寸，这里只有batch是动态的，其他维度都是写死的
fp16：float16推理

五、数据预处理

我们的最终目的是使用引擎对数据进行前向推理。到第四章结束，我们就拿到了最终的“模型”即序列化的引擎文件，下面是对数据的预处理，即加载数据。（我是直接使用了这位佬根据官方MNIST数据集处理代码改写的CIFAR10代码，github链接）

为了满足动态批量的数据输入，可以利用Libtorch的DataLoader类。自定义我们的DataLoader类，只需要重写torch::data::dataset的get和size方法。

这篇文章完全可以让你自学废对自定义数据类型的加载：Custom Data Loading using PyTorch C++ API

假设现在已经写好了CustomDataset类，那么分批喂数据的代码大抵就可以是这样：

// Make DataSet
auto test_dataset = CustomDataset(dataset_path, ".txt", class2label)
    .map(torch::data::transforms::Stack<>());
//Build DataLoader
auto test_data_loader = torch::data::make_data_loader(
    std::move(test_set_transformed), INFERENCE_BATCH);
//const size_t test_dataset_size = test_dataset.size().value();
for (const auto& batch : *test_data_loader){
    torch::Tensor inputs_tensor = batch.data;
    torch::Tensor labels_tensor = batch.target;
    ...
}

六、加载引擎文件

流程：

读取.trt文件到变量.
通过nvinfer1::createInferRuntime创建runtime对象.
调用runtime的deserializeCudaEngine方法反序列化.trt文件得到engine对象.
IExecutionContext* context = engine->createExecutionContext();得到执行上下文对象context.

模型的推理就通过context的enqueueV2方法实现。可以把前三步集合到一个方法中，名叫readTRTfile，方法返回一个engine对象。

之所以不直接取到context后返回context，因为我们需要调用engine的方法查看模型的输入输出维度。

【要点】前文我们生成的模型(得到的pb亦或是pt文件）都是动态批量，得到动态输入的onnx，转为trt时指定了之后推理输入的shape范围，注意只是范围，得到的trt经过deserialize得到engine，在调用engine时需要指定维度。如果没有指定或者维度不对则报错：

[E] [TRT] Parameter check failed at: engine.cpp::nvinfer1::rt::ShapeMachineContext::resolveSlots::1318, condition: allInputDimensionsSpecified(routine)

解决办法：

//查看engine的输入输出维度
for (int i = 0; i < engine->getNbBindings(); i++){
    nvinfer1::Dims dims = engine->getBindingDimensions(i);
    printf("index %d, dims: (",i);
    for (int d = 0; d < dims.nbDims; d++){
        if (d < dims.nbDims - 1)	printf("%d,", dims.d[d]);
        else	printf("%d", dims.d[d]);
    }	printf(")\n");
}

以DenseNet121的trt文件为例，以上程序输出

index 0, dims: (-1,224,224,3)
index 1, dims: (-1,100)

所以我们得把输入的动态维度写死，在python里，在调用engine推理前做这样的设置即可:context.set_binding_shape(0, (BATCH, 3, INPUT_H, INPUT_W))，C++代码里应该调用IExecutionContext类型的实例的setBindingDimensions(int bindingIndex, Dims dimensions)方法。

//确定动态维度
nvinfer1::Dims dims4;
dims4.d[0] = 1;    // replace dynamic batch size with 1
dims4.d[1] = 224;
dims4.d[2] = 224;
dims4.d[3] = 3;
dims4.nbDims = 4;
context->setBindingDimensions(0, dims4);

然后再执行推理就可以了。

总体思路是：拿到一个对维度未知的模型engine文件后，首先读入文件内容并做deserialize获得engine。
然后调用getBindingDimensions()查看engine的输入输出维度(如果知道维度就不用)。
在调用context->executeV2()做推理前把维度值为-1的动态维度值替换成具体的维度并调用context->setBindingDimensions()设置具体维度，然后在数据填入input buffer准备好后调用context->executeV2()做推理即可:

为什么是V2，V1V2有什么区别：

execute/enqueue are for implicit batch networks, and executeV2/enqueueV2 are for explicit batch networks. The V2 versions don’t take a batch_size argument since it’s taken from the explicit batch dimension of the network / or from the optimization profile if used.

In TensorRT 7, the ONNX parser requires that you create an explicit batch network, so you’ll have to use V2 methods.

到这里，我们通过readTRTfile函数得到了engine对象，通过engine得到了context对象，然后确定了context输入的动态维度。

七、执行推理

写一个doinference的方法，传入输入和输出数据数组。前文写的DataLoader每批得到的数据都是torch::tensor向量，

cudaMalloc开辟GPU内存。
cudaMemcpyAsync将批数据传给GPU。
调用context.enqueueV2执行推理。
cudaMemcpyAsync将批数据传回CPU。

大致分为这四步。

程序运行结果：

(TrtInfer::testAllSample) test_dataset_size0
loading filename from:E:/cifar10fix.trt
length:47512416
load engine done
deserializing
[08/25/2022-20:37:10] [W] [TRT] Using an engine plan file across different models of devices is not recommended and is likely to affect performance or even cause errors.
[08/25/2022-20:37:11] [W] [TRT] TensorRT was linked against cuDNN 8.1.0 but loaded cuDNN 8.0.5
[08/25/2022-20:37:11] [W] [TRT] TensorRT was linked against cuBLAS/cuBLAS LT 11.3.0 but loaded cuBLAS/cuBLAS LT 11.2.1
deserialize done
The engine in TensorRT.cpp is not nullptr
tensorRT engine created successfully.
[08/25/2022-20:37:12] [W] [TRT] TensorRT was linked against cuDNN 8.1.0 but loaded cuDNN 8.0.5
[08/25/2022-20:37:12] [W] [TRT] TensorRT was linked against cuBLAS/cuBLAS LT 11.3.0 but loaded cuBLAS/cuBLAS LT 11.2.1
index 4, dims: (-1,32,32,3)
index 2, dims: (-1,10)
num_running_corrects_NUMS=====2132
num_running_NUMS=====10000
 Eval Loss: 2.23657 Eval Acc: 0.2132
test_dataset_size:()
HAPYY ENDING!!!~~~~~ヾ(≧▽≦*)oヾ(≧▽≦*)oヾ(≧▽≦*)o

代码之后贴出来…笔记推了好久好久，之后继续更

可能遇到的错误：

onnx转trt

[W] Dynamic dimensions required for input: input_1:0, but no shapes were provided. Automatically overriding shape to: 1x224x224x3
#这是因为Shapes参数处，输入节点的名字有错误，应该是input_1:0而不是input_1。直接和netron上显示的结点name保持一致即可

[E] [TRT] input_1:0: for dimension number 1 in profile 0 does not match network definition (got min=3, opt=3, max=3), expected min=opt=max=224).
#Shapes参数1x3x224x224改成1x224x224x3即可

ERROR: builtin_op_importers.cpp:2593 In function importResize:
[8] Assertion failed: (mode != "nearest" || nearest_mode == "floor") && "This version of TensorRT only supports floor nearest_mode!"
[07/28/2022-12:54:39] [E] Failed to parse onnx file
#模型中resize(nearest-ceil model)算子不支持

[E] [TRT] C:\source\rtSafe\cuda\cudaConvolutionRunner.cpp (483) - Cudnn Error in nvinfer1::rt::cuda::CudnnConvolutionRunner::executeConv: 2 (CUDNN_STATUS_ALLOC_FAILED)
#--workspace参数设置的太大了  调小一点

【Could not load library cudnn_cnn_infer64_8.dll. Error code 1455.Please make sure cudnn_cnn_infer64_8.dll is in your library path! 】
or 【context null】
原因：内存不足，重启VS或者电脑就OK。（或者参考此问答）