前言
在推理过程中,基于 TensorRT 的应用程序的执行速度可比 CPU 平台的速度快 40 倍。借助 TensorRT,您可以优化在所有主要框架中训练的神经网络模型,精确校正低精度,并最终将模型部署到超大规模数据中心、嵌入式或汽车产品平台中。
TensorRT 以 NVIDIA 的并行编程模型 CUDA 为基础构建而成,可帮助您利用 CUDA-X 中的库、开发工具和技术,针对人工智能、自主机器、高性能计算和图形优化所有深度学习框架中的推理。
TensorRT 针对多种深度学习推理应用的生产部署提供 INT8 和 FP16 优化,例如视频流式传输、语音识别、推荐和自然语言处理。推理精度降低后可显著减少应用延迟,这恰巧满足了许多实时服务、自动和嵌入式应用的要求。
问题
现在网上的教程大多数都是将pb模型转换成 ONNX 或者 UFF 来实现前向推理,但有的模型在转换过程中会遇到各种问题,比如 op 不支持等等,很是苦恼。所以说使用tensorrt api 来搭建网络模型,实现前向推理,可以避免许多问题。
实现过程
首先非常感谢这位大佬: https://github.com/wang-xinyu/tensorrtx , 非常牛逼class ,实现了很多模型 (googlenet、resnet、shuffenetv2、yolov3、yolov4、yolov5等等) ,有兴趣的童鞋可以去瞅瞅,很牛逼, 下面代码实现也都参考这大佬程序改造而成的。但大佬的模型基本都是基于pytorch的,但我的模型是基于tf的,但我想应该都是大同小异,所以直接开撸。
人脸特征点模型:https://github.com/610265158/Peppa_Pig_Face_Engine
1. 首先查看pb网络模型结构,大概如下图
2. 导出pb权重,参考大佬的 genwts.py 改造了一下。
** 这里有个问题需要特别注意下, 在TF中,数据格式是 NHWC , TRT中是 NCHW ,所以说我们需要转置一下,如果不转置 ,后面推理的结果都是不正确的 。
import tensorflow as tf
from tensorflow.python.platform import gfile
import struct
import torch
from tensorflow.python.framework import tensor_util
# path to your .pb file
GRAPH_PB_PATH = './model/landmark.pb'
GRAPH_WTS_PATH = './model/landmark.wts'
with tf.Session() as sess:
print("load graph")
with gfile.FastGFile(GRAPH_PB_PATH, 'rb') as f:
graph_def = tf.GraphDef()
graph_def.ParseFromString(f.read())
sess.graph.as_default()
tf.import_graph_def(graph_def, name='')
graph_nodes = [n for n in graph_def.node]
wts = [n for n in graph_nodes if n.op == 'Const']
dict = {}
for n in wts:
v = n.attr['value']
print(n.name)
ar = tensor_util.MakeNdarray(v.tensor)
dict[n.name] = torch.Tensor(ar)
f = open(GRAPH_WTS_PATH, 'w')
f.write("{}\n".format(len(dict.keys())))
for k, v in dict.items():
print('key: ', k)
print('value: ', v.shape)
if v.ndim == 4: # tf:NHWC trt:NCHW
v = v.transpose(3, 0).transpose(2, 1).transpose(3, 2)
vr = v.reshape(-1).cpu().numpy()
else:
vr = v.reshape(-1).cpu().numpy()
f.write("{} {}".format(k, len(vr)))
for vv in vr:
f.write(" ")
f.write(struct.pack(">f", float(vv)).hex())
f.write("\n")
3. 对比下模型,查看下参数个数是否对的上。 比如下面这个conv 3 * 3 * 3 * 24 = 648
4. 使用tensorrt api 构建神经网络.。可以参考下大佬实现的模型,有一样的可以直接拿来用。
推荐个大佬: https://github.com/wdhao/tensorRT_Wheels 整理了trt常用API集成的轮子.
如下是模型中使用的api
BN
IScaleLayer* addBatchNorm2d(INetworkDefinition *network, std::map<std::string, Weights>& weightMap, ITensor& input, std::string lname, float eps) {
float *gamma = (float*)weightMap[lname + "/BatchNorm/gamma"].values;
float *beta = (float*)weightMap[lname + "/BatchNorm/beta"].values;
float *mean = (float*)weightMap[lname + "/BatchNorm/moving_mean"].values;
float *var = (float*)weightMap[lname + "/BatchNorm/moving_variance"].values;
int len = weightMap[lname + "/BatchNorm/moving_variance"].count;
float *scval = reinterpret_cast<float*>(malloc(sizeof(float) * len));
for (int i = 0; i < len; i++) {
scval[i] = gamma[i] / sqrt(var[i] + eps);
}
Weights scale{ DataType::kFLOAT, scval, len };
float *shval = reinterpret_cast<float*>(malloc(sizeof(float) * len));
for (int i = 0; i < len; i++) {
shval[i] = beta[i] - mean[i] * gamma[i] / sqrt(var[i] + eps);
}
Weights shift{ DataType::kFLOAT, shval, len };
float *pval = reinterpret_cast<float*>(malloc(sizeof(float) * len));
for (int i = 0; i < len; i++) {
pval[i] = 1.0;
}
Weights power{ DataType::kFLOAT, pval, len };
weightMap[lname + ".scale"] = scale;
weightMap[lname + ".shift"] = shift;
weightMap[lname + ".power"] = power;
IScaleLayer* scale_1 = network->addScale(input, ScaleMode::kCHANNEL, shift, scale, power);
assert(scale_1);
return scale_1;
}
conv+bn+relu
ILayer* convBnReLU(INetworkDefinition* network, std::map<std::string, Weights>& weightMap, ITensor& input, std::string lname, std::string weightsName,
int nbOutputMaps, int kernelSize, int strideSize, int groupSize) {
Dims d = input.getDimensions();
Weights emptywts{ DataType::kFLOAT, nullptr, 0 };
IConvolutionLayer* conv = network->addConvolutionNd(input, nbOutputMaps, DimsHW{ kernelSize , kernelSize }, weightMap[lname + weightsName], emptywts);
assert(conv);
conv->setStrideNd(DimsHW{ strideSize, strideSize });
int padSize = paddingSize(d.d[1], kernelSize, strideSize);
int postPadding = ceil(padSize / 2.0);
int prePadding = padSize - postPadding;
if (prePadding > 0)
conv->setPrePadding(DimsHW{ prePadding, prePadding });
if (postPadding > 0)
conv->setPostPadding(DimsHW{ postPadding, postPadding });
IScaleLayer* bn = addBatchNorm2d(network, weightMap, *conv->getOutput(0), lname, 1e-5);
IActivationLayer* relu = network->addActivation(*bn->getOutput(0), ActivationType::kRELU);
assert(relu);
return relu;
}
depthwiseConv2d + conv + bn +relu
ILayer* depthwiseConvolutionNd(INetworkDefinition* network, std::map<std::string, Weights>& weightMap, ITensor& input, std::string lname,
int nbOutputMaps, int kernelSize, int strideSize, bool isPointwise) {
Weights emptywts{ DataType::kFLOAT, nullptr, 0 };
Dims d = input.getDimensions();
int size = d.d[0];
IConvolutionLayer* conv = network->addConvolutionNd(input, size, DimsHW{ kernelSize, kernelSize }, weightMap[lname + "/depthwise_weights"], emptywts);
conv->setStrideNd(DimsHW{ strideSize, strideSize });
int padSize = paddingSize(d.d[1], kernelSize, strideSize);
int postPadding = ceil(padSize / 2.0);
int prePadding = padSize - postPadding;
if (prePadding > 0)
conv->setPrePadding(DimsHW{ prePadding, prePadding });
if (postPadding > 0)
conv->setPostPadding(DimsHW{ postPadding, postPadding });
conv->setNbGroups(size); // 每一个通道作卷积
ILayer* layer2 = convBnReLU(network, weightMap, *conv->getOutput(0), lname, "/pointwise_weights", nbOutputMaps, 1, 1, 1);
return layer2;
}
concat + shuffle + split
std::vector<ILayer*> concat_shuffle_split(INetworkDefinition* network, std::map<std::string, Weights>& weightMap, ILayer* layer1, ILayer* layer2) {
// 合并分支
std::vector<ILayer*> vec;
// channel 作拼接
ITensor* concatTensors[] = { layer1->getOutput(0),layer2->getOutput(0) };
IConcatenationLayer* concatLayer = network->addConcatenation(concatTensors, 2);
assert(concatLayer);
IShuffleLayer *shuffleLayer = network->addShuffle(*concatLayer->getOutput(0));
assert(shuffleLayer);
// 数据格式 tf:HWC trt:CHW
shuffleLayer->setFirstTranspose(Permutation{ 1, 2, 0 }); // 116 * 20 * 20
// tf.strack(xx,axis=3) 增加一个维度
Dims shuffleLayerDims = shuffleLayer->getOutput(0)->getDimensions();
Dims shuffleLayerReshapeDims = Dims4(shuffleLayerDims.d[0], shuffleLayerDims.d[1], 2, shuffleLayerDims.d[2] / 2);
shuffleLayer->setReshapeDimensions(shuffleLayerReshapeDims);// 20 * 20 * 2 * 58
shuffleLayer->setSecondTranspose(Permutation{ 0, 1 ,3 ,2 });// 20 * 20 * 58 * 2
Dims shuffleLayerTransposeDims = shuffleLayer->getOutput(0)->getDimensions();
IShuffleLayer *shuffleLayer2 = network->addShuffle(*shuffleLayer->getOutput(0));// 20 * 20 * 58 * 2
shuffleLayer2->setReshapeDimensions(Dims3(shuffleLayerTransposeDims.d[0], shuffleLayerTransposeDims.d[1], shuffleLayerTransposeDims.d[2] * shuffleLayerTransposeDims.d[3]));// 20 * 20 * 116
shuffleLayer2->setSecondTranspose(Permutation{ 2, 0, 1 }); // 116 * 20 * 20
// 按 channel 分隔成两个tensor
Dims mergeSpliteDims = shuffleLayer2->getOutput(0)->getDimensions();
ISliceLayer *mergeS1 = network->addSlice(*shuffleLayer2->getOutput(0), Dims3{ 0,0, 0 }, Dims3{ mergeSpliteDims.d[0] / 2 , mergeSpliteDims.d[1], mergeSpliteDims.d[2] }, Dims3{ 1,1, 1 });
vec.push_back(mergeS1);
ISliceLayer *mergeS2 = network->addSlice(*shuffleLayer2->getOutput(0), Dims3{ mergeSpliteDims.d[0] / 2, 0, 0 }, Dims3{ mergeSpliteDims.d[0] / 2 , mergeSpliteDims.d[1] , mergeSpliteDims.d[2] }, Dims3{ 1,1, 1 });
vec.push_back(mergeS2);
return vec;
}
Sub + Mul
IConstantLayer *constMean = network->addConstant(DimsCHW{ 3,1,1 }, weightMap["tower_0/image_preprocess/Const"]);
IConstantLayer *constStd = network->addConstant(DimsCHW{ 3,1,1 }, weightMap["tower_0/image_preprocess/Const_1"]);
auto meanSub = network->addElementWise(*data, *constMean->getOutput(0), ElementWiseOperation::kSUB); // 减法
auto meanMul = network->addElementWise(*meanSub->getOutput(0), *constStd->getOutput(0), ElementWiseOperation::kPROD);
计算 padding size ( 这里需要注意下 ,如果 padding_mode = same ,tf 是非对称填充 ,所以需要特别注意下)
// tf padding: same 非对称填充 左小右大 上小下大
int paddingSize(int inSize, int kernelSize, int strideSize) {
int mode = inSize % strideSize;
if (mode == 0)
return std::max(kernelSize - strideSize, 0); // max( kernelSize - strideSize , 0)
return std::max(kernelSize - (inSize % strideSize), 0); // max(kernelSize - (inSize mod strideSize), 0)
}
pooling
ILayer* pooling(INetworkDefinition* network, ITensor& input) {
Dims dims = input.getDimensions();
IPoolingLayer* pool = network->addPoolingNd(input, PoolingType::kAVERAGE, DimsHW{ dims.d[1], dims.d[2] });
return pool;
}
遇到的各种问题
-
TF的数据格式和TRT中的数据格式不同,所以在到导权重文件时,一定要先将权重转置,否则后面推理的结果都是错误的。
-
输入数据也需要转置 ,比如正常我们输入 160 * 160 * 3 ,但现在我们需要转换成 3 * 160 * 160。
-
TF中,当padding是"SAME", 填充是非对称填充 (左小右大 ,上小下大), 而pytorch中是对称填充,所以在padding时,一定要注意补齐的方式,否则推理的结果也是不正确的。
-
conv->setPrePadding(DimsHW{ prePadding, prePadding }); 左上补齐
-
conv->setPostPadding(DimsHW{ postPadding, postPadding }); 右下补齐
-
-
待续..
END:
本文主要记录下自己实现过程中遇到的各种问题,如有不正确的地方,欢迎指正,谢谢。