LPRNet是非常高效的车牌识别模型,模型小巧,各类场景的鲁棒性强,非常适于各种嵌入设备部署。开源代码可见:GitHub - xuexingyu24/License_Plate_Detection_Pytorch: A two stage lightweight and high performance license plate recognition in MTCNN and LPRNet 最近决定将LPRNet转换成TensorRT以加快推理运行速度。于是开启了采坑之旅。
第一个坑就是PyTorch的MaxPool3d算子问题,对于这个问题,《LPRNet maxpool3d性能优化 - 寒武纪软件开发平台 - 开发者论坛 》这篇文章讲解的非常清楚,将MaxPool3d转换成2个MaxPool2d操作,通过将通道channel transpose到2个低的维度之一来完成同等变换。根据其指导,写转换代码如下:
# 将maxpool3d转换成maxpool2d的类
class maxpool_3d(nn.Module):
def __init__(self, kernel_size3d, stride3d):
super(maxpool_3d, self).__init__()
assert(len(kernel_size3d)==3 and len(stride3d)==3)
kernel_size2d1 = kernel_size3d[-2:]
stride2d1 = stride3d[-2:]
kernel_size2d2 = (kernel_size3d[0],kernel_size3d[0])
stride2d2 = (kernel_size3d[0], stride3d[0])
self.maxpool1 = nn.MaxPool2d(kernel_size=kernel_size2d1, stride=stride2d1)
self.maxpool2 = nn.MaxPool2d(kernel_size=kernel_size2d2, stride=stride2d2)
def forward(self,x):
x = self.maxpool1(x)
x = x.transpose(1,3)
x = self.maxpool2(x)
x = x.transpose(1,3)
return x
经测试与原模型输出完全一致
第二个坑就是将pytorch的模型转换成Onnx,本来这一步应该很简单,但是参考了《模型部署翻车记:pytorch转onnx踩坑实录 》 博客后开始入坑。博主要求对forward函数进行修改,在f_mean = torch.mean(f_pow)后面加上.item(),将其从一个张量变成一个标量。原语句变成了
f_mean = torch.mean(f_pow).item()
于是我依葫芦画瓢进行了修改,后愉快的执行 torch.onnx.export,代码:
def convert2onnx():
lprnet = LPRNet_MOD(class_num=len(CHARS), dropout_rate=0)
lprnet.load_state_dict(torch.load('weights/Final_LPRNet_model.pth', map_location=lambda storage, loc: storage))
lprnet.eval()
dummy_input = torch.randn(1, 3, 24, 94)
torch.onnx.export(lprnet, dummy_input, "model/LPRNET.onnx", verbose=True, opset_version=11)
print("Successful to convert the pytorch model to onnx. ONNX file: model/LPRNET.onnx")
#简化Onnx模型
model_onnx = onnx.load("model/LPRNET.onnx")
model_simp, check = simplify(model_onnx)
assert check
onnx.save(model_simp, "model/LPRNet_Simplified.onnx")
print("Simplified the onnx model to model/LPRNet_Simplified.onnx")
顺利输出onnx模型。 再写了一个简单的代码判断onnxruntime的推理结果是否与pytorch推理相等:
#加载PyTorch的模型
import torch
lprnet_mod = LPRNet_MOD(class_num=len(CHARS), dropout_rate=0)
lprnet_mod.load_state_dict(torch.load('weights/Final_LPRNet_model.pth', map_location=lambda storage, loc: storage))
lprnet_mod.eval()
#加载onnx模型
import onnxruntime as ort
lprnet_onnx = ort.InferenceSession("model/LPRNet_Simplified.onnx")
# pytorch推理
dummy_torch = torch.randn(1,3,24,94)
lprnet_torch.eval()
with torch.no_grad():
torch_res = lprnet_torch(dummy_torch).numpy()
#onnx推理
dummy_np = dummy_torch.data.numpy()
onnx_res = lprnet_onnx.run(["127"],{"input.1":dummy_np})[0]; #这个“127”是onnx导出后输出层名称,也有可能不同
#比较结果
try:
np.testing.assert_almost_equal(torch_res, onnx_res, decimal=4)
except AssertionError:
print("The torch and onnx results are not equal at decimal=4")
else:
print("The torch and onnx results are equal at decimal=4")
结果傻了眼,怎么两个模型输出的结果不同,对比一下还差别很大,整数部分都不一样,更别谈小数点后几位的精度了。先怀疑是MaxPool3d转换成2个MaxPool2d后,在export阶段做transpose的时候出现误差,但经过测试后发现不对。各种折腾,始终没有怀疑到那个item()上。后来仔细查看导出onnx模型,注意到在模型的输出部分,div操作中权重都是一个固定的常量:
就是这4个Div算子,都是常量
Pytorch源码模型中的这里应该是除以各通道数据平方后的均值,代码如下:
for i, f in enumerate(keep_features):
if i in [0, 1]:
f = nn.AvgPool2d(kernel_size=5, stride=5)(f)
if i in [2]:
f = nn.AvgPool2d(kernel_size=(4, 10), stride=(4, 2))(f)
f_pow = torch.pow(f, 2)
f_mean = torch.mean(f_pow).item() #这地方出错了
f = torch.div(f, f_mean)
global_context.append(f)
怎么这个均值变成了一个常量了? 是否是加上的那个.item()惹出的祸? 后来将这句的.item()去掉,重新导出onnx,再次对比结果,果然一样了。
后来思考,正如作者所说,.item()这个函数将张量变成了一个标量,本来这个标量应该在每次推理执行的过程中从输入Tensor中计算而来,但是torch.onnx.export函数在导出的过程中,直接将根据dummy_input计算这个标量写入到模型中成为一个权重而不是一个Tensor,导致了这个误差。正确的onnx模型尾部应该是这样:
有一个小问题,去掉item()重新生成的onnx模型,在opencv中无法正确加载,错误提示如下:
cv2.error: OpenCV(4.5.4-dev) /tmp/pip-req-build-6qnmwb6g/opencv/modules/dnn/src/onnx/onnx_importer.cpp:739: error: (-2:Unspecified error) in function 'handleNode'
> Node [Div]:(115) parse error: OpenCV(4.5.4-dev) /tmp/pip-req-build-6qnmwb6g/opencv/modules/dnn/src/onnx/onnx_importer.cpp:1697: error: (-215:Assertion failed) axis != outShape.size() in function 'parseMul'
问题不大,因为onnx只是一个中间过程,主要是为了进一步转成TensorRT。后面转成TensorRT模型就简单了,直接执行trtexec就可以输出模型:
trtexec --onnx=LPRNet_Simplified.onnx --explicitBatch --minShapes=input.1:1x3x24x94 --optShapes=input.1:2x3x24x94 --maxShapes=input.1:4x3x24x94 --saveEngine=LPRNet_Dx3x24x94Max4.engine
顺便贴出在解决以上这个坑的时候,由于想绕过onnx模型,鉴于LPRNet不是很复杂,于是想直接将pytorch的权重导入到逐层构建的TensorRT引擎中的代码(参考了tensorrtx作者的构建模式):
#### 以下代码是直接生成LPRNet的TensorRT Egnine
import tensorrt as trt
trt_logger = trt.Logger()
def createTrtEngine(maxBatchsize, dt, class_num):
def loadWeights(torch_model):
weights_dict = torch_model.state_dict()
weights={}
for k,v in weights_dict.items():
if "num_batches_tracked" in k:
continue
weights[k] = trt.Weights(v.numpy())
return weights
def addMaxPool3d(network, input, kernel_size3d, stride_size3d ):
kernel_size2d1 = kernel_size3d[-2:]
stride_size2d1 = stride_size3d[-2:]
kernel_size2d2 = (kernel_size3d[0], kernel_size3d[0])
stride_size2d2 = (kernel_size3d[0], stride_size3d[0])
mp2d_1 = network.add_pooling(input, trt.PoolingType.MAX, kernel_size2d1)
mp2d_1.stride = stride_size2d1
shuffle_1 = network.add_shuffle(mp2d_1.get_output(0))
shuffle_1.first_transpose=(1,3)
mp2d_2 = network.add_pooling(shuffle_1.get_output(0), trt.PoolingType.MAX, kernel_size2d2)
mp2d_2.stride = stride_size2d2
shuffle_2 = network.add_shuffle(mp2d_2.get_output(0))
shuffle_2.first_transpose=(1,3)
return shuffle_2
def addBatchNorm2d(netowrk, input, weights, lname, eps=0.001):
gamma = weights[lname+".weight"].numpy()
beta = weights[lname+".bias"].numpy()
mean = weights[lname+".running_mean"].numpy()
var = weights[lname+".running_var"].numpy()
scval = gamma / np.sqrt(var + eps)
shval = beta - mean * gamma / np.sqrt(var+eps)
pval = np.ones_like(var)
scale1 = netowrk.add_scale(input, trt.ScaleMode.CHANNEL, trt.Weights(shval), trt.Weights(scval), trt.Weights(pval))
return scale1
def small_basic_block_trt(network, input, weights, lname, in_ch, out_ch):
sbb_conv2d_1 = network.add_convolution_nd(input,num_output_maps=out_ch//4, kernel_shape=(1,1), kernel=weights[lname+".0.weight"], bias=weights[lname+".0.bias"])
sbb_conv2d_1.stride = (1,1)
sbb_conv2d_1.padding = (0,0)
sbb_actvation_1 = network.add_activation(sbb_conv2d_1.get_output(0), trt.ActivationType.RELU)
sbb_conv2d_2 = network.add_convolution(sbb_actvation_1.get_output(0), num_output_maps=out_ch // 4, kernel_shape=(3,1), kernel=weights[lname+".2.weight"], bias=weights[lname+".2.bias"])
sbb_conv2d_2.stride = (1,1)
sbb_conv2d_2.padding = (1,0)
sbb_actvation_2 = network.add_activation(sbb_conv2d_2.get_output(0), trt.ActivationType.RELU)
sbb_conv2d_3 = network.add_convolution(sbb_actvation_2.get_output(0), num_output_maps=out_ch // 4, kernel_shape=(1,3), kernel=weights[lname+".4.weight"], bias=weights[lname+".4.bias"])
sbb_conv2d_3.stride = (1,1)
sbb_conv2d_3.padding = (0,1)
sbb_actvation_3 = network.add_activation(sbb_conv2d_3.get_output(0), trt.ActivationType.RELU)
sbb_conv2d_4 = network.add_convolution(sbb_actvation_3.get_output(0), num_output_maps=out_ch, kernel_shape=(1,1), kernel=weights[lname+".6.weight"], bias=weights[lname+".6.bias"])
sbb_conv2d_4.stride = (1,1)
sbb_conv2d_4.padding = (0,0)
return sbb_conv2d_4
#根据LPRNet的网络结构,直接创建并生成TensorRT的Engine
lprnet = LPRNet(class_num=len(CHARS), dropout_rate=0)
lprnet.load_state_dict(torch.load('weights/Final_LPRNet_model.pth', map_location=lambda storage, loc: storage))
weights = loadWeights(lprnet)
with trt.Builder(trt_logger) as builder:
network = builder.create_network(1<<int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH))
input_tensor = network.add_input(name="input", dtype=dt, shape=(-1,3,24,94))
conv2d_1 = network.add_convolution_nd(input=input_tensor, num_output_maps=64, kernel_shape=(3,3), kernel=weights["backbone.0.weight"], bias=weights["backbone.0.bias"]) #0
bn_1 = addBatchNorm2d(network, conv2d_1.get_output(0), weights, "backbone.1") # 1
act_1 = network.add_activation(bn_1.get_output(0), trt.ActivationType.RELU) # 2
mp3d_1 = network.add_pooling(act_1.get_output(0), trt.PoolingType.MAX, (3,3)) # 3
sb_1 = small_basic_block_trt(network, mp3d_1.get_output(0), weights, "backbone.4.block" , 64, 128) # 4
bn_2 = addBatchNorm2d(network, sb_1.get_output(0), weights, "backbone.5") # 5
act_2 = network.add_activation(bn_2.get_output(0), trt.ActivationType.RELU) # 6
mp3d_2 = network.add_pooling_nd(act_2.get_output(0), trt.PoolingType.MAX, (1, 3, 3)) # 7
mp3d_2.stride_nd=(2,1,2)
sb_2 = small_basic_block_trt(network, mp3d_2.get_output(0), weights, "backbone.8.block", 64, 256) # 8
bn_3 = addBatchNorm2d(network, sb_2.get_output(0), weights, "backbone.9") # 9
act_3 = network.add_activation(bn_3.get_output(0), trt.ActivationType.RELU) # 10
sb_3 = small_basic_block_trt(network, act_3.get_output(0), weights, "backbone.11.block", 256, 256) # 11
bn_4 = addBatchNorm2d(network, sb_3.get_output(0), weights, "backbone.12") # 12
act_4 = network.add_activation(bn_4.get_output(0), trt.ActivationType.RELU) # 13
mp3d_3 = network.add_pooling_nd(act_4.get_output(0), trt.PoolingType.MAX, (1, 3, 3)) # 14
mp3d_3.stride_nd=(4,1,2)
# 第15层源torch网络中是一个dropout,由于设置的dropout_rate=0,因此这层忽略
conv2d_2 = network.add_convolution_nd(input=mp3d_3.get_output(0), num_output_maps=256, kernel_shape=(1,4), kernel=weights["backbone.16.weight"], bias=weights["backbone.16.bias"]) # 16
bn_5 = addBatchNorm2d(network, conv2d_2.get_output(0), weights, "backbone.17") # 17
act_5 = network.add_activation(bn_5.get_output(0), trt.ActivationType.RELU) # 18
# 第19层源torch网络中是一个dropout,由于设置的dropout_rate=0,因此这层忽略
conv2d_3 = network.add_convolution_nd(input=act_5.get_output(0), num_output_maps=class_num, kernel_shape=(13,1), kernel=weights["backbone.20.weight"], bias=weights["backbone.20.bias"]) # 20
bn_6 = addBatchNorm2d(network, conv2d_3.get_output(0), weights, "backbone.21") # 21
act_6 = network.add_activation(bn_6.get_output(0), trt.ActivationType.RELU) # 22
avgpool_1 = network.add_pooling(act_1.get_output(0), trt.PoolingType.AVERAGE, (5,5))
avgpool_1.stride=(5,5)
avgpool_2 = network.add_pooling(act_2.get_output(0), trt.PoolingType.AVERAGE, (5,5))
avgpool_2.stride=(5,5)
avgpool_3 = network.add_pooling(act_4.get_output(0), trt.PoolingType.AVERAGE, (4,10))
avgpool_3.stride=(4,2)
expNum = network.add_constant((1,),trt.Weights(np.array([2],dtype=np.float32)))
power_1 = network.add_elementwise(avgpool_1.get_output(0), expNum.get_output(0), trt.ElementWiseOperation.POW)
mean_1 = network.add_reduce(power_1.get_output(0), trt.ReduceOperation.AVG, 511, False) # 这里对所有通道求均值,相应位1标识该通道参与运算,严格来说这里应该是127
div_1 = network.add_elementwise(avgpool_1.get_output(0) ,mean_1.get_output(0), trt.ElementWiseOperation.DIV)
power_2 = network.add_elementwise(avgpool_2.get_output(0), expNum.get_output(0), trt.ElementWiseOperation.POW)
mean_2 = network.add_reduce(power_2.get_output(0), trt.ReduceOperation.AVG, 511, False) # 严格来说这里应该是 255
div_2 = network.add_elementwise(avgpool_2.get_output(0) ,mean_2.get_output(0), trt.ElementWiseOperation.DIV)
power_3 = network.add_elementwise(avgpool_3.get_output(0), expNum.get_output(0), trt.ElementWiseOperation.POW)
mean_3 = network.add_reduce(power_3.get_output(0), trt.ReduceOperation.AVG, 511, False) # 严格来说这里应该是 511
div_3 = network.add_elementwise(avgpool_3.get_output(0) ,mean_3.get_output(0), trt.ElementWiseOperation.DIV)
power_4 = network.add_elementwise(act_6.get_output(0), expNum.get_output(0), trt.ElementWiseOperation.POW)
mean_4 = network.add_reduce(power_4.get_output(0), trt.ReduceOperation.AVG, 511, False) # 严格来说这里应该是 511
div_4 = network.add_elementwise(act_6.get_output(0) ,mean_4.get_output(0), trt.ElementWiseOperation.DIV)
cat_1 = network.add_concatenation([div_1.get_output(0), div_2.get_output(0), div_3.get_output(0), div_4.get_output(0)])
conv_container = network.add_convolution_nd(cat_1.get_output(0), num_output_maps=class_num, kernel_shape=(1,1), kernel=weights["container.0.weight"], bias=weights["container.0.bias"])
logits = network.add_reduce(conv_container.get_output(0), trt.ReduceOperation.AVG, 2, False)
logits.get_output(0).name = "output"
network.mark_output(logits.get_output(0))
builder.max_batch_size = maxBatchsize
conf = builder.create_builder_config()
conf.max_workspace_size = 8 * (1<<20)
optProfile = builder.create_optimization_profile()
optBatchSize = math.ceil(maxBatchsize/2)
optProfile.set_shape("input",(1,3,24,94),(optBatchSize,3,24,94),(maxBatchsize,3,24,94))
conf.add_optimization_profile(optProfile)
engine = builder.build_engine(network, conf)
if engine is None:
print("Fail to build the engine.")
return None
print("Successful to build the engine.")
return engine
if __name__ == "__main__":
createTrtEngine(4, trt.DataType.FLOAT, len(CHARS))