pytorch/mxnet模型tensorrt部署

本文用于记录pytorch/mxnet模型使用tersorrt的整个流程以及遇到的坑。
tensorrt支持TensorFlow的uff和onnx以及自定义模型的推理加速,对于pytorch有第三方接口torch2trt项目,但是这个需要定义好模型在加入,不能把模型和tensorrt分离

import torch
from torch2trt import torch2trt
from torchvision.models.alexnet import alexnet

# create some regular pytorch model...
model = alexnet(pretrained=True).eval().cuda()

# create example data
x = torch.ones((1, 3, 224, 224)).cuda()

# convert to TensorRT feeding sample data as input
model_trt = torch2trt(model, [x])

部署的时候还依赖pytorch环境,就没尝试。

mxnet官方是有接口直接转tensorrt的,

arg_params.update(aux_params)
all_params = dict([(k, v.as_in_context(mx.gpu(0))) for k, v in arg_params.items()])
executor = mx.contrib.tensorrt.tensorrt_bind(sym, ctx=mx.gpu(0), all_params=all_params,data=batch_shape, grad_req='null', force_rebind=True)
y_gen = executor.forward(is_train=False, data=input)
y_gen[0].wait_to_read()

这个也没有尝试,主要还是想部署时分离,只用tensorrt环境,不需要装深度学习全家桶

pytorch和mxnet转换为onnx的模型官方都有接口和文档,使用方法也很简单

#mxnet转onnx
sym = './resnet-50-symbol.json'
params = './resnet-50-0000.params'
input_shape = (1, 3, 224, 224)
onnx_file = './resnet-50.onnx'
converted_model_path = onnx_mxnet.export_model(sym, params, [input_shape], np.float32, onnx_file)

#pytorch转onnx
import torch
import torchvision

dummy_input = torch.randn(10, 3, 224, 224, device='cuda')
model = torchvision.models.alexnet(pretrained=True).cuda()

# Providing input and output names sets the display names for values
# within the model's graph. Setting these does not change the semantics
# of the graph; it is only for readability.
#
# The inputs to the network consist of the flat list of inputs (i.e.
# the values you would pass to the forward() method) followed by the
# flat list of parameters. You can partially specify names, i.e. provide
# a list here shorter than the number of inputs to the model, and we will
# only set that subset of names, starting from the beginning.
input_names = [ "actual_input_1" ] + [ "learned_%d" % i for i in range(16) ]
output_names = [ "output1" ]

torch.onnx.export(model, dummy_input, "alexnet.onnx", verbose=True, input_names=input_names, output_names=output_names)

转onnx问题记录

  1. 自定义层SegmentConsensus 不识别
    对于自定义层,在pytorch转onnx需要自定义,onnx转trt是还需要自定义,对于这种层还是建议搞懂底层原理,用基础的操作来实现,这个层比较简单,使用了mean和index_select操作实现了

  2. TracerWarning: There are 2 live references to the data region being modified when tracing in-place operator copy_ (possibly due to an assignment). This might cause the trace to be incorrect, because all other views that also reference this data will not reflect this change in the trace! On the other hand, if all other views use the same memory chunk, but are disjoint (e.g. are outputs of torch.split), this might still be safe
    这个错误是说修改的数据有两个引用导致无法trace,错误的代码如下:

out[:, :-1, :fold] = x[:, 1:, :fold] # shift left
out[:, 1:, fold: 2 * fold] = x[:, :-1, fold: 2 * fold]  # shift right
out[:, :, 2 * fold:] = x[:, :, 2 * fold:] # not shift

查了一些资料应该是说左边赋值是一个引用,切片又是一个引用,两个引用无法trace,那么把切片使用index_select替换

left_side = torch.cat((x[:, 1:, :fold], torch.zeros(1, 1, fold, h, w)), dim=1)
middle_side = torch.cat((torch.zeros(1, 1, fold, h, w), x[:, :n_segment - 1, fold: 2 * fold]), dim=1)
out = torch.cat((left_side, middle_side, x[:, :, 2 * fold:]), dim=2)
  1. 模型部分转换为onnx
    保存的模型可能是pretrained的模型,实际使用中只需要用部分层,对于mxnet可以在sym文件中直接指定出口层,再转换即可
sym, arg_params, aux_params = mx.model.load_checkpoint(pretrained, epoch)
sym = get_output_sym(sym, 'fc1_output')
arg_params.update(aux_params)
onnx_mx.export_model(sym, arg_params, input_shape, onnx_file_path=onnx_file_path, verbose=True)

对于pytorch可以继承torch.nn.Module将模型传进来自己进行修改定制

class ExtractFeature(torch.nn.Module):
    def __init__(self, cnn, frames=16):
        super().__init__() 
        self.model = cnn
        self.num_segments = frames
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    def forward(self, data):
        
        # st = time.time()
        # print('feature extracting start')
        n = self.model
        pool = torch.nn.MaxPool2d(3,2)
        with torch.no_grad():
            input= data.view((-1, 3) + data.size()[-2:]).to(self.device)
            x=n.conv1(input)
            x=n.bn1(x)
            x=n.relu(x)
            x=n.maxpool(x)
            x=n.layer1(x)
            x=n.layer2(x)
            x=n.layer3(x)
            x=n.layer4(x)
            x=pool(x)
            x=x.flatten(start_dim=1)
            ndata=x
        data=ndata.view((-1, self.num_segments) + ndata.size()[1:])
        return data
  1. 模型调用不使用默认的forward
    模型继承torch.nn.Module,该类有个__call__方法可以使类可以像函数一样被调用,在__call__中调用了apply方法最终调用到forward方法,如果模型使用中不使用forward方法,该怎么转onnx呢?如下这种
    out = net.forward_features(x)
    显式调用了forward_features方法,开始想通过继承方式,将forward_features函数直接返回父类的forward,其实可以直接修改方法的指向,像下面这样直接修改指向即可
    OCR.forward = OCR.forward_ocr

  2. Exporting the operator GatherElements to ONNX opset version 9 is not supported
    opset9 不支持该op,可以将opset version调高,目前最高是12,越高支持的op越多,opset_version默认是9

torch.onnx.export(model, dummy_input, "alexnet.onnx", verbose=True, input_names=input_names, output_names=output_names,opset_version=11,)
  1. dynamic input
    动态输入包括batchsize, 以及可变h,w,使用dynamic_axes参数指定可变的维度
torch.onnx.export(OCR, dummy_input ,onnx_ocr_forword_ocr_path, 
                     input_names=['input'], 
                     output_names=['segm_pred', 'segm_pred2', 'rbox', 'rbox2', 'angle', 'angle2', 'x'],
                     opset_version=11,
                     dynamic_axes={"input": {0: 'batch',2:'h', 3:'w'}})
  1. onnxruntime测试
    模型转换完成后需要测试onnx的模型和原模型的输出是否一致,先用onnxruntime来跑模型,运行时报错can’t load culib 10.1,找不到cuda库,查看了代码和官方文档,明确指定只支持cuda10.1,不是对应的版本重新安装对应的版本即可
    在这里插入图片描述

tensorrt问题记录

  1. 在tensorrt官网下载最新的tensorrt7.1版本,安装好后,配置环境变量,库里面都是so库,和一些c文件,无法import tensorrt,查看官网说明发现tensorrt 的Python接口是不支持Windows的,无法在Windows下用Python接口
    在这里插入图片描述

  2. [TensorRT] ERROR: …/rtSafe/cuda/caskConvolutionRunner.cpp (290) - Cask Error in checkCaskExecError: 7 (Cask Convolution execution)
    [TensorRT] ERROR: FAILED_EXECUTION: std::exception
    这个问题是因为创建的engine和执行不在一个线程中,使用了多线程,将创建和执行放在一个线程中

  3. [TensorRT] ERROR: …/rtSafe/cuda/cudaConvolutionRunner.cpp (303) - Cudnn Error in execute: 7 (CUDNN_STATUS_MAPPING_ERROR)
    [TensorRT] ERROR: FAILED_EXECUTION: std::exception
    创建engine后不使用to(device)和cuda操作,pytorch和mxnet都需要将模型和数据cuda操作,需要删除

  4. [TensorRT] WARNING: Explicit batch network detected and batch size specified, use execute without batch size instead.
    [TensorRT] ERROR: Parameter check failed at: engine.cpp::resolveSlots::1024, condition: allInputDimensionsSpecified(routine)
    动态batchsize tensorrt不能直接构建engine,需要设置profile构建

profile = builder.create_optimization_profile()
profile.set_shape(                                                                                                                                          
            ModelData.INPUT_NAME,                                                                                                                               
            ModelData.MIN_INPUT_SHAPE,                                                                                                                          
            ModelData.OPT_INPUT_SHAPE,                                                                                                                          
            ModelData.MAX_INPUT_SHAPE)
                        
config.add_optimization_profile(profile)
engine = builder.build_engine(network,config)
  1. [TensorRT]ERROR: …/rtSafe/safeRuntime.cpp (25) - Cuda Error in allocate: 2 (out of memory)
    看起来像是显存爆了,nvidia-smi -l 打开显存实时占用发现显存还剩很多,调试运行后发现分配的buffer size未负数,当使用动态batchsize时候第一个维度变成了-1,分配的size是负数就失败了,将负数变成正数在*batchsize分配buffer
size = trt.volume(engine.get_binding_shape(binding)) * batch_size
if size < 0:
   size *= -1
dtype = trt.nptype(engine.get_binding_dtype(binding))
# Allocate host and device buffers
host_mem = cuda.pagelocked_empty(size, dtype)
  1. dynamic input,目标检测输入宽高不确定,和问题4中batchsize问题一样也需要在profile中设置H/W的最小值、典型值和最大值,还需要在allocate_buffers时传入w/h,如果不传入默认都是-1,算出来的size会很小,binding分别为输入和输出计算size并分配buffer,需要分别传入输入和输出的h_和w_
for binding in engine:
    bind = engine.get_binding_shape(binding)
    vol = trt.volume(bind)
    # size = trt.volume(engine.get_binding_shape(binding)) * engine.max_batch_size
    if binding == 'input':              
        size = trt.volume(engine.get_binding_shape(binding)) * batch_size * h_ * w_
    else:
        size = trt.volume(engine.get_binding_shape(binding)) * batch_size * math.ceil(h_ / 4) *  math.ceil(w_ / 4)
    if size < 0:
        size *= -1

    dtype = trt.nptype(engine.get_binding_dtype(binding))
    # Allocate host and device buffers
    host_mem = cuda.pagelocked_empty(size, dtype)
  1. [TensorRT] ERROR: instance normalization doesn’t support dynamic input
    instance normalization 不支持动态输入,又是在目标检测模型中使用的,无法规避,这个问题也可以归类为不支持的op,可以
    在onnx自定义op然后在tensorrt定义plugin,也可以重写op,这里使用重写op实现,在torch/onnx/symbolic_opset9.py中将原有的instance_norm函数改写如下
@parse_args('v', 'v', 'v', 'v', 'v', 'i', 'f', 'f', 'i')
def instance_norm(g, input, weight, bias, running_mean, running_var, use_input_stats, momentum, eps,
                  cudnn_enabled):
                  
    axes = [-i for i in range(2, 0, -1)]

    two_cst = g.op("Constant", value_t=torch.tensor(2.))
    eps_cst = g.op("Constant", value_t=torch.tensor(eps))

    mean = g.op("ReduceMean", input, axes_i=axes)
    numerator = sub(g, input, mean)
    # variance = e((x - e(x))^2), and (x - e(x)) is the numerator in the layer_norm formula
    variance = g.op("ReduceMean", pow(g, numerator, two_cst), axes_i=axes)
    denominator = sqrt(g, add(g, variance, eps_cst))

    inst_norm = div(g, numerator, denominator)
    if not (weight is None or weight.node().mustBeNone()):
        inst_norm = mul(g, inst_norm, weight)
    if not (bias is None or bias.node().mustBeNone()):
        inst_norm = add(g, inst_norm, bias)

    return inst_norm
  1. instance_norm改写后报mul elementwise dimension mismatch [1,256,4,8]and [1,1,1,256]
    乘法维度不匹配,mul是在计算完均值和方差后成γ参数时报错的,根据pytorch的broadcasting 机制,不同维度的数据会通过expand和repeat操作达到相同维度,但是expand是尾部对齐,变成四维后增加的维度都在前面,最后一个维度都不是1无法进行repeat操作,导致维度不匹配,那直接UNsqueeze两维出来将channel放在第二个维度再broadcasting 就可以维度相同了
@parse_args('v', 'v', 'v', 'v', 'v', 'i', 'f', 'f', 'i')
def instance_norm(g, input, weight, bias, running_mean, running_var, use_input_stats, momentum, eps,
                  cudnn_enabled):
                  
    axes = [-i for i in range(2, 0, -1)]

    two_cst = g.op("Constant", value_t=torch.tensor(2.))
    eps_cst = g.op("Constant", value_t=torch.tensor(eps))

    mean = g.op("ReduceMean", input, axes_i=axes)
    numerator = sub(g, input, mean)
    # variance = e((x - e(x))^2), and (x - e(x)) is the numerator in the layer_norm formula
    variance = g.op("ReduceMean", pow(g, numerator, two_cst), axes_i=axes)
    denominator = sqrt(g, add(g, variance, eps_cst))

    inst_norm = div(g, numerator, denominator)
    weight = g.op("Unsqueeze", weight, axes_i=[-1])
    weight = g.op("Unsqueeze", weight, axes_i=[-1])
    bias = g.op("Unsqueeze", bias, axes_i=[-1])
    bias = g.op("Unsqueeze", bias, axes_i=[-1])
    if not (weight is None or weight.node().mustBeNone()):
        inst_norm = mul(g, inst_norm, weight)
    if not (bias is None or bias.node().mustBeNone()):
        inst_norm = add(g, inst_norm, bias)

    return inst_norm
  • 4
    点赞
  • 14
    收藏
    觉得还不错? 一键收藏
  • 4
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 4
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值