问题描述:
在部署AI云服务后端时,使用tensorRT来进行模型推理,发现随着客户端不断地请求服务,显存会持续的增长,当累积到一定程度时就会出现申请不到显存而报错的情况。
经过分析是在tensorrt模型前向推理是造成的问题,在代码里:
trt_engine_path = './model/resnet50.trt'
trt_runtime = trt.Runtime(TRT_LOGGER)
engine = load_engine(trt_runtime, trt_engine_path)
context = engine.create_execution_context()
trt_outputs = do_inference(context, bindings=bindings, inputs=inputs, outputs=outputs, stream=stream)
解决办法:
使用with语句来加载engine和context,推理结束时会自动释放显卡内存,写法如下:
trt_engine_path = './model/resnet50.trt'
trt_runtime = trt.Runtime(TRT_LOGGER)
with load_engine(trt_runtime, trt_engine_path) as engine:
inputs, outputs, bindings, stream = allocate_buffers(engine)
with engine.create_execution_context() as context:
trt_outputs = do_inference(context, bindings=bindings, inputs=inputs, outputs=outputs, stream=stream)
......
除此之外,大家可以多看看tensorrt里面自带的samples,里面有关于不同模型的tensorRT推理的写法。