Nvidia Deepstream 运行延迟,卡顿,死机处理办法

Nvidia Deepstream小细节系列:Deepstream 运行延迟,卡顿,死机处理办法

当我们在运行deepstream应用的时候,可能会遇到延迟,视频显示卡顿,甚至死机,或应用程序关闭等情况。这里详细罗列所有可能发生上述现象的原因和可行的解决办法。目前代码基于deepstream python,但对于C++平台也是一个道理。

环境描述:

  • 本案例运行环境:Jetson NX
  • IDE:VSCode
  • JetPack 4.6.1 GA
  • Deepstream 6.0.1


可能1:Jetson 时钟未设置

在我们安装Deepstream的最后一步,需要Boost the clocks

sudo nvpmodel -m 0
sudo jetson_clocks

一般如果我们是通过SDKManager安装的Deepstream,或者直接允许相关镜像,不需要担心这个问题。

可能2:设置batched-push-timeout参数

batched-push-timeoutGst-nvstreammux的一个参数。从官方文档看到的关于此参数的定义:

Timeout in microseconds to wait after the first buffer is available to push the batch even if a complete batch is not formed.

翻译成中文:

即使没有形成完整的批次,在第一个缓冲区可用后将批次推入之后的超时(以微秒为单位)。

我们知道,Gst-nvstreammux的作用是将各个video stream mux在一起,然后进入下一个nvinfer模块。batched-push-timeout这个参数不能设置太大,否则,batch push就会造成延迟。一般这个值设置为1/max_fps。需要注意,单位是microseconds。所以在代码中,我们需要这么写:

streammux.set_property('batched-push-timeout', int(1000000/CfgVidSource.fps) )

可能3:nvInfer的模型相关

我们需要检查是否模型本身推理的模型就注定了系统的延迟。比如我们设置2个摄像头连接到系统,每个摄像头25FPS,那么模型推理的时间必须在1/50秒内完成。当然,我们可以通过设置nvTraker,还是模块的多线程来解决这个问题,但终究我们需要先考虑模型本身的性能。

另外一种可能性就是模型的图像输入尺寸。我们可以通过调整Gst-nvstreammux中的widthheight参数来控制从nvstreammux输入给nvinfer的图像尺寸。widthheight的默认值是1280和720。个人觉得图像输入尺寸的调整不会对整体性能有多大改变,可能输入图像尺寸过大会导致nvinfer消耗更多的时间。由于大多数模型的输入图像的尺寸也不大,所以我们可以自定义这两个参数,数值大于等于模型的输入图像尺寸即可,但不需要太大。相关代码如下:

streammux.set_property('width', streammux_width)
streammux.set_property('height', streammux_height)

可能4:设置live-source参数

live-sourceGst-nvstreammux的一个参数。官方文档看到的关于此参数的定义:

Indicates to muxer that sources are live, e.g. live feeds like an RTSP or USB camera.

也就是说,如果我们视频流的输入端是RTSP的IP摄像头,或者USB摄像头(应该也包含CSI摄像头),那么这个参数应该设置为1。其实这个参数默认值就是1。相关代码如下:

if is_live: # For live video input, we should switch on parameter 'live-source' of streammux to 1.
    streammux.set_property('live-source', 1) 

可能5:设置sinksync参数

这一条我没有在Nvidia官方的文档中找到,但在TroubleShoot中有记录,并且在自己实际的项目中使用。对于输入模块,不论是nvoverlaysinknveglglessinkfakesink,都有一个sync的参数。这个值要设置为0/False。对于udpsink来说,还有一个async参数需要设置为False,sync的值设置为1,我也不知道为什么。相关代码如下:

sink_rtsp = Gst.ElementFactory.make("udpsink", "udpsink")           # example rtsp stream: rtsp://<server IP>:8554/ds-test
sink_rtsp.set_property('async', False)
sink_rtsp.set_property('sync', True)
sink_localdisplay = Gst.ElementFactory.make("nvoverlaysink", "nvvideo-renderer")
sink_localdisplay.set_property("sync", False)
sink_localdisplay = Gst.ElementFactory.make("nveglglessink", "nvvideo-renderer")
sink_localdisplay.set_property("sync", False)
sink_localdisplay = Gst.ElementFactory.make("fakesink", "fakesink")              # fakesink is a kind of video output which output nothing.
sink_localdisplay.set_property("sync", False)

可能6:nvinfernvtracker一起使用

一般来说,inference的时间比较长,但object tracking用时比较短。

对于nvinfer,我们可以设置interval参数。interval参数的官方定义如下:

Specifies the number of consecutive batches to be skipped for inference

这个参数,我们可以在config文件中设置,比如dstest1_pgie_config.txt,也可以在代码中设置,比如下面代码:

pgie = Gst.ElementFactory.make("nvinfer", "primary-inference") 
pgie.set_property("interval",interval)

当然,我们需要在pipeline中初始化nvtracker模块,并且与nvinfer link在一起。

tracker = Gst.ElementFactory.make("nvtracker", "tracker")
#...
pgie.link(tracker)
tracker.link(nvvidconv1)

可能7:qos的设置

官网的FAQ上有一个问题:

When deepstream-app is run in loop on Jetson AGX Xavier using “while true; do deepstream-app -c <config_file>; done;”, after a few iterations I see low FPS for certain iterations. Why is that?

回答如下:

This may happen when you are running thirty 1080p streams at 30 frames/second. The issue is caused by initial load. I/O operations bog down the CPU, and with qos=1 as a default property of the [sink0] group, decodebin starts dropping frames. To avoid this, set qos=0 in the [sink0] group in the configuration file.

所以这个问题其实一般不会发生,因为我们一般情况下不会接入那么多摄像头。但是以防万一,我们也可以在代码中设置一下。

AV Sync in DeepStream中,列举了很多RTSP/RTMP/VidFile输入,RTSP/RTMP/VidFile输出的例子,其中可以看到qos的使用,比如:

RTMP_IN -> RTMP_OUT:

gst-launch-1.0 uridecodebin3 uri=$input1 name=demux1 ! queue ! nvvideoconvert ! "video/x-raw(memory:NVMM)" ! mux1.sink_0 nvstreammux batch-size=2 max-latency=250000000 batched-push-timeout=33333 width=1920 height=1080 sync-inputs=1 name=mux1 ! queue ! nvmultistreamtiler width=480 height=360 ! nvvideoconvert ! "video/x-raw(memory:NVMM)" ! nvv4l2h264enc ! h264parse ! queue ! flvmux name=mux streamable=true !  rtmpsink location=$output async=0 qos=0 sync=1 uridecodebin3 uri=$input2 name=demux2 ! queue ! nvvideoconvert ! "video/x-raw(memory:NVMM)" ! mux1.sink_1 demux1. ! queue ! audioconvert ! mixer.sink_0 audiomixer name=mixer  latency=250000000 ! queue ! avenc_aac ! aacparse ! queue ! mux. demux2. ! queue ! audioconvert ! mixer.  fakesrc num-buffers=0 is-live=1 ! mixer. -e

RTMP_IN->FILE_OUT:

gst-launch-1.0 uridecodebin3 uri=$input1 name=demux1 ! queue ! nvvideoconvert ! "video/x-raw(memory:NVMM)" ! mux1.sink_0 nvstreammux batch-size=2 batched-push-timeout=33333 width=1920 height=1080 sync-inputs=1 max-latency=250000000 name=mux1 ! queue ! nvmultistreamtiler width=480 height=360 ! nvvideoconvert ! "video/x-raw(memory:NVMM)" ! nvv4l2h264enc ! h264parse ! queue ! flvmux name=mux streamable=true ! filesink location=out.flv  async=0 qos=0 sync=1 uridecodebin3 uri=$input2 name=demux2 ! queue ! nvvideoconvert ! "video/x-raw(memory:NVMM)" ! mux1.sink_1 demux1. ! queue ! audioconvert ! mixer.sink_0 audiomixer latency=250000000 name=mixer ! queue ! avenc_aac ! aacparse ! queue ! mux. demux2. ! queue ! audioconvert ! mixer.  fakesrc num-buffers=0 is-live=1 ! mixer. -e

RTSP_IN->RTSP_OUT:

gst-launch-1.0 uridecodebin3 uri=$input1 name=demux1 ! queue ! nvvideoconvert ! "video/x-raw(memory:NVMM)" ! mux1.sink_0 nvstreammux batch-size=2 batched-push-timeout=33333 width=1920 height=1080 sync-inputs=1 name=mux1 ! queue ! nvmultistreamtiler width=480 height=360 ! nvrtspoutsinkbin name=r uridecodebin3 uri=$input2 name=demux2 ! queue ! nvvideoconvert ! "video/x-raw(memory:NVMM)" ! mux1.sink_1 demux1. ! queue ! audioconvert ! mixer.sink_0 audiomixer name=mixer ! queue ! r.  demux2. ! queue ! audioconvert ! mixer. -e

可以看到,我们只需要对对应sink的qos赋值0即可。python里面可以这么写:

sink_localdisplay = Gst.ElementFactory.make("nveglglessink", "nvvideo-renderer")
sink_localdisplay.set_property("qos", 0)

可能8:drop-on-latency的设置

在TroubleShooting的网站上是这么写的:

For RTSP streaming input, if the input has high jitter the GStreamer rtpjitterbuffer element might drop packets which are late. Increase the latency property of rtspsrc, for deepstream-app set latency in [source*] group. Alternatively, if using RTSP type source (type=4) with deepstream-app, turn off drop-on-latency in deepstream_source_bin.c. These steps may add cumulative delay in frames reaching the renderer and memory accumulation in the rtpjitterbuffer if the pipeline is not fast enough.

更深一步,我察了一下,Gstreamer关于rtspsrc的网页关于drop-on-latency的描述:

Tells the jitterbuffer to never exceed the given latency in size

也就是说,如果我们使用rtspsrc作为输入模块,那么设置drop-on-latency为0就可以防止jitterbuffer超过预设累计buffer。但如果我们使用uridecodebin就不要设置了,因为没有这个参数。

可能9:probe()函数

这个可能性是困惑我时间最长的,终于在这里找到明确的答案:

Pipeline unable to perform at real time
WARNING: A lot of buffers are being dropped. (13): gstbasesink.c(2902):
gst_base_sink_is_too_late (): /GstPipeline:pipeline0/GstEglGlesSink:nvvideo-renderer:
There may be a timestamping problem, or this computer is too slow.

Answer:
This could be thrown from any GStreamer app when the pipeline through-put is low
resulting in late buffers at the sink plugin.
This could be a hardware capability limitation or expensive software calls
that hinder pipeline buffer flow.
With python, there is a possibility for such an induced delay in software.
This is with regards to the probe() callbacks an user could leverage
for metadata extraction and other use-cases as demonstrated in our test-apps.

Please NOTE:
a) probe() callbacks are synchronous and thus holds the buffer
(info.get_buffer()) from traversing the pipeline until user return.
b) loops inside probe() callback could be costly in python.

重点:probe() callbacks are synchronous。通常,我们会在某一个模块,比如osd后面加一个probe,然后在里面添加一些后处理,或者metadata收集发送给云端等功能。但问题是,这个probe函数是synchronous的,这个函数耗时越长,越影响整个pipeline的实时性。这里,有好几种办法可以缓解或解决。比如,我们可以在主pipeline上的probe上尽量少写代码或循环语句,而可以通过tee和queue新建一个线程,在那个线程上再新建一个probe函数,里面写复杂逻辑。或者我们所幸新建一个线程,在那个线程里写复杂逻辑,probe如果需要结果的话,从那个线程里拿。总之,不要在主线程里的probe函数写耗时的逻辑。

可能10:主pipeline没有进行任何多线程设置

在案例deepstream_test_3.py中,有这么一段代码:

queue1=Gst.ElementFactory.make("queue","queue1")
queue2=Gst.ElementFactory.make("queue","queue2")
queue3=Gst.ElementFactory.make("queue","queue3")
queue4=Gst.ElementFactory.make("queue","queue4")
queue5=Gst.ElementFactory.make("queue","queue5")
pipeline.add(queue1)
pipeline.add(queue2)
pipeline.add(queue3)
pipeline.add(queue4)
pipeline.add(queue5)
print("Creating Pgie \n ")
pgie = Gst.ElementFactory.make("nvinfer", "primary-inference")
if not pgie:
    sys.stderr.write(" Unable to create pgie \n")
print("Creating tiler \n ")
tiler=Gst.ElementFactory.make("nvmultistreamtiler", "nvtiler")
if not tiler:
    sys.stderr.write(" Unable to create tiler \n")
print("Creating nvvidconv \n ")
nvvidconv = Gst.ElementFactory.make("nvvideoconvert", "convertor")
if not nvvidconv:
    sys.stderr.write(" Unable to create nvvidconv \n")
print("Creating nvosd \n ")
nvosd = Gst.ElementFactory.make("nvdsosd", "onscreendisplay")
if not nvosd:
    sys.stderr.write(" Unable to create nvosd \n")
nvosd.set_property('process-mode',OSD_PROCESS_MODE)
nvosd.set_property('display-text',OSD_DISPLAY_TEXT)
if(is_aarch64()):
    print("Creating transform \n ")
    transform=Gst.ElementFactory.make("nvegltransform", "nvegl-transform")
    if not transform:
        sys.stderr.write(" Unable to create transform \n")

print("Creating EGLSink \n")
sink = Gst.ElementFactory.make("nveglglessink", "nvvideo-renderer")
if not sink:
    sys.stderr.write(" Unable to create egl sink \n")

if is_live:
    print("Atleast one of the sources is live")
    streammux.set_property('live-source', 1)

streammux.set_property('width', 1920)
streammux.set_property('height', 1080)
streammux.set_property('batch-size', number_sources)
streammux.set_property('batched-push-timeout', 4000000)
pgie.set_property('config-file-path', "dstest3_pgie_config.txt")
pgie_batch_size=pgie.get_property("batch-size")
if(pgie_batch_size != number_sources):
    print("WARNING: Overriding infer-config batch-size",pgie_batch_size," with number of sources ", number_sources," \n")
    pgie.set_property("batch-size",number_sources)
tiler_rows=int(math.sqrt(number_sources))
tiler_columns=int(math.ceil((1.0*number_sources)/tiler_rows))
tiler.set_property("rows",tiler_rows)
tiler.set_property("columns",tiler_columns)
tiler.set_property("width", TILED_OUTPUT_WIDTH)
tiler.set_property("height", TILED_OUTPUT_HEIGHT)
sink.set_property("qos",0)
sink.set_property("sync", 0)

print("Adding elements to Pipeline \n")
pipeline.add(pgie)
pipeline.add(tiler)
pipeline.add(nvvidconv)
pipeline.add(nvosd)
if is_aarch64():
    pipeline.add(transform)
pipeline.add(sink)

print("Linking elements in the Pipeline \n")
streammux.link(queue1)
queue1.link(pgie)
pgie.link(queue2)
queue2.link(tiler)
tiler.link(queue3)
queue3.link(nvvidconv)
nvvidconv.link(queue4)
queue4.link(nvosd)
if is_aarch64():
    nvosd.link(queue5)
    queue5.link(transform)
    transform.link(sink)
else:
    nvosd.link(queue5)
    queue5.link(sink)   

我们看到,在配置pipeline的时候,出现了许多queue,这是为什么?这就是Gstreamer在pipeline中强制使用新线程的一种方式了。下面是Gstreamer官网的原文:

There are several reasons to force the use of threads. However, for performance reasons, you never want to use one thread for every element out there, since that will create some overhead. Let’s now list some situations where threads can be particularly useful:

  • Data buffering, for example when dealing with network streams or when recording data from a live stream such as a video or audio card. Short hickups elsewhere in the pipeline will not cause data loss. See also Stream buffering about network buffering with queue2.
  • Synchronizing output devices, e.g. when playing a stream containing both video and audio data. By using threads for both outputs, they will run independently and their synchronization will be better.

Above, we’ve mentioned the “queue” element several times now. A queue is the thread boundary element through which you can force the use of threads. It does so by using a classic provider/consumer model as learned in threading classes at universities all around the world. By doing this, it acts both as a means to make data throughput between threads threadsafe, and it can also act as a buffer.

这里需要注意,在任务繁重的probe中,queue中的一个参数:leaky的参数需要设置为2。意思是,由于这个probe函数任务繁重,无法做到每一桢都处理,所以我们会体丢掉一些桢,设置为2的意思是丢掉老的桢。

可能11:Batchsize没有设置正确

一般来说,batchsize的值需要和输入的摄像头数量一致。如果这个值比输入的摄像头数量少,那么可能会导致某几个摄像头视频的延迟。

其他可能性

在Reference中,Deepstream的官方文档还罗列了一些其他可能引起视频显示延迟,卡顿,死机,或应用程序关闭的可能性。但我觉得可能性不大,或不甚理解,这里也罗列如下:

  • If secondary inferencing is enabled, try to increase batch-size in the the configuration file’s [secondary-gie#] group in case the number of objects to be inferred is greater than the batch-size setting. (不甚理解)
  • On Jetson, use Gst-nvdrmvideosink instead of Gst-nveglglessink as nveglglessink requires GPU utilization.(个人不是很赞同。首先,导致视频显示延迟的原因不一定是GPU,也可能是CPU,比如在probe函数写了很复杂。其次,我查了一下,Gst-nvdrmvideosink不常用,而且好像不是很好用)
  • If the elements in the pipeline are getting starved for buffers (you can check if CPU/GPU utilization is low), increase the number of buffers allocated by the decoder by setting the num-extra-surfaces property of the [source#] group in the application or the num-extra-surfaces property of Gst-nvv4l2decoder element.
  • On Jetson in the configuration file of gst-nvinfer set scaling-compute-hw = 1 if gpu usage is not 100%.
  • On dgpu set cudadec-memtype=0 property on Gst-nvv4l2decoder plugin to select device memory output.

Reference

  • 5
    点赞
  • 27
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 10
    评论
Word编辑延迟卡顿可能有多种原因。 首先,电脑性能不足是导致Word编辑延迟卡顿的常见原因之一。如果计算机的处理器速度、内存和硬盘容量较低,可能无法有效处理大型或复杂的Word文档。在这种情况下,升级计算机硬件或关闭其他占用计算资源的程序可能会减少延迟卡顿问题。 其次,Word本身的设置或插件可能会导致延迟卡顿。某些设置或插件可能会增加Word的负担,例如实时拼写检查、自动保存或自动格式化功能。关闭或优化这些功能可能会改善编辑延迟问题。 此外,Word文档的大小、结构和格式也可能影响编辑的流畅性。过大的文档或嵌入大量图像、表格或其他复杂元素的文档可能会导致延迟卡顿。对于这些情况,可以尝试缩小文档大小、优化图像或表格,并避免使用过多复杂的格式。 另外,计算机上的病毒或恶意软件也可能是导致Word编辑延迟的原因之一。这些恶意程序可能会占用计算机资源并减慢Word的运行速度。定期进行杀毒扫描并升级杀毒软件可以帮助解决这个问题。 最后,Word软件本身的问题或错误也可能导致编辑延迟。在这种情况下,更新到最新的Word版本或进行修复程序可能会解决延迟卡顿问题。 综上所述,Word编辑延迟卡顿可能由电脑性能不足、设置或插件问题、文档大小和格式、病毒或恶意软件以及Word软件本身的问题所导致。通过升级硬件、优化设置、处理大型文档、清除病毒并更新软件,可以改善Word编辑的延迟卡顿问题。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 10
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

破浪会有时

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值