【学习OpenCV】gpu模块里面的page-locked

最新推荐文章于 2023-03-27 18:46:39 发布

Kelvin_Ngan

最新推荐文章于 2023-03-27 18:46:39 发布

阅读量2k

点赞数

分类专栏： opencv cuda 文章标签： opencv gpu

本文链接：https://blog.csdn.net/Kelvin_Yan/article/details/41278881

版权

opencv 同时被 2 个专栏收录

69 篇文章 9 订阅

订阅专栏

cuda

47 篇文章 5 订阅

订阅专栏

为什么要研究这个东西：

在upload前先把Mat通过registerPageLocked锁定至显存中，能显著加快程序的速度。

例如：

先运行

src_regist.create(cvSize(8192,8192),CV_16UC1);
cv::gpu::registerPageLocked(src_regist);

再运行

gpusrc.upload(src_regist);

耗时为2ms；但把第1步去掉，耗时400ms！

这样做的好处是：先把数据放到显存中，到用的时候gpu直接与显存交互，而不需要通过cpu的调配，速度当然快很多。这有点类似于DSP中的DMA（直接存储访问），就是避免BUS与processor的速度矛盾。

但有一点不明白的是：

src_regist.create(cvSize(1024,1024),CV_16UC1);
cv::gpu::registerPageLocked(src_regist);

改小pagelocked的mat，会导致速度下降为5.6ms，而我的src_regist其实只是1000*1000，按道理比1000*1000大了应该没有影响吧？

依次降低cvSize的大小，看看情况如何：

cvSize(512,512)

5.6ms

cvSize(128,128)

5.6ms
结论刚好是相反的，显存分配的越小，越趋于一个稳定的耗时；那如果所允许的显存不多的话，我肯分配少一点好了！不过有一点是肯定的，pagelocked之后的效率是提高了很多很多的，5/400≈1%！

以下是wiki关于page-locked的解释（网址）：

In the framework of accelerating computational codes by Parallel Computing on Graphics Processing Unit (GPU), the data to be processed must be transferred from system RAM to the graphics-card's VRAM, and the results of the processing from VRAM back to system RAM. In a computational code accelerated by General Purpose GPU (GPGPU) computing, such transactions can occur many times and may affect the overall performance, so that the problem of carrying out those transfers in the fastest way arises.
内存和显存间的数据交互，影响了GPU运算的性能。

To allow programmers to use a larger virtual address space that is actually available in the RAM, CPUs (or hosts, in the language of GPGPU) implement a virtual memory system Virtual memory (non-locked memory) in which a physical memory page can be swapped out to disk. When the host needs that page, it loads it back in from the disk. The drawback with CPU<->GPU memory transfers is that memory transactions are slower, i.e., the bandwidth of the PCI-E bus to connect CPU and GPU is not fully exploited. Non-locked memory is stored not only in memory (e.g. it can be in swap), so the driver needs to access every single page of the non-locked memory, copy it into pinned buffer and pass it to the Direct Memory Access(DMA) (synchronous, page-by-page copy). Indeed, PCI-E transfers occur only using the DMA. Accordingly, when a “normal” transfer is issued, an allocation of a block of page-locked memory is necessary, followed by a host copy from regular memory to the page-locked one, the transfer, the wait for the transfer to complete and the deletion of the page-locked memory. This consumes precious host time which is avoided when directly using page-locked memory.
为了让编程者能够使用一段较大的虚拟地址空间，CPU会将该段空间分成很多页，并存到硬盘（swap分区）中，在需要用到某一页的数据时再从硬盘中调用（虚拟内存机制），这段虚拟地址空间被称为non-locked memory

因为虚拟内存的存在，CPU-GPU之间的数据交互可能会变得更慢，因为non-locked memory不仅仅存放在物理内存，还可能存放在虚拟内存中，此时硬盘的driver需要寻访虚拟内存，然后拷贝到缓存并传输至DMA，PCI-E再将DMA的数据传给GPU。为了避免访问硬盘的操作，有了page-locked memory，CPU把数据固定在物理内存中，不放到虚拟内存中，当然前提是数据占用的空间不能超过物理内存允许的上限

这段看得不太明白，参考：

操作系统内存管理——分区、页式、段式管理

Windows内存管理
 Enable the Lock Pages in Memory Option (Windows)

However, with today’s memories, the use of virtual memory is no longer necessary for many applications which will fit within the host memory space.In all those cases, it is more convenient to use page-locked (pinned) memory which enables a DMA on the GPU to request transfers to and from the host memory without the involvement of the CPU. In other words, locked memory is stored in the physical memory (RAM), so the GPU (or device, in the language of GPGPU) can fetch it without the help of the host (synchronous copy).

GPU memory is automatically allocated as page-locked, since GPU memory does not support swapping to disk. To allocate page-locked memory on the host in CUDA language one could use cudaHostAlloc.