一般host端用GpuMat,在.cu里定义kernel的caller函数时形参用PtrStepSz(或者PtrStep),kernel函数也可以使用PtrStepSz,从而实现host to device的参数传递
PtrStepSz无非就是GpuMat的阉割版,但仔细看GpuMat的源码(sources\modules\core\src\cuda\gpu_mat.cu),只有 ::data是在device端分配的,即通过cudaMallocPitch分配显存,而其它成员如rows, cols, step都是存放在host端内存。再看GpuMat到PtrStepSz(PtrStep)的类型转换:
template <class T> inline
GpuMat::operator PtrStepSz<T>() const
{
return PtrStepSz<T>(rows, cols, (T*)data, step);
}
template <class T> inline
GpuMat::operator PtrStep<T>() const
{
return PtrStep<T>((T*)data, step);
}
PtrStepSz的构造只是通过逐个成员赋值,它们应该也是host端内存,kernel里面怎么能够读到rows, cols, step这些变量的呢?
看看opencv官方的介绍https://docs.opencv.org/2.4/modules/gpu/doc/data_structures.html
Lightweight class encapsulating pitched memory on a GPU and passed to nvcc-compiled code (CUDA kernels). Typically, it is used internally by OpenCV and by users who write device code. You can call its members from both host and device code.
再看PtrStepSz(PtrStep)的定义(include\opencv2\core\cuda_types.hpp):
#ifdef __CUDACC__
#define __CV_CUDA_HOST_DEVICE__ __host__ __device__ __forceinline__
#else
#define __CV_CUDA_HOST_DEVICE__
#endif
...
template <typename T> struct PtrStep : public DevPtr<T>
{
__CV_CUDA_HOST_DEVICE__ PtrStep() : step(0) {}
__CV_CUDA_HOST_DEVICE__ PtrStep(T* data_, size_t step_) : DevPtr<T>(data_), step(step_) {}
size_t step;
__CV_CUDA_HOST_DEVICE__ T* ptr(int y = 0) { return ( T*)( ( char*)DevPtr<T>::data + y * step); }
__CV_CUDA_HOST_DEVICE__ const T* ptr(int y = 0) const { return (const T*)( (const char*)DevPtr<T>::data + y * step); }
__CV_CUDA_HOST_DEVICE__ T& operator ()(int y, int x) { return ptr(y)[x]; }
__CV_CUDA_HOST_DEVICE__ const T& operator ()(int y, int x) const { return ptr(y)[x]; }
};
template <typename T> struct PtrStepSz : public PtrStep<T>
{
__CV_CUDA_HOST_DEVICE__ PtrStepSz() : cols(0), rows(0) {}
__CV_CUDA_HOST_DEVICE__ PtrStepSz(int rows_, int cols_, T* data_, size_t step_)
: PtrStep<T>(data_, step_), cols(cols_), rows(rows_) {}
template <typename U>
explicit PtrStepSz(const PtrStepSz<U>& d) : PtrStep<T>((T*)d.data, d.step), cols(d.cols), rows(d.rows){}
int cols;
int rows;
};
这里PtrStep的成员函数都是带__host__ __device__ __forceinline__
前缀的,代表既能被host端调用,也能被device调用,这会使编译代码时,nvcc(device端代码编译器)和mvsc(host端编译器)都分别编译一份各自的,成员函数在调用时,在host和device端其实执行的是不一样的目标代码,而里面用到的成员变量,如rows, cols, step也会存在两份,host端对应内存,devcie端对应显存,而这些变量的赋值也是编译器自动帮我们做了,具体得看目标代码(待进一步研究和确认)
关于 __host__ __device__
https://stackoverflow.com/questions/33218522/cuda-host-device-variables
https://forums.developer.nvidia.com/t/what-is-host-device/26709
Sometimes the same functionality is needed in both the host and the device portions of CUDA code. To avoid code duplication, CUDA allows such functions to carry both host and device attributes, which means the compiler places one copy of that function into the host compilation flow (to be compiled by the host compiler, e.g. gcc or MSVC), and a second copy into the device compilation flow (to be compiled with NVIDIA’s CUDA compiler).
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#host
The __ device __ and __ host __ execution space specifiers can be used together however, in which case the function is compiled for both the host and the device. The __ CUDA_ARCH __ macro introduced in Application Compatibility can be used to differentiate code paths between host and device:
__host__ __device__ func()
{
#if __CUDA_ARCH__ >= 800
// Device code path for compute capability 8.x
#elif __CUDA_ARCH__ >= 700
// Device code path for compute capability 7.x
#elif __CUDA_ARCH__ >= 600
// Device code path for compute capability 6.x
#elif __CUDA_ARCH__ >= 500
// Device code path for compute capability 5.x
#elif __CUDA_ARCH__ >= 300
// Device code path for compute capability 3.x
#elif !defined(__CUDA_ARCH__)
// Host code path
#endif
}
在代码里面可以通过宏__CUDA_ARCH__
来识别当前运行的是host还是device端代码