What's the difference between CUDA shared and global memory?_what-is-the-difference-between-cudamalloc-in-host--CSDN博客

1.When we use cudaMalloc()

In order to store data on the gpu that can be communicated back to the host, we need to have allocated memory that lives until it is freed, see global memory as the heap space with life until the application closes or is freed, it is visible to any thread and block that have a pointer to that memory region. Shared memory can be considered as stack space with life until a block of a kernel finishes, the visibility is limited to only threads within the same block. So cudaMalloc is used to allocate space in global memory.

2.Do we get a pointer to shared or global memory?

You will get a pointer to an memory address residing in the global memory.

3.Does global memory reside on the host or device?

Global memory resides on the device. However, there is ways to use the host memory as "global" memory using mapped memory, see: CUDA Zero Copy memory considerations however, it may be slow speeds due to bus transfer speed limitations.

4.Is there a size limit to either one?

The size of the Global memory depends from card to card, anything from none to 8GB. While the shared memory depend on the compute capability. Anything below compute capability 2.x have a maximum 16KB of shared memory per multiprocessor(where the amount of multiprocessors varies from card to card). And cards with compute capability of 2.x and greater have an maximum of 48KB of shared memory per multiprocessor.

If you are using mapped memory, the only limitation is how much the host machine have in memory.

5.Which is faster to access?

In terms of raw numbers, shared memory is much faster (shared memory ~1.7TB/s, while global memory ~ 150GB/s). However, in order to do anything you need to fill the shared memory with something, you usually pull from the global memory. If the memory access to global memory is coalesced(non-random), you will get speeds up to 150-200GB/s depending on the card and its memory interface.

The use of shared memory is when you need to within a block of threads, reuse data already pulled or evaluated from global memory. So instead of pulling from global memory again, you put it in the shared memory for other threads within the same block to see and reuse.

6.Is storing a variable in shared memory the same as passing its address via the kernel?

No, if you pass an address of anything, it always is an address to global memory. From the host you can not set the shared memory, unless you pass it either as an constant where the kernel sets the shared memory to that constant, or you pass it an address to global memory where it is pulled by the kernel when needed.