CUDA shared memory

最新推荐文章于 2024-05-13 09:10:45 发布

scgillian

最新推荐文章于 2024-05-13 09:10:45 发布

阅读量3.7k

点赞数

分类专栏： CUDA 文章标签： cuda conflict thread access float each

CUDA 专栏收录该内容

8 篇文章 1 订阅

订阅专栏

原文来自CUDA C programming guide

shared memory在片上，因此比local memory与global memory快得多。

To achieve high bandwidth, shared memory is divided into equally-sized memory modules, called banks, which can be accessed simultaneously. Any memory read or write request made of n addresses that fall in n distinct memory banks can therefore be serviced simultaneously, yielding an overall bandwidth that is n times as high as the bandwidth of a single module.

为了达到高带宽，shared memory被划分成相同大小的内存块，叫做banks。Banks可以同步访问。n个地址的落在n个独立memory bank的任意内存读或写请求因此可以同步服务。

However, if two addresses of a memory request fall in the same memory bank, there is a bank conflict and the access has to be serialized. The hardware splits a memory request with bank conflicts into as many separate conflict-free requests as necessary, decreasing throughput by a factor equal to the number of separate memory requests. If the number of separate memory requests is n, the initial memory request is said to cause n-way bank conflicts.

然而，如果一次内存请求的两个地址落到同一个memory bank上，将会导致bank conflict并且访问将被串行化。

To get maximum performance, it is therefore important to understand how memory addresses map to memory banks in order to schedule the memory requests so as to minimize bank conflicts。

章节1 计算能力1.x的device

Shared memory has 16 banks that are organized such that successive 32-bit words are assign to successive banks, i.e. interleaved. Each bank has a bandwidth of 32 bits per two clock cycles.

Shared memory被组织成16个bank这样连续的32位 word被分配到连续的bank上。每两个时钟周期每条bank有32位的带宽。

A shared memory request for a warp is split into two memory requests, one for each half-warp, that are issued independently. As a consequence, there can be no bank conflict between a thread belonging to the first half of a warp and a thread belonging to the second half of the same warp.

一个warp的一次shared memory请求被分成两次内存请求，每个half-warp一次，并独立发射。因此，在属于前半部分warp的线程与后半部分warp的线程之间不会有bank conflict。

注:这应该是针对计算能力1.x的device

For devices of compute capability 1.x, the warp size is 32 threads and the number of banks is 16.

A shared memory request for a warp is split into one request for the first half of the warp and one request for the second half of the warp. Note that no bank conflict occurs if only one memory location per bank is accessed by a half warp of threads.

对于计算能力1.x的device，warp大小为32个线程bank的数量为16。一个warp的一次shared memory请求被分成前半个warp的一次请求和后半个warp的一次请求。

If a non-atomic instruction executed by a warp writes to the same location in shared memory for more than one of the threads of the warp, only one thread per half-warp performs a write and which thread performs the final write is undefined.

A common access pattern is for each thread to access a 32-bit word from an array indexed by the thread ID tid and with some stride s: __shared__ float shared[32];

float data = shared[BaseIndex + s * tid];

s为跨越的歩数。即跨越的bank

In this case, threads tid and tid+n access the same bank whenever s*n is a multiple of the number of banks (i.e. 16) or, equivalently, whenever n is a multiple of 16/d where d is the greatest common divisor of 16 and s. As a consequence, there will be no bank conflict only if half the warp size (i.e. 16) is less than or equal to 16/d., that is only if d is equal to 1, i.e. s is odd.

32位的跨越访问，即访问bank时跨越访问

tid访问的是BaseIndex+s*tid这个地址，线程tid+n访问的是BaseIndex+s*tid+s*n。因为bank的个数为16，如果两个地址相差的是16的倍数，即s*n是16的倍数即这两个线程会访问相同的bank。两者线程间隔为n，这两个线程要不属于同一个half-warp，若16/d>=half-warp的大小,n肯定>=half-warp的size，两者不会在相同的half-warp内。

F .3.3.2 32-Bit Broadcast Access
Shared memory features a broadcast mechanism whereby a 32-bit word can be read and broadcast to several threads simultaneously when servicing one memory read.request. This reduces the number of bank conflicts when several threads read from an address within the same 32-bit word. More precisely, a memory read request made of several addresses is serviced in several steps over time by servicing one conflict-free subset of these addresses per step until all addresses have been serviced; at each step, the subset is built from the remaining addresses that have yet to be serviced using the following procedure:

这种方式减少了bank conflict的数量当几个线程从相同的32位word内的某一个地址（注:这个地址应该是相同的)读取的时候。
 Select one of the words pointed to by the remaining addresses as the broadcast word;
 Include in the subset:
 All addresses that are within the broadcast word,
 One address for each bank (other than the broadcasting bank) pointed to by the remaining addresses.
Which word is selected as the broadcast word and which address is picked up for each bank at each cycle are unspecified.
A common conflict-free case is when all threads of a half-warp read from an address within the same 32-bit word.

8-bit and 16-bit accesses typically generate bank conflicts. For example, there are bank conflicts if an array of char is accessed the following way:

__shared__ char shared[32];

char data = shared[BaseIndex + tid];
because shared[0], shared[1], shared[2], and shared[3], for example, belong to the same bank. There are no bank conflicts however, if the same array is accessed the following way: char data = shared[BaseIndex + 4 * tid];

注:char占一个字节

章节2 计算能力2.x的设备

For devices of compute capability 2.x, the warp size is 32 threads and the number of banks is also 32. A shared memory request for a warp is not split as with devices of compute capability 1.x, meaning that bank conflicts can occur between threads in the first half of a warp and threads in the second half of the same warp (see Section F.4.3 of the CUDA C Programming Guide).

对于计算能力2.x的设备，warp大小为32个线程，bank数量也是32.一个warp的一次memory请求不会像计算能力1.x的设备那样分开。这意味着bank conflict会在前半个warp的线程与同一个warp的后半个线程之间发生。

注:这应该是属于同一warp的不同半边的线程访问了同一个bank

A bank conflict only occurs if two or more threads access any bytes within different 32-bit words belonging to the same bank. If two or more threads access any bytes within the same 32-bit word, there is no bank conflict between these threads: For read accesses, the word is broadcast to the requesting threads (unlike for devices of compute capability 1.x, multiple words can be broadcast in a single transaction); for write accesses, each byte is written by only one of the threads (which thread performs the write is undefined).

bank conflict仅在两个或更多的线程访问属于同一个bank的不同32位word中的字节。如果两个或以上的线程访问了同一个32位字内的字节。这些线程之间将没有bank conflict。对于读访问，这个word将会在所有请求的线程之间广播。对于写操作，每个字节只会被这其中之一的线程写入。（哪个线程执行这个写操作未定义）

This means, in particular, that unlike for devices of compute capability 1.x, there are no bank conflicts if an array of char is accessed as follows, for example:

__shared__ char shared[32];

char data = shared[BaseIndex + tid];

这个在1.x的device上有bank conflict，在2.x上没有bank conflict。因为2.x上只要访问的是同一个bank上的相同32位word中的字节，就不会有bank conflict。

The __shared__ qualifier, optionally used together with __device__, declares a variable that:
 Resides in the shared memory space of a thread block,
 Has the lifetime of the block,
 Is only accessible from all the threads within the block.

When declaring a variable in shared memory as an external array such as

extern __shared__ float shared[];
the size of the array is determined at launch time (see Section B.17).

注:因为shared数组为external，因此数组大小由kernel函数加载时的参数决定

All variables declared in this fashion, start at the same address in memory, so that the layout of the variables in the array must be explicitly managed through offsets.

所有以这个方式声明的变量在memory中的初始地址都是相同的，因此数组中变量的布局必须通过偏移量来明确管理。

For example, if one wants the equivalent of short array0[128]; float array1[64]; int array2[256];
in dynamically allocated shared memory, one could declare and initialize the arrays the following way:

extern __shared__ float array[];

__device__ void func() // __device__ or __global__ function

{ short* array0 = (short*)array;

float* array1 = (float*)&array0[128];

int* array2 = (int*)&array1[64];

}