CUDA shared memory

翻译 2012年03月23日 11:24:55

原文来自CUDA C programming guide

shared memory在片上,因此比local memory与global memory快得多。

 

To achieve high bandwidth, shared memory is divided into equally-sized memory modules, called banks, which can be accessed simultaneously. Any memory read or write request made of n addresses that fall in n distinct memory banks can therefore be serviced simultaneously, yielding an overall bandwidth that is n times as high as the bandwidth of a single module.

为了达到高带宽,shared memory被划分成相同大小的内存块,叫做banks。Banks可以同步访问。n个地址的落在n个独立memory bank的任意内存读或写请求因此可以同步服务。

However, if two addresses of a memory request fall in the same memory bank, there is a bank conflict and the access has to be serialized. The hardware splits a memory request with bank conflicts into as many separate conflict-free requests as necessary, decreasing throughput by a factor equal to the number of separate memory requests. If the number of separate memory requests is n, the initial memory request is said to cause n-way bank conflicts.

然而,如果一次内存请求的两个地址落到同一个memory bank上,将会导致bank conflict并且访问将被串行化。

 

To get maximum performance, it is therefore important to understand how memory addresses map to memory banks in order to schedule the memory requests so as to minimize bank conflicts。

 

章节1 计算能力1.x的device

Shared memory has 16 banks that are organized such that successive 32-bit words are assign to successive banks, i.e. interleaved. Each bank has a bandwidth of 32 bits per two clock cycles.

 

Shared memory被组织成16个bank这样连续的32位 word被分配到连续的bank上。每两个时钟周期每条bank有32位的带宽。

 

A shared memory request for a warp is split into two memory requests, one for each half-warp, that are issued independently. As a consequence, there can be no bank conflict between a thread belonging to the first half of a warp and a thread belonging to the second half of the same warp.

一个warp的一次shared memory请求被分成两次内存请求,每个half-warp一次,并独立发射。因此,在属于前半部分warp的线程与后半部分warp的线程之间不会有bank conflict。

注:这应该是针对计算能力1.x的device

 

For devices of compute capability 1.x, the warp size is 32 threads and the number of banks is 16.

 A shared memory request for a warp is split into one request for the first half of the warp and one request for the second half of the warp. Note that no bank conflict occurs if only one memory location per bank is accessed by a half warp of threads.

对于计算能力1.x的device,warp大小为32个线程bank的数量为16。一个warp的一次shared memory请求被分成前半个warp的一次请求和后半个warp的一次请求。

 

If a non-atomic instruction executed by a warp writes to the same location in shared memory for more than one of the threads of the warp, only one thread per half-warp performs a write and which thread performs the final write is undefined.

 

A common access pattern is for each thread to access a 32-bit word from an array indexed by the thread ID tid and with some stride s: __shared__ float shared[32];

 float data = shared[BaseIndex + s * tid];

s为跨越的歩数。即跨越的bank

 


In this case, threads tid and tid+n access the same bank whenever s*n is a multiple of the number of banks (i.e. 16) or, equivalently, whenever n is a multiple of 16/d where d is the greatest common divisor of 16 and s. As a consequence, there will be no bank conflict only if half the warp size (i.e. 16) is less than or equal to 16/d., that is only if d is equal to 1, i.e. s is odd.

32位的跨越访问,即访问bank时跨越访问

tid访问的是BaseIndex+s*tid这个地址,线程tid+n访问的是BaseIndex+s*tid+s*n。因为bank的个数为16,如果两个地址相差的是16的倍数,即s*n是16的倍数即这两个线程会访问相同的bank。两者线程间隔为n,这两个线程要不属于同一个half-warp,若16/d>=half-warp的大小,n肯定>=half-warp的size,两者不会在相同的half-warp内。

 


 F .3.3.2 32-Bit Broadcast Access
Shared memory features a broadcast mechanism whereby a 32-bit word can be read and broadcast to several threads simultaneously when servicing one memory read.request. This reduces the number of bank conflicts when several threads read from an address within the same 32-bit word. More precisely, a memory read request made of several addresses is serviced in several steps over time by servicing one conflict-free subset of these addresses per step until all addresses have been serviced; at each step, the subset is built from the remaining addresses that have yet to be serviced using the following procedure:

这种方式减少了bank conflict的数量当几个线程从相同的32位word内的某一个地址(注:这个地址应该是相同的)读取的时候。
 Select one of the words pointed to by the remaining addresses as the broadcast word;
 Include in the subset:
 All addresses that are within the broadcast word,
 One address for each bank (other than the broadcasting bank) pointed to by the remaining addresses.
Which word is selected as the broadcast word and which address is picked up for each bank at each cycle are unspecified.
A common conflict-free case is when all threads of a half-warp read from an address within the same 32-bit word.

 

 

 

8-bit and 16-bit accesses typically generate bank conflicts. For example, there are bank conflicts if an array of char is accessed the following way:

__shared__ char shared[32];

 char data = shared[BaseIndex + tid];
because shared[0], shared[1], shared[2], and shared[3], for example, belong to the same bank. There are no bank conflicts however, if the same array is accessed the following way: char data = shared[BaseIndex + 4 * tid];

注:char占一个字节

 

 

 

章节2  计算能力2.x的设备

For devices of compute capability 2.x, the warp size is 32 threads and the number of banks is also 32. A shared memory request for a warp is not split as with devices of compute capability 1.x, meaning that bank conflicts can occur between threads in the first half of a warp and threads in the second half of the same warp (see Section F.4.3 of the CUDA C Programming Guide).

对于计算能力2.x的设备,warp大小为32个线程,bank数量也是32.一个warp的一次memory请求不会像计算能力1.x的设备那样分开。这意味着bank conflict会在前半个warp的线程与同一个warp的后半个线程之间发生。

注:这应该是属于同一warp的不同半边的线程访问了同一个bank

 

A bank conflict only occurs if two or more threads access any bytes within different 32-bit words belonging to the same bank. If two or more threads access any bytes within the same 32-bit word, there is no bank conflict between these threads: For read accesses, the word is broadcast to the requesting threads (unlike for devices of compute capability 1.x, multiple words can be broadcast in a single transaction); for write accesses, each byte is written by only one of the threads (which thread performs the write is undefined).

bank conflict仅在两个或更多的线程访问属于同一个bank的不同32位word中的字节。如果两个或以上的线程访问了同一个32位字内的字节。这些线程之间将没有bank conflict。对于读访问,这个word将会在所有请求的线程之间广播。对于写操作,每个字节只会被这其中之一的线程写入。(哪个线程执行这个写操作未定义)

 

This means, in particular, that unlike for devices of compute capability 1.x, there are no bank conflicts if an array of char is accessed as follows, for example:

__shared__ char shared[32];

char data = shared[BaseIndex + tid];

这个在1.x的device上有bank conflict,在2.x上没有bank conflict。因为2.x上只要访问的是同一个bank上的相同32位word中的字节,就不会有bank conflict。

 

 

The __shared__ qualifier, optionally used together with __device__, declares a variable that:
Resides in the shared memory space of a thread block,
 Has the lifetime of the block,
 Is only accessible from all the threads within the block.


When declaring a variable in shared memory as an external array such as

extern __shared__ float shared[];
the size of the array is determined at launch time (see Section B.17).

注:因为shared数组为external,因此数组大小由kernel函数加载时的参数决定

 

 All variables declared in this fashion, start at the same address in memory, so that the layout of the variables in the array must be explicitly managed through offsets.

所有以这个方式声明的变量在memory中的初始地址都是相同的,因此数组中变量的布局必须通过偏移量来明确管理。

For example, if one wants the equivalent of short array0[128]; float array1[64]; int array2[256];
in dynamically allocated shared memory, one could declare and initialize the arrays the following way:

 extern __shared__ float array[];

__device__ void func() // __device__ or __global__ function

 {  short* array0 = (short*)array;

   float* array1 = (float*)&array0[128];

   int* array2 = (int*)&array1[64];

}

 

 

 

 

 

 

 

 

 

共享内存(Shared Memory)介绍

共享内存是常用的进程间通信,两个进程可以直接共享访问同一块内存区域。...
  • WAN_EXE
  • WAN_EXE
  • 2017年02月22日 14:17
  • 1918

共享内存 Shared Memory - Linux进程间通信

共享内存的机制允许两个或多个进程共享一个给定的内存区域。 共享内存,是分配一块能被其他进程访问的内存,实现是通过将内存去映射到共享它的进程的地址空间,使进程间的数据传送不再涉及内核,...
  • luckyjoy521
  • luckyjoy521
  • 2014年04月10日 11:23
  • 2307

CUDA: 使用shared memory

CUDA 的thread本身是有层次结构的。 thread 被grouped成多个thread blocks。 同一个block 的threads 在同一时间, 运行在同一个SM上。 每一个bloc...
  • a130737
  • a130737
  • 2015年03月11日 20:01
  • 1743

进程间通信之共享内存 shared memory 完整代码

  • 2015年01月24日 16:41
  • 984B
  • 下载

cuda编程:关于共享内存(shared memory)和存储体(bank)的事实和疑惑

主要是在研究访问共享内存会产生bank conflict时,自己产生的疑惑。对于这
  • linger2012liu
  • linger2012liu
  • 2014年06月20日 20:06
  • 3951

CUDA内存类型memory

http://www.cnblogs.com/traceorigin/archive/2013/04/11/3015482.htmlCUDA存储器类型:每个线程拥有自己的register and lo...
  • Augusdi
  • Augusdi
  • 2013年09月30日 10:34
  • 8503

CUDA教程之——共享存储器(1)-矩阵相乘

利用矩阵乘法说明如何使用共享存储器   参考资料:1)CUDA_C_Programming_Guide;        2)CUDA_C_Best_Practice...
  • u010837794
  • u010837794
  • 2016年11月21日 16:12
  • 622

【CUDA笔记1】share memory优化

//share memory demo //实现C[MH,NW]=A[MH,MW]B[MW,NW] #include #include #include #define N 8 //A、B、C...
  • wt881010
  • wt881010
  • 2016年12月11日 09:11
  • 418

CUDA之矩阵乘法——TILE&sharedmemory

CUDA 矩阵乘法将输入数据分成很多个TILE使用shared memory进行并行计算
  • Sumujingling
  • Sumujingling
  • 2016年05月25日 10:12
  • 904

【并行计算-CUDA开发】CUDA shared memory bank 冲突

CUDA SHARED MEMORY shared memory在之前的博文有些介绍,这部分会专门讲解其内容。在global Memory部分,数据对齐和连续是很重要的话题,当使用L1的时候,对...
  • LG1259156776
  • LG1259156776
  • 2016年10月13日 21:19
  • 1169
内容举报
返回顶部
收藏助手
不良信息举报
您举报文章:CUDA shared memory
举报原因:
原因补充:

(最多只允许输入30个字)