CUDA shared memory

翻译 2012年03月23日 11:24:55

原文来自CUDA C programming guide

shared memory在片上,因此比local memory与global memory快得多。


To achieve high bandwidth, shared memory is divided into equally-sized memory modules, called banks, which can be accessed simultaneously. Any memory read or write request made of n addresses that fall in n distinct memory banks can therefore be serviced simultaneously, yielding an overall bandwidth that is n times as high as the bandwidth of a single module.

为了达到高带宽,shared memory被划分成相同大小的内存块,叫做banks。Banks可以同步访问。n个地址的落在n个独立memory bank的任意内存读或写请求因此可以同步服务。

However, if two addresses of a memory request fall in the same memory bank, there is a bank conflict and the access has to be serialized. The hardware splits a memory request with bank conflicts into as many separate conflict-free requests as necessary, decreasing throughput by a factor equal to the number of separate memory requests. If the number of separate memory requests is n, the initial memory request is said to cause n-way bank conflicts.

然而,如果一次内存请求的两个地址落到同一个memory bank上,将会导致bank conflict并且访问将被串行化。


To get maximum performance, it is therefore important to understand how memory addresses map to memory banks in order to schedule the memory requests so as to minimize bank conflicts。


章节1 计算能力1.x的device

Shared memory has 16 banks that are organized such that successive 32-bit words are assign to successive banks, i.e. interleaved. Each bank has a bandwidth of 32 bits per two clock cycles.


Shared memory被组织成16个bank这样连续的32位 word被分配到连续的bank上。每两个时钟周期每条bank有32位的带宽。


A shared memory request for a warp is split into two memory requests, one for each half-warp, that are issued independently. As a consequence, there can be no bank conflict between a thread belonging to the first half of a warp and a thread belonging to the second half of the same warp.

一个warp的一次shared memory请求被分成两次内存请求,每个half-warp一次,并独立发射。因此,在属于前半部分warp的线程与后半部分warp的线程之间不会有bank conflict。



For devices of compute capability 1.x, the warp size is 32 threads and the number of banks is 16.

 A shared memory request for a warp is split into one request for the first half of the warp and one request for the second half of the warp. Note that no bank conflict occurs if only one memory location per bank is accessed by a half warp of threads.

对于计算能力1.x的device,warp大小为32个线程bank的数量为16。一个warp的一次shared memory请求被分成前半个warp的一次请求和后半个warp的一次请求。


If a non-atomic instruction executed by a warp writes to the same location in shared memory for more than one of the threads of the warp, only one thread per half-warp performs a write and which thread performs the final write is undefined.


A common access pattern is for each thread to access a 32-bit word from an array indexed by the thread ID tid and with some stride s: __shared__ float shared[32];

 float data = shared[BaseIndex + s * tid];



In this case, threads tid and tid+n access the same bank whenever s*n is a multiple of the number of banks (i.e. 16) or, equivalently, whenever n is a multiple of 16/d where d is the greatest common divisor of 16 and s. As a consequence, there will be no bank conflict only if half the warp size (i.e. 16) is less than or equal to 16/d., that is only if d is equal to 1, i.e. s is odd.




 F .3.3.2 32-Bit Broadcast Access
Shared memory features a broadcast mechanism whereby a 32-bit word can be read and broadcast to several threads simultaneously when servicing one memory read.request. This reduces the number of bank conflicts when several threads read from an address within the same 32-bit word. More precisely, a memory read request made of several addresses is serviced in several steps over time by servicing one conflict-free subset of these addresses per step until all addresses have been serviced; at each step, the subset is built from the remaining addresses that have yet to be serviced using the following procedure:

这种方式减少了bank conflict的数量当几个线程从相同的32位word内的某一个地址(注:这个地址应该是相同的)读取的时候。
 Select one of the words pointed to by the remaining addresses as the broadcast word;
 Include in the subset:
 All addresses that are within the broadcast word,
 One address for each bank (other than the broadcasting bank) pointed to by the remaining addresses.
Which word is selected as the broadcast word and which address is picked up for each bank at each cycle are unspecified.
A common conflict-free case is when all threads of a half-warp read from an address within the same 32-bit word.




8-bit and 16-bit accesses typically generate bank conflicts. For example, there are bank conflicts if an array of char is accessed the following way:

__shared__ char shared[32];

 char data = shared[BaseIndex + tid];
because shared[0], shared[1], shared[2], and shared[3], for example, belong to the same bank. There are no bank conflicts however, if the same array is accessed the following way: char data = shared[BaseIndex + 4 * tid];





章节2  计算能力2.x的设备

For devices of compute capability 2.x, the warp size is 32 threads and the number of banks is also 32. A shared memory request for a warp is not split as with devices of compute capability 1.x, meaning that bank conflicts can occur between threads in the first half of a warp and threads in the second half of the same warp (see Section F.4.3 of the CUDA C Programming Guide).

对于计算能力2.x的设备,warp大小为32个线程,bank数量也是32.一个warp的一次memory请求不会像计算能力1.x的设备那样分开。这意味着bank conflict会在前半个warp的线程与同一个warp的后半个线程之间发生。



A bank conflict only occurs if two or more threads access any bytes within different 32-bit words belonging to the same bank. If two or more threads access any bytes within the same 32-bit word, there is no bank conflict between these threads: For read accesses, the word is broadcast to the requesting threads (unlike for devices of compute capability 1.x, multiple words can be broadcast in a single transaction); for write accesses, each byte is written by only one of the threads (which thread performs the write is undefined).

bank conflict仅在两个或更多的线程访问属于同一个bank的不同32位word中的字节。如果两个或以上的线程访问了同一个32位字内的字节。这些线程之间将没有bank conflict。对于读访问,这个word将会在所有请求的线程之间广播。对于写操作,每个字节只会被这其中之一的线程写入。(哪个线程执行这个写操作未定义)


This means, in particular, that unlike for devices of compute capability 1.x, there are no bank conflicts if an array of char is accessed as follows, for example:

__shared__ char shared[32];

char data = shared[BaseIndex + tid];

这个在1.x的device上有bank conflict,在2.x上没有bank conflict。因为2.x上只要访问的是同一个bank上的相同32位word中的字节,就不会有bank conflict。



The __shared__ qualifier, optionally used together with __device__, declares a variable that:
Resides in the shared memory space of a thread block,
 Has the lifetime of the block,
 Is only accessible from all the threads within the block.

When declaring a variable in shared memory as an external array such as

extern __shared__ float shared[];
the size of the array is determined at launch time (see Section B.17).



 All variables declared in this fashion, start at the same address in memory, so that the layout of the variables in the array must be explicitly managed through offsets.


For example, if one wants the equivalent of short array0[128]; float array1[64]; int array2[256];
in dynamically allocated shared memory, one could declare and initialize the arrays the following way:

 extern __shared__ float array[];

__device__ void func() // __device__ or __global__ function

 {  short* array0 = (short*)array;

   float* array1 = (float*)&array0[128];

   int* array2 = (int*)&array1[64];












【CUDA笔记1】share memory优化

//share memory demo //实现C[MH,NW]=A[MH,MW]B[MW,NW] #include #include #include #define N 8 //A、B、C...

cuda share memory

cuda share memory 声明时最好加上 volatile 关键字。__shared__ volatile T sdata[blockSize];volatile 表示这个变量会被外部程...

What's the difference between CUDA shared and global memory?

What's the difference between CUDA shared and global memory? 1.When we use cudaMalloc() In order to...

cuda编程:关于共享内存(shared memory)和存储体(bank)的事实和疑惑

主要是在研究访问共享内存会产生bank conflict时,自己产生的疑惑。对于这

cuda中线程块共享存储(shared memory)加速较全局存储(global memory)之优势

(仅供参考,各方面表述可能有错,概不负责)           首先个人观点说明两个概念,(1)共享存储加速,就是一个线程块内不仅所有线程并发执行,而且各线程还通过共享的内存来实现协作,进一步提升加...

CUDA学习笔记一:CUDA+OpenCV的图像转置,采用Shared Memory进行CUDA程序优化

通过OpenCV读取图像,采用CUDA对图像数据进行处理,利用Shared Memory避免转置过程中的不合并访存,提高图像处理速度

Shared Memory Introduction

  • 2012-12-10 10:14
  • 61KB
  • 下载

Android系统匿名共享内存Ashmem(Anonymous Shared Memory)简要介绍和学习计划

在Android系统中,提供了独特的匿名共享内存子系统Ashmem(Anonymous Shared Memory),它以驱动程序的形式实现在内核空间中。它有两个特点,一是能够辅助内存管理系统来有效地...