Maximizing Shared Memory Bandwidth on NVIDIA Kepler GPUs

Shared Memory Configurations

On NVIDIA Kepler (Compute 3.x) GPUs, shared memory has 32 banks, with eachbank having a bandwidth of 64-bits per clock cycle. On Fermi GPUs (Compute 2.x)shared memory also has 32 banks, but the bandwidth per bank is only 32-bits perclock cycle.

In FermiGPUs, successive 32-bit words are assigned to successive banks. Kepler GPUs areconfigurable where either successive 32-bit words OR 64-bit words are assignedto successive banks. You can specify your desired configuration bycalling cudaDeviceSharedMemConfig() fromthe host prior to launching your kernel, and specifying one of cudaSharedMemBankSizeDefaultcudaSharedMemBankSizeFourByte, orcudaSharedMemBankSizeEightByte.

In the eightbyte configuration, bank conflicts occur when two or more threads in a warprequest different 64-bit words from the same bank. In the four byteconfiguration, bank conflicts occur if two or more threads in a warpaccess 32-bitwords from the same bank where those words span multiple 64-word alignedsegments (Figure 1).


Figure 1 -Bank Conflicts in the Four Byte Configuration

PerformanceImplications

For kernels operating on double precision values, the eight byteconfiguration on Kepler allows loads/stores to consecutive doubles without bankconflicts. This is an improvement over Fermi GPUs, where accessing doubleprecision values always incurred bank conflicts. If your kernels are operatingon single precision values, you are only utilizing half the available sharedmemory bandwidth. Modifying your kernels to operate on 64-bit values in sharedmemory (perhaps using float2) may improve performance if shared memorythroughput is a performance bottleneck for your kernel.

Shared MemoryExample

I’vepreviously discussed a21-point 1D filter kernel. The kernel is written such that each thread computesa single output value. A thread block with N threads loads N+20 values into ashared memory array (inputS) so that all N threads can read their 21 input valueswithout redundant reads from global memory. The filter calculation looks likethe following:


filtered[gIdx] = inputS[sIdx - 10] * FilterCoeffs[ 0] +
       inputS[sIdx - 9] * FilterCoeffs[ 1]+ 
       inputS[sIdx - 8] * FilterCoeffs[ 2]+ 

       inputS[sIdx - 1] * FilterCoeffs[ 9]+ 
       inputS[sIdx] * FilterCoeffs[10]+ 
       inputS[sIdx + 1] * FilterCoeffs[11]+ 

       inputS[sIdx + 10] * FilterCoeffs[20];

The profiler shows that compute and memory utilization is balanced (Figure2).



Figure 2 -Compute, Bandwidth or Latency Bound?

The compute workload is dominated by shared memory load/stores (Figure 3),and shared memory bandwidth is a bottleneck (Figure 4).



Figure 3 -Compute Resources



Figure 4 -Memory Bandwidth

Optimizingthe Kernel

If we could read/write 2 floats at a time from shared memory, we couldhalve the number of load/store operations. We can accomplish this by rewritingour kernel so that each thread computes two filtered output elements, separatedby blockDim.x*gridDim.x elements. We need half as many thread blocks as before.Each thread block needs to store 2N + 40 input elements in shared memory. Tofacilitate conflict-free access, we want to pack elements separated byblockDim.x*gridDim.x into the same float2 value. The filter computation thenbecomes:


float2 sum = {0.0f, 0.0f};
#pragma unroll
for(int i = 0; i < 21; i++)
{
       float2 temp = inputS[sIdxShift – 10 +i];
       sum.x += temp.x*FilterCoeffs[i];
       sum.y += temp.y*FilterCoeffs[i];
}
filtered[gIdx] = sum.x;
filtered[gIdx + blockDim.x*gridDim.x] = sum.y;

PerformanceResults

The kernels were compiled for Compute 3.5 architectures using the CUDA 5.5compiler. I benchmarked the kernels using a 224 point input data set on a TeslaK20c GPU with the 331.20 Linux drivers. ECC was disabled for the test. Theresults are shown in Table 1.

Rewriting the kernel but keeping the shared memory configuration in fourbyte mode stills shows some improvement over the original. There are someeconomies of scale in the new kernel – the index calculations are amortizedover two output calculations, and we have twice as many independent globalmemory loads/stores per thread, which may have better latency hidingcharacteristics. However, most of our shared memory accesses suffer from bankconflicts. This is for very subtle reasons. The shared memory accesses are toconsecutive float2 elements, but they don’t fall in a 64-word aligned segment,so the accesses cause bank conflicts. Switching to the eight-byte modeeliminates the bank conflicts and further improves performance!

Kernel

Execution Time

32-bit Shared Memory

2.1387 (ms)

64-bit Shared Memory (cudaSharedMemBankSizeFourByte)

1.78614 (ms)

64-bit Shared Memory (cudaSharedMemBankSizeEightByte)

1.33753 (ms)

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值