Maximizing Shared Memory Bandwidth on NVIDIA Kepler GPUs

最新推荐文章于 2023-04-08 00:01:18 发布

hdanbang

最新推荐文章于 2023-04-08 00:01:18 发布

阅读量684

点赞数

分类专栏： CUDA编程 GPU编程学之初文章标签： cuda shared memory配置

GPU编程学之初同时被 2 个专栏收录

12 篇文章 0 订阅

订阅专栏

CUDA编程

7 篇文章 1 订阅

订阅专栏

Shared Memory Configurations

On NVIDIA Kepler (Compute 3.x) GPUs, shared memory has 32 banks, with eachbank having a bandwidth of 64-bits per clock cycle. On Fermi GPUs (Compute 2.x)shared memory also has 32 banks, but the bandwidth per bank is only 32-bits perclock cycle.

In FermiGPUs, successive 32-bit words are assigned to successive banks. Kepler GPUs areconfigurable where either successive 32-bit words OR 64-bit words are assignedto successive banks. You can specify your desired configuration bycalling cudaDeviceSharedMemConfig() fromthe host prior to launching your kernel, and specifying one of cudaSharedMemBankSizeDefault, cudaSharedMemBankSizeFourByte, orcudaSharedMemBankSizeEightByte.

In the eightbyte configuration, bank conflicts occur when two or more threads in a warprequest different 64-bit words from the same bank. In the four byteconfiguration, bank conflicts occur if two or more threads in a warpaccess 32-bitwords from the same bank where those words span multiple 64-word alignedsegments (Figure 1).

Figure 1 -Bank Conflicts in the Four Byte Configuration

PerformanceImplications

For kernels operating on double precision values, the eight byteconfiguration on Kepler allows loads/stores to consecutive doubles without bankconflicts. This is an improvement over Fermi GPUs, where accessing doubleprecision values always incurred bank conflicts. If your kernels are operatingon single precision values, you are only utilizing half the available sharedmemory bandwidth. Modifying your kernels to operate on 64-bit values in sharedmemory (perhaps using float2) may improve performance if shared memorythroughput is a performance bottleneck for your kernel.

Shared MemoryExample

I’vepreviously discussed a21-point 1D filter kernel. The kernel is written such that each thread computesa single output value. A thread block with N threads loads N+20 values into ashared memory array (inputS) so that all N threads can read their 21 input valueswithout redundant reads from global memory. The filter calculation looks likethe following:

filtered[gIdx] = inputS[sIdx - 10] * FilterCoeffs[ 0] +
       inputS[sIdx - 9] * FilterCoeffs[ 1]+
       inputS[sIdx - 8] * FilterCoeffs[ 2]+
…
       inputS[sIdx - 1] * FilterCoeffs[ 9]+
       inputS[sIdx] * FilterCoeffs[10]+
       inputS[sIdx + 1] * FilterCoeffs[11]+
…
       inputS[sIdx + 10] * FilterCoeffs[20];

The profiler shows that compute and memory utilization is balanced (Figure2).

Figure 2 -Compute, Bandwidth or Latency Bound?

The compute workload is dominated by shared memory load/stores (Figure 3),and shared memory bandwidth is a bottleneck (Figure 4).

Figure 3 -Compute Resources

Figure 4 -Memory Bandwidth

Optimizingthe Kernel

If we could read/write 2 floats at a time from shared memory, we couldhalve the number of load/store operations. We can accomplish this by rewritingour kernel so that each thread computes two filtered output elements, separatedby blockDim.x*gridDim.x elements. We need half as many thread blocks as before.Each thread block needs to store 2N + 40 input elements in shared memory. Tofacilitate conflict-free access, we want to pack elements separated byblockDim.x*gridDim.x into the same float2 value. The filter computation thenbecomes:

float2 sum = {0.0f, 0.0f};
#pragma unroll
for(int i = 0; i < 21; i++)
{
       float2 temp = inputS[sIdxShift – 10 +i];
       sum.x += temp.x*FilterCoeffs[i];
       sum.y += temp.y*FilterCoeffs[i];
}
filtered[gIdx] = sum.x;
filtered[gIdx + blockDim.x*gridDim.x] = sum.y;

PerformanceResults

The kernels were compiled for Compute 3.5 architectures using the CUDA 5.5compiler. I benchmarked the kernels using a 224 point input data set on a TeslaK20c GPU with the 331.20 Linux drivers. ECC was disabled for the test. Theresults are shown in Table 1.

Rewriting the kernel but keeping the shared memory configuration in fourbyte mode stills shows some improvement over the original. There are someeconomies of scale in the new kernel – the index calculations are amortizedover two output calculations, and we have twice as many independent globalmemory loads/stores per thread, which may have better latency hidingcharacteristics. However, most of our shared memory accesses suffer from bankconflicts. This is for very subtle reasons. The shared memory accesses are toconsecutive float2 elements, but they don’t fall in a 64-word aligned segment,so the accesses cause bank conflicts. Switching to the eight-byte modeeliminates the bank conflicts and further improves performance!

Kernel	Execution Time
32-bit Shared Memory	2.1387 (ms)
64-bit Shared Memory (cudaSharedMemBankSizeFourByte)	1.78614 (ms)
64-bit Shared Memory (cudaSharedMemBankSizeEightByte)	1.33753 (ms)