CUDA Pro Tip: Occupancy API Simplifies Launch Configuration

CUDA programmers often need to decide on a block size to use for a kernel launch. For key kernels, its important to understand the constraints of the kernel and the GPU it is running on to choose a block size that will result in good performance. One common heuristic used to choose a good block size is to aim for high occupancy, which is the ratio of the number of active warps per multiprocessor to the maximum number of warps that can be active on the multiprocessor at once. Higher occupancy does not always mean higher performance, but it is a useful metric for gauging the latency hiding ability of a kernel.

cuda6_5 Release Candidate Available
Become a  CUDA Registered Developer and download now!

Before CUDA 6.5, calculating occupancy was tricky. It required implementing a complex computation that took account of the present GPU and its capabilities (including register file and shared memory size), and the properties of the kernel (shared memory usage, registers per thread, threads per block). Implementating the occupancy calculation is difficult, so very few programmers take this approach, instead using the occupancy calculator spreadsheet included with the CUDA Toolkit to find good block sizes for each supported GPU architecture.

CUDA 6.5 includes several new runtime functions to aid in occupancy calculations and launch configuration. The core occupancy calculator API,cudaOccupancyMaxActiveBlocksPerMultiprocessor produces an occupancy prediction based on the block size and shared memory usage of a kernel. This function reports occupancy in terms of the number of concurrent thread blocks per multiprocessor. Note that this value can be converted to other metrics. Multiplying by the number of warps per block yields the number of concurrent warps per multiprocessor; further dividing concurrent warps by max warps per multiprocessor gives the occupancy as a percentage.

CUDA 6.5 also introduces occupancy-based launch configurator APIs,cudaOccupancyMaxPotentialBlockSize andcudaOccupancyMaxPotentialBlockSizeVariableSMem, which heuristically calculate a block size that achieves the maximum multiprocessor-level occupancy. You can use the VariableSmem version for kernels where the amount of shared memory allocated depends on the number of threads per block. Note that there are also CUDA driver API equivalents of these functions. The following example demonstrates the use of these APIs. It first chooses a reasonable block size by callingcudaOccupancyMaxPotentialBlockSize, and then calculates the theoretical maximum occupancy the kernel will achieve on the present device by calling cudaGetDeviceProperties andcudaOccupancyMaxActiveBlocksPerMultiprocessor.

#include "stdio.h"

__global__ void MyKernel(int *array, int arrayCount) 
{ 
  int idx = threadIdx.x + blockIdx.x * blockDim.x; 
  if (idx < arrayCount) 
  { 
    array[idx] *= array[idx]; 
  } 
} 

void launchMyKernel(int *array, int arrayCount) 
{ 
  int blockSize;   // The launch configurator returned block size 
  int minGridSize; // The minimum grid size needed to achieve the 
                   // maximum occupancy for a full device launch 
  int gridSize;    // The actual grid size needed, based on input size 

  cudaOccupancyMaxPotentialBlockSize( &minGridSize, &blockSize, 
                                      MyKernel, 0, arrayCount); 
  // Round up according to array size 
  gridSize = (arrayCount + blockSize - 1) / blockSize; 

  MyKernel<<< gridSize, blockSize >>>(array, arrayCount); 

  cudaDeviceSynchronize(); 

  // calculate theoretical occupancy
  int maxActiveBlocks;
  cudaOccupancyMaxActiveBlocksPerMultiprocessor( &maxActiveBlocks, 
                                                 MyKernel, blockSize, 
                                                 0);

  int device;
  cudaDeviceProp props;
  cudaGetDevice(&device);
  cudaGetDeviceProperties(&props, device);

  float occupancy = (maxActiveBlocks * blockSize / props.warpSize) / 
                    (float)(props.maxThreadsPerMultiProcessor / 
                            props.warpSize);

  printf("Launched blocks of size %d. Theoretical occupancy: %f\n", 
         blockSize, occupancy);
}

cudaOccupancyMaxPotentialBlockSize makes it possible to compute a reasonably efficient execution configuration for a kernel without having to directly query the kernel’s attributes or the device properties, regardless of what device is present or any compilation details. This can greatly simplify the task of frameworks (such as Thrust), that must launch user-defined kernels. This is also handy for kernels that are not primary performance bottlenecks, where the programmer just wants a simple way to run the kernel with correct results, rather than hand-tuning the execution configuration.

The CUDA Toolkit version 6.5 also provides a self-documenting, standalone occupancy calculator and launch configurator implementation in <CUDA_Toolkit_Path>/include/cuda_occupancy.h for any use cases that cannot depend on the CUDA software stack. A spreadsheet version of the occupancy calculator is also included (and has been for many CUDA releases). The spreadsheet version is particularly useful as a learning tool that visualizes the impact of changes to the parameters that affect occupancy (block size, registers per thread, and shared memory per thread). You can find more information in the CUDA C Programming Guide and CUDA Runtime API Reference.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值