# Launch Bounds

## 1.概述

As discussed in detail in Multiprocessor Level, the fewer registers a kernel uses, the more threads and thread blocks are likely to reside on a multiprocessor, which can improve performance.

## 2.用法

### 2.1概要

Therefore, the compiler uses heuristics to minimize register usage while keeping register spilling (seeDevice Memory Accesses) and instruction count to a minimum. An application can optionally aid these heuristics by providing additional information to the compiler in the form of launch bounds that are specified using the __launch_bounds__()qualifier in the definition of a __global__function:

__global__ void launch_bounds(maxThreadsPerBlock, minBlocksPerMultiprocessor)

MyKernel(...)
{
....
}

### 2.2参数

• maxThreadsPerBlockspecifies the maximum number of threads per block with which the application will ever launch MyKernel(); it compiles to the .maxntidPTX directive;
• 每个 CTA 中 最大的线程数。
• minBlocksPerMultiprocessor is optional and specifies the desired minimum number of resident blocks per multiprocessor; it compiles to the .minnctapersmPTX directive.
• 每个SM上最小的CTA数。驻留数

## 3.分析

If launch bounds are specified, the compiler first derives from them the upper limit L on the number of registers the kernel should use to ensure that minBlocksPerMultiprocessorblocks (or a single block if minBlocksPerMultiprocessor is not specified) of maxThreadsPerBlock threads can reside on the multiprocessor (see Hardware Multithreading for the relationship between the number of registers used by a kernel and the number of registers allocated per block).

The compiler then optimizes register usage in the following way:

• If the initial register usage is higher than L, the compiler reduces it further until it becomes less or equal to L, usually at the expense of more local memory usage and/or higher number of instructions;
• 如果寄存器的使用量比L高，那么编译器会将其减少到小于或者等于L；通常要消耗更多local memory
• If the initial register usage is lower than L
• If maxThreadsPerBlock is specified andminBlocksPerMultiprocessor is not, the compiler uses maxThreadsPerBlock to determine the register usage thresholds for the transitions between n and n+1 resident blocks (i.e., when using one less register makes room for an additional resident block as in the example of Multiprocessor Level) and then applies similar heuristics as when no launch bounds are specified;
• If both minBlocksPerMultiprocessor andmaxThreadsPerBlock are specified, the compiler may increase register usage as high as L to reduce the number of instructions and better hide single thread instruction latency.
• 当两个参数都确定之后，编译器会尽可能的将寄存器的利用率提高到 L，，减少指令数量来更好的隐藏单线程指令的延迟。

A kernel will fail to launch if it is executed with more threads per block than its launch bound maxThreadsPerBlock.

### 4.Example

Optimal launch bounds for a given kernel will usually differ across major architecture revisions. The sample code below shows how this is typically handled in device code using the__CUDA_ARCH__ macro introduced in Application Compatibility

#define THREADS_PER_BLOCK      256
#if CUDA_ARCH >= 200
#define MY_KERNEL_MIN_BLOCKS   3
#else
#define MY_KERNEL_MIN_BLOCKS   2
#endif
// Device code
__global_ void

MyKernel(...)
{}

In the common case where MyKernel is invoked with the maximum number of threads per block (specified as the first parameter of __launch_bounds__()), it is tempting to use MY_KERNEL_MAX_THREADS as the number of threads per block in the execution configuration:

// Host code
MyKernel<<<blocksPerGrid, MY_KERNEL_MAX_THREADS>>>(...);

This will not work however since __CUDA_ARCH__ is undefined in host code as mentioned in Application Compatibility, so MyKernel will launch with 256 threads per block even when __CUDA_ARCH__is greater or equal to 200. Instead the number of threads per block should be determined:

Either at compile time using a macro that does not depend on__CUDA_ARCH__, for example

// Host code
MyKernel<<<blocksPerGrid, THREADS_PER_BLOCK>>>(...);

Or at runtime based on the compute capability

// Host code
cudaGetDeviceProperties(&deviceProp, device);
MyKernel<<<blocksPerGrid, threadsPerBlock>>>(...);
Register usage is reported by the --ptxasoptions=-vcompiler option. The number of resident blocks can be derived from the occupancy reported by the CUDA profiler (see Device Memory Accessesfor a definition of occupancy).
Register usage can also be controlled for all__global__ functions in a file using the maxrregcountcompiler option. The value of maxrregcount is ignored for functions with launch bounds.