Threads
Thread is a sequential set of instructions executed alongside other threads in a processor. Multithreading is the ability for a processor to execute a few processes simultaneously.
Resources
- global memory is shared across all threads
- shared memory is shared across threads running on a single SM.
- registers are only available to a single thread.
CUDA Function Declarations
__host__ int abs(){ }
__device__ int abs(){ }
__global__ int abs(){ }
- host is mostly ignored by nvcc without CUDA declaration. host can only be called by other host functions.
- device can only be executed on device. It is compiled completely by nvcc without passing to any host compiler.
- global function are the only device function that can be called from a host function. global must have a void return type. This function specify special parameters to define how the kernel is lauched. Once lauched, the kernel runs on device. The execution continues in the host function that lauched the kernel.
CUDA Thread organization
A block is composed of threads.
The processor executes threads in warps.
Parameters for threads.
Each kernel has an indenpent set of parameters available assigned by CUDA at lauch.
gridDim // the number of blocks in the grid
blockDim // the number of threads in a block
blockIdx // a unique identifier for the current block
threadIdx // a unique identifier for the current thread
Each parameter is a triple: (x, y, z), which means grids and blocks can be specified in 3 dimensions. eg: gridDim.x, threadIdx.y,…
Grid Constraints
There are a limited number of threads per block: currently 1024. The max size of any block dimension is usually the max number of threads: 1024. The max size of any grid dimension is usually 65535.
Warps
GPU block architecture is a SPMD (single program multiple data).
The GPU organizes threads within a block into wraps.
- warp is a series of threads in a block that execute the same instruction in a SIMD manner.
- The warp size is generally 32 threads.
Streaming multiprocessor (SM)
All threads in a block run on the same SM. An SM can execute multiple threads.
maximizing throughput = maximizing occupancy
These two terms mean to maximize the number of threads running simultaneously on an SM. The limiting factors include:
number of registers used. (variables in the kernel)
shared memory used
stalls (thread divergence)
transparent scalability
The ability of an algorithm to operate on data of different sizes without reprogramming and user tuning. To achieve a better transparent scalability, we need to maximize the possible number of threads incase the data is big, which means to lauch each block with as many threads as possible.