CUDA programming(4) --resources

最新推荐文章于 2022-06-14 19:22:09 发布

你是年少的头秃

最新推荐文章于 2022-06-14 19:22:09 发布

阅读量63

点赞数

文章标签： cuda

本文链接：https://blog.csdn.net/qq_41865032/article/details/122838496

版权

Threads

Thread is a sequential set of instructions executed alongside other threads in a processor. Multithreading is the ability for a processor to execute a few processes simultaneously.

Resources

global memory is shared across all threads
shared memory is shared across threads running on a single SM.
registers are only available to a single thread.

CUDA Function Declarations

	__host__ int abs(){ }
	__device__ int abs(){ }
	__global__ int abs(){ }

host is mostly ignored by nvcc without CUDA declaration. host can only be called by other host functions.
device can only be executed on device. It is compiled completely by nvcc without passing to any host compiler.
global function are the only device function that can be called from a host function. global must have a void return type. This function specify special parameters to define how the kernel is lauched. Once lauched, the kernel runs on device. The execution continues in the host function that lauched the kernel.

CUDA Thread organization

A block is composed of threads.

The processor executes threads in warps.

Parameters for threads.

Each kernel has an indenpent set of parameters available assigned by CUDA at lauch.

gridDim			// the number of blocks in the grid
blockDim		// the number of threads in a block
blockIdx		// a unique identifier for the current block
threadIdx		// a unique identifier for the current thread

Each parameter is a triple: (x, y, z), which means grids and blocks can be specified in 3 dimensions. eg: gridDim.x, threadIdx.y,…

Grid Constraints

There are a limited number of threads per block: currently 1024. The max size of any block dimension is usually the max number of threads: 1024. The max size of any grid dimension is usually 65535.

Warps

GPU block architecture is a SPMD (single program multiple data).

The GPU organizes threads within a block into wraps.

warp is a series of threads in a block that execute the same instruction in a SIMD manner.
The warp size is generally 32 threads.

Streaming multiprocessor (SM)

All threads in a block run on the same SM. An SM can execute multiple threads.

maximizing throughput = maximizing occupancy

These two terms mean to maximize the number of threads running simultaneously on an SM. The limiting factors include:

number of registers used. (variables in the kernel)
shared memory used 
stalls (thread divergence)

transparent scalability

The ability of an algorithm to operate on data of different sizes without reprogramming and user tuning. To achieve a better transparent scalability, we need to maximize the possible number of threads incase the data is big, which means to lauch each block with as many threads as possible.

你是年少的头秃

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
CUDA programming(4) --resources

ThreadsThread is a sequential set of instructions executed alongside other threads in a processor. Multithreading is the ability for a processor to execute a few processes simultaneously.Resourcesglobal memory is shared across all threadsshared memory
复制链接

扫一扫