CUDA programming(4) --resources

Threads

Thread is a sequential set of instructions executed alongside other threads in a processor. Multithreading is the ability for a processor to execute a few processes simultaneously.

Resources

  • global memory is shared across all threads
  • shared memory is shared across threads running on a single SM.
  • registers are only available to a single thread.

CUDA Function Declarations

	__host__ int abs(){ }
	__device__ int abs(){ }
	__global__ int abs(){ }
  • host is mostly ignored by nvcc without CUDA declaration. host can only be called by other host functions.
  • device can only be executed on device. It is compiled completely by nvcc without passing to any host compiler.
  • global function are the only device function that can be called from a host function. global must have a void return type. This function specify special parameters to define how the kernel is lauched. Once lauched, the kernel runs on device. The execution continues in the host function that lauched the kernel.

CUDA Thread organization

A block is composed of threads.

The processor executes threads in warps.

Parameters for threads.

Each kernel has an indenpent set of parameters available assigned by CUDA at lauch.

gridDim			// the number of blocks in the grid
blockDim		// the number of threads in a block
blockIdx		// a unique identifier for the current block
threadIdx		// a unique identifier for the current thread

Each parameter is a triple: (x, y, z), which means grids and blocks can be specified in 3 dimensions. eg: gridDim.x, threadIdx.y,…

Grid Constraints

There are a limited number of threads per block: currently 1024. The max size of any block dimension is usually the max number of threads: 1024. The max size of any grid dimension is usually 65535.

Warps

GPU block architecture is a SPMD (single program multiple data).

The GPU organizes threads within a block into wraps.

  • warp is a series of threads in a block that execute the same instruction in a SIMD manner.
  • The warp size is generally 32 threads.

Streaming multiprocessor (SM)

All threads in a block run on the same SM. An SM can execute multiple threads.

maximizing throughput = maximizing occupancy

These two terms mean to maximize the number of threads running simultaneously on an SM. The limiting factors include:

number of registers used. (variables in the kernel)
shared memory used 
stalls (thread divergence)

transparent scalability

The ability of an algorithm to operate on data of different sizes without reprogramming and user tuning. To achieve a better transparent scalability, we need to maximize the possible number of threads incase the data is big, which means to lauch each block with as many threads as possible.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值