DAT400 workshop3 cuda

基础知识

host: traditional CPU
devices: GPU processors with a massive unmber of arithmetic units

main source of performance: data parallelism
very fine grain parallelism enabled by hardware thread scheduling!

cuda threads take very few clock cycles to generate and schedule, due to efficient hardware support.

Single instruction, multiple thread (SIMT) -> low overhead threading
Single program, multiple data (SPMD)

GPU execute blocks in a SIMD-like approach. 32 operations from the same thread block are started in parallel. If not a multiple of 32, then some d=functional units be idle.

API

cudaMalloc()
cudaFree()
cudaMemcpy()
blockDim, threadIdx, blockIdx
function<<<dimGrid, dimBlock>>>
dim3

dimensions of thread blocks: multiples of 32 (due to hardware efficiency reasons)

__syncthreads() : for threads in the same block

_ _ shared _ _

workshop

dim3 NumBlock = {(M % 64 == 0)? 64 : 1 + M / 64, 1, 1};
__syncthreads() ;
while (mod <= blockDim . x ) {
	if ( threadIdx.x % mod == 0) {
		sum[ threadIdx.x] += sum[ threadIdx.x + mod / 2 ] ;
	}
	mod <<= 1 ;
	__syncthreads();
}
for ( int i = 0 ; i < ceil(N/BLOCK SIZE ) ; i++) {
	for ( j = 0 ; j < ceil(BLOCK_SIZE / blockDim.x); j++) {
		if ( i * BLOCK_SIZE + threadIdx.x < N) && threadIdx.x + j * blockDim . x < BLOCK_SIZE) 
			vec_shared [threadIdx.x + j * blockDim.x ] = vec [ i * BLOCK_SIZE + threadIdx.x + j* blockDim.x ] ;
	}
	__syncthreads() ;
	if ( tid < M) {
		for ( int j = 0 ; j < BLOCK_SIZE ; j++){
			sum += mat [ tid * N + i * BLOCK_SIZE + j ] *vec_shared[j] ;
		}
	}
	__syncthreads() ;
}

cuda streams

A stream in CUDA is a sequence of operations that execute on the device in the order in which they are issued by the host code.

The default stream is different from other streams because it is a synchronizing stream with respect to operations on the device:

  • no operation in the default stream will begin until all previously issued operations in any stream on the device have completed
  • an operation in the default stream must complete before any other operation (in any stream on the device) will begin.

The asynchronous behavior of kernel launches (from host’s perspective) makes overlapping device and host computation very simple. Whether the host function or device kernel completes first does not affect the subsequent device-to-host transfer, which will begin only after the kernel completes.

cudaStream_t stream1;
cudaError_t result;
result = cudaStreamCreate(&stream1) ;
cudaMemcpyAsync(d_a, a, N, cudaMemcpyHostToDevice, stream1);
increment<<<1, N, 0, stream1>>>(d_a);
cudaStreamQuery(stream1)
cudaDeviceSynchronize();
result = cudaStreamDestroy(stream1) ;
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值