DAT400 workshop3 cuda

最新推荐文章于 2024-08-29 16:39:06 发布

Ella486900

最新推荐文章于 2024-08-29 16:39:06 发布

阅读量91

点赞数

文章标签： c++

本文链接：https://blog.csdn.net/Ella486900/article/details/134045845

版权

基础知识

host: traditional CPU
devices: GPU processors with a massive unmber of arithmetic units

main source of performance: data parallelism
very fine grain parallelism enabled by hardware thread scheduling!

cuda threads take very few clock cycles to generate and schedule, due to efficient hardware support.

Single instruction, multiple thread (SIMT) -> low overhead threading
Single program, multiple data (SPMD)

GPU execute blocks in a SIMD-like approach. 32 operations from the same thread block are started in parallel. If not a multiple of 32, then some d=functional units be idle.

API

cudaMalloc()
cudaFree()
cudaMemcpy()
blockDim, threadIdx, blockIdx
function<<<dimGrid, dimBlock>>>
dim3

dimensions of thread blocks: multiples of 32 (due to hardware efficiency reasons)

__syncthreads() : for threads in the same block

_ _ shared _ _

workshop

dim3 NumBlock = {(M % 64 == 0)? 64 : 1 + M / 64, 1, 1};

__syncthreads() ;
while (mod <= blockDim . x ) {
	if ( threadIdx.x % mod == 0) {
		sum[ threadIdx.x] += sum[ threadIdx.x + mod / 2 ] ;
	}
	mod <<= 1 ;
	__syncthreads();
}

for ( int i = 0 ; i < ceil(N/BLOCK SIZE ) ; i++) {
	for ( j = 0 ; j < ceil(BLOCK_SIZE / blockDim.x); j++) {
		if ( i * BLOCK_SIZE + threadIdx.x < N) && threadIdx.x + j * blockDim . x < BLOCK_SIZE) 
			vec_shared [threadIdx.x + j * blockDim.x ] = vec [ i * BLOCK_SIZE + threadIdx.x + j* blockDim.x ] ;
	}
	__syncthreads() ;
	if ( tid < M) {
		for ( int j = 0 ; j < BLOCK_SIZE ; j++){
			sum += mat [ tid * N + i * BLOCK_SIZE + j ] *vec_shared[j] ;
		}
	}
	__syncthreads() ;
}

cuda streams

A stream in CUDA is a sequence of operations that execute on the device in the order in which they are issued by the host code.

The default stream is different from other streams because it is a synchronizing stream with respect to operations on the device:

no operation in the default stream will begin until all previously issued operations in any stream on the device have completed
an operation in the default stream must complete before any other operation (in any stream on the device) will begin.

The asynchronous behavior of kernel launches (from host’s perspective) makes overlapping device and host computation very simple. Whether the host function or device kernel completes first does not affect the subsequent device-to-host transfer, which will begin only after the kernel completes.

cudaStream_t stream1;
cudaError_t result;
result = cudaStreamCreate(&stream1) ;
cudaMemcpyAsync(d_a, a, N, cudaMemcpyHostToDevice, stream1);
increment<<<1, N, 0, stream1>>>(d_a);
cudaStreamQuery(stream1)
cudaDeviceSynchronize();
result = cudaStreamDestroy(stream1) ;