CUDA（31）之Parallel Reduction Question: How Many Threads on Earth We Need?-CSDN博客

本文链接：https://blog.csdn.net/canhui_wang/article/details/53455309

Abstract
This blog will try to show you how to further optimize your CUDA performance via exploring how many threads should be called.

1. How many threads we need?
The first question here is how many threads on earth that we need? Someone may say as much as possible (known as TLP) while the others may say as little as possible (known as ILP). Well, I would like to say that this is a trade-off problem!
In the previous case, our reduction implementation is based on the TLP strategies. It is very likely that our previous performance can be further improved by combining TLP and ILP strategies together.

2. Use Less Threads (512 threads VS 1024 threads)

// load one-time coputation data into shared memory
	__shared__ uint64_t data[512];
	if(tid < 512){
		data[tid] = data_gpu[tid] + data_gpu[tid + 512];
	}
	__syncthreads();

	// reduction
	if(tid < 256){
		data[tid] += data[tid + 256];
	}
	__syncthreads();

	if(tid < 128){
		data[tid] += data[tid + 128];
	}
	__syncthreads();

	if(tid < 64){
		data[tid] += data[tid + 64];
	}
	__syncthreads();

	if(tid < 32){
		data[tid] += data[tid + 32];
		data[tid] += data[tid + 16];
		data[tid] += data[tid + 8];
		data[tid] += data[tid + 4];
		data[tid] += data[tid + 2];
		data[tid] += data[tid + 1];
	}

	// write root node (data[0]) back
	if(tid == 0){
		data_gpu[tid] = data[tid];
	}

4. Experimental Results
这里写图片描述

The experimental results shows an new higher performance we can get by balancing the TLP and ILP strategies.

5. More Details
The whole project has been submitted to Github, where anyone who is interested on CUDA is warmly welcome to develop this project. Big thanks to all of you!