CUDA(31)之Parallel Reduction Question: How Many Threads on Earth We Need?

Abstract
This blog will try to show you how to further optimize your CUDA performance via exploring how many threads should be called.

1. How many threads we need?
The first question here is how many threads on earth that we need? Someone may say as much as possible (known as TLP) while the others may say as little as possible (known as ILP). Well, I would like to say that this is a trade-off problem!
In the previous case, our reduction implementation is based on the TLP strategies. It is very likely that our previous performance can be further improved by combining TLP and ILP strategies together.

2. Use Less Threads (512 threads VS 1024 threads)

// load one-time coputation data into shared memory
	__shared__ uint64_t data[512];
	if(tid < 512){
		data[tid] = data_gpu[tid] + data_gpu[tid + 512];
	}
	__syncthreads();

	// reduction
	if(tid < 256){
		data[tid] += data[tid + 256];
	}
	__syncthreads();

	if(tid < 128){
		data[tid] += data[tid + 128];
	}
	__syncthreads();

	if(tid < 64){
		data[tid] += data[tid + 64];
	}
	__syncthreads();

	if(tid < 32){
		data[tid] += data[tid + 32];
		data[tid] += data[tid + 16];
		data[tid] += data[tid + 8];
		data[tid] += data[tid + 4];
		data[tid] += data[tid + 2];
		data[tid] += data[tid + 1];
	}

	// write root node (data[0]) back
	if(tid == 0){
		data_gpu[tid] = data[tid];
	}

4. Experimental Results
这里写图片描述

The experimental results shows an new higher performance we can get by balancing the TLP and ILP strategies.

5. More Details
The whole project has been submitted to Github, where anyone who is interested on CUDA is warmly welcome to develop this project. Big thanks to all of you!

评论 3
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值