数组求和的快速方法（利用cuda的共享内存）--第一部分之源码分析

最新推荐文章于 2022-03-12 17:00:51 发布

lingerlanlan

最新推荐文章于 2022-03-12 17:00:51 发布

阅读量4.7k

点赞数 3

分类专栏： cuda(GPU programming) 文章标签： cuda 数组求和共享内存

本文链接：https://blog.csdn.net/lingerlanlan/article/details/24622897

版权

cuda(GPU programming) 专栏收录该内容

20 篇文章 1 订阅

订阅专栏

代码来自于这里

https://code.google.com/p/stanford-cs193g-sp2010/source/browse/trunk/tutorials/sum_reduction.cu

貌似是斯坦福的课程。

核函数分析：

// this kernel computes, per-block, the sum
// of a block-sized portion of the input
// using a block-wide reduction
template<class DType>
__global__ void block_sum(const DType *input,
		DType *per_block_results,
						const size_t n)
{
	extern __shared__ DType sdata[];

	unsigned int i = blockIdx.x * blockDim.x + threadIdx.x;

	// load input into __shared__ memory //一个线程负责把一个元素从全局内存载入到共享内存
	DType x = 0;
	if(i < n)
	{
		x = input[i];
	}
	sdata[threadIdx.x] = x;
	__syncthreads();//等待所有线程把自己负责的元素载入到共享内存

	// contiguous range pattern//块内进行合并操作，每次合并变为一半.注意threadIdx.x是块内的偏移，上面算出的i是全局的偏移。
	for(int offset = blockDim.x / 2;
			offset > 0;
			offset >>= 1)
	{
		if(threadIdx.x < offset)//控制只有某些线程才进行操作。
		{
			// add a partial sum upstream to our own
			sdata[threadIdx.x] += sdata[threadIdx.x + offset];
		}

		// wait until all threads in the block have
		// updated their partial sums
		__syncthreads();
	}

	// thread 0 writes the final result//每个块的线程0负责存放块内求和的结果
	if(threadIdx.x == 0)
	{
		per_block_results[blockIdx.x] = sdata[0];
	}
}

使用分析：

		// move input to device memory//分配内存
		double *d_input = 0;
		cudaMalloc((void**)&d_input, sizeof(double) * num_elements);
		cudaMemcpy(d_input, &h_input[0], sizeof(double) * num_elements, cudaMemcpyHostToDevice);

		const size_t block_size = 512;//线程块的大小。目前有些gpu的线程块最大为512，有些为1024.
		const size_t num_blocks = (num_elements/block_size) + ((num_elements%block_size) ? 1 : 0);

	// allocate space to hold one partial sum per block, plus one additional
	// slot to store the total sum
		double *d_partial_sums_and_total = 0;//一个线程块一个和，另外加一个元素，存放所有线程块的和。
		cudaMalloc((void**)&d_partial_sums_and_total, sizeof(double) * (num_blocks + 1));

		// launch one kernel to compute, per-block, a partial sum//把每个线程块的和求出来
		block_sum<<<num_blocks,block_size,block_size * sizeof(double)>>>(d_input, d_partial_sums_and_total, num_elements);

		// launch a single block to compute the sum of the partial sums
	        //再次用一个线程块把上一步的结果求和。
		//注意这里有个限制，上一步线程块的数量，必须不大于一个线程块线程的最大数量，因为这一步得把上一步的结果放在一个线程块操作。
		//即num_blocks不能大于线程块的最大线程数量。
		block_sum<<<1,num_blocks,num_blocks * sizeof(double)>>>(d_partial_sums_and_total, d_partial_sums_and_total + num_blocks, num_blocks);

		// copy the result back to the host
		double device_result = 0;
		cudaMemcpy(&device_result, d_partial_sums_and_total + num_blocks, sizeof(double), cudaMemcpyDeviceToHost);

		std::cout << "Device sum: " << device_result << std::endl;

		// deallocate device memory
		cudaFree(d_input);
		cudaFree(d_partial_sums_and_total);

上面提到这个程序有个限制，即num_blocks不能大于线程块的最大线程数量。本来这两者没有必然联系的，只是由于程序的设计导致的问题。

这也导致了数组的元素数量也有限制了，不能大于线程块最大线程数的平方。如果maxThreadsPerBlock=512，那数组数量必须少于512*512；

如果maxThreadsPerBlock=1024，则数组数量必须少于1024*1024。

其实是可以继续完善这一点的。只需要继续迭代，保证最后一次是可以在一个线程块计算即可。

有时间实现一下。