告诉你一个cudaMalloc和cudaMallocPitch的秘密-CSDN博客

转载请注明出处：http://blog.csdn.net/bendanban/article/details/7646306

偶有兴趣测试了一下题目中提到的这两个函数，为了满足对齐访问数据，咱们平时可能会用到cudamallocPitch，以为它会带来更高的效率。呵呵，这里给出一段测试程序，大家可以在自己的机器上跑跑，你会发现这两个函数在某些情况下是一样的。

#include <stdio.h>
#include <stdlib.h>
#include <cuda_runtime_api.h>

int main(int argc, char **argv)
{
	// device pointers.
	float *d_pitch;
	float *d_normal;

	// matrix size.
	size_t cols = 63;
	size_t rows = 16;
	
	size_t pitch = 0;
	
	// alloc the data form gpu memory.
	cudaMallocPitch((void**)&d_pitch, &pitch, cols*sizeof(float), rows);
	cudaMalloc((void**)(&d_normal), rows*cols*sizeof(float));
	
	// test the data address.
	fprintf(stdout, "row size(in bytes) = %.2f*128.\n", pitch/128.0f);
	fprintf(stdout, "the head address of d_pitch  mod 128 = %x.\n", ((unsigned int)d_pitch)%128);
	fprintf(stdout, "the head address of d_normal mod 128 = %x.\n", ((unsigned int)d_normal)%128);
	
	cudaFree(d_pitch);
	cudaFree(d_normal);

	getchar();
	return 0;
}

上面这段程序的运行结果如下：

row size(in bytes) = 28.00*128.
the head address of d_pitch mod 128 = 0.
the head address of d_normal mod 128 = 0.

我多次做过实验，我觉得从以上实验结果可以知道，无论如何改变实验的参数，两个显存申请函数返回的数据首地址都是128，256的整数倍，我猜想GPU上的每个计算单元的数据在全局中加载的时候一次可以连续加载2的幂次个数据，并且这些数据的加载其实地址一定也是2的幂次，所以warp使用全局内存中的数据的时候应该尽量按照对齐的原则加载数据，这样就可以获得更高的效率了。至于对齐原则可以在CUDA的编程手册中找到。