见
CUDA Pro Tip:Write Flexible Kernels with Grid-Stride Loops
懒得翻译了,这篇文章讲使用Grid-Stride Loop
如下非Grid-Stride Loop
kernel()
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i<n)
....
}
Grid-Stride Loop
kernel()
{
for (int i = blockIdx.x * blockDim.x + threadIdx.x;
i < n;
i += blockDim.x * gridDim.x)
{
...
}
}
并表明这样做的三个有点
1. 可扩展和thread reuse
2. Debugging
3. Portability and readability
我认为吧主要就是前两个原因。