cuda合并访问的要求,CUDA 5.0内存对齐和合并访问

I have a 2D host array with 10 rows and 96 columns. I load this array to my cuda device global memory linearly i.e. row1, row2, row3 ... row10.

The array is of type float. In my kernel each thread accesses one float value from the device global memory.

The BLOCK_SIZE I use is = 96

The GRID_DIM I use is = 10

Now what I understood from the "Cuda C programming guide" for coalesced accesses, the pattern I am using is correct, access consecutively memory location by warp. But there is a clause about memory 128 byte memory alignment. Which I fail to understand.

Q1) 128 bytes memory alignment; Does it mean that each thread in a warp should access 4 bytes starting from an address 0x00 (for example) till 0x80?

Q2) So in the scenario, will I be making uncoalesced accesses or not?

My understanding is: one thread should make one memory access with should be 4 bytes, from range of address such as from 0x00 to 0x80. If a thread from a warp accesses a location outside it, its an uncoalesced access.

解决方案

Loads from global memory are usually done in chunks of 128 bytes, aligned on 128 byte boundaries. Coalesced memory access means that you keep all accesses from your warp to one chunk of 128 bytes. (In older cards, the memory had to be accessed in order of thread id, but newer cards no longer have this requirement.)

If the 32 threads in your warp each read a float, you will read a total of 128 bytes from global memory. If the memory is aligned correctly, all reads will be from the same block. If alignment is off, you'll need two reads. If you do something like a[32*i], then each access will come from a different 128 byte block in global memory, which will be very slow.

It doesn't matter which block you access, as long as all threads in a warp access the same block.

If you have an array of 96 floats, then if each thread with index i in your warp accesses a[i], it will be a coalesced read. Same with either a[i+32] or a[i+64].

So, the answer to Q1 is that all threads need to stay within the same block of length 128 bytes aligned on 128 byte boundaries.

The answer to your Q2 is that if your arrays are aligned correctly, and your accesses are of the form a[32*x+i] with i the thread id and x any integer that is the same for all threads, your accesses will be coalesced.

According to Section 5.3.2.1.1 of the programming guide, memory is always aligned on at least 256 byte boundaries, so arrays created with cudaMalloc are always aligned correctly.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值