When to use volatile with shared CUDA Memory

Under what circumstances should you use the volatile keyword with a CUDA kernel's shared memory? I understand that volatile tells the compiler never to cache any values, but my question is about the behavior with a shared array:

__shared__ float products[THREADS_PER_ACTION];

// some computation
products[threadIdx.x] = localSum;

// wait for everyone to finish their computation
__syncthreads();

// then a (basic, ugly) reduction:
if (threadIdx.x == 0) {
    float globalSum = 0.0f;
    for (i = 0; i < THREADS_PER_ACTION; i++)
        globalSum += products[i];
}

Do I need products to be volatile in this case? Each array entry is only accessed by a single thread, except at the end, where everything is read by thread 0. Is it possible that the compiler could cache the entire array, and so I need it to be volatile, or will it only cache elements?

Thanks!



If you don't declare a shared array as volatile, then the compiler is free to optimize locations in shared memory by locating them in registers (whose scope is specific to a single thread), for any thread, at it's choosing. This is true whether you access that particular shared element from only one thread or not. Therefore, if you use shared memory as a communication vehicle between threads of a block, it's best to declare it volatile.

Obviously if each thread only accessed its own elements of shared memory, and never those associated with another thread, then this does not matter, and the compiler optimization will not break anything.

In your case, where you have a section of code where each thread is accessing it's own elements of shared memory, and the only inter-thread access occurs at a well understood location, you could use a memory fence function to force the compiler to evict any values that are temporarily stored in registers, back out to the shared array. So you might think that __threadfence_block() might be useful, but in your case, __syncthreads() already has memory-fencing functionality built in. So your __syncthreads() call is sufficient to force thread synchronization as well as to force any register-cached values in shared memory to be evicted back to shared memory.

By the way, if that reduction at the end of your code is of performance concern, you could consider using a parallel reduction method to speed it up.


https://stackoverflow.com/questions/15331009/when-to-use-volatile-with-shared-cuda-memory

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值