parallel reduction 并行规约，unroll last warp 同步问题

最新推荐文章于 2022-09-28 11:48:26 发布

bestabou_cv

最新推荐文章于 2022-09-28 11:48:26 发布

阅读量1.3k

点赞数

分类专栏： CUDA c++

本文链接：https://blog.csdn.net/nothinglefttosay/article/details/45312241

版权

在 CUDA 中提高 parallel reduction 类程序性能的一个技巧就是 unroll last warp ，这在官方给出的示例 CUDA Radix Sort (Thrust Library) ，CUDA Parallel Reduction，scan 中都有涉及，在 CUDA_sample 中提到：

The included RadixSort class can sort either keyvalue pairs (with float or unsigned integer keys) or keys only. The optimized code in this sample (and also in reduction and scan) uses a technique known as warp-synchronous programming, which relies on the fact that within a warp of threads running on a CUDA GPU, all threads execute instructions synchronously. The code uses this to avoid __syncthreads() when threads within a warp are sharing data via __shared__ memory. It is important to note that for this to work correctly without race conditions on all GPUs, the shared memory used in these warp-synchronous expressions must be declared volatile. If it is not declared volatile, then in the absence of __syncthreads(), the compiler is free to delay stores to __shared__ memory and keep the

最低0.47元/天解锁文章

bestabou_cv

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
parallel reduction 并行规约，unroll last warp 同步问题

在 CUDA 中提高 parallel reduction 类程序性能的一个技巧就是 unroll last warp ，这在官方给出的示例 CUDA Radix Sort (Thrust Library) ，CUDA Parallel Reduction，scan 中都有涉及，在 CUDA_sample 中提到：The included RadixSort class can sort ei
复制链接

扫一扫