Cooperative Groups:更灵活的CUDA thread同步

1. 引言

__syncthreads()仅支持单一block内线程间的同步。
而Cooperative Groups支持跨grid,跨多个GPU设备的同步。

相关代码实现可看:

cudaLaunchCooperativeKernel参数定义为:

template < class T >

__host__​cudaError_t cudaLaunchCooperativeKernel ( const T* func, dim3 gridDim, dim3 blockDim, void** args, size_t sharedMem = 0, cudaStream_t stream = 0 ) [inline]
Launches a device function.
Parameters
func
- Device function symbol
gridDim
- Grid dimentions
blockDim
- Block dimentions
args
- Arguments
sharedMem
- Shared memory (defaults to 0)
stream
- Stream identifier (defaults to NULL)

cudaLaunchCooperativeKernel会调用kernel func函数,将该函数运行在gridDim (gridDim.x gridDim.y gridDim.z)个grid of blocks,每个block内包含了blockDim (blockDim.x blockDim.y blockDim.z)个threads。
【当前,每个block最多仅能有1024个threads,每个grid最多有65535个blocks。】
运行该kernel的设备必须具有非零的设备属性值cudaDevAttrCooperativeLaunch

2. 多block线程同步

若采用cooperative groups方式,kernel必须使用cudaLaunchCooperativeKernel来调用。

可采用atomics with bitfields这种更简单的方式来实现,如:

// A global var with 64 bits can track 64 blocks, 
// use an array if you need to track more blocks
__device__ uint64_t CompleteMask; 

//This is where we put in all the smarts
//from the CPU reference solver
__global__ void doWork() {
    atomicAnd(&CompleteMask, 0);
    //do lots of work

    const auto SollMask = (1 << gridDim.x) - 1;
    if (ThreadId() == 0) {
        while ((atomicOr(&CompleteMask, 1ULL << blockIdx.x)) != SollMask) { /*do nothing*/ }
    }
    if (ThreadId() == 0 && 0 == blockIdx.x) {
        printf("Print a single line for the entire process")
    }
}

以上仅可同步64个blocks之间的线程,可借助数组来track the bits and aotmicAdd来跟踪数量,如:

// A global var with 64 bits can track 64 blocks, 
// use an array if you need to track more blocks
__device__ int CompleteMask[2];
__device__ int CompleteSuperMask;

__global__ void doWork() {
    for (auto i = 0; i < 2; i++) { atomicAnd(&CompleteMask[i], 0); }
    atomicAnd(&CompleteSuperMask, 0);
    //do lots of work

    int SollMask[3];
    SollMask[0] = -1;
    SollMask[1] = (1 << (gridDim.x % 32)) - 1;
    SollMask[2] = (1 << (gridDim.x / 32)) - 1;

    const auto b = blockIdx.x / 32;
    while (atomicOr(&CompleteMask[b], (1U << (blockIdx.x % 32))) != SollMask[b]) { /*do nothing*/ }

    while (atomicOr(&CompleteSuperMask, (1U << b)) != SollMask[2]) { /*do nothing*/ }
    if (threadIdx.x == 0 && blockIdx.x == 0) {
        printf("Print a single line for the entire process");
    }
}

参考资料

[1] CUDA: Cooperative Groups
[2] NVIDIA Cooperative Groups
[3] Cooperative Groups: Flexible CUDA Thread Programming

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值