CUDA——warp分支发散问题

最新推荐文章于 2024-04-02 10:44:58 发布

UCAS_HMM

最新推荐文章于 2024-04-02 10:44:58 发布

阅读量775

点赞数 1

分类专栏：踩坑记录 CUDA

本文链接：https://blog.csdn.net/UCAS_HMM/article/details/127485991

版权

CUDA 分支效率 SIMT nvcc优化 PTX代码

关键词由CSDN通过智能技术生成

CUDA 同时被 2 个专栏收录

10 篇文章 3 订阅

订阅专栏

踩坑记录

7 篇文章 0 订阅

订阅专栏

前言

对于存在if-else分支语句的kernel来说，warp极容易在执行的时候出现发散，即同一个warp内的一部分thread执行if分支，另一部分执行else分支。但是由于SIMT的设计，同一个warp内的thread会依次执行所有的分支，因此不满足条件的部分thread会被设定为inactive，从而造成warp执行效率低。

__global__ void kernel1(int arr)
{
	int gid = threadIdx.x + blockDim.x * blockIdx.x;
	if(gid % 2 == 0)
		arr[gid] = 0;
	else
		arr[gid] = 1;
}

对于这个kernel来说，同一个warp内，ID为偶数的执行if分支，ID为奇数的执行else分支，因此根据branch efficiency的定义：

$\begin{equation} \text{branch efficiency} = \frac{\text{branches} - \text{Divergent branches}}{\text{branches}} \end{equation}$

因此，kernel1的分支效率应该为50%。

测试

[mmhe@k057 Test]$ nvcc -arch=sm_70 test.cu -o test
[mmhe@k057 Test]$ nvprof --metrics branch_efficiency ./test 
==53611== NVPROF is profiling process 53611, command: ./test
==53611== Profiling application: ./test
==53611== Profiling result:
==53611== Metric result:
Invocations                               Metric Name                        Metric Description         Min         Max         Avg
Device "Tesla V100-PCIE-32GB (0)"
    Kernel: kernel1(int*)
          1                         branch_efficiency                         Branch Efficiency     100.00%     100.00%     100.00%

很惊讶的发现效率是100%。所以说nvcc在默认情况下是会对这种存在明显分支的结构进行优化的。我们禁止nvcc的自动代码优化，观察测试结果：

[mmhe@k057 Test]$ nvcc -g -G -arch=sm_70 test.cu -o test
[mmhe@k057 Test]$ nvprof --metrics branch_efficiency ./test 
==53713== NVPROF is profiling process 53713, command: ./test
==53713== Profiling application: ./test
==53713== Profiling result:
==53713== Metric result:
Invocations                               Metric Name                        Metric Description         Min         Max         Avg
Device "Tesla V100-PCIE-32GB (0)"
    Kernel: kernel1(int*)
          1                         branch_efficiency                         Branch Efficiency      85.71%      85.71%      85.71%

可以看到，尽管禁止nvcc优化代码，但是这个效率依然是要高于50%。这说明它在内部还是执行了一定程度的优化。

PTX代码

为了更加深入的探索原因，我们查看上述代码生成的PTX文件。通过编译指令：

nvcc test.cu -o test -gencode=arch=compute_70,code=\"sm_70,compute_70\"
cuobjdump -ptx test > test.ptx

打开test.ptx代码：


Fatbin elf code:
================
arch = sm_70
code version = [1,7]
producer = <unknown>
host = linux
compile_size = 64bit

Fatbin elf code:
================
arch = sm_70
code version = [1,7]
producer = cuda
host = linux
compile_size = 64bit

Fatbin ptx code:
================
arch = sm_70
code version = [6,3]
producer = cuda
host = linux
compile_size = 64bit
compressed

.version 6.3
.target sm_70
.address_size 64

.visible .entry _Z7kernel1Pi(
.param .u64 _Z7kernel1Pi_param_0
)
{
.reg .b32 %r<6>;
.reg .b64 %rd<5>;

ld.param.u64 %rd1, [_Z7kernel1Pi_param_0];			;- 加载核函数参数到rd1
cvta.to.global.u64 %rd2, %rd1;									;- 将地址转到global
mov.u32 %r1, %tid.x;													;- 
mov.u32 %r2, %ntid.x;
mov.u32 %r3, %ctaid.x;
mad.lo.s32 %r4, %r3, %r2, %r1;
mul.wide.s32 %rd3, %r4, 4;
add.s64 %rd4, %rd2, %rd3;
and.b32 %r5, %r4, 1;
st.global.u32 [%rd4], %r5;
ret;
}

PTX这一块目前暂时不怎么理解，需要仔细研读PTX官方文档。