下图为test_barrier的kernel代码。
"__kernel void compute_sum(__global int *a, int n, __global int *tmp_sum, __global int *sum)\n"
"{\n"
" int tid = get_local_id(0);\n"
" int lsize = get_local_size(0);\n"
" int i;\n"
"\n"
" tmp_sum[tid] = 0;\n"
" for (i=tid; i<n; i+=lsize)\n"
" tmp_sum[tid] += a[i];\n"
" \n"
" // updated to work for any workgroup size \n"
" for (i=hadd(lsize,1); lsize>1; i = hadd(i,1))\n"
" {\n"
" barrier(CLK_GLOBAL_MEM_FENCE);\n"
" if (tid + i < lsize)\n"
" tmp_sum[tid] += tmp_sum[tid + i];\n"
" lsize = i; \n"
" }\n"
"\n"
" //no barrier is required here because last person to write to tmp_sum[0] was tid 0 \n"
" if (tid == 0)\n"
" *sum = tmp_sum[0];\n"
"}\n";
例如 a=[1,2,3,4,5,6,7,8,9,10,11,12],local size=4。
则在第一个for循环时,步长为4,tmp_sum储存的为每个local group的值的和。
temp_sum[0] = a[0] + a[4] + a[8] = 13
temp_sum[1] = a[1] + a[5] + a[9] = 16
temp_sum[2] = a[2] + a[6] + a[10] = 19
temp_sum[3] = a[3] + a[7] + a[11] = 22
第二个循环时,i的初始值为hadd(lsize,1) = 2。
barrier(CLK_GLOBAL_MEM_FENCE)的存在可以保证上一组计算全部完成后才会进行下一次计算。
lsize = 4, i = 2:temp[0] += temp[2]同时 temp[1] += temp[3]
lsize = 2,i = 1:temp[0] += temp[1]
lsize = 1退出循环
*sum = tmp_sum[0]即可得a所有值的和