OpenACC2.0-routine

最新推荐文章于 2023-06-28 03:17:26 发布

wcj0626

最新推荐文章于 2023-06-28 03:17:26 发布

阅读量1.5k

点赞数

分类专栏： OpenACC 文章标签： routine

本文链接：https://blog.csdn.net/wcj0626/article/details/21122457

版权

OpenACC 专栏收录该内容

9 篇文章 3 订阅

订阅专栏

对于OpenCC1.0中，没有定义函数调用，用户自定义的函数调用只能通过内联（inline）的方式，在编译时需要添加编译选项：-Minline.

在OpenACC2.0中，新添加了routine导语，来实现用户自定义函数调用。下面我们来看一个例子：

#include<stdio.h>

#pragma acc routine worker
int sum(int n,float *A)
{
        int i;
        float s=0.0f;
        #pragma acc loop vector reduction(+:s)
        for(i=0;i<n;i++){
                s=s+A[i];
        }
        return s;
}

int main()
{
        float *X,*Y;
        X=(float*)malloc(sizeof(float)*100*200);
        Y=(float*)malloc(sizeof(float)*100);
        int j,i;
        for(j=0;j<100;j++){
                for(i=0;i<200;i++){
                        X[j*200+i]=j;
                }
        }
        #pragma acc parallel copyout(Y[0:100]) copyin(X[0:100*200])
        {
                #pragma acc loop gangs
                for(j=0;j<100;j++){
                        Y[j]=sum(200,(X+j*200));
                }
        }

        for(j=0;j<10;j++){
                printf("Y[%d]=%f\n",j,Y[j]);
        }
        return 0;
}

例子的功能很简单，x是100列200行的数组，需要求出x中每一行的大小。对于这个例子，把x的每一列，映射到OpenACC的gang级别并行(对于CUDA来说是block，对于OpenCL来说是workgroup)，再对于每一行求和来说再把其映射到OpenACC的worker级别并行。
程序编译输出如下：

[root@lucas routine]# pgcc -o routine routine.c -acc -Minfo -ta=nvidia,cc20
NOTE: your trial license will expire in 13 days, 8.59 hours.
PGC-W-0155-Pointer value created from a nonlong integral type  (routine.c: 18)
PGC-W-0155-Pointer value created from a nonlong integral type  (routine.c: 19)
sum:
      0, Generating acc routine worker
          9, #pragma acc loop  /* threadIdx.x threadIdx.y */
      0, Generating NVIDIA code
main:
     22, Memory set idiom, loop replaced by call to __c_mset4
     26, Generating copyout(Y[0:100])
         Generating copyin(X[0:20000])
         Accelerator kernel generated
         29, #pragma acc loop gang /* blockIdx.x */
     26, Generating NVIDIA code
PGC/x86-64 Linux 14.2-0: compilation completed with warnings
[root@lucas routine]#

把 PGI _ACC_TIME设置为1时，程序输出如下：

[root@lucas routine]# ./routine 
Y[0]=0.000000
Y[1]=200.000000
Y[2]=400.000000
Y[3]=600.000000
Y[4]=800.000000
Y[5]=1000.000000
Y[6]=1200.000000
Y[7]=1400.000000
Y[8]=1600.000000
Y[9]=1800.000000

Accelerator Kernel Timing /root/OpenACC/routine/routine.c
  main  NVIDIA  devicenum=0
    time(us): 450
    26: data region reached 1 time
        26: data copyin reached 1 time
             device time(us): total=33 max=33 min=33 avg=33
        34: data copyout reached 1 time
             device time(us): total=14 max=14 min=14 avg=14
    26: compute region reached 1 time
        26: kernel launched 1 time
            grid: [100]  block: [1]
             device time(us): total=403 max=403 min=403 avg=403
            elapsed time(us): total=414 max=414 min=414 avg=414
[root@lucas routine]#

需要注意的是routine导语必须包含gang，worker，vector或seq子句，来指定调用函数的上下文，和允许的最大任务分享的级别。例如，有worker子句的routine导语可以包含worker，vector和seq loops，但是不能包含gang loops。同样，该routine不能从worker或vector loop中被调用。对于不包含 gang， worker ，vector 子句的routine，那必须有seq子句，这样该routine可以随便在哪调用。

wcj0626

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
OpenACC2.0-routine

对于OpenCC1.0中，没有定义函数调用，用户自定义的函数调用只能通过内联（inline）的方式，在编译时需要添加编译选项：-Minline.在OpenACC2.0中，新添加了routine导语，来实现用户自定义函数调用。下面我们来看一个例子：#include#pragma acc routine workerint sum(int n,float *A){
复制链接

扫一扫