Guide into OpenMP

Feisy

已于 2022-03-10 17:28:32 修改

阅读量221

点赞数

分类专栏：多核文章标签： openMP

于 2022-03-10 14:21:57 首次发布

本文链接：https://blog.csdn.net/feisy/article/details/123400442

版权

多核专栏收录该内容

1 篇文章 0 订阅

订阅专栏

Guide into OpenMP: Easy multithreading programming for C++
https://bisqwit.iki.fi/story/howto/openmp/

文章目录

Abstract
Preface: Importance of multithreading
- Support in different compilers
Introduction to OpenMP in C++
The syntax
Offloading support
- The declare target and end declare target directives

Abstract

This document attempts to give a quick introduction to OpenMP (as of version 4.5), a simple C/C++/Fortran compiler extension that allows to add parallelism into existing source code without significantly having to rewrite it.
In this document, we concentrate on the C++ language in particular, and use GCC to compile the examples.

By Joel Yliluoma, September 2007; last update in June 2016 for OpenMP 4.5

Preface: Importance of multithreading

As CPU speeds no longer improve as significantly as they did before, multicore systems are becoming more popular.
To harness that power, it is becoming important for programmers to be knowledgeable in parallel programming — making a program execute multiple things simultaneously.

This document attempts to give a quick introduction to OpenMP, a simple C/C++/Fortran compiler extension that allows to add parallelism into existing source code without significantly having to entirely rewrite it.

Support in different compilers

GCC (GNU Compiler Collection) supports OpenMP 4.5 since version 6.1, OpenMP 4.0 since version 4.9, OpenMP 3.1 since version 4.7, OpenMP 3.0 since version 4.4, and OpenMP 2.5 since version 4.2. Add the commandline option -fopenmp to enable it. OpenMP offloading is supported for Intel MIC targets only (Intel Xeon Phi KNL + emulation) since version 5.1, and to NVidia (NVPTX) targets since version 7 or so.
Clang++ supports OpenMP 4.5 since version 3.9 (without offloading), OpenMP 4.0 since version 3.8 (for some parts), and OpenMP 3.1 since version 3.7. Add the commandline option -fopenmp to enable it.
Solaris Studio supports OpenMP 4.0 since version 12.4, and OpenMP 3.1 since version 12.3. Add the commandline option -xopenmp to enable it.
Intel C Compiler (icc) supports Openmp 4.5 since version 17.0, OpenMP 4.0 since version 15.0, OpenMP 3.1 since version 12.1, OpenMP 3.0 since version 11.0, and OpenMP 2.5 since version 10.1. Add the commandline option -openmp to enable it. Add the -openmp-stubs option instead to enable the library without actual parallel execution.
Microsoft Visual C++ (cl) supports OpenMP 2.0 since version 2005. Add the commandline option /openmp to enable it.

Introduction to OpenMP in C++

OpenMP consists of a set of compiler #pragmas that control how the program works. The pragmas are designed so that even if the compiler does not support them, the program will still yield correct behavior, but without any parallelism.
Here are two simple example programs demonstrating OpenMP.

You can compile them like this:

  g++ tmp.cpp -fopenmp

cmake中使用openMP的方法：

cmake_minimum_required (VERSION 2.8)
project (TEST)


set(USE_OPEN_MP TRUE)
#set(USE_OPEN_MP FALSE)

set(CMAKE_BUILD_TYPE "Debug")
set(CMAKE_CXX_FLAGS_DEBUG "$ENV{CXXFLAGS} -O0 -Wall -g -ggdb -DDEBUG")
set(CMAKE_CXX_FLAGS_RELEASE "$ENV{CXXFLAGS} -O3 -Wall")

set(EXECUTABLE_OUTPUT_PATH ${PROJECT_BINARY_DIR})

if(USE_OPEN_MP)
        FIND_PACKAGE( OpenMP REQUIRED)
        if(OPENMP_FOUND)
        message("OPENMP FOUND")
                set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} ${OpenMP_C_FLAGS}")
                set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} ${OpenMP_CXX_FLAGS}")
                set(CMAKE_EXE_LINKER_FLAGS"${CMAKE_EXE_LINKER_FLAGS}${OpenMP_EXE_LINKER_FLAGS}")
        endif()

endif(USE_OPEN_MP)


aux_source_directory(${PROJECT_SOURCE_DIR} DIR_SRC)
include_directories(${PROJECT_SOURCE_DIR}/src)
#include_directories(${PROJECT_SOURCE_DIR}/src)
link_directories(${PROJECT_SOURCE_DIR}/lib)

add_executable(main ${DIR_SRC})

Example: Initializing a table in parallel (multiple threads)

This code divides the table initialization into multiple threads, which are run simultaneously. Each thread initializes a portion of the table.

 #include <cmath>
  int main()
  {
    const int size = 256;
    double sinTable[size];
    
    #pragma omp parallel for
    for(int n=0; n<size; ++n)
      sinTable[n] = std::sin(2 * M_PI * n / size);
  
    // the table is now initialized
  }

加了时间输出的版本：单线程15ms,openMP:6ms

  #include <cmath>

#include <unistd.h>
#include <time.h>
#include<string.h>
#include<sys/time.h>
#include <sys/timeb.h>

 #include <stdio.h>


long long  GetMillsTime()
{
        long long cur_time = 0;

        struct timeval tp;
        gettimeofday(&tp, NULL);

        //必须将tp.tv_sec转成long long 再运算，这样子防止在32位机器上溢出
        long long data = (long long) tp.tv_sec * 1000;
        cur_time = data + tp.tv_usec / 1000;
        return cur_time;
}



  int main()
  {
    const int size = 256000;
    double sinTable[size];

    long long start_time = GetMillsTime();

    #pragma omp parallel for
    for(int n=0; n<size; ++n)
      sinTable[n] = std::sin(2 * M_PI * n / size);

        long long end_time = GetMillsTime();

        long long elapsed_time = end_time  - start_time;

        printf("%ld\n",elapsed_time);

        return 0;
    // the table is now initialized
  }

Example: Initializing a table in parallel (single thread, SIMD)

This version requires compiler support for at least OpenMP 4.0, and the use of a parallel floating point library such as AMD ACML or Intel SVML (which can be used in GCC with e.g. ‑mveclibabi=svml).

 #include <cmath>
  int main()
  {
    const int size = 256;
    double sinTable[size];
    
    #pragma omp simd
    for(int n=0; n<size; ++n)
      sinTable[n] = std::sin(2 * M_PI * n / size);
  
    // the table is now initialized
  }

加了时间输出的版本，单线程 16 ms 开了openMP 13ms

  #include <cmath>

#include <unistd.h>
#include <time.h>
#include<string.h>
#include<sys/time.h>
#include <sys/timeb.h>

 #include <stdio.h>


long long  GetMillsTime()
{
        long long cur_time = 0;

        struct timeval tp;
        gettimeofday(&tp, NULL);

        //必须将tp.tv_sec转成long long 再运算，这样子防止在32位机器上溢出
        long long data = (long long) tp.tv_sec * 1000;
        cur_time = data + tp.tv_usec / 1000;
        return cur_time;
}



  int main()
  {
    const int size = 256000;
    double sinTable[size];

    long long start_time = GetMillsTime();

    #pragma omp simd
    for(int n=0; n<size; ++n)
      sinTable[n] = std::sin(2 * M_PI * n / size);

        long long end_time = GetMillsTime();

        long long elapsed_time = end_time  - start_time;

        printf("%ld\n",elapsed_time);

        return 0;
    // the table is now initialized
  }

Example: Initializing a table in parallel (multiple threads on another device)

OpenMP 4.0 added support for offloading code to different devices, such as a GPU. Therefore there can be three layers of parallelism in a single program: Single thread processing multiple data; multiple threads running simultaneously; and multiple devices running same program simultaneously.

#include <cmath>
  int main()
  {
    const int size = 256;
    double sinTable[size];
    
    #pragma omp target teams distribute parallel for map(from:sinTable[0:256])
    for(int n=0; n<size; ++n)
      sinTable[n] = std::sin(2 * M_PI * n / size);

    // the table is now initialized
  }

Example: Calculating the Mandelbrot fractal in parallel (host computer)

This program calculates the classic Mandelbrot fractal at a low resolution and renders it with ASCII characters, calculating multiple pixels in parallel.

使用openMp 1296ms
单线程：4401 ms

#include <complex>
 #include <cstdio>
 
 typedef std::complex<double> complex;
 
 int MandelbrotCalculate(complex c, int maxiter)
 {
     // iterates z = z + c until |z| >= 2 or maxiter is reached,
     // returns the number of iterations.
     complex z = c;
     int n=0;
     for(; n<maxiter; ++n)
     {
         if( std::abs(z) >= 2.0) break;
         z = z*z + c;
     }
     return n;
 }
 int main()
 {
     const int width = 78, height = 44, num_pixels = width*height;
     
     const complex center(-.7, 0), span(2.7, -(4/3.0)*2.7*height/width);
     const complex begin = center-span/2.0;//, end = center+span/2.0;
     const int maxiter = 100000;
   
   #pragma omp parallel for ordered schedule(dynamic)
     for(int pix=0; pix<num_pixels; ++pix)
     {
         const int x = pix%width, y = pix/width;
         
         complex c = begin + complex(x * span.real() / (width +1.0),
                                     y * span.imag() / (height+1.0));
         
         int n = MandelbrotCalculate(c, maxiter);
         if(n == maxiter) n = 0;
         
       #pragma omp ordered
         {
           char c = ' ';
           if(n > 0)
           {
               static const char charset[] = ".,c8M@jawrpogOQEPGJ";
               c = charset[n % (sizeof(charset)-1)];
           }
           std::putchar(c);
           if(x+1 == width) std::puts("|");
         }
     }
 }

This program can be improved in many different ways, but it is left simple for the sake of an introductory example.

Discussion

As you can see, there is very little in the program that indicates that it runs in parallel. If you remove the #pragma lines, the result is still a valid C++ program that runs and does the expected thing.
Only when the compiler interprets those #pragma lines, it becomes a parallel program. It really does calculate N values simultaneously where N is the number of threads. In GCC, libgomp determines that from the number of processors.

By C and C++ standards, if the compiler encounters a #pragma that it does not support, it will ignore it. So adding the OMP statements can be done safely[1] without breaking compatibility with legacy compilers.

There is also a runtime library that can be accessed through omp.h, but it is less often needed. If you need it, you can check the #define _OPENMP for conditional compilation in case of compilers that don’t support OpenMP.

[1]: Within the usual parallel programming issues (concurrency, mutual exclusion) of course.

The syntax

All OpenMP constructs in C and C++ are indicated with a #pragma omp followed by parameters, ending in a newline. The pragma usually applies only into the statement immediately following it, except for the barrier and flush commands, which do not have associated statements.

The parallel construct

The parallel construct starts a parallel block. It creates a team of N threads (where N is determined at runtime, usually from the number of CPU cores, but may be affected by a few things), all of which execute the next statement (or the next block, if the statement is a {…} -enclosure). After the statement, the threads join back into one.

 #pragma omp parallel
  {
    // Code inside this region runs in parallel.
    printf("Hello!\n");
  }

This code creates a team of threads, and each thread executes the same code. It prints the text “Hello!” followed by a newline, as many times as there are threads in the team created. For a dual-core system, it will output the text twice. (Note: It may also output something like “HeHlellolo”, depending on system, because the printing happens in parallel.) At the }, the threads are joined back into one, as if in non-threaded program.
Internally, GCC implements this by creating a magic function and moving the associated code into that function, so that all the variables declared within that block become local variables of that function (and thus, locals to each thread).
ICC, on the other hand, uses a mechanism resembling fork(), and does not create a magic function. Both implementations are, of course, valid, and semantically identical.

Variables shared from the context are handled transparently, sometimes by passing a reference and sometimes by using register variables which are flushed at the end of the parallel block (or whenever a flush is executed).

Parallelism conditionality clause: if

The parallelism can be made conditional by including a if clause in the parallel command, such as:

  extern int parallelism_enabled;
  #pragma omp parallel for if(parallelism_enabled)
  for(int c=0; c<n; ++c)
    handle(c);

In this case, if parallelism_enabled evaluates to a zero value, the number of threads in the team that processes the for loop will always be exactly one.

Loop construct: for 将for循环拆成多个线程执行，顺序不保证

The for construct splits the for-loop so that each thread in the current team handles a different portion of the loop.

#pragma omp for
 for(int n=0; n<10; ++n)
 {
   printf(" %d", n);
 }
 printf(".\n");

This loop will output each number from 0…9 once. However, it may do it in arbitrary order. It may output, for example:
0 5 6 7 1 8 2 3 4 9.

注意，如果你的程序是单线程的，上面的输出就跟没有使用openMP一样

Internally, the above loop becomes into code equivalent to this:

 int this_thread = omp_get_thread_num(), num_threads = omp_get_num_threads();
  int my_start = (this_thread  ) * 10 / num_threads;
  int my_end   = (this_thread+1) * 10 / num_threads;
  for(int n=my_start; n<my_end; ++n)
    printf(" %d", n);

上面的例子是在当前线程队列里面执行的，当前线程队列一开始只有一个线程，用来运行程序的主线程。

So each thread gets a different section of the loop, and they execute their own sections in parallel.
Note: #pragma omp for only delegates portions of the loop for different threads in the current team. A team is the group of threads executing the program. At program start, the team consists only of a single member: the master thread that runs the program.

To create a new team of threads, you need to specify the parallel keyword. It can be specified in the surrounding context:
如果需要创建新的线程队列来执行，使用下面的语法：

 #pragma omp parallel
 {
  #pragma omp for
  for(int n=0; n<10; ++n) printf(" %d", n);
 }
 printf(".\n");

这里因为开了新的线程队列来运行for循环，所以输出是错乱的：
0 1 2 8 9 3 4 5 6 7.

Equivalent shorthand is to specify it in the pragma itself, as #pragma omp parallel for:
上面例子的等价语法

 #pragma omp parallel for
 for(int n=0; n<10; ++n) printf(" %d", n);
 printf(".\n");

You can explicitly specify the number of threads to be created in the team, using the num_threads attribute:

 #pragma omp parallel num_threads(3)
 {
   // This code will be executed by three threads.
   
   // Chunks of this loop will be divided amongst
   // the (three) threads of the current team.
   #pragma omp for
   for(int n=0; n<10; ++n) printf(" %d", n);
 }

Note that OpenMP also works for C. However, in C, you need to set explicitly the loop variable as private, because C does not allow declaring it in the loop body:

 int n;
 #pragma omp for private(n)
 for(n=0; n<10; ++n) printf(" %d", n);
 printf(".\n");

See the “private and shared clauses” section for details.
In OpenMP 2.5, the iteration variable in for must be a signed integer variable type. In OpenMP 3.0, it may also be an unsigned integer variable type, a pointer type or a constant-time random access iterator type. In the latter case, std::distance() will be used to determine the number of loop iterations.

What are: parallel, for and a team

The difference between parallel, parallel for and for is as follows:

A team is the group of threads that execute currently.
1.1 At the program beginning, the team consists of a single thread.
1.2 A parallel construct splits the current thread into a new team of threads for the duration of the next block/statement, after which the team merges back into one.
for divides the work of the for-loop among the threads of the current team. It does not create threads, it only divides the work amongst the threads of the currently executing team.
parallel for is a shorthand for two commands at once: parallel and for. Parallel creates a new team, and for splits that team to handle different portions of the loop.

If your program never contains a parallel construct, there is never more than one thread; the master thread that starts the program and runs it, as in non-threading programs.
总结一下：
4. 未使用parallel 语法，则不会新开线程队列，在当前的线程队列运行。如果当前线程队列只有一个线程，那么就跟没有使用openMP一样
5. 使用了parallel语法，会开一个新的线程队列来执行，同时可以指定多少个线程

Scheduling

The scheduling algorithm for the for-loop can explicitly controlled.

 #pragma omp for schedule(static)
 for(int n=0; n<10; ++n) printf(" %d", n);
 printf(".\n");

There are five scheduling types: static, dynamic, guided, auto, and (since OpenMP 4.0) runtime. In addition, there are three scheduling modifiers (since OpenMP 4.5): monotonic, nonmonotonic, and simd.
static is the default schedule as shown above. Upon entering the loop, each thread independently decides which chunk of the loop they will process.
static类型是无法控制执行顺序的，各个线程独立运行

There is also the dynamic schedule:

#pragma omp for schedule(dynamic)
 for(int n=0; n<10; ++n) printf(" %d", n);
 printf(".\n");

In the dynamic schedule, there is no predictable order in which the loop items are assigned to different threads. Each thread asks the OpenMP runtime library for an iteration number, then handles it, then asks for next, and so on. This is most useful when used in conjunction with the ordered clause, or when the different iterations in the loop may take different time to execute.
dynamic模式也是无法预测执行顺序，它最重要用途是用在协程和有序clause中

The chunk size can also be specified to lessen the number of calls to the runtime library:

 #pragma omp for schedule(dynamic, 3)
 for(int n=0; n<10; ++n) printf(" %d", n);
 printf(".\n");

In this example, each thread asks for an iteration number, executes 3 iterations of the loop, then asks for another, and so on. The last chunk may be smaller than 3, though.

Internally, the loop above becomes into code equivalent to this (illustration only, do not write code like this):

  int a,b;
  if(GOMP_loop_dynamic_start(0,10,1, 3, &a,&b))
  {
    do {
      for(int n=a; n<b; ++n) printf(" %d", n);
    } while(GOMP_loop_dynamic_next(&a,&b));
  }b

The guided schedule appears to have behavior of static with the shortcomings of static fixed with dynamic-like traits. It is difficult to explain — this example program maybe explains it better than words do. (Requires libSDL to compile.)

The “runtime” option means the runtime library chooses one of the scheduling options at runtime at the compiler library’s discretion.

A scheduling modifier can be added to the clause, e.g.: #pragma omp for schedule(nonmonotonic:dynamic

The modifiers are:

monotonic: Each thread executes chunks in an increasing iteration order.
nonmonotonic: Each thread executes chunks in an unspecified order.
simd: If the loop is a simd loop, this controls the chunk size for scheduling in a manner that is optimal for the hardware limitations according to how the compiler decides. This modifier is ignored for non-SIMD loops.

The ordered clause

The order in which the loop iterations are executed is unspecified, and depends on runtime conditions.
However, it is possible to force that certain events within the loop happen in a predicted order, using the ordered clause.

 #pragma omp for ordered schedule(dynamic)
 for(int n=0; n<100; ++n)
 {
   files[n].compress();

   #pragma omp ordered
   send(files[n]);
 }

This loop “compresses” 100 files with some files being compressed in parallel, but ensures that the files are “sent” in a strictly sequential order.
If the thread assigned to compress file 7 is done but the file 6 has not yet been sent, the thread will wait before sending, and before starting to compress another file. The ordered clause in the loop guarantees that there always exists one thread that is handling the lowest-numbered unhandled task.

Each file is compressed and sent exactly once, but the compression may happen in parallel.

There may only be one ordered block per an ordered loop, no less and no more. In addition, the enclosing for construct must contain the ordered clause.

OpenMP 4.5 added some modifiers and clauses to the ordered construct.

#pragma omp ordered threads means the same as #pragma omp ordered. It means the threads executing the loop execute the ordered regions sequentially in the order of loop iterations.
#pragma omp ordered simd can only be used in a for simd loop.
#pragma omp ordered depend(source) and #pragma omp ordered depend(vectorvariable) also exist.

The collapse clause

When you have nested loops, you can use the collapse clause to apply the threading to multiple nested iterations.
Example:

 #pragma omp parallel for collapse(2)
 for(int y=0; y<25; ++y)
   for(int x=0; x<80; ++x)
   {
     tick(x,y);
   }

The reduction clause

The reduction clause is a special directive that instructs the compiler to generate code that accumulates values from different loop iterations together in a certain manner. It is discussed in a separate chapter later in this article. Example:

int sum=0;
 #pragma omp parallel for reduction(+:sum)
 for(int n=0; n<1000; ++n) sum += table[n];

Sections

Sometimes it is handy to indicate that “this and this can run in parallel”. The sections setting is just for that.

 #pragma omp sections
 {
   { Work1(); }
   #pragma omp section
   { Work2();
     Work3(); }
   #pragma omp section
   { Work4(); }
 }

This code indicates that any of the tasks Work1, Work2 + Work3 and Work4 may run in parallel, but that Work2 and Work3 must be run in sequence. Each work is done exactly once.
As usual, if the compiler ignores the pragmas, the result is still a correctly running program.

Internally, GCC implements this as a combination of the parallel for and a switch-case construct. Other compilers may implement it differently.

Note: #pragma omp sections only delegates the sections for different threads in the current team. To create a team, you need to specify the parallel keyword either in the surrounding context or in the pragma, as #pragma omp parallel sections.
Example:

 #pragma omp parallel sections // starts a new team
 {
   { Work1(); }
   #pragma omp section
   { Work2();
     Work3(); }
   #pragma omp section
   { Work4(); }
 }

#pragma omp parallel // starts a new team
 {
   //Work0(); // this function would be run by all threads.
   
   #pragma omp sections // divides the team into sections
   { 
     // everything herein is run only once.
     { Work1(); }
     #pragma omp section
     { Work2();
       Work3(); }
     #pragma omp section
     { Work4(); }
   }
   
   //Work5(); // this function would be run by all threads.
 }

The simd construct (OpenMP 4.0+) 类似矩阵相乘，不同行列可同时并发计算

OpenMP 4.0 added explicit SIMD parallelism (Single-Instruction, Multiple-Data). SIMD means that multiple calculations will be performed simultaneously by the processor, using special instructions that perform the same calculation to multiple values at once. This is often more efficient than regular instructions that operate on single data values. This is also sometimes called vector parallelism or vector operations (and is in fact the preferred term in OpenACC).
There are two use cases for the simd construct.

Firstly, #pragma omp simd can be used to declare that a loop will be utilizing SIMD.

 float a[8], b[8];
 ...
 #pragma omp simd
 for(int n=0; n<8; ++n) a[n] += b[n];

Secondly, #pragma omp declare simd can be used to indicate a function or procedure that is explicitly designed to take advantage of SIMD parallelism. The compiler may create multiple versions of the same function that use different parameter passing conventions for different CPU capabilities for SIMD processing.

  #pragma omp declare simd aligned(a,b:16)
  void add_arrays(float *__restrict__ a, float *__restrict__ b)
  {
    #pragma omp simd aligned(a,b:16)
    for(int n=0; n<8; ++n) a[n] += b[n];
  }

Without the pragma, the function will use the default non-SIMD-aware ABI, even though the function itself may do calculation using SIMD.
Since compilers of today attempt to do SIMD regardless of OpenMP simd directives, the simd directive can be thought essentially as a directive to the compiler, saying: “Try harder”.

The collapse clause

The collapse clause can be added to bind the SIMDness into multiple nested loops. The example code below will direct the compiler into attempting to generate instructions that calculate 16 values simultaneously, if at all possible.

  #pragma omp simd collapse(2)
  for(int i=0; i<4; ++i)
    for(int j=0; j<4; ++j)
      a[j*4+i] += b[i*4+j];

The reduction clause

The reduction clause can be used with SIMD just like with parallel loops.

 int sum=0;
 #pragma omp simd reduction(+:sum)
 for(int n=0; n<1000; ++n) sum += table[n];

The aligned clause

The aligned attribute hints the compiler that each element listed is aligned to the given number of bytes. Use this attribute if you are sure that the alignment is guaranteed, and it will increase the performance of the code and make it shorter.
The attribute can be used in both the function declaration, and in the individual SIMD statements.

  #pragma omp declare simd aligned(a,b:16)
  void add_arrays(float *__restrict__ a, float *__restrict__ b)
  {
    #pragma omp simd aligned(a,b:16)
    for(int n=0; n<8; ++n) a[n] += b[n];
  }

The safelen clause

While the restrict keyword in C tells the compiler that it can assume that two pointers will not address the same data (and thus it is safe to change the ordering of reads and writes), the safelen clause in OpenMP provides much fine-grained control over pointer aliasing.
In the example code below, the compiler is informed that a[x] and b[y] are independent as long as the difference between x and y is smaller than 4. In reality, the clause controls the upper limit of concurrent loop iterations. It means that only 4 items can be processed concurrently at most. The actual concurrency may be smaller, and depends on the compiler implementation and hardware limits.

  #pragma omp declare simd
  void add_arrays(float* a, float* b)
  {
    #pragma omp simd aligned(a,b:16) safelen(4)
    for(int n=0; n<8; ++n) a[n] += b[n];
  }

The simdlen clause (OpenMP 4.5+)

The simdlen clause can be added to a declare simd construct to limit how many elements of an array are passed in SIMD registers instead of using the normal parameter passing convention.

The uniform clause

The uniform clause declares one or more arguments to have an invariant value for all concurrent invocations of the function in the execution of a single SIMD loop.

The linear clause (OpenMP 4.5+)

The linear clause is similar to the firstprivate clause discussed later in this article.

Consider this example code:

 #include <stdio.h>

  int b = 10;
  int main()
  {
    int array[8];

    #pragma omp simd linear(b:2)
    for(int n=0; n<8; ++n) array[n] = b;

    for(int n=0; n<8; ++n) printf("%d\n", array[n]);

The inbranch and notinbranch clauses

The inbranch clause specifies that the function will always be called from inside a conditional statement of a SIMD loop. The notinbranch clause specifies that the function will never be called from inside a conditional statement of a SIMD loop.
The compiler may use this knowledge to optimize the code.

The for simd construct (OpenMP 4.0+)

The for and simd constructs can be combined, to divide the execution of a loop into multiple threads, and then execute those loop slices in parallel using SIMD.

  float sum(float* table)
  {
    float result=0;
    #pragma omp parallel for simd reduction(+:result)
    for(int n=0; n<1000; ++n) result += table[n];
    return result;
  }

The task construct (OpenMP 3.0+)

When for and sections are too cumbersome, the task construct can be used. This is only supported in OpenMP 3.0 and later.
These examples are from the OpenMP 3.0 manual:

struct node { node *left, *right; };
extern void process(node* );
void traverse(node* p)
{
    if (p->left)
        #pragma omp task // p is first private by default
        traverse(p->left);
    if (p->right)
        #pragma omp task // p is first private by default
        traverse(p->right);
    process(p);
}

In the next example, we force a postorder traversal of the tree by adding a taskwait directive. Now, we can safely assume that the left and right sons have been executed before we process the current node.

struct node { node *left, *right; };
extern void process(node* );
void postorder_traverse(node* p)
{
    if (p->left)
        #pragma omp task // p is firstprivate by default
        postorder_traverse(p->left);
    if (p->right)
        #pragma omp task // p is firstprivate by default
        postorder_traverse(p->right);
    #pragma omp taskwait
    process(p);
}

The following example demonstrates how to use the task construct to process elements of a linked list in parallel. The pointer p is firstprivate by default on the task construct so it is not necessary to specify it in a firstprivate clause.

struct node { int data; node* next; };
extern void process(node* );
void increment_list_items(node* head)
{
    #pragma omp parallel
    {
        #pragma omp single
        {
            for(node* p = head; p; p = p->next)
            {
            	#pragma omp task
            	process(p); // p is firstprivate by default
            }
        }
    }
}

Offloading support

Offloading means that parts of the program can be executed not only on the CPU of the computer itself, but also in other hardware attached to it, such as on the graphics card.

The declare target and end declare target directives

The declare target and end declare target directives delimit a section of the source code wherein all declarations, whether they are variables or functions/subroutines, are compiled for a device.
Example:

#pragma omp declare target
int x;
void murmur() { x+=5; }
#pragma omp end declare target

This creates one or more versions of “x” and “murmur”. A set that exists on the host computer, and also a separate set that exists and can be run on a device.
These two functions and variables are separate, and may contain values separate from each others.

Variables declared in this manner can be accessed by the device code without separate map clauses.

Feisy

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Guide into OpenMP

Guide into OpenMP: Easy multithreading programming for C++https://bisqwit.iki.fi/story/howto/openmp/文章目录AbstractPreface: Importance of multithreadingSupport in different compilersIntroduction to OpenMP in C++AbstractThis document attempts to give a q
复制链接

扫一扫