openmp crunching_openmp 线程的私有变量存储位置-CSDN博客

本文链接：https://blog.csdn.net/jacocheung/article/details/111059503

概念

线程 vs 进程

进程=可执行文件+执行资源+上下文(比如各个寄存器，函数栈帧)
每个进程有自己的虚拟地址空间，即自己的页表
进程切换会导致页表切换，cache 刷新等等，上下文比较多，所以很慢
进程下面可以有很多个线程，这些线程共享进程拥有的所有资源除了CPU核心。实际上CPU调度是以线程为单位进行调度的。每个线程有自己的私有栈(当然这个栈的分配是从进程空间分配的，也就是所有线程公用进程的地址空间，所以线程之间是共享内存)
如此一来，线程切换比进程切换开销就会小很多

omp 执行模型

openmp是共享内存模型的，调度实体是线程，线程之间的地址空间是统一的。实际上除了omp还有其他的共享内存模型多线程编程解决方案:TBB(intel 特有)和POSIX线程

注意，每个线程有自己的栈，寄存器状态和程序计数器，所以每个线程可以有自己的私有变量。虽然线程之间地址是共享的，理论上只要获取到其他线程的栈指针，就可以访问到其他线程的私有变量，但是omp禁止这样的行为。
openmp启动后(从main函数开始)，一个单线程会启动(master)，相邻两个并行区域之间只有主线程执行代码。并行区域内会并发地执行不同数量的线程。
omp实际上就是一些编译器制导语句(包括一些内置函数)，对于不支持omp的编译器来说这些指令应该当作注释
制导指令(directive):parallel; for; sections; critical 等等，子句:schedule(), num_threads(),private()等子句是用来修饰当前并行区域的行为。比如子句schedule(static)指定for指令当前调度策略是静态的(针对for指令)。
```
#pragma omp [directive [clause ...]]
     structured block
```

一个例子

数组加法

double a[N], b[N], c[N];//三个声明在主线程栈上的共享数组变量
//'''
#pragma omp parallel //启用默认的线程数
{
  int numth = omp_get_num_threads();//当前并行区域内的活跃线程数
  int tid = omp_get_thread_num();//tid
  int blen = N / numth;
  int rem = N % numth;
  if(tid < rem){
    blen += 1;
    bstart = blen * tid + 1;
  }
  else{
    bstart = blen * tid + rem + 1;
  }
  int bend = bstart + len;
  for(int i = bstart; i < bend ; i++){
    a[i] = b[i] + c[i];
  }
}

一个隐秘的错误

{
  //单线程区域
  omp_set_num_threads(10);
  int num_thrds = omp_get_num_threads();//实际上这里返回的是1
  #pragma omp parallel for (num_thds)//这里仅启用了1个线程
  {
  
	}
}

omp_set_num_threads()和omp_get_max_threads()才是搭配的

omp_get_num_threads()返回的是当前执行区域内活跃的线程数

作用域

所有变量都有作用域。私有变量是每个线程都独立拥有的(在自己的栈上)。堆上分配的内存是所有线程共享的。

所有的线程私有变量存储在各自的栈上
- 并行区域定义的局部变量是每个线程私有的
```
void foo(){
  #pragma omp parallel
  {
    int a = 1; // 私有
  }
}
```

共享变量存储在堆或者主线程的栈上

void foo(){
  
  int a = 1;// 主线程的栈上
  #pragma omp parallel
  	a += 1;//所有的线程将会访问主线程的栈，因此a也被称做shared变量
}

子句指定变量data-sharing

在一个并行区域的启动时，可以使用子句指定出现(定义)在并行区域之前的变量的共享性data-sharing

shared()

指出在并行区域，这些变量是所有线程共享的

{  
	int a = 1;// 主线程的栈上
  #pragma omp parallel shared(a)
  	a += 1;//所有的线程将会访问主线程的栈，因此a也被称做shared变量
}

当然如果不写shared也是同样的效果，默认行为本身就是shared

private()/firstprivate

指出在并行区域，所有的线程应当有一份变量的私有的拷贝
```
{  
	int a = 1;// 主线程的栈上
  #pragma omp parallel private(a)
  	a += 1;//每个线程将会修改自己栈上的a
}
```
使用private并不会使用主线程上的a初始化私有的a。如果要初始化私有变量，需要使用firstprivate子句
default(none)， default(shared)

编译器如果无法确定数据的共享状态，就会将变量设置成默认值，而default子句会更改默认值。(如果没有private也没有出现在shared子句中，而且变量类型也不是const类型，那么其他的变量的共享状态将会从default子句来确定。)

default(none)子句意味着默认值是空。程序员必须要显式指定他们的共享状态，而default(shared)指出默认值是共享的。(Fortran支持private | firstprivate | shared | none)。但是对于const类型的变量，编译器默认将其设置为shared类型，不需要再显示指出。
```
const int n = 10;
const int a = 7;

#pragma omp parallel for default(none) shared(a, n)
for (int i = 0; i < n; i++)
{
    int b = a + i;
    ...
}
```
会报错
```
error: "a" is predetermined "shared" for "shared"
```
注意reduction子句中的变量是共享的(尽管底层实现是使用私有变量)

从编程角度上看如何实现私有变量:

使用private()语句可以将并行区域前的共享变量设成私有的
- 本质上是在自己私有栈生成一份拷贝
- 私有变量初始化
  
  使用FIRSTPRIVATE代替PRIVATE将会使用共享实例初始化这个私有变量(C++用拷贝构造函数)，否则将不会使用初始化(C++调用的是默认构造函数)
共享循环的索引变量自动私有化

int a;
#omp parallel for private(a)
for(int i = 0;i < 10; i++){//每个线程有自己的i
    '''
    a += 10;//每个线程现在都有一份a的拷贝
}

并行区域调用的子过程里的变量是局部变量(递归定义)
定义成static的本地变量会被共享(实际上那些变量是在data段或者在bss段)

ICV 和环境变量

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传
在这里插入图片描述

在openmp内部是通过internal control variables控制openmp 行为

The following table displays how you can set or get the value of the internal control variables as well as the initial value of the variables.

Internal Control Variable	To Modify	To Retrieve	Initial Value
nthreads-var	OMP_NUM_THREADS omp_set_num_threads num_threads Note num_threads overrides the variable only for the parallel region in which you call it.	omp_get_max_threads	Number of processors on machine
dyn-var	OMP_DYNAMIC omp_set_dynamic	omp_get_dynamic	FALSE
nest-var	OMP_NESTED omp_set_nested	omp_get_nested	FALSE
run-sched-var	OMP_SCHEDULE	none	static
def-sched-var	none	none	static

机理:

一旦启动，所有的ICV都会有一个默认的值。优先级initial val < environment < omp_set*() < clause

使用clause的有效周期仅限当前制导语句所包含的区域。

在这里插入图片描述

icv 参考

常用的directive和clause

parallel region

parallel region

#pragma omp parallel 
{
  #pragma omp master
  printf("active threads num is %d\n",omp_get_num_threads);
	'''
}

语义上就是启用多个线程执行parallel construct里面的语句

在结束的地方有一个隐式的barrier

同步相关directive

master
在并行区域中，只有master线程执行，其他的线程直接跳过
- master区域结束，没有隐式的barrier，这意味着其他线程可以直接跳过
critical
每次只能有一个线程在执行,其余线程阻塞

容易联想到reduction子句和atomic指令

global_val = 0;
#pragma omp parallel reduction(+:global_val) //注意: 切勿指定 private(global_val)
	global_val+=local_foo();

reduction是一个子句，仅用于修饰变量，而critical是directive。

reduction实际上是每个线程在local variable上做计算，最后在结束区域再做规约。
在这里插入图片描述

atomic

critical严格串行执行(锁机制),更普适。而atomic仅仅要求对变量内存的访问顺序是原子的。这允许编译器更多优化空间。atomic后面只能有一条语句。
```
#pragma omp atomic
g_qCount++;
```
```
#pragma omp critical
g_qCount++;
```
两种方案结果是一样的，但是atomic效率可能更高

如果g_qCount+=foo()，则atomic不保证右边的foo执行被保护
barrier
```
#pragma omp barrier
```
所有线程到达同一个点
taskwait
flush

内存一致性。比如single区域写了某个内存区域，需要让其他线程感知到这个写操作。
ordered

让执行结果严格按照循环下标的增长方向执行.注意和critical区别

#pragma omp parallel for ordered
for ( ... i ... ) {
  ... f(i) ...
#pragma omp ordered // 必须出现在ordered的loop内部才合法
  printf("something with %d\n",i);
}//注意: 每个迭代仅能有一个omp ordered区域！！

#pragma omp parallel for shared(y) ordered
for ( ... i ... ) {
  int x = f(i)
#pragma omp ordered 
  y[i] += f(x)
  z[i] = g(y[i])
}

tid  List of     Timeline
     iterations
0    0,1,2       ==o==o==o
1    3,4,5       ==.......o==o==o
2    6,7,8       ==..............o==o==o

tid  List of     Timeline
     iterations
0    0,3,6       ==o==o==o
1    1,4,7       ==.o==o==o
2    2,5,8       ==..o==o==o

worksharing construct

所谓的worksharing就是将一个区域的任务分配给多个线程，而不是多个线程执行相同的所有任务。在worksharing contruct的结束地方有一个barrier,如果要取消这个隐式的barrier,可以使用taskwait子句

for(Fortran中是`DO`)

#pragma omp for 
for(int i = 0 ; i < N;i++)// 当前区域的所有活跃线程仅执行一个for循环,不同线程处理不同的i
{
  // worksharing construct
  printf("%d \n",i);
}
#pragma omp parallel 
for(int i = 0 ; i < N;i++)// 当前区域的每一个活跃线程都执行一个for循环
{
  // parallel region
  printf("%d \n",i);
}

注意:omp for是一个worksharing construct 而omp parallel for=omp parallel + omp for

sections

double y1,y2;
#pragma omp sections
{
#pragma omp section
  y1 = f(x)
#pragma omp section
  y2 = g(x)
}
y = y1+y2;

sections和for一定程度上是可以相互转的。可以把sections看作是for完全展开的形式

single

全部任务分给当前活跃线程组的某个线程(注意和master区别)

master will be executed by the master only while single can be executed by whichever thread reaching the region first
single has an implicit barrier upon completion of the region, where all threads wait for synchronization, while master doesn’t have any.意思就是其他线程不用等master线程执行结束master区域的代码，可以直接执行其他的代码。

task(openmp 3.0)

并不是所有任务都是迭代类型的，有时候任务数很多，sections写起来非常麻烦。另一方面

#pragma omp sections 			 
for(int i = 0;i < N;i++)
{
#pragma omp section
		task(a[0]);
}//‘#pragma omp section’ may only be used in ‘#pragma omp sections’ construct

是非法的。考虑下一个片段

for(int i = 0;i < N; i=i+a[i])
{
		task(a[i]);
}

这时无法使用for类型的worksharing-construct的,如果要使用sections，这几乎不可能，因为上界N很有可能是运行时确定的。这时候可以使用task

#include <omp.h>
 
void task(int p)
{
	'''
}
 
#define N		50
int main()  
{
	int a[N];
//'''
#pragma omp parallel num_threads(2)
	{
#pragma omp single
		{
			for(int i = 0;i < N; i=i+a[i])
			{
#pragma omp task
				task(a[i]);
			}
		}
	}
 
	return 0;  
}

这里一个task相当于创建一个任务。这个任务可以被当前活跃线程组中的任意一个线程执行。注意，这里for必须由single子句，否则for将会跑两次。相当于某一个线程负责生产任务producer，而所有活跃线程负责被调度执行这些任务

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-6mthDeon-1607750342819)(/Users/jaxxxxxo/Library/Application Support/typora-user-images/image-20201203090635933.png)]

来看一个例子

int main(int argc, char *argv[]) {
#pragma omp parallel num_threads(2)
    {   
#pragma omp single nowait //其余线程可以直接跳过这段代码(使用omp master也行)
        {   
            printf("A single tid %d\n",omp_get_thread_num()); 
#pragma omp task
            { sleep(1);printf("race tid %d\n", omp_get_thread_num());} 

#pragma omp task
            { sleep(1); printf("car tid %d\n",omp_get_thread_num());}
        }   
#pragma omp single
        printf("B single tid %d\n",omp_get_thread_num());
    } // End of parallel region

    printf("\n");
    return(0);
}

输出

A single tid 0
B single tid 1
race tid 0
car tid 1

分析:

0号线程先进入了第一个single区域输出A single，与此同时
1号线程跳过第一个single(因为有nowait)，直接进入第二个single区域,输出B single,然后1号线程进入空闲状态
在1号线程打印期间, 0号线程提交任务{ sleep(1);printf("race tid %d\n", omp_get_thread_num());}但没有立即执行(因为观察到race 和 car几乎同时输出的，如果0线程立即执行任务(输出race)，car至少要等race1秒之后才能输出)，紧接着
0号线程提交第二个任务。然后
任务队列里面有两个任务，0号线程执行任务1，1号线程执行任务2。两者几乎在同一时间输出。

如果线程10个，则输出可能是

A single tid 0
B single tid 1
car tid 2
race tid 5

或者

A single tid 0
B single tid 2
race tid 1
car tid 9

可以发现race和car的顺序是不确定的。

task可以嵌套提交。所以taskwait用于等待子任务全部结束

下面这种方式是合理的

#pragma omp parallel for num_threads(2)
for(int i = 0;i < N; i=i+1)
{
#pragma omp task // 本行有无都是合法
	task(a[i]);
}

用于等待所有任务结束可以使用

#pragma omp taskwait或者#pragma omp barrier

workshare(Fortran)

for

Fortran是do，c/c++是for

for 之后的循环体被称作worksharing loop: 共享循环。for循环结束的地方默认有一个barrier同步。

如果不加parallel，则默认使用当前执行区域的活跃线程数。

几个重要的`clause`

schedule

static:schedule(static, chunk-size)

Here are three examples of static scheduling.

schedule(static):      
****************                                                
                ****************                                
                                ****************                
                                                ****************
schedule(static, 4):   
****            ****            ****            ****            
    ****            ****            ****            ****        
        ****            ****            ****            ****    
            ****            ****            ****            ****
schedule(static, 8):   
********                        ********                        
        ********                        ********                
                ********                        ********        
                        ********                        ********

dynamic

schedule(dynamic):     //默认chunk-size=1
*   ** **  * * *  *      *  *    **   *  *  * *       *  *   *  
  *       *     *    * *     * *   *    *        * *   *    *   
 *       *    *     * *   *   *     *  *       *  *  *  *  *   *
   *  *     *    * *    *  *    *    *    ** *  *   *     *   * 
schedule(dynamic, 1):  
    *    *     *        *   *    * *  *  *         *  * *  * *  
*  *  *   * *     *  * * *    * *      *   ***  *   *         * 
 *   *  *  *  *    ** *    *      *  *  * *   *  *   *   *      
  *    *     *  **        *  * *    *          *  *    *  * *  *
schedule(dynamic, 4):  
            ****                    ****                    ****
****            ****    ****            ****        ****        
    ****            ****    ****            ****        ****    
        ****                    ****            ****            
schedule(dynamic, 8):  
                ********                                ********
                        ********        ********                
********                        ********        ********        
        ********

guided

和动态类似，但是chunk-size是根据剩余任务量动态变化的(chunk-size有变小的趋势)
auto

编译器自己决定
runtime

OMP_SCHEDULE或者omp_set_schedule()指定
default

如果上述都不执行，则openmp会调用默认的调度方式对应ICV:def-sched-var,注意这个内部控制变量是openmp自己定义的，无法修改

nowait
reduction:每个线程拷贝一份中间变量，最后的时候openmp再规约

嵌套

直接嵌套二层循环内层循环默认是关掉的，由nest-var控制开关

#pragma omp parallel for
for(int i = 0 ; i < N ;i++){
 
#pragma omp parallel for
	for(int j = 0 ; j < N ;i++){
    
  }
}

如果nest-var关闭，则内部循环仍然只会启用一个线程，否则内部循环会再启动额外的线程去跑任务

max-active-levels-var 变量控制嵌套层数

间接嵌套


void foo(){
		
    #pragma omp parallel for
    for(int i = 0 ; i < N ; i++){
        
    }   
}
void bar(){
     #pragma omp parallel for
    for(int i = 0 ; i < N ; i++){
        
    }   
}

这种也属于嵌套中的一种

需要注意的是

void bar(){
     #pragma omp parallel for
    for(int i = 0 ; i < N ; i++){
         #pragma omp for
   	 		  for(int j = 0 ; j < N ; j++){
    			}
    }   
}

是不允许的，因为

Improper nesting of OpenMP* constructs

orphaned directive

orphaned directive

#include<omp.h>
#include<stdio.h>

void foo(){
		
  	printf("current threads nums is %d\n",omp_get_num_threads());
    #pragma omp for
    for(int i = 0 ; i < 4 ; i++){
        printf("id is %d\n",omp_get_thread_num());
    }   
}
int main(){
    int thread_num = 2;
    omp_set_num_threads(thread_num);
  	#pragma omp parallel num_threads(thread_num)
	  foo();
    return 0;
}

因为foo中的omp for(worksharing-loop被孤立在并行区域之外，所以这里的for被称为孤立的指令(orphaned directive)

输出

current threads nums is 2
id is 0
id is 0
current threads nums is 2
id is 1
id is 1

语义上而言上述的代码等价于

int main(){
    int thread_num = 2;
    omp_set_num_threads(thread_num);
  	#pragma omp parallel num_threads(thread_num)
	  {
    		printf("current threads nums is %d\n",omp_get_num_threads());
        #pragma omp for
        for(int i = 0 ; i < 4 ; i++){
            printf("id is %d\n",omp_get_thread_num());
        }   
    }
    return 0;
}

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-XVaIoiPq-1607750342820)(/Users/jaxxxxxo/Library/Application Support/typora-user-images/image-20201201184218804.png)]

写main的程序员不需要知道foo的底层实现，不论是否调用foo发生在parallel region还serial region，都能正确执行。

下列的写法是错误的


void foo(){
		
  	printf("current threads nums is %d\n",omp_get_num_threads());
    #pragma omp for
    for(int i = 0 ; i < 4 ; i++){
        printf("id is %d\n",omp_get_thread_num());
    }   
}
int main(){
    int thread_num = 2;
    omp_set_num_threads(thread_num);
  	#pragma omp parallel for num_threads(thread_num)
		for(int i = 0 ; i < 2;i++)
 		  foo();
    return 0;
}

因为它等价于

#pragma omp parallel for
for()
{
  #pragma omp for
  foo() // 错误, 报错work-sharing region may not be closely nested inside of work-sharing, ‘critical’, ‘ordered’, ‘master’, explicit ‘task’ or ‘taskloop’ region
}

不允许的嵌套

simd/ for simd /declare simd

必须用在循环里面

#include <unistd.h>
#include<omp.h>
#include<stdio.h>


void foo(){
    const int N = 100;
    float C[N],a[N], b[N];
    #pragma omp simd
    for(int i = 0 ; i < N ; i++){
        C[i] = a[i] + b[i];
    }   
}

在-O1就能自动开启simd模式。如果没有omp需要在-O3才能开启。。。

如果在循环内部有函数调用,必须把函数调用声明成为declare simd类型，否则编译器不会使用simd优化循环。

#pragma omp declare simd
float add(float a,float b){
    return a + b;                                                                                                                                                                                                                                            
}

编译器可能会自动生成很多add的实例代码(类似模板)。

可以跟simdlen(),linear(),uniform()

建议

负载小启用少线程

#pragma omp parallel for if(N>1)
for(i 0 ; i < N ; i++)
{
  
}

避免隐式同步

大部分worksharing在结束时都有一个同步

#pragma omp parallel 
{
  #pragma omp for nowait
  for()  
  #pragma critical //做完任务后可以直接进入临界区
  {
    
  }
}

不要在内层循环启动多线程

避免"普通"负载不均衡

!$OMP PARALLEL DO SCHEDULE(STATIC) REDUCTION(+:res)
do l=1,M
    do k=1,N
      do j=1,N
				do i=1,N 
          res=res+A(i,j,k,l)
        enddo ; enddo; enddo ; enddo
!$OMP END PARALLEL DO

一定要M,N最大值上启用多线程

避免使用动态任务调度

除了static任务调度,其他的都需要动态计算任务分配

避免伪共享

避免不同线程读写同一cache line

#pragma omp parallel 
for(int i = 0 ; i < N ; i++){
#pragma omp parallel 
  for(int j = 0 ; j < N; j++){
    y[i] = A[i][j] * x[j]
  }
}

omp死锁

在这里插入图片描述

NUMA策略

first touch policy

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-LPeY0GIq-1607750342822)(/Users/jaxxxxxo/Library/Application Support/typora-user-images/image-20201209150435188.png)]