简单排序的OpenMP并行程序

绝不收费

于 2024-07-17 23:00:58 发布

阅读量377

点赞数 20

分类专栏：并行计算实验文章标签： c++

本文链接：https://blog.csdn.net/weixin_60705585/article/details/140507844

版权

并行计算实验专栏收录该内容

4 篇文章 0 订阅

订阅专栏

深圳大学并行计算实验三

一、实验目的

1. 实现简单排序的OpenMP并行程序；

2. 掌握for编译制导语句和数据划分方法；

3. 对并行程序进行简单的性能分析。

二、实验环境

1. 硬件环境：64核CPU、256GB内存的共享内存并行计算平台；

2. 软件环境：Ubuntu Linux、gcc、g++（g++ -O3 -fopenmp -o a.out a.cc）；

3. 远程登录：本地PowerShell中执行ssh bxjs@hpc.szu.edu.cn；

4. 传输文件：本地PowerShell中执行scp c:\a.cpp bxjs@hpc.szu.edu.cn:/home/bxjs/或ftp://hpc.szu.edu.cn。

三、实验内容

1. 用OpenMP语言编写并行程序，排序长度为n的数组a：先将数组a划分成p个子数组，每个线程排序1个子数组，然后将这p个子数组归并到数组a或另一个数组b中，其中归并步骤可以串行。为了验证结果正确性，将并行计算结果和串行计算结果相比较。

代码实现思路：

首先，我们考虑构建两个int类型的一维数组a、b，数组a用来存储并行排序的结果，数组b用于存储串行排序的结果。接着，我们考虑对a、b进行随机的初始化。每排序前，对a、b进行随机初始化，才能避免分配的内存中的值对实验产生影响，更好地验证并行化结果的正确性。对此，我采用了cstdlib库中rand()函数获取一个0到RAND_MAX之间的随机整数，并将该随机浮点数同时赋值给数组a、数组b。

接着，考虑如何实现并行的排序。这里我先定义了两个长度为thread_num的一维数组begin和end，然后采用for循环将长度为n的数组a，分成thread_num个子数组。具体就是通过用数组begin和end存储子数组的起始和结束的下标来实现的。

然后，我考虑使用了OpenMP库中，omp_set_num_threads(int thread_num)函数用于设置并行区域的线程数为tp。再用omp_get_wtime()函数，获取当前线程距离线程线程开始的时间差，赋值给double类型的变量t0，作为并行域计算的起始时间。接着，使用“#pragma omp parallel shared(a,begin,end)”编译指导语句，构建一个并行域，然后在并行域中，每个线程通过omp_get_thread_num()函数获取该线程的线号nthreads，然后通过线程号nthreads获取该线程具体需要排序的子数组的起始begin[nthreads]和结束end[nthreads]下标，然后通过调用sort(a+begin[nthreads],a+end[nthreads]+1)函数对子数组进程排序，实现来多个子数组的并行排序。

在完成了thread_num个子数组的并行排序后，考虑对这个thread_num个子数组进行归并，为此，我专门实现了void guibing(int begin1, int end1, int begin2, int end2,int*shuzu)函数，实现对shuzu数组的两个连续的[begin[j], end[j]]和[begin[j+i], end[j+i]]的元素进行归并（前闭后闭区间），供给调用。

然而，同时考虑到如何直接串行对这thread_num个子数组进行串行排序的时间复杂度可可能比较高，所以考虑采用并行的方式，对这thread_num个子数组进行归并。这里在实现并行的时候，最主要是需要控制好归并的顺序，不然会导致结果的失败。

具体，我采用了两个for循环来实现，最外层for循环用于控制归并的步长，确定该次执行的需要执行(thread_num/i+1)/2归并。然后使用“#pragma omp parallel for shared(a,begin,end,i)”编译指导语句，构建一个for循环并行域，在其中并行的执行(thread_num/i+1)/2个归并操作。

接着，再次使用omp_get_wtime()函数，获取当前线程距离线程线程开始的时间差，赋值给double类型的变量t1，作为并行域计算的结束时间。继而，通过t1-t0就可以获得该并行域的运行时间。紧接着，再通过一次调用sort函数实现，实现对b数组进行串行的排序。

最后，通过执行一个for循环实现对a数组和b数组的比较，来保证并行排序结果的正确性。

初步代码实现：

//对shuzu数组的两个连续的[begin[j], end[j]]和[begin[j+i], end[j+i]]的元素进行归并，前闭后闭区间 
void guibing(int begin1, int end1, int begin2, int end2,int*shuzu){
	int *tempkfc;
	tempkfc=new int[end2-begin1+1];
	int cnt_temp=0,cnt1=begin1,cnt2=begin2;
	while(cnt1<=end1||cnt2<=end2){
		if(cnt1<=end1&&cnt2<=end2)
			if(shuzu[cnt1]<shuzu[cnt2]) tempkfc[cnt_temp++]=shuzu[cnt1++];
			else tempkfc[cnt_temp++]=shuzu[cnt2++];
		else if(cnt1<=end1) tempkfc[cnt_temp++]=shuzu[cnt1++];
		else if(cnt2<=end2) tempkfc[cnt_temp++]=shuzu[cnt2++];
	}
	for(int i=0;i<cnt_temp;i++) shuzu[i+begin1]=tempkfc[i];
	delete[]tempkfc;
}
int a[N],b[N];//a为并行结果、b为串行验证结果 
//初始化参数 
srand(time(NULL));
for (int i = 0; i < N; i++)
	a[i] = b[i] = rand();//0 ~ (RAND_MAX)32767
double t0,t1;
int begin[thread_num],end[thread_num],num;
//分成 thread_num 个子数组  
for (int i = 0; i < thread_num; i++){
	if(i == 0) begin[0] = 0;
	else begin[i] = end[i-1] + 1;
	if(i == thread_num - 1) end[i] = N - 1;
	else end[i] = (N/thread_num) * (i+1);
}
omp_set_num_threads(thread_num);
t0 = omp_get_wtime();
//并行排序代码
//并行——小数组排序，对 thread_num 个子数组进行并行排序 
#pragma omp parallel shared(a,begin,end)
{
	int nthreads = omp_get_thread_num();
	sort(a+begin[nthreads],a+end[nthreads]+1); 
}
//并行——优化后的归并，对 thread_num 个子数组进行并行归并 
for(int i=1;i<thread_num;i*=2){
	num=thread_num/i;
	num=(num+1)/2;
	#pragma omp parallel for shared(a,begin,end,i)
	for(int j=0;j<num;j++){
		guibing(begin[j*(2*i)], end[j*(2*i)], begin[j*(2*i)+i], end[j*(2*i)+i], a);
		end[j*(2*i)]=end[j*(2*i)+i];
	}
}
t1 = omp_get_wtime();
//串行排序代码 
sort(b,b+N);
//验证代码
int flag=1;
for(int i = 0; i < N; i++)
	if(b[i]!=a[i]||(i>0 && a[i] < a[i-1])){
		flag=0;
		cout << "Results are not equal!" << endl;
		break;
	}	
if(flag) cout << "Results are equal!" << endl;
cout << "Thread num: "<<thread_num<<", Parallel time: " << t1 - t0 <<"second"<< endl;

2. 测试并行程序在不同线程数下的执行时间和加速比（与线程数=1时的执行时间相比）。其中，n固定为100000000，线程数分别取1、2、4、8、16、32、64时，为减少误差，每项实验进行5次，取平均值作为实验结果。

为了避免代码多次修改，这里可以将对上述1中的代码进行封装成函数“double achieve(int thread_num)”，传入参数thread_num，为程序并行域的线程数，并将并行域的运行时间作为返回值进行返回。

同时，为了避免程序的多次间断的运行，导致程序运行时CPU的状态差，导致实验获取的结果误差较大，这里考虑，创建数组提前存储“1、2、4、8、16、32、64”，然后通过for循环将数值传入函数“double achieve(int thread_num)”中，获取运行时间，并计算出对应5次实验的平均值。从而，避免了程序的多次间断的运行带来的误差。具体的代码在下列“四、代码描述”中可见。

四、代码描述

如下所示，为该实验的完整实现代码，必要的一些参数主要通过改变全局变量的参数进行设置，而主函数通过循环调用double achieve(int thread_num) 函数，获取对应thread_num下的平均的并行域的运行时间，然后打印出来。

在double achieve(int thread_num)函数中，主要通过“#pragma omp parallel shared(a,begin,end)”和“#pragma omp parallel for shared(a,begin,end,i)”编译制导语句实现并行域，然后将结果与串行执行结果进行比较，保证结果的准确性。

而其中，并行排序中，一开始对thread_num个子数组进行分组和并行排序的理论时间复杂度为，子数组归一化的理论时间复杂度为。相比于串行排序，可以有一个常数基级别的优化。

#include <omp.h>
#include <bits/stdc++.h>
using namespace std;
const int N=100000000;//求解范围 
const int ThreadNum_CanShu=7;//线程数组的大小
const int set_thread_num[]={1,2,4,8,16,32,64};//线程数
const int XunHuan_CanShu=5;//循环次数 
int a[N],b[N];//a为并行结果、b为串行验证结果 
//对shuzu数组的两个连续的[begin[j], end[j]]和[begin[j+i], end[j+i]]的元素进行归并，前闭后闭区间 
void guibing(int begin1, int end1, int begin2, int end2,int*shuzu){
	int *tempkfc;
	tempkfc=new int[end2-begin1+1];
	int cnt_temp=0,cnt1=begin1,cnt2=begin2;
	while(cnt1<=end1||cnt2<=end2){
		if(cnt1<=end1&&cnt2<=end2)
			if(shuzu[cnt1]<shuzu[cnt2]) 
				tempkfc[cnt_temp++]=shuzu[cnt1++];
			else tempkfc[cnt_temp++]=shuzu[cnt2++];
		else if(cnt1<=end1) tempkfc[cnt_temp++]=shuzu[cnt1++];
		else if(cnt2<=end2) tempkfc[cnt_temp++]=shuzu[cnt2++];
	}
	for(int i=0;i<cnt_temp;i++)
		shuzu[i+begin1]=tempkfc[i];
	delete[]tempkfc;
}
//根据传入的线程数，进行函数的调用 
double achieve(int thread_num){
	//初始化参数 
	srand(time(NULL));
	for (int i = 0; i < N; i++) a[i] = b[i] = rand();//0 ~ (RAND_MAX)32767
	double t0,t1;
	int begin[thread_num],end[thread_num],num;
	//分成 thread_num 个子数组  
	for (int i = 0; i < thread_num; i++){
		if(i == 0) begin[0] = 0;
		else begin[i] = end[i-1] + 1;
		if(i == thread_num - 1) end[i] = N - 1;
		else end[i] = (N/thread_num) * (i+1);
	}
	omp_set_num_threads(thread_num);
	t0 = omp_get_wtime();
	//并行排序代码
	//并行——小数组排序，对 thread_num 个子数组进行并行排序 
	#pragma omp parallel shared(a,begin,end)
	{
		int nthreads = omp_get_thread_num();
		sort(a+begin[nthreads],a+end[nthreads]+1); 
	}
	//并行——优化后的归并，对 thread_num 个子数组进行并行归并 
	for(int i=1;i<thread_num;i*=2){
		num=thread_num/i;
		num=(num+1)/2;
		#pragma omp parallel for shared(a,begin,end,i)
		for(int j=0;j<num;j++){
			guibing(begin[j*(2*i)], end[j*(2*i)], 
				begin[j*(2*i)+i], end[j*(2*i)+i], a);
			end[j*(2*i)]=end[j*(2*i)+i];
		}
	}
	t1 = omp_get_wtime();
	//串行排序代码 
	sort(b,b+N);
	//验证代码
	int flag=1;
	for(int i = 0; i < N; i++)
		if(b[i]!=a[i]||(i>0 && a[i] < a[i-1])){
			flag=0;
			cout << "Results are not equal!" << endl;
			break;
		}	
	if(flag) cout << "Results are equal!" << endl;
	cout << "Thread num: "<<thread_num<<", Parallel time: " << t1 - t0 <<"second"<< endl;
	return t1-t0;
}
int main()
{
	cout<<"CPU核心数: "<<omp_get_num_procs()<<endl;
	cout<<"最大线程数: "<<omp_get_thread_limit()<<endl; 
	cout<<"----------------------------------"<<endl; 
	for(int i=0;i<ThreadNum_CanShu;i++){
		double pingjunzhi=0,speedup_rate=0;
		for(int j=0;j<XunHuan_CanShu;j++)
			pingjunzhi+=achieve(set_thread_num[i]);
		cout<<"omp_set_num_threads="<<set_thread_num[i];
		cout<<", 平均运行时间："<< pingjunzhi/XunHuan_CanShu<<"秒"<<endl;;
		cout<<"----------------------------------"<<endl; 
	}
	return 0;
}

五、实验结果和分析

通过在文件目录下输入“ftp://hpc.szu.edu.cn”，将本地的c.pp的代码文件赋值到“/2021150233”的目录下，连接A1机器，通过“g++ -o a.out a.cpp -fopenmp -O3”将a.cpp文件编译为c.out文件。然后通过“./c.out”运行c.out文件，最终执行的结果如下图1所示，并将执行的结果记录进表1中，同时计算出加速比（与线程数=1时的执行时间相比）。

图1 c.out文件的执行结果

表1 并行程序在不同线程数下的执行时间（秒）和加速比（n=5000000）

线程数执行时间/s	1	2	4	8	16	32	64
第1次	11.3753	6.5231	4.3566	3.5897	3.1774	3.3442	3.2583
第2次	11.0783	6.5305	4.3493	3.6416	3.1178	3.0552	3.2258
第3次	11.0455	6.5340	4.3620	3.5750	3.1662	3.1977	3.3308
第4次	11.0799	6.5353	4.3518	3.8005	3.1833	3.1257	3.2058
第5次	11.0661	6.4750	4.3491	3.6747	3.3242	3.5033	3.2096
平均值	11.1290	6.5196	4.3538	3.6563	3.1938	3.2452	3.2461
加速比	1.0000	1.7070	2.5562	3.0438	3.4846	3.4294	3.4284

根据上述表1的平均值和加速比数据，绘制出下面的折线图：

图2 平均值和加速比的折线图

根据对上述，表1和图2的分析可知，在一开始随着线程数的增加，排序长度为100000000所有的需要的时间逐渐下降，加速比逐渐上升，但是随着线程增加，运行时间和加数比逐渐趋于平缓，甚至运行时间有上升趋势，加速比有下降趋势。

这是因为通过增加线程数，可以使得子数组的排序任务加速，但是这个加速只是常数级别的加速。同时，随着线程数增加，又会导致子数组的归并开销加大，线程调度的开销加大，所以当线程数增加到一定程度时，通过增加线程数的排序加速会逐渐消失，使得最后的运行时间和加数比逐渐趋于平缓。

六、实验结论

根据实验结果和分析，可以得出以下结论：

随着线程数增加，排序时间逐渐减少，然后趋于平缓甚至有增加趋势：在一开始，增加线程数可以有效地加快排序任务的执行，从而降低排序时间，同时加速比也随之增加。然而，随着线程数进一步增加，线程间的调度开销和子数组归并的开销开始占据主导地位，导致增加线程数对整体性能的提升逐渐减弱，甚至出现运行时间增加和加速比下降的情况。
存在线程数与性能的最优匹配点：实验结果表明，并不是线程数越多越好，存在一个最优的线程数，能够获得最佳的性能提升。超过这个最优线程数后，由于额外的线程调度开销和归并开销增加，性能提升开始减弱甚至逆转。
并行程序正确性的验证：在并行程序中，必须确保并行排序的结果与串行排序的结果一致，以验证并行算法的正确性。通过对比并行排序结果和串行排序结果，可以检查算法是否存在错误。
运用 OpenMP 并行编程技术：实验中使用了 OpenMP 库来实现并行程序，充分利用了多核 CPU 的计算资源，提高了程序的执行效率。通过设置并行域和编译制导语句，实现了对子数组的并行排序和归并操作。
实验性能的改进与优化：在实验过程中，可以考虑使用更高效的排序算法，如快速排序等，并结合更优的归并策略来进一步提高程序性能。此外，还可以考虑优化并行算法的实现，减少不必要的线程同步和通信开销，以提高并行程序的性能。

综上所述，通过本次实验，我们深入了解了并行排序算法的设计和实现过程，掌握了并行程序的开发技术和性能分析方法，同时在实验过程，通过查找性能瓶颈，不断优化程序运行时间，为优化并行程序提供了参考和思路。

绝不收费

关注

20
点赞
踩
15

收藏

觉得还不错? 一键收藏
0
评论
简单排序的OpenMP并行程序

同时，为了避免程序的多次间断的运行，导致程序运行时CPU的状态差，导致实验获取的结果误差较大，这里考虑，创建数组提前存储“1、2、4、8、16、32、64”，然后通过for循环将数值传入函数“double achieve(int thread_num)”中，获取运行时间，并计算出对应5次实验的平均值。同时，随着线程数增加，又会导致子数组的归并开销加大，线程调度的开销加大，所以当线程数增加到一定程度时，通过增加线程数的排序加速会逐渐消失，使得最后的运行时间和加数比逐渐趋于平缓。接着，考虑如何实现并行的排序。
复制链接

扫一扫