代码调优其实属于编译优化,是编译器干的事情,但为了研究,我们用手动的方式简单地做三个小实验
先介绍mpi的相关知识:
mpicc:类似于gcc编译器,可以编译c文件为一个可执行文件
mpic++:类似于g++编译器,可以编译cpp文件为一个可执行文件
mpirun:运行可执行文件,可以调整线程数目,但需要代码中含有mpi的一些函数
在这里我们选用mpic++来编译cpp文件
mpi实际上是一个库,可以被c++,c和fortran三种语言调用(作者已知的),这里我们使用c++的库,也就是一个含有很多头文件的文件夹,我们不用关心他在哪里,因为我们使用的mpic++会自动找到mpi库,在cpp中首先要引入头文件:
#include <mpi.h>
下面这段代码目的是启用MPI环境,具体含义作者母鸡,就不深究了,总之所有测试程序都要写
int numprocs, myid, source;
MPI_Status status;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &myid);
MPI_Comm_size(MPI_COMM_WORLD, &numprocs);
我们还需知道两个函数:
MPI_Wtime();//获取当前时间
MPI_Wtick();//获取本进程时间精度(时钟频率)
我们通过时间计算出执行时间来衡量性能
开始实验:
① 循环交换
老师的PPT给出了如下例子:
我们不妨就按照这个例子书写代码:
尚未优化的代码:
#include<iostream>
#include <mpi.h>
using namespace std;
int main(int argc, char* argv[]){
int numprocs, myid, source;
MPI_Status status;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &myid);
MPI_Comm_size(MPI_COMM_WORLD, &numprocs);
int row=50000,col=1000;
int **x = new int*[row];
for(int i=0;i<row;i++)x[i]=new int[col];
double begin = MPI_Wtime();
for(int j=0;j<col;j++){
for(int i=0;i<row;i++){
x[i][j]=2*x[i][j];
}
}
double end = MPI_Wtime();
double diff = end - begin;
printf("%d process time is %9.16f\n", myid, diff);
printf("%d process tick is %9.16f\n", myid, MPI_Wtick());
MPI_Finalize();
}
优化后的代码:
#include<iostream>
#include <mpi.h>
using namespace std;
int main(int argc, char* argv[]){
int numprocs, myid, source;
MPI_Status status;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &myid);
MPI_Comm_size(MPI_COMM_WORLD, &numprocs);
int row=50000,col=1000;
int **x = new int*[row];
for(int i=0;i<row;i++)x[i]=new int[col];
double begin = MPI_Wtime();
//在这里进行了循环交换
for(int i=0;i<row;i++){
for(int j=0;j<col;j++){
x[i][j]=2*x[i][j];
}
}
double end = MPI_Wtime();
double diff = end - begin;
printf("%d process time is %9.16f\n", myid, diff);
printf("%d process tick is %9.16f\n", myid, MPI_Wtick());
MPI_Finalize();
}
下面是性能对比:
可见进行循环交换让程序性能提高了7-8倍
② 数组合并
老师的PPT给出了如下例子:
我们不妨就按照这个例子书写代码:
尚未优化的代码:
#include<iostream>
#include <mpi.h>
#define SIZE 1000000
using namespace std;
int main(int argc, char* argv[]){
int numprocs, myid, source;
MPI_Status status;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &myid);
MPI_Comm_size(MPI_COMM_WORLD, &numprocs);
int val[SIZE];
int key[SIZE];
double begin = MPI_Wtime();
for(int i=0;i<SIZE;i++){
val[i] = val[i]+key[i];
}
double end = MPI_Wtime();
double diff = end - begin;
printf("%d process time is %9.16f\n", myid, diff);
printf("%d process tick is %9.16f\n", myid, MPI_Wtick());
MPI_Finalize();
}
优化后的代码:
#include<iostream>
#include <mpi.h>
#define SIZE 1000000
using namespace std;
int main(int argc, char* argv[]){
int numprocs, myid, source;
MPI_Status status;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &myid);
MPI_Comm_size(MPI_COMM_WORLD, &numprocs);
//这里两个数组合并为一个结构体数组
struct merge{
int val;
int key;
};
struct merge merge_array[SIZE];
double begin = MPI_Wtime();
for(int i=0;i<SIZE;i++){
merge_array->val = merge_array->val+merge_array->key;
}
double end = MPI_Wtime();
double diff = end - begin;
printf("%d process time is %9.16f\n", myid, diff);
printf("%d process tick is %9.16f\n", myid, MPI_Wtick());
MPI_Finalize();
}
下面是性能对比:
可见,数组合并让程序性能提高了3倍左右
③ 循环融合
老师的PPT给出了如下例子:
我们不妨就按照这个例子书写代码:
尚未优化的代码:
#include<iostream>
#include <mpi.h>
#define N 700
using namespace std;
int main(int argc, char* argv[]){
int numprocs, myid, source;
MPI_Status status;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &myid);
MPI_Comm_size(MPI_COMM_WORLD, &numprocs);
int a[N][N],b[N][N],c[N][N],d[N][N];
for(int i=0;i<N;i++){
for(int j=0;j<N;j++){
a[i][j]=1;
b[i][j]=1;
c[i][j]=1;
d[i][j]=1;
}
}
double begin = MPI_Wtime();
for(int i=0;i<N;i++)
for(int j=0;j<N;j++)
a[i][j]=1/b[i][j]*c[i][j];
for(int i=0;i<N;i++)
for(int j=0;j<N;j++)
d[i][j]=a[i][j]+c[i][j];
double end = MPI_Wtime();
double diff = end - begin;
printf("%d process time is %9.16f\n", myid, diff);
printf("%d process tick is %9.16f\n", myid, MPI_Wtick());
MPI_Finalize();
}
优化后的代码:
#include<iostream>
#include <mpi.h>
#define N 700
using namespace std;
int main(int argc, char* argv[]){
int numprocs, myid, source;
MPI_Status status;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &myid);
MPI_Comm_size(MPI_COMM_WORLD, &numprocs);
int a[N][N],b[N][N],c[N][N],d[N][N];
for(int i=0;i<N;i++){
for(int j=0;j<N;j++){
a[i][j]=1;
b[i][j]=1;
c[i][j]=1;
d[i][j]=1;
}
}
double begin = MPI_Wtime();
//两个循环合成一个循环
for(int i=0;i<N;i++)
for(int j=0;j<N;j++){
a[i][j]=1/b[i][j]*c[i][j];
d[i][j]=a[i][j]+c[i][j];
}
double end = MPI_Wtime();
double diff = end - begin;
printf("%d process time is %9.16f\n", myid, diff);
printf("%d process tick is %9.16f\n", myid, MPI_Wtick());
MPI_Finalize();
}
下面是性能对比:
可见循环融合对程序性能的提升非常有限