矩阵乘法的并行化基本都是用加农算法,但是在共享内存的情况下,我觉得加农并没有优势。
加农保证了在每个变量全局单副本的情况下,并行度的提升。在共享内存时,没有变量复制的成本,所以直接使用带状划分可以避免迭代中间的barrier开销,提高效率。
SMP下实现矩阵乘法
#include "stdafx.h"
#include "matrixOperation.h"
#include <omp.h>
int _tmain(int argc, _TCHAR* argv[])
{
const int size=5000;
double **a,**b,**c;
a=new double*[size];
b=new double*[size];
c=new double*[size];
for(int i=0;i<size;++i)
{
a[i]=new double[size];
b[i]=new double[size];
c[i]=new double[size];
}
cout<<"mem set"<<endl;
//read file
cout<<readMatrix("matrix",a,size)<<endl;
cout<<readMatrix("matrix",b,size)<<endl;
cout<<compareMatrix(a,b,size)<<endl;
//for more cache hits
//transposition b and place data needed in one cache block
matrixTransposition(b,size);
cout<<"data prepared"<<endl<<"calculating"<<endl;
long start=time(0);
// omp_set_nested(true);
#pragma omp parallel for num_threads(16) schedule(dynamic)
for(int i=0;i<size;++i)
{
// #pragma omp parallel for firstprivate(i) num_threads(4)
for(int j=0;j<size;++j)
{
c[i][j]=0;
for(int k=0;k<size;++k)
{
c[i][j]+=a[i][k]*b[j][k];//different from the original formulation
}
}
cout<<".";
}
long end=time(0);
cout<<end-start<<" seconds"<<endl;
writeMatrix("out",c,size);
for(int i=0;i<size;++i)
{
delete[] a[i];
delete[] b[i];
delete[] c[i];
}
delete[] a;
delete[] b;
delete[] c;
cin>>start;
return 0;
}
i7 2600处理器,5000*5000的矩阵相乘上面的参数效果较好,纯计算时间在126秒左右。
matrixOperation头文件见另一个文章:http://blog.csdn.net/pouloghost/article/details/8746913