这是您的代码的重写版本.我已经将工作拆分为最外层循环,而不是在你的情况下 – 最内层循环.我还明确地分配了d结果向量的局部部分,以及Hessian矩阵的局部部分.
在您的代码中,您依靠drange来拆分工作,并直接访问分布式阵列以避免提取本地部分.不可否认,如果MATLAB正确地完成所有事情,它不会导致如此大的减速.最重要的是,我不知道你的代码为什么这么慢 – 很可能因为MATLAB做了一些远程数据访问,尽管你分发了你的矩阵.
无论如何,下面的代码运行并使用4个实验室在我的计算机上提供了相当好的加速.我已经生成了合成的随机输入数据,以便有所作为.看看评论.如果事情不清楚,我可以稍后详细说明.
clear all;
D = rand(512, 512, 3);
S = size(D);
[fx, fy, fz] = gradient(D);
% this part could also be parallelized - at least a bit.
tic;
DHess = zeros([3 3 S(1) S(2) S(3)]);
[DHess(1,1,:,:,:), DHess(1,2,:,:,:), DHess(1,3,:,:,:)] = gradient(fx);
[DHess(2,1,:,:,:), DHess(2,2,:,:,:), DHess(2,3,:,:,:)] = gradient(fy);
[DHess(3,1,:,:,:), DHess(3,2,:,:,:), DHess(3,3,:,:,:)] = gradient(fz);
toc
% your sequential implementation
d = zeros([3, S(1) S(2) S(3)]);
disp('sequential')
tic
for i = 1 : S(1)
for ii = 1 : S(2)
for iii = 1 : S(3)
d(:,i,ii,iii) = eig(squeeze(DHess(:,:,i,ii,iii)));
end
end
end
toc
% my parallel implementation
disp('parallel')
tic
spmd
% just for information
disp(['lab ' num2str(labindex)]);
% distribute the input data along the third dimension
% This is the dimension of the outer-most loop, hence this is where we
% want to parallelize!
DHess_dist = codistributed(DHess, codistributor1d(3));
DHess_local = getLocalPart(DHess_dist);
% create an output data distribution -
% note that this time we split along the second dimension
codist = codistributor1d(2, codistributor1d.unsetPartition, [3, S(1) S(2) S(3)]);
localSize = [3 codist.Partition(labindex) S(2) S(3)];
% allocate local part of the output array d
d_local = zeros(localSize);
% your ordinary loop, BUT! the outermost loop is split amongst the
% threads explicitly, using local indexing. In the loop only local parts
% of matrix d and DHess are accessed
for i = 1:size(d_local,2)
for ii = 1 : S(2)
for iii = 1 : S(3)
d_local(:,i,ii,iii) = eig(squeeze(DHess_local(:,:,i,ii,iii)));
end
end
end
% assemble local results to a codistributed matrix
d_dist = codistributed.build(d_local, codist);
end
toc
isequal(d, d_dist)
和输出
Elapsed time is 0.364255 seconds.
sequential
Elapsed time is 33.498985 seconds.
parallel
Lab 1:
lab 1
Lab 2:
lab 2
Lab 3:
lab 3
Lab 4:
lab 4
Elapsed time is 9.445856 seconds.
ans =
1
编辑我已经检查了重构矩阵DHess = [3x3xN]的性能.性能不是很好(10%),所以它并不重要.但也许你可以有点不同地实现eig?毕竟,这些是你正在处理的3×3矩阵.