最大的开销是启动和停止线程所花费的时间.如果我将数组的大小从10000减少到10,则花费的时间大约相同.
如果保留线程池,并为每个线程分配工作量以写入本地数据集,则在具有6个内核的计算机上,速度要快4倍.
import java.util.ArrayList;
import java.util.List;
import java.util.concurrent.*;
public class ParallelImplementationOptimised {
static final int numberOfThreads = Runtime.getRuntime().availableProcessors();
final ExecutorService exec = Executors.newFixedThreadPool(numberOfThreads);
private int numberOfCells;
public ParallelImplementationOptimised(int numberOfCells) {
this.numberOfCells = numberOfCells;
}
public void update() throws ExecutionException, InterruptedException {
List> futures = new ArrayList<>();
for(int thread = 0; thread < numberOfThreads; thread++) {
final int threadId = thread;
futures.add(exec.submit(new Runnable() {
@Override
public void run() {
int num = numberOfCells / numberOfThreads;
double[] h0 = new double[num],
h1 = new double[num],
h2 = new double[num],
h3 = new double[num],
h4 = new double[num],
h5 = new double[num],
h6 = new double[num],
h7 = new double[num],
h8 = new double[num],
h9 = new double[num];
for (int i = 0; i < num; i++) {
h0[i] = h0[i] + 1;
h1[i] = h1[i] + 1;
h2[i] = h2[i] + 1;
h3[i] = h3[i] + 1;
h4[i] = h4[i] + 1;
h5[i] = h5[i] + 1;
h6[i] = h6[i] + 1;
h7[i] = h7[i] + 1;
h8[i] = h8[i] + 1;
h9[i] = h9[i] + 1;
}
}
}));
}
for (Future> future : futures) {
future.get();
}
}
public static void main(String[] args) throws ExecutionException, InterruptedException {
ParallelImplementationOptimised si = new ParallelImplementationOptimised(10);
long start = System.currentTimeMillis();
for (int i = 0; i < 10000; i++) {
if(i % 1000 == 0) {
System.out.println(i);
}
si.update();
}
long stop = System.currentTimeMillis();
System.out.println("Time: " + (stop - start));
si.exec.shutdown();
}
}
SequentialImplementation 3.3秒.
并行实施优化0.8秒.
您似乎正在同一高速缓存行上写入同一数据.这意味着数据必须经过L3高速缓存未命中,这比访问L1高速缓存要花费20倍的时间.我建议您尝试完全分开的数据结构,这些数据结构至少间隔128个字节,以确保您不会碰到同一条缓存行.
注意:即使您打算完成覆盖整个缓存行,x64 CPU也会首先拉入缓存行的先前值.
另一个问题可能是
Why isn’t this 20x slower?
抓住了缓存行的CPU内核可能有两个运行超线程的线程(即,两个线程可以在本地访问数据),并且该CPU可能绕了几次循环,然后才将缓存行丢失给了另一个CPU内核.要求它.这意味着20倍的损失不是在每次访问或每个循环上都出现,而是经常使您获得慢得多的结果.