This code is slower with OpenMP. Without OpenMP I get about 10s. With OpenMP i get about 40s. What is happening? Thank you very much friends!
for (i=2;i
#pragma omp parallel for
for (j=2; j
C[i][j]= absi[i]*absj[j]*
(2.0f*B[i][j] + absi[i]*absj[j]*
(VEL[i][j]*VEL[i][j]*fat*
(16.0f*(B[i][j-1]+B[i][j+1]+B[i-1][j]+B[i+1][j])
-1.0f*(B[i][j-2]+B[i][j+2]+B[i-2][j]+B[i+2][j])
-60.0f*B[i][j]
)-A[i][j]));
c2 = (abs(C[i][j]) > Amax[i][j]);
if (c2) {
Amax[i][j] = abs(C[i][j]);
Ttra[i][j] = t;
}
}
}
解决方案
Just because you're using OpenMP doesn't mean your program will run faster. A couple of things can be happening here:
There is a cost associated to spawning each thread, and if you spawn a thread to do a small amount of computation, the spawning of the thread itself will take more time than the computation.
By default, OpenMP will spawn the maximum number of threads supported by your CPU. With CPU's that support 2 or more threads per core, the threads will be competing for each core's resources. Using omp_get_num_threads() you can see how many threads will be spawned by default. I recommend trying running your code with half that value using omp_set_num_threads().
Did you confirm the results were the same with and without OpenMP? It seems there is a dependency with the variables j and c2. You should declare them private to each thread:
#pragma omp parallel for private(j,c2)
I wanted to add another thing: before attempting any parallelization, you should make sure that the code is already optimized.
Depending on your compiler, compiler flags and the complexity of the instruction, the compiler may or may not optimize your code:
// avoid calculation nnoib-2 every iteration
int t_nnoib = nnoib - 2;
for (i=2; i< t_nnoib; ++i){
// avoid calculation nnojb-2 every iteration
int t_nnojb = nnojb - 2;
// avoid loading absi[i] every iteration
int t_absi = absi[i];
for (j=2; j< t_nnojb; ++j) {
C[i][j]= t_absi * absj[j] *
(2.0f*B[i][j] + t_absi * absj[j] *
(VEL[i][j] * VEL[i][j] * fat *
(16.0f * (B[i][j-1] + B[i][j+1] + B[i-1][j] + B[i+1][j])
-1.0f * (B[i][j-2] + B[i][j+2] + B[i-2][j] + B[i+2][j])
-60.0f * B[i][j]
) - A[i][j]));
// c2 is a useless variable
if (abs(C[i][j]) > Amax[i][j]) {
Amax[i][j] = abs(C[i][j]);
Ttra[i][j] = t;
}
}
}
It may not seem much, but it can have a huge impact on your code. The compiler will try to place local variables in registers (which have a much faster access time). Keep in mind that you cant apply this technique indefinitely since you have an limited number of registers, and abusing this will cause your code to suffer from register spilling.
In the case of the array absi, you'll avoid having the system keeping a piece of that array in cache during the execution of the j loop. The general idea of this technique is to move to the outer loop any array access that doesn't depend on the inner loop's variable.