1.Overview
Previous lectures
In the last lecture we started looking at shared memory parallelism (SMP):
- Relevant to multi-core CPUs.
- Separate processing units (cores) share some levels of memory cache.
- Various frameworks for programming SMP systems.
- Widely-implemented standard: OpenMP
Today’s lecture
Today we are going to look at a some actual problems.
- Examples of a data parallel problems, where the same operation is applied to multiple data elements.
- Also known as a map.
- Multi-threading solution employs a fork-join pattern.
- How to parallelise nested loops.
- Parallel code can be non-deterministic, even when the serial code is deterministic.
2. Multi-treaded vector addition
2.1 vector addition in serial
Vector addition
An n-vector a can be thought of as an array of n numbers: a = (a1, a2, . . . , an).
If two vectors a and b are the same size, they can be added to generate a new n-vector c:
Or:
Serial vector addition
vectorAddition serial.c
//
// Vector addition in parallel. Compile with: gcc -fopenmp -Wall -o vectorAddition_parallel vectorAddition_parallel.c
//
// Includes. Don't need <omp.h> for this example.
#include <stdio.h>
#include <stdlib.h>
// n is the size of all three vectors. For simplicity, use statically allocated arrays
// [rather than dynamically allocated using malloc()/free()]
#define n 20
int main()
{
float a[n], b[n], c[n];
int i;
// Initialise a[n] and b[n]
// For this example, assign arbitrary numbers to vectors a and b.
for( i=0; i<n; i++ ) a[i] = b[i] = i;
// Vector addition in serial
for( i=0; i<n; i++ ) c[i] = a[i] + b[i];
return 0;
}
2.2 Vector addition in parallel
Add #pragma omp parallel for just before the loop:
vectorAddition parallel.c
//
// Vector addition in parallel. Compile with: gcc -fopenmp -Wall -o vectorAddition_parallel vectorAddition_parallel.c
//
// Includes. Don't need <omp.h> for this example.
#include <stdio.h>
#include <stdlib.h>
// n is the size of all three vectors. For simplicity, use statically allocated arrays
// [rather than dynamically allocated using malloc()/free()]
#define n 20
int main()
{
float a[n], b[n], c[n];
int i;
// Initialise a[n] and b[n]
// For this example, assign arbitrary numbers to vectors a and b.
for( i=0; i<n; i++ ) a[i] = b[i] = i;
// Vector addition in parallel
#pragma omp parallel for
for( i=0; i<n; i++ ) c[i] = a[i] + b[i];
return 0;
}
This only parallelises this one loop, not any later ones!
Fork-and-join
When the executable reaches #pragma omp parallel for, it spawns multiple threads.
- Each thread computes part of the loop.
- The extra threads are destroyed at the end of the loop.
This is known as a fork-join construct:
2.3 Thread-level description
Example: Four threads in total
Pseudocode for the main thread:
1 // Main thread starts in serial
2 // Initialise arrays a, b; allocate c.
3 ...
4 // REACHES # pragma omp parallel for
5 // FORK : Create three new threads .
6 worker1 = fork (...) ;
7 worker2 = fork (...) ;
8 worker3 = fork (...) ;
9
10 // Perform 1/4 of the total loop .
11 for ( i =0; i < n /4; i ++ )
12 c[i ] = a[ i] + b[ i ];
13
14 // JOIN : Wait for other threads to finish .
15 worker1 . join () ;
16 worker2 . join () ;
17 worker3 . join () ;
18
19 // Continue in serial after the loop
Worker thread 1:
1 // CREATED BY MAIN (‘ fork ’)
2 // Perform second 1/4 of loop .
3 for ( i=n /4; i <n /2; i ++ ) c[ i] = a [i] + b [i ];
4 // FINISH (‘ join ’)
Worker thread 2:
1 // CREATED BY MAIN (‘ fork ’)
2 // Perform third 1/4 of loop .
3 for ( i=n /2; i <3* n /4; i ++ ) c[i] = a[i] + b[i ];
4 // FINISH (‘ join ’)
Worker thread 3:
1 // CREATED BY MAIN (‘ fork ’)
2 // Perform final 1/4 of loop .
3 for ( i =3* n /4; i <n ; i ++ ) c[ i] = a [i] + b [i ];
4 // FINISH (‘ join ’)
Notes
The four threads are not being executed one after the other:
- Each thread runs concurrently, hopefully on separate cores, i.e. in parallel.
- Cannot be understood in terms of serial programming concepts.
Each thread performs the same operations on different data.
- Would be SIMD in Flynn’s taxonomy, except this is implemented in software on a MIMD device.
Have assumed n is divisible by the number of threads for clarity.
- Generalising to arbitrary n is not difficult, but obscures the parallel aspects.
#pragma omp parallel for
The total loop range was evenly divided between all threads.
- Happens as soon as #pragma omp parallel for reached.
- The trip count (i.e. loop range) must be known at the start of the loop.
- The start, end and stride must be constant.
- Cannot break from the loop.
- Cannot apply to ‘while. . . do’ or ‘do. . . while’ loops
2.4 Data parallel and embarrassingly parallel
This is an example of a data parallel problem or a map:
- Array elements distributed evenly over the threads.
- Same operation performed on all elements.
- Suitable for the SIMD model
In fact, this example is so straightforward to parallelise that is also sometimes referred to as an embarrassingly parallel problem.
- Easy to get working correctly in parallel.
- May still be a challenge to achieve good parallel performance.
(易并行计算问题)
在编写并行程序过程中,首先需要将一个问题分解成若干部分,然后将每个部分分配给不同的processer(处理器)或者thread(线程)分别进行计算,如果该计算问题能够被分解成一些完全独立的子计算(task)、同时各个task之间数据几乎没有依赖,没有通信。那这个计算问题就叫作易并行计算问题。对于输入的Data,直接拆分成几块互相没有依赖的Subdata,每一份Subdata具有完全相同的处理方式,最后再将subdata的处理结果合并行输出,对于这类数据计算问题,就天然的适用于并行计算。
3. Nested loops in parallel
3.1 Worked example: Mandelbrot set
Mandelbrot set generator
Mandelbrot.c, makefile
Classic computationally intensive problem in two dimensions that used to be used as a benchmark for processor speeds
- Loops over pixels, i.e. a two dimensional, nested double loop.
- Colour of each pixel calculated independently of all other pixels.
- Each colour calculation requires many floating point operations
Code snippet
The part of the code that interests us here is shown below:
1 // Change the colour arrays for the whole image .
2 int i , j ;
3 for ( j =0; j < numPixels_y ; j ++ )
4 for ( i =0; i < numPixels_x ; i ++ )
5 {
6 // Set the colour of pixel (i,j), i.e. modify the values
of red [i][j], green [i][j], and /or blue [i][j].
7 setPixelColour ( i , j ) ;
8 }
Note the i-loop is nested inside the j-loop.
The graphical output is performed in OpenGL/GLFW. Since including and linking is different between Linux and Macs, a simple makefile has been provided
Purely for background interest, here’s how the colours are calculated:
1 Each pixel i,j is converted to floating point numbers cx , cy , both in the range -2 to 2.
2 Two other floats zx and zy are initialised to zero.
3 The following iteration1 is performed until zx^2 + zy^2 ≥ 4, or a maximum number of iterations maxIters is reached:
4 The colour is selected based on the number of iterations.
3.2 Parallelising nested loops
Parallel Mandelbrot: First attempt
Parallelise only the inner loop.
1 int i , j ;
2 for ( j =0; j < numPixels_y ; j ++ )
3 # pragma omp parallel for
4 for ( i =0; i < numPixels_x ; i ++ )
5 {
6 setPixelColour ( i , j ) ;
7 }
This works, but may be slower than serial (check on your system).
Multiple possibilities for this:
- The fork-join is inside the j-loop, so threads are created and destroyed numPixels y times, which incurs an overhead.
- This problem suffers from poor load balancing; see later.
Parallel Mandelbrot: Second attempt
Parallelise only the outer loop, so there is only a single fork event and a single join event.
1 int i , j ;
2 # pragma omp parallel for
3 for ( j =0; j < numPixels_y ; j ++ )
4 for ( i =0; i < numPixels_x ; i ++ )
5 {
6 setPixelColour ( i , j ) ;
7 }
This is faster . . . but wrong!
- A distorted image results.
- The distortion is different each time the program is executed
The same variable i for the inner loop counter is being updated by all threads:
- When one thread completes a calculation, it increments i.
- Therefore other threads will skip at least one pixel.
- Threads do not calculate the full line of pixels.
Parallel Mandelbrot: Third attempt
Make the inner loop variable i private to each thread:
1 int j;
2 # pragma omp parallel for
3 for ( j =0; j < numPixels_y ; j ++ )
4 {
5 int i;
6 for ( i =0; i < numPixels_x ; i ++ )
7 {
8 setPixelColour ( i , j ) ;
9 }
10 }
. . . or (for compilers following the C99 standard):
1 # pragma omp parallel for
2 for ( int j =0; j < numPixels_y ; j ++ )
3 for ( int i =0; i < numPixels_x ; i ++ )
4 {
5 setPixelColour ( i , j ) ;
6 }
The private clause
A third way to solve this is to use OpenMP’s private clause:
1 int i , j ;
2 # pragma omp parallel for private (i)
3 for ( j =0; j < numPixels_y ; j ++ )
4 for ( i =0; i < numPixels_x ; i ++ )
5 {
6 setPixelColour ( i , j ) ;
7 }
- Creates a copy of i for each thread.
- Multiple variables may be listed, e.g. private(i,a,b,c)
The code now works . . . but is no faster than serial!
- The primary overhead is poor load balancing. We will look at this next lecture briefly, and detail in Lecture 13
3.3 The collapse clause
The collapse clause
The collapse clause replaces 2 or more nested loops with a single loop, at the expense of additional internal calculations.
1 # pragma omp parallel for collapse (2)
2 for ( int j =0; j < numPixels_y ; j ++ )
3 for ( int i =0; i < numPixels_x ; i ++ )
4 setPixelColour ( i , j ) ;
is equivalent to (but more readable than)
1 # pragma omp parallel for
2 for ( int k = 0; k < numPixels_x * numPixels_y ; k ++ )
3 {
4 int
5 i = k % numPixels_x ,
6 j = k / numPixels_x ;
7 setPixelColour ( i , j ) ;
8 }
This is principally intended for short loops that cannot be equally distributed across all threads.
3.4 Determinsim and non-determinism
Determinism and non-determinism
Notice that the incorrect images were slightly different each time:
The pixels plotted depend on the order in which threads update the shared variable i, which depends on the thread scheduler.
- Will be influenced by factors outside our control.
- e.g. the various background tasks that every OS must run.
Our serial code was deterministic, i.e. produced the same results each time it was run.
By contrast, our (incorrect) parallel code was non-deterministic.
Often this is the result of an error, but can sometimes be useful:
- Some algorithms, often in science and engineering, do not care about non-deterministic errors as long as they are small.
- Strictly imposing determinism may result in additional overheads and performance loss.
However, for this module we will try to develop parallel algorithms whose results match that of the serial equivalent.
Summary and next lecture
Today we have look at data parallel problems or maps, where the same operation is applied to multiple data members.
Distribute data evenly across threads.
Sometimes referred to as embarrassingly parallel.
In two lectures time we will start looking at more complex problems for which the calculations on different threads are not independent.
Before then, we need to learn the vocabulary of parallel theory, which is the topic of next lecture