[三] XJCO3221 Parallel Computation并行计算 - Data parallel problems

1.Overview

Previous lectures

In the last lecture we started looking at shared memory parallelism (SMP):

  • Relevant to multi-core CPUs.
  • Separate processing units (cores) share some levels of memory cache.
  • Various frameworks for programming SMP systems.
  • Widely-implemented standard: OpenMP

 Today’s lecture

Today we are going to look at a some actual problems.

  • Examples of a data parallel problems, where the same operation is applied to multiple data elements.
  • Also known as a map.
  • Multi-threading solution employs a fork-join pattern.
  • How to parallelise nested loops.
  • Parallel code can be non-deterministic, even when the serial code is deterministic.

 2. Multi-treaded vector addition

2.1 vector addition in serial

Vector addition

An n-vector a can be thought of as an array of n numbers: a = (a1, a2, . . . , an).

If two vectors a and b are the same size, they can be added to generate a new n-vector c:

 Or:

 Serial vector addition

vectorAddition serial.c

//
// Vector addition in parallel. Compile with: gcc -fopenmp -Wall -o vectorAddition_parallel vectorAddition_parallel.c 
//

// Includes. Don't need <omp.h> for this example.
#include <stdio.h>
#include <stdlib.h>

// n is the size of all three vectors. For simplicity, use statically allocated arrays
// [rather than dynamically allocated using malloc()/free()]
#define n 20

int main()
{
	float a[n], b[n], c[n];
	int i;
    // Initialise a[n] and b[n]
	// For this example, assign arbitrary numbers to vectors a and b.
	for( i=0; i<n; i++ ) a[i] = b[i] = i;

	// Vector addition in serial
	for( i=0; i<n; i++ ) c[i] = a[i] + b[i];
	return 0;
}

2.2 Vector addition in parallel

Add #pragma omp parallel for just before the loop:

vectorAddition parallel.c

//
// Vector addition in parallel. Compile with: gcc -fopenmp -Wall -o vectorAddition_parallel vectorAddition_parallel.c 
//

// Includes. Don't need <omp.h> for this example.
#include <stdio.h>
#include <stdlib.h>

// n is the size of all three vectors. For simplicity, use statically allocated arrays
// [rather than dynamically allocated using malloc()/free()]
#define n 20

int main()
{
	float a[n], b[n], c[n];
	int i;
    // Initialise a[n] and b[n]
	// For this example, assign arbitrary numbers to vectors a and b.
	for( i=0; i<n; i++ ) a[i] = b[i] = i;

	// Vector addition in parallel
    #pragma omp parallel for
	for( i=0; i<n; i++ ) c[i] = a[i] + b[i];
	return 0;
}

This only parallelises this one loop, not any later ones!

Fork-and-join

When the executable reaches #pragma omp parallel for, it spawns multiple threads.

  • Each thread computes part of the loop.
  • The extra threads are destroyed at the end of the loop.

This is known as a fork-join construct:

 2.3 Thread-level description

Example: Four threads in total

Pseudocode for the main thread:

1 // Main thread starts in serial
2 // Initialise arrays a, b; allocate c.
3 ...
4 // REACHES # pragma omp parallel for
5 // FORK : Create three new threads .
6 worker1 = fork (...) ;
7 worker2 = fork (...) ;
8 worker3 = fork (...) ;
9
10 // Perform 1/4 of the total loop .
11 for ( i =0; i < n /4; i ++ )
12 c[i ] = a[ i] + b[ i ];
13
14 // JOIN : Wait for other threads to finish .
15 worker1 . join () ;
16 worker2 . join () ;
17 worker3 . join () ;
18
19 // Continue in serial after the loop

Worker thread 1:

1 // CREATED BY MAIN (‘ fork ’)
2 // Perform second 1/4 of loop .
3 for ( i=n /4; i <n /2; i ++ ) c[ i] = a [i] + b [i ];
4 // FINISH (‘ join ’)

Worker thread 2:

1 // CREATED BY MAIN (‘ fork ’)
2 // Perform third 1/4 of loop .
3 for ( i=n /2; i <3* n /4; i ++ ) c[i] = a[i] + b[i ];
4 // FINISH (‘ join ’)

Worker thread 3:

1 // CREATED BY MAIN (‘ fork ’)
2 // Perform final 1/4 of loop .
3 for ( i =3* n /4; i <n ; i ++ ) c[ i] = a [i] + b [i ];
4 // FINISH (‘ join ’)

Notes

The four threads are not being executed one after the other:

  • Each thread runs concurrently, hopefully on separate cores, i.e. in parallel.
  • Cannot be understood in terms of serial programming concepts.

Each thread performs the same operations on different data.

  • Would be SIMD in Flynn’s taxonomy, except this is implemented in software on a MIMD device.

Have assumed n is divisible by the number of threads for clarity.

  • Generalising to arbitrary n is not difficult, but obscures the parallel aspects.

 #pragma omp parallel for

The total loop range was evenly divided between all threads.

  • Happens as soon as #pragma omp parallel for reached.
  • The trip count (i.e. loop range) must be known at the start of the loop.
  • The start, end and stride must be constant.
  • Cannot break from the loop.
  • Cannot apply to ‘while. . . do’ or ‘do. . . while’ loops

2.4 Data parallel and embarrassingly parallel

This is an example of a data parallel problem or a map:

  • Array elements distributed evenly over the threads.
  • Same operation performed on all elements.
  • Suitable for the SIMD model

 In fact, this example is so straightforward to parallelise that is also sometimes referred to as an embarrassingly parallel problem.

  • Easy to get working correctly in parallel.
  • May still be a challenge to achieve good parallel performance.

 (易并行计算问题)

在编写并行程序过程中,首先需要将一个问题分解成若干部分,然后将每个部分分配给不同的processer(处理器)或者thread(线程)分别进行计算,如果该计算问题能够被分解成一些完全独立的子计算(task)、同时各个task之间数据几乎没有依赖,没有通信。那这个计算问题就叫作易并行计算问题。对于输入的Data,直接拆分成几块互相没有依赖的Subdata,每一份Subdata具有完全相同的处理方式,最后再将subdata的处理结果合并行输出,对于这类数据计算问题,就天然的适用于并行计算。

3. Nested loops in parallel

3.1 Worked example: Mandelbrot set

Mandelbrot set generator

Mandelbrot.c, makefile

Classic computationally intensive problem in two dimensions that used to be used as a benchmark for processor speeds

  • Loops over pixels, i.e. a two dimensional, nested double loop.
  • Colour of each pixel calculated independently of all other pixels.
  • Each colour calculation requires many floating point operations

 Code snippet

The part of the code that interests us here is shown below:

1 // Change the colour arrays for the whole image .
2 int i , j ;
3 for ( j =0; j < numPixels_y ; j ++ )
4     for ( i =0; i < numPixels_x ; i ++ )
5     {
6         // Set the colour of pixel (i,j), i.e. modify the values
            of red [i][j], green [i][j], and /or blue [i][j].
7         setPixelColour ( i , j ) ;
8     }

Note the i-loop is nested inside the j-loop.

The graphical output is performed in OpenGL/GLFW. Since including and linking is different between Linux and Macs, a simple makefile has been provided

Purely for background interest, here’s how the colours are calculated:

1 Each pixel i,j is converted to floating point numbers cx , cy , both in the range -2 to 2.

2 Two other floats zx and zy are initialised to zero.

3 The following iteration1 is performed until zx^2 + zy^2 ≥ 4, or a maximum number of iterations maxIters is reached: 

4 The colour is selected based on the number of iterations.

3.2 Parallelising nested loops

Parallel Mandelbrot: First attempt

Parallelise only the inner loop.

1 int i , j ;
2 for ( j =0; j < numPixels_y ; j ++ )
3     # pragma omp parallel for
4     for ( i =0; i < numPixels_x ; i ++ )
5     {
6         setPixelColour ( i , j ) ;
7     }

This works, but may be slower than serial (check on your system).

Multiple possibilities for this:

  • The fork-join is inside the j-loop, so threads are created and destroyed numPixels y times, which incurs an overhead.
  • This problem suffers from poor load balancing; see later.

Parallel Mandelbrot: Second attempt

Parallelise only the outer loop, so there is only a single fork event and a single join event.

1 int i , j ;
2 # pragma omp parallel for
3 for ( j =0; j < numPixels_y ; j ++ )
4     for ( i =0; i < numPixels_x ; i ++ )
5     {
6         setPixelColour ( i , j ) ;
7     }

This is faster . . . but wrong!

  • A distorted image results.
  • The distortion is different each time the program is executed

 The same variable i for the inner loop counter is being updated by all threads:

  • When one thread completes a calculation, it increments i.
  • Therefore other threads will skip at least one pixel.
  • Threads do not calculate the full line of pixels.

 Parallel Mandelbrot: Third attempt

Make the inner loop variable i private to each thread:

1 int j;
2 # pragma omp parallel for
3 for ( j =0; j < numPixels_y ; j ++ )
4 {
5     int i;
6     for ( i =0; i < numPixels_x ; i ++ )
7     {
8         setPixelColour ( i , j ) ;
9     }
10 }

. . . or (for compilers following the C99 standard):


1 # pragma omp parallel for
2 for ( int j =0; j < numPixels_y ; j ++ )
3     for ( int i =0; i < numPixels_x ; i ++ )
4     {
5         setPixelColour ( i , j ) ;
6     }

The private clause

 A third way to solve this is to use OpenMP’s private clause:

1 int i , j ;
2 # pragma omp parallel for private (i)
3 for ( j =0; j < numPixels_y ; j ++ )
4     for ( i =0; i < numPixels_x ; i ++ )
5     {
6         setPixelColour ( i , j ) ;
7     }
  • Creates a copy of i for each thread.
  • Multiple variables may be listed, e.g. private(i,a,b,c)

 The code now works . . . but is no faster than serial!

  • The primary overhead is poor load balancing. We will look at this next lecture briefly, and detail in Lecture 13

 3.3 The collapse clause

The collapse clause

The collapse clause replaces 2 or more nested loops with a single loop, at the expense of additional internal calculations.

1 # pragma omp parallel for collapse (2)
2 for ( int j =0; j < numPixels_y ; j ++ )
3     for ( int i =0; i < numPixels_x ; i ++ )
4         setPixelColour ( i , j ) ;

is equivalent to (but more readable than)

1 # pragma omp parallel for
2 for ( int k = 0; k < numPixels_x * numPixels_y ; k ++ )
3 {
4     int
5     i = k % numPixels_x ,
6     j = k / numPixels_x ;
7     setPixelColour ( i , j ) ;
8 }

This is principally intended for short loops that cannot be equally distributed across all threads.

3.4 Determinsim and non-determinism

Determinism and non-determinism

Notice that the incorrect images were slightly different each time:

 The pixels plotted depend on the order in which threads update the shared variable i, which depends on the thread scheduler.

  • Will be influenced by factors outside our control.
  • e.g. the various background tasks that every OS must run.

Our serial code was deterministic, i.e. produced the same results each time it was run.

By contrast, our (incorrect) parallel code was non-deterministic.

Often this is the result of an error, but can sometimes be useful:

  • Some algorithms, often in science and engineering, do not care about non-deterministic errors as long as they are small.
  • Strictly imposing determinism may result in additional overheads and performance loss.

However, for this module we will try to develop parallel algorithms whose results match that of the serial equivalent. 

Summary and next lecture

Today we have look at data parallel problems or maps, where the same operation is applied to multiple data members.

Distribute data evenly across threads.

Sometimes referred to as embarrassingly parallel.

In two lectures time we will start looking at more complex problems for which the calculations on different threads are not independent.

Before then, we need to learn the vocabulary of parallel theory, which is the topic of next lecture

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值