[三] XJCO3221 Parallel Computation并行计算 - Data parallel problems

河图洛水

已于 2023-09-14 19:20:17 修改

阅读量49

点赞数

分类专栏：并行计算文章标签：开发语言 c语言

于 2023-03-13 22:20:59 首次发布

本文链接：https://blog.csdn.net/HETUW/article/details/129506617

版权

并行计算专栏收录该内容

7 篇文章 0 订阅

订阅专栏

文章介绍了在SMP系统中实现数据并行的方法，如使用OpenMP进行多线程编程。通过多线程向量化加法的例子展示了如何并行化简单的操作，强调了并行代码可能导致的非确定性。还讨论了并行化嵌套循环的问题，特别是Mandelbrot集生成器中的循环并行化，以及如何解决变量共享导致的错误。最后提到了非确定性在某些算法中的应用及其对性能的影响。

摘要由CSDN通过智能技术生成

1.Overview

Previous lectures

In the last lecture we started looking at shared memory parallelism (SMP):

Relevant to multi-core CPUs.
Separate processing units (cores) share some levels of memory cache.
Various frameworks for programming SMP systems.
Widely-implemented standard: OpenMP

Today’s lecture

Today we are going to look at a some actual problems.

Examples of a data parallel problems, where the same operation is applied to multiple data elements.
Also known as a map.
Multi-threading solution employs a fork-join pattern.
How to parallelise nested loops.
Parallel code can be non-deterministic, even when the serial code is deterministic.

2. Multi-treaded vector addition

2.1 vector addition in serial

Vector addition

An n-vector a can be thought of as an array of n numbers: a = (a1, a2, . . . , an).

If two vectors a and b are the same size, they can be added to generate a new n-vector c:

Or:

Serial vector addition

vectorAddition serial.c

//
// Vector addition in parallel. Compile with: gcc -fopenmp -Wall -o vectorAddition_parallel vectorAddition_parallel.c 
//

// Includes. Don't need <omp.h> for this example.
#include <stdio.h>
#include <stdlib.h>

// n is the size of all three vectors. For simplicity, use statically allocated arrays
// [rather than dynamically allocated using malloc()/free()]
#define n 20

int main()
{
	float a[n], b[n], c[n];
	int i;
    // Initialise a[n] and b[n]
	// For this example, assign arbitrary numbers to vectors a and b.
	for( i=0; i<n; i++ ) a[i] = b[i] = i;

	// Vector addition in serial
	for( i=0; i<n; i++ ) c[i] = a[i] + b[i];
	return 0;
}

2.2 Vector addition in parallel

Add #pragma omp parallel for just before the loop:

vectorAddition parallel.c

//
// Vector addition in parallel. Compile with: gcc -fopenmp -Wall -o vectorAddition_parallel vectorAddition_parallel.c 
//

// Includes. Don't need <omp.h> for this example.
#include <stdio.h>
#include <stdlib.h>

// n is the size of all three vectors. For simplicity, use statically allocated arrays
// [rather than dynamically allocated using malloc()/free()]
#define n 20

int main()
{
	float a[n], b[n], c[n];
	int i;
    // Initialise a[n] and b[n]
	// For this example, assign arbitrary numbers to vectors a and b.
	for( i=0; i<n; i++ ) a[i] = b[i] = i;

	// Vector addition in parallel
    #pragma omp parallel for
	for( i=0; i<n; i++ ) c[i] = a[i] + b[i];
	return 0;
}

This only parallelises this one loop, not any later ones!

Fork-and-join

When the executable reaches #pragma omp parallel for, it spawns multiple threads.

Each thread computes part of the loop.
The extra threads are destroyed at the end of the loop.

This is known as a fork-join construct:

2.3 Thread-level description

Example: Four threads in total

Pseudocode for the main thread:

1 // Main thread starts in serial
2 // Initialise arrays a, b; allocate c.
3 ...
4 // REACHES # pragma omp parallel for
5 // FORK : Create three new threads .
6 worker1 = fork (...) ;
7 worker2 = fork (...) ;
8 worker3 = fork (...) ;
9
10 // Perform 1/4 of the total loop .
11 for ( i =0; i < n /4; i ++ )
12 c[i ] = a[ i] + b[ i ];
13
14 // JOIN : Wait for other threads to finish .
15 worker1 . join () ;
16 worker2 . join () ;
17 worker3 . join () ;
18
19 // Continue in serial after the loop

Worker thread 1:

1 // CREATED BY MAIN (‘ fork ’)
2 // Perform second 1/4 of loop .
3 for ( i=n /4; i <n /2; i ++ ) c[ i] = a [i] + b [i ];
4 // FINISH (‘ join ’)

Worker thread 2:

1 // CREATED BY MAIN (‘ fork ’)
2 // Perform third 1/4 of loop .
3 for ( i=n /2; i <3* n /4; i ++ ) c[i] = a[i] + b[i ];
4 // FINISH (‘ join ’)

Worker thread 3:

1 // CREATED BY MAIN (‘ fork ’)
2 // Perform final 1/4 of loop .
3 for ( i =3* n /4; i <n ; i ++ ) c[ i] = a [i] + b [i ];
4 // FINISH (‘ join ’)

Notes

The four threads are not being executed one after the other:

Each thread runs concurrently, hopefully on separate cores, i.e. in parallel.
Cannot be understood in terms of serial programming concepts.

Each thread performs the same operations on different data.

Would be SIMD in Flynn’s taxonomy, except this is implemented in software on a MIMD device.

Have assumed n is divisible by the number of threads for clarity.

Generalising to arbitrary n is not difficult, but obscures the parallel aspects.

#pragma omp parallel for

The total loop range was evenly divided between all threads.

Happens as soon as #pragma omp parallel for reached.
The trip count (i.e. loop range) must be known at the start of the loop.
The start, end and stride must be constant.
Cannot break from the loop.
Cannot apply to ‘while. . . do’ or ‘do. . . while’ loops

2.4 Data parallel and embarrassingly parallel

This is an example of a data parallel problem or a map:

Array elements distributed evenly over the threads.
Same operation performed on all elements.
Suitable for the SIMD model

In fact, this example is so straightforward to parallelise that is also sometimes referred to as an embarrassingly parallel problem.

Easy to get working correctly in parallel.
May still be a challenge to achieve good parallel performance.

（易并行计算问题）

在编写并行程序过程中，首先需要将一个问题分解成若干部分，然后将每个部分分配给不同的processer（处理器）或者thread（线程）分别进行计算，如果该计算问题能够被分解成一些完全独立的子计算（task）、同时各个task之间数据几乎没有依赖，没有通信。那这个计算问题就叫作易并行计算问题。对于输入的Data，直接拆分成几块互相没有依赖的Subdata,每一份Subdata具有完全相同的处理方式，最后再将subdata的处理结果合并行输出，对于这类数据计算问题，就天然的适用于并行计算。

3. Nested loops in parallel

3.1 Worked example: Mandelbrot set

Mandelbrot set generator

Mandelbrot.c, makefile

Classic computationally intensive problem in two dimensions that used to be used as a benchmark for processor speeds

Loops over pixels, i.e. a two dimensional, nested double loop.
Colour of each pixel calculated independently of all other pixels.
Each colour calculation requires many floating point operations

Code snippet

The part of the code that interests us here is shown below:

1 // Change the colour arrays for the whole image .
2 int i , j ;
3 for ( j =0; j < numPixels_y ; j ++ )
4     for ( i =0; i < numPixels_x ; i ++ )
5     {
6         // Set the colour of pixel (i,j), i.e. modify the values
            of red [i][j], green [i][j], and /or blue [i][j].
7         setPixelColour ( i , j ) ;
8     }

Note the i-loop is nested inside the j-loop.

The graphical output is performed in OpenGL/GLFW. Since including and linking is different between Linux and Macs, a simple makefile has been provided

Purely for background interest, here’s how the colours are calculated:

1 Each pixel i,j is converted to floating point numbers cx , cy , both in the range -2 to 2.

2 Two other floats zx and zy are initialised to zero.

3 The following iteration1 is performed until zx^2 + zy^2 ≥ 4, or a maximum number of iterations maxIters is reached:

4 The colour is selected based on the number of iterations.

3.2 Parallelising nested loops

Parallel Mandelbrot: First attempt

Parallelise only the inner loop.

1 int i , j ;
2 for ( j =0; j < numPixels_y ; j ++ )
3     # pragma omp parallel for
4     for ( i =0; i < numPixels_x ; i ++ )
5     {
6         setPixelColour ( i , j ) ;
7     }

This works, but may be slower than serial (check on your system).

Multiple possibilities for this:

The fork-join is inside the j-loop, so threads are created and destroyed numPixels y times, which incurs an overhead.
This problem suffers from poor load balancing; see later.

Parallel Mandelbrot: Second attempt

Parallelise only the outer loop, so there is only a single fork event and a single join event.

1 int i , j ;
2 # pragma omp parallel for
3 for ( j =0; j < numPixels_y ; j ++ )
4     for ( i =0; i < numPixels_x ; i ++ )
5     {
6         setPixelColour ( i , j ) ;
7     }

This is faster . . . but wrong!

A distorted image results.
The distortion is different each time the program is executed

The same variable i for the inner loop counter is being updated by all threads:

When one thread completes a calculation, it increments i.
Therefore other threads will skip at least one pixel.
Threads do not calculate the full line of pixels.

Parallel Mandelbrot: Third attempt

Make the inner loop variable i private to each thread:

1 int j;
2 # pragma omp parallel for
3 for ( j =0; j < numPixels_y ; j ++ )
4 {
5     int i;
6     for ( i =0; i < numPixels_x ; i ++ )
7     {
8         setPixelColour ( i , j ) ;
9     }
10 }

. . . or (for compilers following the C99 standard):


1 # pragma omp parallel for
2 for ( int j =0; j < numPixels_y ; j ++ )
3     for ( int i =0; i < numPixels_x ; i ++ )
4     {
5         setPixelColour ( i , j ) ;
6     }

The private clause

A third way to solve this is to use OpenMP’s private clause:

1 int i , j ;
2 # pragma omp parallel for private (i)
3 for ( j =0; j < numPixels_y ; j ++ )
4     for ( i =0; i < numPixels_x ; i ++ )
5     {
6         setPixelColour ( i , j ) ;
7     }

Creates a copy of i for each thread.
Multiple variables may be listed, e.g. private(i,a,b,c)

The code now works . . . but is no faster than serial!

The primary overhead is poor load balancing. We will look at this next lecture briefly, and detail in Lecture 13

3.3 The collapse clause

The collapse clause

The collapse clause replaces 2 or more nested loops with a single loop, at the expense of additional internal calculations.

1 # pragma omp parallel for collapse (2)
2 for ( int j =0; j < numPixels_y ; j ++ )
3     for ( int i =0; i < numPixels_x ; i ++ )
4         setPixelColour ( i , j ) ;

is equivalent to (but more readable than)

1 # pragma omp parallel for
2 for ( int k = 0; k < numPixels_x * numPixels_y ; k ++ )
3 {
4     int
5     i = k % numPixels_x ,
6     j = k / numPixels_x ;
7     setPixelColour ( i , j ) ;
8 }

This is principally intended for short loops that cannot be equally distributed across all threads.

3.4 Determinsim and non-determinism

Determinism and non-determinism

Notice that the incorrect images were slightly different each time:

The pixels plotted depend on the order in which threads update the shared variable i, which depends on the thread scheduler.

Will be influenced by factors outside our control.
e.g. the various background tasks that every OS must run.

Our serial code was deterministic, i.e. produced the same results each time it was run.

By contrast, our (incorrect) parallel code was non-deterministic.

Often this is the result of an error, but can sometimes be useful:

Some algorithms, often in science and engineering, do not care about non-deterministic errors as long as they are small.
Strictly imposing determinism may result in additional overheads and performance loss.

However, for this module we will try to develop parallel algorithms whose results match that of the serial equivalent.

Summary and next lecture

Today we have look at data parallel problems or maps, where the same operation is applied to multiple data members.

Distribute data evenly across threads.

Sometimes referred to as embarrassingly parallel.

In two lectures time we will start looking at more complex problems for which the calculations on different threads are not independent.

Before then, we need to learn the vocabulary of parallel theory, which is the topic of next lecture

河图洛水

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
[三] XJCO3221 Parallel Computation并行计算 - Data parallel problems

[三] XJCO3221 Parallel Computation并行计算 - Data parallel problems
复制链接

扫一扫

专栏目录