并行计算课程总结（hdu）

最新推荐文章于 2024-07-10 16:21:46 发布

packdge_black

最新推荐文章于 2024-07-10 16:21:46 发布

阅读量2.7k

点赞数

分类专栏：并行计算文章标签：并行计算导论

本文链接：https://blog.csdn.net/packdge_black/article/details/107198999

版权

并行计算专栏收录该内容

1 篇文章 0 订阅

订阅专栏

仅供学习使用。

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------

1. Why does computational science need high-performance computers?

To obtain a more accurate result; for large-scale computation in short time; so that error does not increase; to understand large-scale computation results.

2. How to measure the speed of a supercomputer?

The speed is measured by the LINPACK program.

3. Features of recent supercomputers.

Multicore CPUs; Accelerators; High-speed inter connect.

4. Why should we use BLAS?

Portability: Programs using BLAS run on any platform.

Performance: BLAS routines are well tuned.

5. Matrix data handling in BLAS.

Matrices are managed as one-dimensional arrays.

For a[i][j]:

Column Major Layout: a [ i + j*m]

Row Major Layout: a [ i*n + j]

Leading dimension of the matrix

The array size of the m*n matrix is not necessarily m*n

We may handle submatrices of larger matrices

6. Speed difference between naive and optimized implementations.

Optimized implementation’s speed is faster than naive.

7. Arithmetic intensity of each level of BLAS routines.

f: total flops count

m: total bytes transferred between CPU and memory

Arithmetic intensity: ρ = f/m

8. Memory hierarchy of recent computers and processor-DRAM gap.

Memory hierarchy of recent computers:

Processor-DRAM gap: Memory hierarchies are getting deeper

9. Loop transformation techniques and their effects.

The matrix-matrix product is represented by a three nested loop.

There is no data dependency between each loop.

Even if you change the order of the loops, the result does not change.

There are six kinds of loop orders.

10. Process and thread.

When an application program is executed, one process is generated.

A process can be said to be the program itself.

A thread is a sequence of instructions that can be executed in parallel.

A process can execute multiple threads.

A thread is a unit of execution within a process.

11. Shared and private variables in OpenMP programs.

Global variables are shared by all threads.

Private directive declares data to have a separate copy in the memory of each thread.

12. Monte Carlo method.

13. Amdahl's law and Gustafson's law.

The potential speedup of an algorithm on a parallel computing platform:

Amdahl's law:

Gustafson's law:

14. Index of parallel performance.

Goal of parallel processing → Reduce the execution time to T/p

15. Strong scaling and Weak scaling.

Strong scaling: How the execution time varies with p for a fixed problem size.

Weak scaling: How the execution time varies with p for a fixed problem size per processor.

16. What is MPI?

MPI (Message Passing Interface)

MPI forum standardizes its specification

Message passing

Multiple processes exchange information as messages l

Inter process communication library

Portability

MPI is supported in (all most) all supercomputers -> portability

Many parallel applications are programmed with MPI

17. Point-to-point communication and collective communication.

Point-to-point communication: A pair of send and receive

Collective communication: Collective communication is convenient, but it is very expensive as

it requires synchronization of all ranks

MPI_Bcast: broadcast the message to all processes;

MPI_Gather: gather the message from all processes;

MPI_Scatter: opposite operation to MPI_Gather

18. Blocking communication and deadlock.

Blocking:

Sender: wait until send buffer is free

Receiver: wait until receive buffer is free

Non-blocking:

Sender: doesn't wait for transmission completed

Receiver: doesn't wait for transmission completed

Deadlock: Both are waiting for transmission completed

19. Data distributions.

20. Cannon's algorithm:

21. Fox's algorithm:

Fox's algorithm uses corrective communication (multicast)

Many modern computers have network hardware capable of high-speed one-to-all communication

22. SUMMA: Scalable Universal Matrix Multiplication Algorithm

SUMMA is an algorithm that uses only multicast for communication <- pros

SUMMA needs synchronization of each processor at each step <- cons

23. PUMMA: Parallel Universal Matrix Multiplication Algorithm

Extending Fox's algorithm to 2D block cyclic data distribution

24. What are parallel patterns and why do we need to understand them.

25. There are limits to “automatic” improvement of scalar performance:

The Power Wall: Clock frequency cannot be increased without exceeding air cooling.

The Memory Wall: Access to data is a limiting factor.

The ILP Wall: All the existing instruction-level parallelism (ILP) is already being used.

-->Conclusion: Explicit parallel mechanisms and explicit parallel programming are required for performance scaling.

26. Parallel SW Engineering Considerations

• Problem: Amdahl’s Law* notes that scaling will be limited by the serial fraction of your program.

• Solution: scale the parallel part of your program faster than the serial part using data parallelism.

• Problem: Locking, access to data (memory and communication), and overhead will strangle scaling.

• Solution: use programming approaches with good data locality and low overhead, and avoid locks.

• Problem: Parallelism introduces new debugging challenges: deadlocks and race conditions.

• Solution: use structured programming strategies to avoid these by design, improving maintainability.

27. Structured Programming with Patterns

• Patterns are “best practices” for solving specific problems.

• Patterns can be used to organize your code, leading to algorithms that are more scalable and maintainable.

• A pattern supports a particular “algorithmic structure” with an efficient implementation.

• Good parallel programming models support a set of useful parallel patterns with low-overhead implementations.

28. Structured Serial Patterns

• Sequence • Selection • Iteration • Nesting

• Functions • Recursion • Random read • Random write

• Stack allocation • Heap allocation • Objects • Closures

29. Structured Parallel Patterns

• Superscalar sequence • Speculative selection • Map • Recurrence

• Scan • Reduce • Pack/expand • Fork/join • Pipeline • Partition

• Segmentation • Stencil • Search/match • Gather • Merge scatter

• Priority scatter • *Permutation scatter • !Atomic scatter

30. Semantics and Implementation

Semantics: What

– The intended meaning as seen from the “outside”

– For example, for scan: compute all partial reductions given an associative operator Implementation: How

– How it executes in practice, as seen from the “inside”

– For example, for scan: partition, serial reduction in each partition, scan of reductions, serial scan in each partition.

– Many implementations may be possible

– Parallelization may require reordering of operations

– Patterns should not over-constrain the ordering; only the important ordering constraints are specified in the semantics

– Patterns may also specify additional constraints, i.e. associativity of operators

31. 3 Ways to Accelerate Applications

Libraries: Easy, High-Quality Acceleration

OpenACC Directives: Easy, Open, Powerful

Programming Languages: OpenACC, CUDA C Thrust, CUDA C++ MATLAB PyCUDA

32. Terminology:

▪ Host The CPU and its memory (host memory)

▪ Device The GPU and its memory (device memory)

33. Simple Processing Flow

1. Copy input data from CPU memory to GPU memory

2. Load GPU program and execute, caching data on chip for performance

3. Copy results from GPU memory to CPU memory

34. Map

• Map replicates a function over every element of an index set

• The index set may be abstract or associated with the elements of an array.

• Map replaces one specific usage of iteration in serial programs: independent operations.

Examples：gamma correction and thresholding in images; color space conversions; Monte Carlo sampling; ray tracing

35. Terminology: each parallel invocation of add() is referred to as a block

– The set of blocks is referred to as a grid

– Each invocation can refer to its block index using blockIdx.x

By using blockIdx.x to index into the array, each block handles a different index

36.

• Difference between host and device – Host CPU

– Device GPU

• Using __global__ to declare a function as device code – Executes on the device

– Called from the host

• Passing parameters from host code to a device function37. Terminology: a block can be split into parallel threads

38. Use the built-in variable blockDim.x for threads per block

int index = threadIdx.x + blockIdx.x * blockDim.x;

39. Why Bother with Threads?

• Threads seem unnecessary

– They add a level of complexity – What do we gain?

• Unlike parallel blocks, threads have mechanisms to:

– Communicate – Synchronize

40. Giga Thread controls Blocks/Threads to SMs/Cores assignment

41. Geometric Decomposition/Partition

• Geometric decomposition breaks an input collection into sub-collections

• Partition is a special case where sub-collections do not overlap

• Does not move data, it just provides an alternative “view” of its organization

42. Stencil 模版

• Stencil applies a function to neighbourhoods of a collection.

• Neighbourhoods are given by set of relative offsets.

• Boundary conditions need to be considered, but majority of computation is in interior.

43. nD Stencil

• nD Stencil applies a function to neighbourhoods of an nD array

• Neighbourhoods are given by set of relative offsets

• Boundary conditions need to be considered

Examples: image filtering including convolution, median, anisotropic diffusion; simulation including fluid flow, electromagnetic, and financial PDE solvers, lattice QCD

44. 1D Stencil1D Stencil

Each output element is the sum of input elements within a radius

• Each thread processes one output element

– blockDim.x elements per block

• Input elements are read several times

– With radius 3, each input element is read seven times

45. Sharing Data Between Threads

• Terminology:withinablock,threadssharedatavia shared memory

• Extremely fast on-chip memory, user-managed

• Declare using __shared__, allocated per block

• Data is not visible to threads in other blocks

46. • Use __shared__ to declare a variable/array in shared memory

– Data is shared between threads in a block – Not visible to threads in other blocks

• Use __syncthreads() as a barrier

– Use to prevent data hazards

47. Reduction

• Reduction combines every element in a collection into one element using an associative operator.

• Reordering of the operations is often needed to allow for parallelism.

• A tree reordering requires associativity.

Examples: averaging of Monte Carlo samples; convergence testing; image comparison metrics; matrix operations.

48．A Naïve Thread to Data Mapping

• Each thread is responsible for an even-index location of the partial sum vector (location of responsibility)

• After each step, half of the threads are no longer needed

• One of the inputs is always from the location of responsibility

• In each step, one of the inputs comes from an increasing distance away

49. BarrierSynchronization

– __syncthreads() is needed to ensure that all elements of each version of partial sums have been generated before we proceed to the next step

50. A Better Reduction Kernel

• Learn to write a better reduction kernel – Resource efficiency analysis

– Improved thread to data mapping

– Reduced control divergence

51. Scan

• Scan computes all partial reductions of a collection

• Operator must be (at least) associative.

• Diagram shows one possible parallel implementation using three-phase strategy

52. Improving Efficiency

• BalancedTrees

– Form a balanced binary tree on the input data and sweep it to and from the root

– Tree is not an actual data structure, but a concept to determine what each thread does at each step

• For scan:

– Traverse down from leaves to the root building partial sums at internal nodes in the tree • The root holds the sum of all leaves

– Traverse back up the tree building the output from the partial sumsBallot

53. Histogram

• A method for extracting notable features and patterns from large data sets

– Feature extraction for object recognition in images

– Fraud detection in credit card transactions

– Correlating heavenly object movements in astrophysics

• Basic histograms - for each element in the data set, use the value to identify a “bin counter” to increment

54. Key Concepts of Atomic Operations

• A read-modify-write operation performed by a single hardware instruction on a memory location address – Read the old value, calculate a new value, and write the new value to the location

• The hardware ensures that no other threads can perform another read-modify-write operation on the same location until the current atomic operation is complete

– Any other threads that attempt to perform an atomic operation on the same location will typically be held in a queue

– All threads perform their atomic operations serially on the same location

55. Cost and Benefit of Privatization

• Cost

– Overhead for creating and initializing private copies

– Overhead for accumulating the contents of private copies into the final copy

• Benefit

– Much less contention and serialization in accessing both the private copies and the final copy

– The overall performance can often be improved more than 10x

packdge_black

关注

0
点赞
踩
7

收藏

觉得还不错? 一键收藏
6
评论
并行计算课程总结（hdu）

仅供学习使用。-------------------------------------------------------------------------------------------------------------------------------------------------------------------------1. Why does computational science need high-performance computers? To obta
复制链接

扫一扫

专栏目录