并行计算课程总结(hdu)

仅供学习使用。
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1. Why does computational science need high-performance computers?
To obtain a more accurate result; for large-scale computation in short time; so that error does not increase; to understand large-scale computation results.
 
2. How to measure the speed of a supercomputer?
The speed is measured by the LINPACK program.
 
3. Features of recent supercomputers.
Multicore CPUs; Accelerators; High-speed inter connect.
 
4. Why should we use BLAS?
Portability: Programs using BLAS run on any platform.
Performance: BLAS routines are well tuned.
 
5. Matrix data handling in BLAS.
Matrices are managed as one-dimensional arrays.
For a[i][j]:
Column Major Layout: a [ i + j*m]
Row Major Layout: a [ i*n + j]
Leading dimension of the matrix
The array size of the m*n matrix is not necessarily m*n
We may handle submatrices of larger matrices
 
6. Speed difference between naive and optimized implementations.
Optimized implementation’s speed is faster than naive.
 
7. Arithmetic intensity of each level of BLAS routines.
f: total flops count
m: total bytes transferred between CPU and memory
Arithmetic intensity: ρ = f/m
 
8. Memory hierarchy of recent computers and processor-DRAM gap.
Memory hierarchy of recent computers:
Processor-DRAM gap: Memory hierarchies are getting deeper
 
9. Loop transformation techniques and their effects.
The matrix-matrix product is represented by a three nested loop.
There is no data dependency between each loop.
Even if you change the order of the loops, the result does not change.
There are six kinds of loop orders.
 
10. Process and thread.
When an application program is executed, one process is generated.
    A process can be said to be the program itself.
A thread is a sequence of instructions that can be executed in parallel.
     A process can execute multiple threads.
     A thread is a unit of execution within a process.
 
11. Shared and private variables in OpenMP programs.
Global variables are shared by all threads.
Private directive declares data to have a separate copy in the memory of each thread.
 
12. Monte Carlo method.
 
13. Amdahl's law and Gustafson's law.
The potential speedup of an algorithm on a parallel computing platform:
Amdahl's law:
Gustafson's law:
 
14. Index of parallel performance.
Goal of parallel processing → Reduce the execution time to T/p
 
15. Strong scaling and Weak scaling.
Strong scaling: How the execution time varies with p for a fixed problem size.
Weak scaling: How the execution time varies with p for a fixed problem size per processor.
16. What is MPI?
MPI (Message Passing Interface)
     MPI forum standardizes its specification
Message passing
     Multiple processes exchange information as messages l
Inter process communication library
Portability
     MPI is supported in (all most) all supercomputers -> portability
     Many parallel applications are programmed with MPI
 
17. Point-to-point communication and collective communication.
Point-to-point communication: A pair of send and receive
Collective communication: Collective communication is convenient, but it is very expensive as
it requires synchronization of all ranks
    MPI_Bcast: broadcast the message to all processes;
    MPI_Gather: gather the message from all processes;
    MPI_Scatter: opposite operation to MPI_Gather
 
18. Blocking communication and deadlock.
Blocking:
     Sender: wait until send buffer is free
     Receiver: wait until receive buffer is free
Non-blocking:
    Sender: doesn't wait for transmission completed
    Receiver: doesn't wait for transmission completed
Deadlock: Both are waiting for transmission completed
 
19. Data distributions.
 
20. Cannon's algorithm:
 
21. Fox's algorithm:
Fox's algorithm uses corrective communication (multicast)
Many modern computers have network hardware capable of high-speed one-to-all communication
 
22. SUMMA: Scalable Universal Matrix Multiplication Algorithm
SUMMA is an algorithm that uses only multicast for communication <- pros
SUMMA needs synchronization of each processor at each step <- cons
 
23. PUMMA: Parallel Universal Matrix Multiplication Algorithm
Extending Fox's algorithm to 2D block cyclic data distribution
 
24. What are parallel patterns and why do we need to understand them.
 
25. There are limits to “automatic” improvement of scalar performance:
The Power Wall: Clock frequency cannot be increased without exceeding air cooling.
The Memory Wall: Access to data is a limiting factor.
The ILP Wall: All the existing instruction-level parallelism (ILP) is already being used.
-->Conclusion: Explicit parallel mechanisms and explicit parallel programming are required for performance scaling.
 
26. Parallel SW Engineering Considerations
• Problem: Amdahl’s Law* notes that scaling will be limited by the serial fraction of your program.
• Solution: scale the parallel part of your program faster than the serial part using data parallelism.
• Problem: Locking, access to data (memory and communication), and overhead will strangle scaling.
• Solution: use programming approaches with good data locality and low overhead, and avoid locks.
• Problem: Parallelism introduces new debugging challenges: deadlocks and race conditions.
• Solution: use structured programming strategies to avoid these by design, improving maintainability.
 
27. Structured Programming with Patterns
• Patterns are “best practices” for solving specific problems.
• Patterns can be used to organize your code, leading to algorithms that are more scalable and maintainable.
• A pattern supports a particular “algorithmic structure” with an efficient implementation.
• Good parallel programming models support a set of useful parallel patterns with low-overhead implementations.
 
28. Structured Serial Patterns
• Sequence • Selection • Iteration • Nesting
• Functions • Recursion • Random read • Random write
• Stack allocation • Heap allocation • Objects • Closures
 
29. Structured Parallel Patterns
• Superscalar sequence • Speculative selection • Map • Recurrence
• Scan • Reduce • Pack/expand • Fork/join • Pipeline • Partition
• Segmentation • Stencil • Search/match • Gather • Merge scatter
• Priority scatter • *Permutation scatter • !Atomic scatter
 
30. Semantics and Implementation
Semantics: What
– The intended meaning as seen from the “outside”
– For example, for scan: compute all partial reductions given an associative operator Implementation: How
– How it executes in practice, as seen from the “inside”
– For example, for scan: partition, serial reduction in each partition, scan of reductions, serial scan in each partition.
– Many implementations may be possible
– Parallelization may require reordering of operations
– Patterns should not over-constrain the ordering; only the important ordering constraints are specified in the semantics
– Patterns may also specify additional constraints, i.e. associativity of operators
 
31. 3 Ways to Accelerate Applications
Libraries: Easy, High-Quality Acceleration
OpenACC Directives: Easy, Open, Powerful
Programming Languages: OpenACC, CUDA C Thrust, CUDA C++ MATLAB PyCUDA
 
32. Terminology:
▪ Host The CPU and its memory (host memory)
▪ Device The GPU and its memory (device memory)
 
33. Simple Processing Flow
1. Copy input data from CPU memory to GPU memory
2. Load GPU program and execute, caching data on chip for performance
3. Copy results from GPU memory to CPU memory
 
34. Map
• Map replicates a function over every element of an index set
• The index set may be abstract or associated with the elements of an array.
• Map replaces one specific usage of iteration in serial programs: independent operations.
Examples:gamma correction and thresholding in images; color space conversions; Monte Carlo sampling; ray tracing
 
35. Terminology: each parallel invocation of add() is referred to as a block
– The set of blocks is referred to as a grid
– Each invocation can refer to its block index using blockIdx.x
By using blockIdx.x to index into the array, each block handles a different index
 
36.
• Difference between host and device – Host CPU
– Device GPU
• Using __global__ to declare a function as device code – Executes on the device
– Called from the host
• Passing parameters from host code to a device function37. Terminology: a block can be split into parallel threads
 
38. Use the built-in variable blockDim.x for threads per block
int index = threadIdx.x + blockIdx.x * blockDim.x;
 
39. Why Bother with Threads?
• Threads seem unnecessary
– They add a level of complexity – What do we gain?
• Unlike parallel blocks, threads have mechanisms to:
– Communicate – Synchronize
 
40. Giga Thread controls Blocks/Threads to SMs/Cores assignment
 
41. Geometric Decomposition/Partition
• Geometric decomposition breaks an input collection into sub-collections
• Partition is a special case where sub-collections do not overlap
• Does not move data, it just provides an alternative “view” of its organization
 
42. Stencil 模版
• Stencil applies a function to neighbourhoods of a collection.
• Neighbourhoods are given by set of relative offsets.
• Boundary conditions need to be considered, but majority of computation is in interior.
 
43. nD Stencil
• nD Stencil applies a function to neighbourhoods of an nD array
• Neighbourhoods are given by set of relative offsets
• Boundary conditions need to be considered
Examples: image filtering including convolution, median, anisotropic diffusion; simulation including fluid flow, electromagnetic, and financial PDE solvers, lattice QCD
 
44. 1D Stencil1D Stencil
Each output element is the sum of input elements within a radius
• Each thread processes one output element
– blockDim.x elements per block
• Input elements are read several times
– With radius 3, each input element is read seven times
 
45. Sharing Data Between Threads
• Terminology:withinablock,threadssharedatavia shared memory
• Extremely fast on-chip memory, user-managed
• Declare using __shared__, allocated per block
• Data is not visible to threads in other blocks
 
46. • Use __shared__ to declare a variable/array in shared memory
– Data is shared between threads in a block – Not visible to threads in other blocks
• Use __syncthreads() as a barrier
– Use to prevent data hazards
 
47. Reduction
• Reduction combines every element in a collection into one element using an associative operator.
• Reordering of the operations is often needed to allow for parallelism.
• A tree reordering requires associativity.
Examples: averaging of Monte Carlo samples; convergence testing; image comparison metrics; matrix operations.
 
48.A Naïve Thread to Data Mapping
• Each thread is responsible for an even-index location of the partial sum vector (location of responsibility)
• After each step, half of the threads are no longer needed
• One of the inputs is always from the location of responsibility
• In each step, one of the inputs comes from an increasing distance away
 
49. BarrierSynchronization
– __syncthreads() is needed to ensure that all elements of each version of partial sums have been generated before we proceed to the next step
 
50. A Better Reduction Kernel
• Learn to write a better reduction kernel – Resource efficiency analysis
– Improved thread to data mapping
– Reduced control divergence
 
51. Scan
• Scan computes all partial reductions of a collection
• Operator must be (at least) associative.
• Diagram shows one possible parallel implementation using three-phase strategy
 
52. Improving Efficiency
• BalancedTrees
– Form a balanced binary tree on the input data and sweep it to and from the root
– Tree is not an actual data structure, but a concept to determine what each thread does at each step
• For scan:
– Traverse down from leaves to the root building partial sums at internal nodes in the tree • The root holds the sum of all leaves
– Traverse back up the tree building the output from the partial sumsBallot
 
53. Histogram
• A method for extracting notable features and patterns from large data sets
– Feature extraction for object recognition in images
– Fraud detection in credit card transactions
– Correlating heavenly object movements in astrophysics
• Basic histograms - for each element in the data set, use the value to identify a “bin counter” to increment
 
54. Key Concepts of Atomic Operations
• A read-modify-write operation performed by a single hardware instruction on a memory location address – Read the old value, calculate a new value, and write the new value to the location
• The hardware ensures that no other threads can perform another read-modify-write operation on the same location until the current atomic operation is complete
– Any other threads that attempt to perform an atomic operation on the same location will typically be held in a queue
– All threads perform their atomic operations serially on the same location
 
55. Cost and Benefit of Privatization
• Cost
– Overhead for creating and initializing private copies
– Overhead for accumulating the contents of private copies into the final copy
• Benefit
– Much less contention and serialization in accessing both the private copies and the final copy
– The overall performance can often be improved more than 10x
  • 3
    点赞
  • 7
    收藏
    觉得还不错? 一键收藏
  • 6
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 6
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值