两种主要的架构类型
- Master/Slave
- Peer to Peer
1. Architectures
1.1 Master/Slave
- 所有通信都通过一个主节点,该主节点控制一组为其执行任务的节点
- Advantages/Disadvantages:
(1) Advantages:
- 通常实施起来相对简单
- 单个进程可以访问所有数据
(2) Disadvantage - 可扩展性问题,因为与主设备的通信可能会成为瓶颈
- 何时考虑使用Master/Slave结构:
(1) Trivially parallel problems
- 可以完全独立执行的子任务的问题
- 主服务器将问题分发给所有从属节点,并等待答案返回
(2) 需要整理来自所有节点的数据的问题 - 信息不仅在节点之间交换,还需要合并
- Might form part of an otherwise peer-to-peer architecture – see hybrid architectures
- Types of communications to be used –
(1) A Master/Slave architecture will usually rely heavily on collective communications. 这会将一些通信负载分散到主节点之外
(2) Scatter/Gather or Scatter/Reduce the usual communication methods
(3) Can use non-blocking point to point if you wish to respond to individual slave nodes as they complete. 将新数据发送到节点,而无需等待所有节点完成
1.2 Peer to Peer
-
Every process has equal precedent and communicates directly with the other process (or, more usually, a subset of them)
-
Advantages/Disadvantage
(1) Advantages
- 非常适合强耦合的问题,尤其是在每个节点仅需要与邻居的子集进行通信的情况下
- 良好的可伸缩性,因为如果彼此通信的邻居数量与系统大小无关,则进入特定节点的通信量通常与所使用的进程数量无关或仅弱于所使用的进程数量
(2) Disadvantages - 由于没有节点负责系统,因此通常更难编写代码
- 通常,没有节点会知道整个解决方案
- 需要对来自不同节点的结果进行后处理
- Types of communications to be used –
(1) 将主要依靠非阻塞点对点通信 -MPI_Isend
andMPI_Irecv
(2) 如果节点仅与其他节点的子集进行通信并且这些通信集完全互连,则是最合适的选择
(3) 有时可能需要在所有节点之间进行数据通信(例如, 使用动态时间步长时获得单个时间步长).MPI_Allgather
orMPI_Allreduce
most appropriate
1.3 Hybrid Architectures
- 不需要程序忠实地遵循单一通信体系结构
- An example of where a hybrid architecture is appropriate might be domain decomposition with dynamic load balancing
2. Parallel Performance
2.1 简介
- 评估并行代码的性能的两个指标
(1) Speedup ratio
相对于串行代码,并行执行代码要快多少倍
S = T 1 T N S=\frac{T_1}{T_N} S=TNT1
(2) Parallel Efficiency
代码相对于理想加速的速度有多快
E = T 1 N T N = S N E=\frac{T_1}{NT_N}=\frac SN E=NTNT1=NS - Parallel efficiency通常会随着使用的内核数量的增加而下降. 可能具有超线性加速(效率大于一),但很少见,通常是由较小的任务和更有效的缓存使用引起的
2.2 Amdahl’s Law
- 基于以下想法:部分解决方案是并行的,部分是串行的.
(1) f is the fraction of the code that is executing in parallel.
(2) Assumes that the parallel portion has an efficiency of 1
T N = ( 1 − f ) T 1 + f N T 1 = T 1 ( 1 − f + f N ) T_N=(1-f)T_1+\frac fNT_1=T_1(1-f+\frac fN) TN=(1−f)T1+NfT1=T1(1−f+Nf)
S = 1 1 − f + f N S=\frac1{1-f+{\displaystyle\frac fN}} S=1−f+Nf1
E = 1 N ( 1 + f ) − f E=\frac1{N(1+f)-f} E=N(1+f)−f1
这意味着无论使用多少核,speedup都不能大于 1 1 − f \frac1{1-f} 1−f1
2.3 Communication and Parallel Efficiency
- 在分布式内存代码中,大多数计算是并行的. Inefficiency often comes from the relative amount of time spent transferring data (or waiting for communications) relative to the amount of time spend doing calculations.
T t o t a l = T c a l c u l a t e + T c o m m u n i c a t e T_{total}=T_{calculate}+T_{communicate} Ttotal=Tcalculate+Tcommunicate
如果我们假设问题的大小为P,并且可以完美分解问题,则
T c a l c u l a t e ∈ P N T_{calculate}\in\frac PN Tcalculate∈NP
The communications are often associated with the “edges” of the data
T c o m m u n i c a t e ∈ ( P N ) n T_{communicate}\in(\frac PN)^n Tcommunicate∈(NP)n
其中n通常介于零和一之间,并且通常取决于数据的维数 - 我们可以将它们结合起来,以估计随着问题大小和所用内核数的变化,speedup and parallel efficiency的预期变化, 请注意,这些只是近似值,但可用于了解预期趋势
E ≈ 1 1 + k P n − 1 N 1 − n E\approx\frac1{1+kP^{n-1}N^{1-n}} E≈1+kPn−1N1−n1
其中k是特定于问题的,并且与通信和计算的相对成本有关
(2) 假设0 <n<1,则该等式意味着:
对于给定的问题大小P,效率随着N的增加而下降. 这也意味着在使用相同数量的内核时,更大的问题将具有更高的效率
2.4 Efficiency of domain decomposition
- 如果我们假设分辨率恒定,那么对于3D系统,computational time will be roughly proportional to the volume of a domain and communication to the surface area of the domain (volume of the domain raised to the power 2/3).
In 2D system the equivalent is the computational time varying with the area of the domain and the communications with the perimeter of the domain
For domain decomposition
n ≈ d − 1 d n\approx\frac{d-1}d n≈dd−1
E 3 D ≈ 1 1 + k ( V N ) − 1 3 E_{3D}\approx\frac1{1+k({\displaystyle\frac VN})^{-{\displaystyle\frac13}}} E3D≈1+k(NV)−311
E 2 D ≈ 1 1 + k ( V N ) − 1 2 E_{2D}\approx\frac1{1+k({\displaystyle\frac VN})^{-{\displaystyle\frac12}}} E2D≈1+k(NV)−211
2.5 Note
- 与并行执行数据相比,并行执行任务实际上总是更有效率的. (1) 例如 如果您要完成10个大型仿真,则可以使用并行代码来进行仿真,并且可以使用100个内核, 有两个选择
- Carry out each of the simulations on 100 cores, doing this for each of the 10 simulations
- Carry out all 10 simulations at the same time, each using 10 cores.
(2) 第二个选项通常是最佳选择 - 随着核心数量的增加,执行单个仿真的并行效率下降
- The exception to this heuristic will be when some of the simulations to be carried out are computationally much more expensive than others