
  • 题目:Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning
  • 时间:2022
  • 会议:OSDI
  • 研究机构:UCB

传统的DNN并行策略: 现有的分布式训练系统要么需要用户手动创建并行化计划,要么需要用户从有限的模型并行化配置空间中自动生成并行化计划

  • 数据并行:将模型复制多份,数据集分配到不同的设备上
  • 运算符并行:将某个op沿non-batch axes分配到多个设备上
  • 流水线并行:不同的op或stage分配到不同的设备上,彼此间流水线


本篇论文的主要贡献: :本篇论文将分布式训练分成了inter-operator并行和intra-operator并行。

  • We construct a two-level parallel execution plan space where plans are specified hierarchically using inter- and intra-operator parallelisms.

  • We design tractable optimization algorithms to derive near-optimal execution plans at each level.

  • We implement Alpa, a compiler system for distributed DL on GPU clusters. Alpa features:
    (1) a set of compilation passes that generate execution plans using the hierarchical optimization algorithms,
    (2) a new runtime architecture that orchestrates the inter-op parallelism between device meshes, and
    (3) a number of system optimizations that improve compilation and address cross-mesh communication

  • intra-operator parallelism: 硬件利用率更高,但每次训练迭代中需要在拆分和合并时进行通信

  • inter-operator parallelism: 只需要在相邻计算阶段之间需要通信,但数据依赖可能导致设备的空闲时间


a) 在Megatron-LM中,针对equal mesh shape的scatter-gather优化
b) 用于unequal mesh shape的发送/接收
c) unequal mesh shape上的local all-gather

  • G = ( V , E ) G = (V, E) G=(V,E): computational graph
  • v ∈ V , e ∈ E v \in V, e \in E vV,eE: node v and edge e in graph
  • k v k_v kv: the number of possible parallel algorithms for node v
  • c v ∈ R k v c_v \in \mathbb{R}^{k_v} cvRkv: communication cost vector of length k v k_v kv
  • d v ∈ R k v d_v \in \mathbb{R}^{k_v} dvRkv: compute cost vector
  • s v ∈ { 0 , 1 } k v s_v \in \{0, 1\}^{k_v} sv{0,1}kv one-hot decision vector for node v
  • R v u ∈ R k v × k u R_{vu} \in \mathbb{R}^{k_v \times k_u} RvuRkv×ku resharding cost from the output of node v to the input of node u

min ⁡ s ∑ v ∈ V s v ⊤ ( c v + d v ) + ∑ ( v , u ) ∈ E s v ⊤ R v u s u \min_s \sum_{v \in V} s_v^\top(c_v + d_v) + \sum_{(v,u)\in E} s_v^\top R_{vu}s_u sminvVsv(cv+dv)+(v,u)EsvRvusu

  • o 1 , o 2 , . . . , o K o_1, o_2, ..., o_K o1,o2,...,oK: a sequence of operators
  • s 1 , s 2 , . . . , s K s_1, s_2, ..., s_K s1,s2,...,sK: each stage consists of operators ( o l i , . . . , o r i o_{l_i}, ...,o_{r_i} oli,...,ori)
  • n i × m i n_i \times m_i ni×mi: assign each stage s i s_i si to a submesh sliced from a computer cluster
  • N × M N \times M N×M: the shape of the cluster mesh
  • ∑ i = 1 S n i ∗ m i = N ∗ M \sum^S_{i=1} n_i * m_i = N * M i=1Snimi=NM
  • t i = t i n t r a ( s i , M e s h ( n i , m i ) ) t_i = t_{intra}(s_i, Mesh(n_i, m_i)) ti=tintra(si,Mesh(ni,mi)): the latency of executing stage s i s_i si

T ∗ = min ⁡ s 1 , . . . , s S ; ( n 1 , m 1 ) , . . . ( n S , m S ) { ∑ i = 1 S t i + ( B − 1 ) ∗ max ⁡ 1 ≤ j ≤ S t j } T^* = \min_{s_1, ..., s_S;(n_1, m_1), ...(n_S, m_S)}\{ \sum^S_{i=1} t_i + (B-1) * \max_{1\leq j \leq S} {t_j} \} T=s1,...,sS;(n1,m1),...(nS,mS)min{i=1Sti+(B1)1jSmaxtj}
T ∗ = min ⁡ s 1 , . . . , s S ; ( n 1 , m 1 ) , . . . ( n S , m S ) { t t o t a l + ( B − 1 ) ∗ t m a x } T^* = \min_{s_1, ..., s_S;(n_1, m_1), ...(n_S, m_S)}\{ t_{total} + (B-1) * t_{max} \} T=s1,...,sS;(n1,m1),...(nS,mS)min{ttotal+(B1)tmax}

  • F ( s , k , d ; t m a x ) F(s, k, d; t_{max}) F(s,k,d;tmax): slicing operators o k o_k ok to o K o_K oK into s stages and puttingthem onto d devices
  • m e m s t a g e + s ∗ m e m a c t ≤ m e m d e v i c e mem_{stage} + s * mem_{act} \leq mem_{device} memstage+smemactmemdevice

T ∗ = min ⁡ s { F ( s , 0 , N ∗ M ; t m a x ) + ( B − 1 ) ∗ t m a x } T^* = \min_{s}\{F(s, 0, N*M; t_{max}) + (B-1) * t_{max} \} T=smin{F(s,0,NM;tmax)+(B1)tmax}

  • 题目:On Optimizing the Communication of Model Parallelism
  • 时间:2022
  • 会议:MLsys
  • 研究机构:UCB

Neither intra-op parallelism nor inter-op parallelism alone suffices to train large models. In practice, they must be combined to support large models like GPT-3.
单独的intra-op parallelism或inter-op parallelism都不足以训练大型模型,它们必须结合起来支持大型模型,如GPT-3
This combined strategy is implemented in many model-parallel systems by first partitioning the computational graph using inter-op parallelism then sharding each stage using intra-op parallelism.
这种组合策略在许多模型并行系统使用,首先inter-op parallelism划分计算图,然后intra-op parallelism划分每个stage
Specifically, the graph is first partitioned into multiple stages. Each stage is assigned to a group of devices, referred to as a device mesh, sliced from the cluster.
Operators and tensors of a stage are parallelized over that stage’s assigned mesh following a chosen intra-op parallelism plan; collective communication happens only across devices within each mesh.
一个stage的算子和张量按照选择的intra-op parallelism在该stage分配的mesh上并行化,集合通信仅发生在每个mesh内的设备之间。
At the boundary of any two adjacent stages, communication is required to exchange tensors between their meshes.
Unlike inter-op parallelism, the tensor might have been sharded with different layouts on the source and destination meshes – in which case, communication involves not only transferring the tensor, but also performing tensor layout conversion between the source and destination groups of devices.
与inter-op parallelism不同,张量可能在源和目标网格上以不同的布局进行分割,在这种情况下,通信不仅涉及传输张量,还涉及在源和目标设备组之间执行张量布局转换
We call this communication pattern cross-mesh resharding, which is the focus of this paper.

一般的cross-mesh resharding问题可以分解为多个单元通信任务,每个单元通信任务负责发送一个data slice,我们将原始问题构建为一个两级优化问题:

  • 单个单元通信任务的优化,对此使用广播以达到最佳性能。
  • cross-mesh resharding中多单元通信任务的负载平衡和调度。


每个cross-mesh resharding包括多个单元通信任务。这些任务可能会在发送方和接收方设备上重叠,并影响彼此的性能。因此,为了优化cross-mesh resharding的总完成时间,我们将该问题视为负载平衡和调度问题,需要:

  • 最直接的算法:将每个任务分配给sender’s host中的第一个(即最低索引)设备发送。我们按照任意的全局顺序安排所有的任务
  • 仅考虑负载平衡的算法:将每个任务分配给当前工作负载最轻的sender’s host
  • 带剪枝的深度优先搜索:对于每台host,为该host上的所有send/recv任务分配一个执行顺序,且该任务有指数级的复杂度
  • 随机贪婪搜索:对所有要调度的任务进行随机排序,迭代并选择与先前选择的任务不重叠的任务。我们多次重复这个过程,并为迭代选择最大的候选集。


  • D D D: N-elemensional tensor
  • ∣ M e s h A ∣ = m 1 × m 2 |Mesh_A| = m_1 \times m_2 MeshA=m1×m2: compute devices mesh
  • X i ∈ { S , R } , 0 ≤ i ≤ N − 1 X_i \in \{S, R\}, 0 \leq i \leq N-1 Xi{S,R},0iN1: i-th dimension of D is sharded or replicated
  • X i d i = S 0 X_i^{d_i} = S^0 Xidi=S0: the i-th dimension of D is sharded along the first dimension of M e s h A Mesh_A MeshA
  • X i d i = S 01 X_i^{d_i} = S^{01} Xidi=S01: the i-th dimension of D is sharded along both dimensions of M e s h A Mesh_A MeshA
  • X = X 0 d 0 X 1 d 1 . . . X N − 1 d N − 1 , d i ∈ { 0 , 1 , 01 } \mathcal{X} = X_0^{d_0}X_1^{d_1}...X_{N-1}^{d_{N-1}}, d_i \in \{ 0, 1, 01 \} X=X0d0X1d1...XN1dN1,di{0,1,01}: N-element string
  • D S i DS_i DSi data slice of D
  • T i T_i Ti: the duration of the i-th task
  • S i S_i Si: the execution starting time
  • n i ⊆ M e s h A n_i \subseteq Mesh_A niMeshA: host set to send
  • n i ∗ ∈ n i n_{i*} \in n_i nini host to send the data
  • m i ⊆ M e s h B m_i \subseteq Mesh_B miMeshB: host set to recv

min ⁡ S , n ∗ max ⁡ i S i + T i \min_{S, n*} \max_i S_i + T_i S,nminimaxSi+Ti
s . t . n i ∗ ∈ n i s.t. n_{i*} \in n_i s.t.nini
( S i , S i + T i ) ∩ ( S j , S j + T j ) = ∅ (S_i, S_i + T_i) \cap (S_j, S_j+T_j) = \emptyset (Si,Si+Ti)(Sj,Sj+Tj)=
∀ n i ∗ = n j ∗   o r   m i ∩ m j ≠ ∅ \forall n_{i*} = n_{j*} \ or \ m_i \cap m_j \neq \emptyset ni=nj or mimj=
min ⁡ n ∗ max ⁡ k ∈ M e s h A ∑ i : n i ∗ = k T i \min_{n_*}\max_{k \in Mesh_A} \sum_{i:n_{i*}=k} T_i nminkMeshAmaxi:ni=kTi






