- 题目:Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning
- 时间:2022
- 会议:OSDI
- 研究机构:UCB
传统的DNN并行策略: 现有的分布式训练系统要么需要用户手动创建并行化计划,要么需要用户从有限的模型并行化配置空间中自动生成并行化计划
- 数据并行:将模型复制多份,数据集分配到不同的设备上
- 运算符并行:将某个op沿non-batch axes分配到多个设备上
- 流水线并行:不同的op或stage分配到不同的设备上,彼此间流水线
本篇论文的主要贡献: :本篇论文将分布式训练分成了inter-operator并行和intra-operator并行。
-
We construct a two-level parallel execution plan space where plans are specified hierarchically using inter- and intra-operator parallelisms.
-
We design tractable optimization algorithms to derive near-optimal execution plans at each level.
-
We implement Alpa, a compiler system for distributed DL on GPU clusters. Alpa features:
(1) a set of compilation passes that generate execution plans using the hierarchical optimization algorithms,
(2) a new runtime architecture that orchestrates the inter-op parallelism between device meshes, and
(3) a number of system optimizations that improve compilation and address cross-mesh communication -
intra-operator parallelism: 硬件利用率更高,但每次训练迭代中需要在拆分和合并时进行通信
-
inter-operator parallelism: 只需要在相邻计算阶段之间需要通信,但数据依赖可能导致设备的空闲时间
下图中,红色箭头表示慢速连接上的发送/接收,绿色箭头表示快速连接上的全收集。
a) 在Megatron-LM中,针对equal mesh shape的scatter-gather优化
b) 用于unequal mesh shape的发送/接收
c) unequal mesh shape上的local all-gather
- G = ( V , E ) G = (V, E) G=(V,E): computational graph
- v ∈ V , e ∈ E v \in V, e \in E v∈V,e∈E: node v and edge e in graph
- k v k_v kv: the number of possible parallel algorithms for node v
- c v ∈ R k v c_v \in \mathbb{R}^{k_v} cv∈Rkv: communication cost vector of length k v k_v kv
- d v ∈ R k v d_v \in \mathbb{R}^{k_v} dv∈Rkv: compute cost vector
- s v ∈ { 0 , 1 } k v s_v \in \{0, 1\}^{k_v} sv∈{0,1}kv one-hot decision vector for node v
- R v u ∈ R k v × k u R_{vu} \in \mathbb{R}^{k_v \times k_u} Rvu∈Rkv×ku resharding cost from the output of node v to the input of node u
min s ∑ v ∈ V s v ⊤ ( c v + d v ) + ∑ ( v , u ) ∈ E s v ⊤ R v u s u \min_s \sum_{v \in V} s_v^\top(c_v + d_v) + \sum_{(v,u)\in E} s_v^\top R_{vu}s_u sminv∈V∑sv⊤(cv+dv)+(v,u)∈E∑sv⊤Rvusu
- o 1 , o 2 , . . . , o K o_1, o_2, ..., o_K o1,o2,...,oK: a sequence of operators
- s 1 , s 2 , . . . , s K s_1, s_2, ..., s_K s1,s2,...,sK: each stage consists of operators ( o l i , . . . , o r i o_{l_i}, ...,o_{r_i} oli,...,ori)
- n i × m i n_i \times m_i ni×mi: assign each stage s i s_i si to a submesh sliced from a computer cluster
- N × M N \times M N×M: the shape of the cluster mesh
- ∑ i = 1 S n i ∗ m i = N ∗ M \sum^S_{i=1} n_i * m_i = N * M ∑i=1Sni∗mi=N∗M
- t i = t i n t r a ( s i , M e s h ( n i , m i ) ) t_i = t_{intra}(s_i, Mesh(n_i, m_i)) ti=tintra(si,Mesh(ni,mi)): the latency of executing stage s i s_i si
T
∗
=
min
s
1
,
.
.
.
,
s
S
;
(
n
1
,
m
1
)
,
.
.
.
(
n
S
,
m
S
)
{
∑
i
=
1
S
t
i
+
(
B
−
1
)
∗
max
1
≤
j
≤
S
t
j
}
T^* = \min_{s_1, ..., s_S;(n_1, m_1), ...(n_S, m_S)}\{ \sum^S_{i=1} t_i + (B-1) * \max_{1\leq j \leq S} {t_j} \}
T∗=s1,...,sS;(n1,m1),...(nS,mS)min{i=1∑Sti+(B−1)∗1≤j≤Smaxtj}
T
∗
=
min
s
1
,
.
.
.
,
s
S
;
(
n
1
,
m
1
)
,
.
.
.
(
n
S
,
m
S
)
{
t
t
o
t
a
l
+
(
B
−
1
)
∗
t
m
a
x
}
T^* = \min_{s_1, ..., s_S;(n_1, m_1), ...(n_S, m_S)}\{ t_{total} + (B-1) * t_{max} \}
T∗=s1,...,sS;(n1,m1),...(nS,mS)min{ttotal+(B−1)∗tmax}
- F ( s , k , d ; t m a x ) F(s, k, d; t_{max}) F(s,k,d;tmax): slicing operators o k o_k ok to o K o_K oK into s stages and puttingthem onto d devices
- m e m s t a g e + s ∗ m e m a c t ≤ m e m d e v i c e mem_{stage} + s * mem_{act} \leq mem_{device} memstage+s∗memact≤memdevice
T
∗
=
min
s
{
F
(
s
,
0
,
N
∗
M
;
t
m
a
x
)
+
(
B
−
1
)
∗
t
m
a
x
}
T^* = \min_{s}\{F(s, 0, N*M; t_{max}) + (B-1) * t_{max} \}
T∗=smin{F(s,0,N∗M;tmax)+(B−1)∗tmax}
- 题目:On Optimizing the Communication of Model Parallelism
- 时间:2022
- 会议:MLsys
- 研究机构:UCB
Neither intra-op parallelism nor inter-op parallelism alone suffices to train large models. In practice, they must be combined to support large models like GPT-3.
单独的intra-op parallelism或inter-op parallelism都不足以训练大型模型,它们必须结合起来支持大型模型,如GPT-3
This combined strategy is implemented in many model-parallel systems by first partitioning the computational graph using inter-op parallelism then sharding each stage using intra-op parallelism.
这种组合策略在许多模型并行系统使用,首先inter-op parallelism划分计算图,然后intra-op parallelism划分每个stage
Specifically, the graph is first partitioned into multiple stages. Each stage is assigned to a group of devices, referred to as a device mesh, sliced from the cluster.
计算图被分成多个stage,每个stage被分配给一组设备,称为设备网格,
Operators and tensors of a stage are parallelized over that stage’s assigned mesh following a chosen intra-op parallelism plan; collective communication happens only across devices within each mesh.
一个stage的算子和张量按照选择的intra-op parallelism在该stage分配的mesh上并行化,集合通信仅发生在每个mesh内的设备之间。
At the boundary of any two adjacent stages, communication is required to exchange tensors between their meshes.
在任意两个相邻的stage处,需要进行通信以在网格之间交换张量。
Unlike inter-op parallelism, the tensor might have been sharded with different layouts on the source and destination meshes – in which case, communication involves not only transferring the tensor, but also performing tensor layout conversion between the source and destination groups of devices.
与inter-op parallelism不同,张量可能在源和目标网格上以不同的布局进行分割,在这种情况下,通信不仅涉及传输张量,还涉及在源和目标设备组之间执行张量布局转换
We call this communication pattern cross-mesh resharding, which is the focus of this paper.
我们称这种通信模式为跨网格重共享,这是本文的重点。
一般的cross-mesh resharding问题可以分解为多个单元通信任务,每个单元通信任务负责发送一个data slice,我们将原始问题构建为一个两级优化问题:
- 单个单元通信任务的优化,对此使用广播以达到最佳性能。
- cross-mesh resharding中多单元通信任务的负载平衡和调度。
集群设置:节点之间是全连接拓扑,独立的发送/接收带宽(全双工)
每个cross-mesh resharding包括多个单元通信任务。这些任务可能会在发送方和接收方设备上重叠,并影响彼此的性能。因此,为了优化cross-mesh resharding的总完成时间,我们将该问题视为负载平衡和调度问题,需要:
(1)通过在发送方设备和主机间通信链路上平均分配通信工作负载来平衡负载,以避免拥塞和掉队;
(2)调度分配给一个特定设备的不同任务的顺序,以最小化由于不可用的发送器/接收器而导致的等待
- 最直接的算法:将每个任务分配给sender’s host中的第一个(即最低索引)设备发送。我们按照任意的全局顺序安排所有的任务
- 仅考虑负载平衡的算法:将每个任务分配给当前工作负载最轻的sender’s host
- 带剪枝的深度优先搜索:对于每台host,为该host上的所有send/recv任务分配一个执行顺序,且该任务有指数级的复杂度
- 随机贪婪搜索:对所有要调度的任务进行随机排序,迭代并选择与先前选择的任务不重叠的任务。我们多次重复这个过程,并为迭代选择最大的候选集。
符号化表述
- D D D: N-elemensional tensor
- ∣ M e s h A ∣ = m 1 × m 2 |Mesh_A| = m_1 \times m_2 ∣MeshA∣=m1×m2: compute devices mesh
- X i ∈ { S , R } , 0 ≤ i ≤ N − 1 X_i \in \{S, R\}, 0 \leq i \leq N-1 Xi∈{S,R},0≤i≤N−1: i-th dimension of D is sharded or replicated
- X i d i = S 0 X_i^{d_i} = S^0 Xidi=S0: the i-th dimension of D is sharded along the first dimension of M e s h A Mesh_A MeshA
- X i d i = S 01 X_i^{d_i} = S^{01} Xidi=S01: the i-th dimension of D is sharded along both dimensions of M e s h A Mesh_A MeshA
- X = X 0 d 0 X 1 d 1 . . . X N − 1 d N − 1 , d i ∈ { 0 , 1 , 01 } \mathcal{X} = X_0^{d_0}X_1^{d_1}...X_{N-1}^{d_{N-1}}, d_i \in \{ 0, 1, 01 \} X=X0d0X1d1...XN−1dN−1,di∈{0,1,01}: N-element string
- D S i DS_i DSi data slice of D
- T i T_i Ti: the duration of the i-th task
- S i S_i Si: the execution starting time
- n i ⊆ M e s h A n_i \subseteq Mesh_A ni⊆MeshA: host set to send
- n i ∗ ∈ n i n_{i*} \in n_i ni∗∈ni host to send the data
- m i ⊆ M e s h B m_i \subseteq Mesh_B mi⊆MeshB: host set to recv
min
S
,
n
∗
max
i
S
i
+
T
i
\min_{S, n*} \max_i S_i + T_i
S,n∗minimaxSi+Ti
s
.
t
.
n
i
∗
∈
n
i
s.t. n_{i*} \in n_i
s.t.ni∗∈ni
(
S
i
,
S
i
+
T
i
)
∩
(
S
j
,
S
j
+
T
j
)
=
∅
(S_i, S_i + T_i) \cap (S_j, S_j+T_j) = \emptyset
(Si,Si+Ti)∩(Sj,Sj+Tj)=∅
∀
n
i
∗
=
n
j
∗
o
r
m
i
∩
m
j
≠
∅
\forall n_{i*} = n_{j*} \ or \ m_i \cap m_j \neq \emptyset
∀ni∗=nj∗ or mi∩mj=∅
min
n
∗
max
k
∈
M
e
s
h
A
∑
i
:
n
i
∗
=
k
T
i
\min_{n_*}\max_{k \in Mesh_A} \sum_{i:n_{i*}=k} T_i
n∗mink∈MeshAmaxi:ni∗=k∑Ti