文献阅读（247）AIpa

tiaozhanzhe1900

已于 2023-11-08 16:12:13 修改

阅读量496

点赞数

分类专栏：编译优化文章标签：分布式

于 2023-03-26 21:21:51 首次发布

本文链接：https://blog.csdn.net/tiaozhanzhe1900/article/details/129784734

版权

编译优化专栏收录该内容

17 篇文章 5 订阅

订阅专栏

Alpa是一个由UCB研发的编译系统，旨在自动化分布式深度学习中的操作间和操作内并行性，提高GPU集群的硬件利用率。通过构建两层并行执行计划空间和设计优化算法，Alpa减少了通信开销，特别是针对跨网格重共享的通信模式进行了优化。此外，文章还探讨了如何通过负载平衡和调度算法来进一步提升性能，减少等待时间。

摘要由CSDN通过智能技术生成

题目：Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning
时间：2022
会议：OSDI
研究机构：UCB

传统的DNN并行策略： 现有的分布式训练系统要么需要用户手动创建并行化计划，要么需要用户从有限的模型并行化配置空间中自动生成并行化计划

数据并行：将模型复制多份，数据集分配到不同的设备上
运算符并行：将某个op沿non-batch axes分配到多个设备上
流水线并行：不同的op或stage分配到不同的设备上，彼此间流水线

在这里插入图片描述

本篇论文的主要贡献： ：本篇论文将分布式训练分成了inter-operator并行和intra-operator并行。

We construct a two-level parallel execution plan space where plans are specified hierarchically using inter- and intra-operator parallelisms.
We design tractable optimization algorithms to derive near-optimal execution plans at each level.
We implement Alpa, a compiler system for distributed DL on GPU clusters. Alpa features:
(1) a set of compilation passes that generate execution plans using the hierarchical optimization algorithms,
(2) a new runtime architecture that orchestrates the inter-op parallelism between device meshes, and
(3) a number of system optimizations that improve compilation and address cross-mesh communication
intra-operator parallelism: 硬件利用率更高，但每次训练迭代中需要在拆分和合并时进行通信
inter-operator parallelism: 只需要在相邻计算阶段之间需要通信，但数据依赖可能导致设备的空闲时间

在这里插入图片描述

下图中，红色箭头表示慢速连接上的发送/接收，绿色箭头表示快速连接上的全收集。
a) 在Megatron-LM中，针对equal mesh shape的scatter-gather优化
b) 用于unequal mesh shape的发送/接收
c) unequal mesh shape上的local all-gather
在这里插入图片描述

$G = (V, E)$ : computational graph
$\in V, e \in E$ : node v and edge e in graph
$k_v$ : the number of possible parallel algorithms for node v
$c_v \in \mathbb{R}^{k_v}$ : communication cost vector of length $k_v$
$d_v \in \mathbb{R}^{k_v}$ : compute cost vector
$s_v \in \{0, 1\}^{k_v}$ one-hot decision vector for node v
$R_{vu} \in \mathbb{R}^{k_v \times k_u}$ resharding cost from the output of node v to the input of node u

$\min_s \sum_{v \in V} s_v^\top(c_v + d_v) + \sum_{(v,u)\in E} s_v^\top R_{vu}s_u$

$o_1, o_2, ..., o_K$ : a sequence of operators
$s_1, s_2, ..., s_K$ : each stage consists of operators ( $o_{l_i}, ...,o_{r_i}$ )
$n_i \times m_i$ : assign each stage $s_i$ to a submesh sliced from a computer cluster
$\times M$ : the shape of the cluster mesh
$\sum^S_{i=1} n_i * m_i = N * M$
$t_i = t_{intra}(s_i, Mesh(n_i, m_i))$ : the latency of executing stage $s_i$

$T^* = \min_{s_1, ..., s_S;(n_1, m_1), ...(n_S, m_S)}\{ \sum^S_{i=1} t_i + (B-1) * \max_{1\leq j \leq S} {t_j} \}$
$T^* = \min_{s_1, ..., s_S;(n_1, m_1), ...(n_S, m_S)}\{ t_{total} + (B-1) * t_{max} \}$
在这里插入图片描述

$F(s, k, d; t_{max})$ : slicing operators $o_k$ to $o_K$ into s stages and puttingthem onto d devices
$mem_{stage} + s * mem_{act} \leq mem_{device}$

$T^* = \min_{s}\{F(s, 0, N*M; t_{max}) + (B-1) * t_{max} \}$
在这里插入图片描述

题目：On Optimizing the Communication of Model Parallelism
时间：2022
会议：MLsys
研究机构：UCB

Neither intra-op parallelism nor inter-op parallelism alone suffices to train large models. In practice, they must be combined to support large models like GPT-3.
单独的intra-op parallelism或inter-op parallelism都不足以训练大型模型，它们必须结合起来支持大型模型，如GPT-3
This combined strategy is implemented in many model-parallel systems by first partitioning the computational graph using inter-op parallelism then sharding each stage using intra-op parallelism.
这种组合策略在许多模型并行系统使用，首先inter-op parallelism划分计算图，然后intra-op parallelism划分每个stage
Specifically, the graph is first partitioned into multiple stages. Each stage is assigned to a group of devices, referred to as a device mesh, sliced from the cluster.
计算图被分成多个stage，每个stage被分配给一组设备，称为设备网格，
Operators and tensors of a stage are parallelized over that stage’s assigned mesh following a chosen intra-op parallelism plan; collective communication happens only across devices within each mesh.
一个stage的算子和张量按照选择的intra-op parallelism在该stage分配的mesh上并行化，集合通信仅发生在每个mesh内的设备之间。
At the boundary of any two adjacent stages, communication is required to exchange tensors between their meshes.
在任意两个相邻的stage处，需要进行通信以在网格之间交换张量。
Unlike inter-op parallelism, the tensor might have been sharded with different layouts on the source and destination meshes – in which case, communication involves not only transferring the tensor, but also performing tensor layout conversion between the source and destination groups of devices.
与inter-op parallelism不同，张量可能在源和目标网格上以不同的布局进行分割，在这种情况下，通信不仅涉及传输张量，还涉及在源和目标设备组之间执行张量布局转换
We call this communication pattern cross-mesh resharding, which is the focus of this paper.
我们称这种通信模式为跨网格重共享，这是本文的重点。

一般的cross-mesh resharding问题可以分解为多个单元通信任务，每个单元通信任务负责发送一个data slice，我们将原始问题构建为一个两级优化问题:

单个单元通信任务的优化，对此使用广播以达到最佳性能。
cross-mesh resharding中多单元通信任务的负载平衡和调度。

集群设置：节点之间是全连接拓扑，独立的发送/接收带宽（全双工）

在这里插入图片描述
每个cross-mesh resharding包括多个单元通信任务。这些任务可能会在发送方和接收方设备上重叠，并影响彼此的性能。因此，为了优化cross-mesh resharding的总完成时间，我们将该问题视为负载平衡和调度问题，需要：
(1)通过在发送方设备和主机间通信链路上平均分配通信工作负载来平衡负载，以避免拥塞和掉队；
(2)调度分配给一个特定设备的不同任务的顺序，以最小化由于不可用的发送器/接收器而导致的等待

最直接的算法：将每个任务分配给sender’s host中的第一个(即最低索引)设备发送。我们按照任意的全局顺序安排所有的任务
仅考虑负载平衡的算法：将每个任务分配给当前工作负载最轻的sender’s host
带剪枝的深度优先搜索：对于每台host，为该host上的所有send/recv任务分配一个执行顺序，且该任务有指数级的复杂度
随机贪婪搜索：对所有要调度的任务进行随机排序，迭代并选择与先前选择的任务不重叠的任务。我们多次重复这个过程，并为迭代选择最大的候选集。

符号化表述

$D$ : N-elemensional tensor
$|Mesh_A| = m_1 \times m_2$ : compute devices mesh
$X_i \in \{S, R\}, 0 \leq i \leq N-1$ : i-th dimension of D is sharded or replicated
$X_i^{d_i} = S^0$ : the i-th dimension of D is sharded along the first dimension of $Mesh_A$
$X_i^{d_i} = S^{01}$ : the i-th dimension of D is sharded along both dimensions of $Mesh_A$
$\mathcal{X} = X_0^{d_0}X_1^{d_1}...X_{N-1}^{d_{N-1}}, d_i \in \{ 0, 1, 01 \}$ : N-element string
$DS_i$ data slice of D
$T_i$ : the duration of the i-th task
$S_i$ : the execution starting time
$n_i \subseteq Mesh_A$ : host set to send
$n_{i*} \in n_i$ host to send the data
$m_i \subseteq Mesh_B$ : host set to recv

$min_{S, n*} \max_i S_i + T_i$
$n_{i*} \in n_i$
$(S_i, S_i + T_i) \cap (S_j, S_j+T_j) = \emptyset$
$\forall n_{i*} = n_{j*} \ or \ m_i \cap m_j \neq \emptyset$
$\min_{n_*}\max_{k \in Mesh_A} \sum_{i:n_{i*}=k} T_i$