14.Chimera: efficiently training large-scale neural networks with bidirectional pipelines

陈超帅-大模型Agent

已于 2023-09-09 02:09:13 修改

阅读量120

点赞数

分类专栏： 2023-09论文阅读集合文章标签：论文阅读论文笔记深度学习人工智能机器学习

于 2023-09-09 02:03:57 首次发布

2023-09论文阅读集合专栏收录该内容

8 篇文章 0 订阅

订阅专栏

转载自https://www.cnblogs.com/qdsjddm/p/16495243.html
里面还有很多相关文章，可以阅读一下
本文转载自上述链接

Chimera: Efficiently Training Large-Scale Neural Networks with Bidirectional Pipelines（2022）

动机：
pipeline parallelism suffers from bubbles(gpipe) or weight staleness（pipedream）

DAPPLE, PipeDream, and PipeDream- 2BW, the first accelerator of a pipeline of depth 𝐷 has to store 𝐷 such activations while the last accelerator requires memory for one(一般都会用activation recomputation吧，但也会降低33%效率).

工作
fully-packed bidirectional pipelines Chimera
to keep the overall training synchronous without relying on stale weights

a higher pipeline utilization (less bubbles) than existing approaches and thus higher performance

the same peak activation memory consumption as the stateof- the-art methods, with an extra benefit of more balanced memory consumption

easy configurability to various pipelined deep neural networks as well as system architectures guided by an accurate performance model

BACKGROUND AND RELATEDWORKa
在这里插入图片描述

Bubbles in the pipeline
GEMS is mainly designed for small ˆ 𝐵 and has at most two active micro-batches

Memory consumption
weight parameters：看一个gpu计算阶段数，如Gpipe和DAPPLE，每个gpu只计算一个阶段，存一个阶段的参数；GEMS和Chimera，每个gpu计算两个阶段，需存两个阶段的参数
由于DAPPLE, PipeDream, and PipeDream- 2BW和Chimera中，每个时刻存在的激活值数最大为D，因此限制了 activitions memory

Convergence friendliness
Although they empirically show promising convergence results, the generality is lack of proof. More recent work [4, 34, 36, 37, 52] observes that asynchronous training algorithms may result in lower convergence performance

THE SCHEME OF CHIMERA
Bidirectional Pipelines
见fig2
Communication Scheme
Chimera uses p2p (point-to-point) communication to transfer the intermediate activations and gradients (with respect to the inputs) between pipeline stages in the forward pass and the backward pass, respectively. Since Chimera combines bidirectional pipelines
together, collective communication (i.e., allreduce) is used to synchronize the weight gradients across stage replicas before the next training iteration.

Taking P0 and P3in Figure 4(b) as an example, after these two workers finish the backward passes on micro-batch 3 and micro-batch 1, respectively, the calculation for the weight gradients of stage3 has been finished; therefore, P0 and P3 can launch an asynchronous allreduce using nonblocking collectives [23, 25] to synchronize the gradients of stage3 as soon as they are finished, and a wait operation is called after all the local computation to make sure the allreduce is finished. In this way, the gradient synchronization for stage3 is overlapped by the bubbles and the following computation.

在这里插入图片描述

Hybrid of Pipeline and Data Parallelism
Chimera supports a hybrid of pipeline and data parallelism. When scaling to the parallel machines equipped with high performance interconnected networks (such as Infiniband [50], Cray Aries [2] or Slingshot [48], and NVLink [18]), hybrid parallelism usually achieves better performance than the pure pipeline parallelism [16, 39]. This is because pure pipeline parallelism has𝑊 · 𝐷 stages in the pipeline, while hybrid parallelism has 𝐷 stages (𝑊 times less) which helps to reduce the p2p communication overhead between stages and increase the computation workload of each stage(应该还会减少bubble size). Although hybrid parallelism leads to gradient synchronization between stage replicas, the overhead of it can be alleviated by the aforementioned high performance interconnected networks. However, as𝑊 increases (𝐷 decreases), pipeline stages become coarser, until at some point the increased gradient synchronization overhead cannot be amortized by the reduced p2p communication overhead. Therefore, it is important to find the sweet spot to achieve the best performance.

Configuration Selection Based on Performance Modelling（ Given the mini-batch size 𝐵^ and the number of workers 𝑃, the configuration of 𝐵,𝑊, and 𝐷 largely affects the training throughput）
Larger micro-batch size (𝐵) usually improves the computational efficiency of the accelerators. Since Chimera greatly alleviates the bubble problem, it greedily chooses to use the maximum microbatch size fitting in the device memory（增加micro-batch size，会减少N，增加bubble size，但Chimera优化了bubble size，所以不惧）

To select the best configuration of𝑊 and 𝐷, we build a performance model to predict the runtime of a single training iteration (represented by 𝑇 ) for each available configuration

Scale to More Micro-Batches
For a large ˆ𝐵, there may be more than 𝐷 micro-batches in a training iteration for each worker (i.e., 𝑁>𝐷), especially when the compute resources are limited. To scale to a large ˆ𝐵, we first choose the maximum 𝐵 with 𝐷 micro-batches to saturate the device memory,
and schedule these 𝐷 micro-batches using bidirectional pipelines as discussed previously（B^ = NBW，当计算资源有限，W有限，B受限于内存，需要增加N，注意增加N并不会增加内存，因为并行的最大激活值数量为D）.

Direct concatenation，The bubbles at the end of the first basic unit can be occupied by the forward passes at the beginning of the second basic unit. If the backward pass has the same workload as the forward pass, basic units can be concatenated seamlessly. However, backward pass has about two times workload of the forward pass, which results in intermediate bubbles.

forward doubling以及backward halving
equalize the workloads of forward and backward passes.

Forward doubling removes the intermediate bubbles, but it leads to two times activation memory consumption and therefore may exceed the device memory capacity（可以用activation recomputation）.

Forward doubling prefers large models in which even 𝐵=1 exceeds the device memory capacity, since in such case activation recomputation must be used.

For smaller models which has a larger 𝐵, we propose to use backward halving, which uses the same schedule as forward doubling, except that rather than executing two micro-batches in the forward pass but to halve the micro-batch size of the backward pass. Backward halving does not increase the activation memory (thus no activation recomputation), but it may lower the computational efficiency because of using a sub-max 𝐵.

EXPERIMENTAL EVALUATION
Parallel Scalability
Performance Optimization Space for the Baselines
we can see the highest throughput of both DAPPLE and GPipe (with activation recomputation) is achieved by (𝑊=8, 𝐷=4, 𝐵=4), under which they hit the sweet spot for the trade-off between p2p communication overhead and allreduce communication overhead by (𝑊=8, 𝐷=4)（确定W）, and the sweet spot for the trade-off between bubble ratio and computational efficiency by 𝐵=4 (and 𝑁=16)（确定B）. GEMS prefers a large 𝐵 for high computational efficiency since a smaller 𝐵 does not help a lot to reduce the bubble ratio, and therefore its best performance is achieved by (𝑊=8, 𝐷=4, 𝐵=32).

Asynchronous baselines (PipeDream-2BW and PipeDream) always prefer the maximum 𝐵 fitting in the device memory, since there is no bubble problem for them. Note that PipeDream conducts gradient synchronization across𝑊 pipelines after each backward
pass on a micro-batch, thus its ˆ𝐵is limited by the maximum 𝐵. Since the frequent gradient synchronization of PipeDream leads to high allreduce overhead, its best performance is achieved with a deeper pipeline than others, namely by (𝑊=4, 𝐷=8, ˆ𝐵=48). PipeDream-2BWscales to large ˆ𝐵 by accumulating the gradients for more than 𝐷 micro-batches (i.e., 𝑁>=𝐷), and its best performance is achieved by (𝑊=8, 𝐷=4, 𝐵=16) with activation recomputation.