Reducing Activation Recomputation in Large Transformer Models
背景
训练大的transformer模型需要的资源是庞大的,内存是主要的限制之一。在前向传播的过程不保存激活值,反向传播的时候重计算。这节约了内存,但是增加了计算。
流水线并行各rank内存的分配情况
GPU集群的管理与维护
如何高效的使用GPU集群
GPUs used in pipeline parallel model training store the input activations of layers until they are
consumed at the gradient computation during back-propagation. As discussed in Section 4.2.3, the first
pipeline stage stores the most activations, an equivalent of storing activations for all of the transformer
layers in the model.