Python微信订餐小程序课程视频
https://edu.csdn.net/course/detail/36074
Python实战量化交易理财系统
https://edu.csdn.net/course/detail/35475
[源码解析] 模型并行分布式训练 Megatron (4) — 如何设置各种并行
目录* [源码解析] 模型并行分布式训练 Megatron (4) — 如何设置各种并行
+ 0x00 摘要
+ 0x01 前文回顾
+ 0x02 初始化
- 2.1 全局变量
- 2.2 初始化代码
+ 0x03 切分样例
- 3.1 注释
- 3.2 切分情况
- 3.3 切分策略
- 3.4 实验
+ 0x04 起始状态
- 4.1 GPU 状况
- 4.2 符号说明
- 4.3 初始分组
+ 0x05 Tensor model-parallel
- 5.1 分组
- 5.2 使用
+ 0x06 Pipe-parallel
- 6.1 分组
- 6.2 使用
* 6.2.1 上下游rank
* 6.2.2 world size
+ 0x07 Data-parallel
- 7.1 分组
- 7.2 如何使用
+ 0x08 模型组
+ 0x09 如何把模型分到GPU
+ 0xFF 参考
0x00 摘要
NVIDIA Megatron 是一个基于 PyTorch 的分布式训练框架,用来训练超大Transformer语言模型,其通过综合应用了数据并行,Tensor并行和Pipeline并行来复现 GPT3,值得我们深入分析其背后机理。
本系列大概有 5 篇文章,通过论文和源码和大家一起学习研究。本文将看看 Megatron 如何处理设置并行。
本系列其他文章为:
[源码解析] 模型并行分布式训练Megatron (1) — 论文 & 基础
[源码解析] 模型并行分布式训练Megatron (2) — 整体架构
[源码解析] 模型并行分布式训练 Megatron (3) —模型并行实现
0x01 前文回顾
前文我们对模型并行的原理和代码进行了分析,对于给定的模型,现在还需要解决几个问题:
- 如何把模型切分给节点,比如哪个节点负责哪些层。
- 数据并行,模型并行,流水线并行这几种并行之中,每个节点分别属于哪个部分?
- 如何避免流水线带来的问题。
我们接下来就仔细分析一下。
0x02 初始化
initialize_model_parallel 方法用来设置模型并行,所以我们接下来就具体分析。
2.1 全局变量
因为前文_initialize_distributed之中调用了torch.distributed.init_process_group 初始化分布式环境,所以我们知道,每个进程都有自己的 gloabl rank 和 local rank,都有自己的全局变量。
主要变量如下(具体例子可以结合 initialize_model_parallel 之中的注释来看):
- _TENSOR_MODEL_PARALLEL_GROUP :当前 rank 所属于的Intra-layer model parallel group,就是tensor 并行进程组。
- 假如每一层分为两个tensor,则 _TENSOR_MODEL_PARALLEL_GROUP 例子为:[g0, g1], [g2, g3], [g4, g5], [g6, g7], [g8, g9], [g10, g11], [g12, g13], [g14, g15]。
- _PIPELINE_MODEL_PARALLEL_GROUP :当前 rank 所属于的Intra-layer model parallel group,就是流水线进程组。
- 假如流水线深度为4,则例子为 [g0, g4, g8, g12], [g1, g5, g9, g13], [g2, g6, g10, g14], [g3, g7, g11, g15]。
- _MODEL_PARALLEL_GROUP :当前 rank 所属于的模型并行进程组,包括了以上两组。
- 针对我们例子,就是完整模型被复制了两份,其 GPU 节点具体是[0, 1, 4, 5, 8, 9, 12, 13],[2, 3, 6, 7, 10, 11, 14, 15]
- _EMBEDDING_GROUP : 嵌入对应的进程组。
- _DATA_PARALLEL_GROUP :当前 rank 所属于的Data parallel group。
- 假如数据并行度数为2,则例子为[g0, g2], [g1, g3], [g4, g6], [g5, g7], [g8, g10], [g9, g11], [g12, g14], [g13, g15]。
具体如下:
# Intra-layer model parallel group that the current rank belongs to.
_TENSOR_MODEL_PARALLEL_GROUP = None
# Inter-layer model parallel group that the current rank belongs to.
_PIPELINE_MODEL_PARALLEL_GROUP = None
# Model parallel group (both intra- and pipeline) that the current rank belongs to.
_MODEL_PARALLEL_GROUP = None
# Embedding group.
_EMBEDDING_GROUP = None
# Data parallel group that the current rank belongs to.
_DATA_PARALLEL_GROUP = None
_VIRTUAL_PIPELINE_MODEL_PARALLEL_RANK = None
_VIRTUAL_PIPELINE_MODEL_PARALLEL_WORLD_SIZE = None
_PIPELINE_MODEL_PARALLEL_SPLIT_RANK = None
# These values enable us to change the mpu sizes on the fly.
_MPU_TENSOR_MODEL_PARALLEL_WORLD_SIZE = None
_MPU_PIPELINE_MODEL_PARALLEL_WORLD_SIZE = None
_MPU_TENSOR_MODEL_PARALLEL_RANK = None
_MPU_PIPELINE_MODEL_PARALLEL_RANK = None
# A list of ranks that have a copy of the embedding.
_EMBEDDING_GLOBAL_RANKS = None
# A list of global ranks for each pipeline group to ease calculation of the source
# rank when broadcasting from the first or last pipeline stage.
_PIPELINE_GLOBAL_RANKS = None
2.2 初始化代码
我们首先把 initialize_model_parallel 代码摘录出来。initialize_model_parallel 作用就是对模型进行分组,然后初始化进程组相关的各种全局变量。
def initialize\_model\_parallel(tensor\_model\_parallel\_size\_=1,
pipeline\_model\_parallel\_size\_=1,
virtual\_pipeline\_model\_parallel\_size\_=None,
pipeline\_model\_parallel\_split\_rank\_=None):
"""
Initialize model data parallel groups.
Arguments:
tensor\_model\_parallel\_size: number of GPUs used for tensor model parallelism.
pipeline\_model\_parallel\_size: number of GPUs used for pipeline model parallelism.
virtual\_pipeline\_model\_parallel\_size: number of virtual stages (interleaved
pipeline).
pipeline\_model\_parallel\_split\_rank: for models with both encoder and decoder,
rank in pipeline with split point.
Let's say we have a total of 16 GPUs denoted by g0 ... g15 and we
use 2 GPUs to parallelize the model tensor, and 4 GPUs to parallelize
the model pipeline. The present function will
create 8 tensor model-parallel groups, 4 pipeline model-parallel groups
and 8 data-parallel groups as:
8 data\_parallel groups:
[g0, g2], [g1, g3], [g4, g6], [g5, g7], [g8, g10], [g9, g11], [g12, g14], [g13, g15]
8 tensor model-parallel groups:
[g0, g1], [g2, g3], [g4, g5], [g6, g7], [g8, g9], [g10, g11], [g12, g13], [g14, g15]
4 pipeline model-parallel groups:
[g0, g4, g8, g12], [g1, g5, g9, g13], [g2, g6, g10, g14], [g3, g7, g11, g15]
Note that for efficiency, the caller should make sure adjacent ranks
are on the same DGX box. For example if we are using 2 DGX-1 boxes
with a total of 16 GPUs, rank 0 to 7 belong to the first box and
ranks 8 to 15 belong to the second box.
"""
if torch.distributed.get_rank() == 0:
print('> initializing tensor model parallel with size {}'.format(
tensor_model_parallel_size_))
print('> initializing pipeline model parallel with size {}'.format(
pipeline_model_parallel_size_))
# Get world size and rank. Ensure some consistencies.
world_size = torch.distributed.get_world_size()
tensor_model_parallel_size = min(tensor_model_parallel_size_, world_size)
pipeline_model_parallel_size = min(pipeline_model_parallel_size_, world_size)
ensure_divisibility(world_size,
tensor_model_parallel_size * pipeline_model_parallel_size)
data_parallel_size = world_size // (tensor_model_parallel_size *
pipeline_model_parallel_size)
num_tensor_model_parallel_groups = world_size // tensor_model_parallel_size
num_pipeline_model_parallel_groups = world_size // pipeline_model_parallel_size
num_data_parallel_groups = world_size // data_parallel_size
if virtual_pipeline_model_parallel_size_ is not None:
global _VIRTUAL_PIPELINE_MODEL_PARALLEL_RANK
global _VIRTUAL_PIPELINE_MODEL_PARALLEL_WORLD_SIZE
_VIRTUAL_PIPELINE_MODEL_PARALLEL_RANK = 0
_VIRTUAL_PIPELINE_MODEL_PARALLEL_WORLD_SIZE = virtual_pipeline_model_parallel_size_
if pipeline_model_parallel_split_rank_ is not None:
global _PIPELINE_MODEL_PARALLEL_SPLIT_RANK