Deepspeed代码解析

一只积极向上的小咸鱼

于 2024-08-16 11:33:02 发布

阅读量155

点赞数 2

分类专栏：数据结构与算法文章标签： Deepspeed

本文链接：https://blog.csdn.net/m0_49448331/article/details/141255396

版权

数据结构与算法专栏收录该内容

6 篇文章 0 订阅

订阅专栏

1 相关资料

https://blog.csdn.net/Chenql716/article/details/127537718
https://github.com/microsoft/DeepSpeed/issues/1110
https://github.com/microsoft/DeepSpeed/issues/489
https://deepspeed.readthedocs.io/en/stable/pipeline.html

2 解析

命令

#!/bin/bash
deepspeed  --include localhost:13,14 train.py --deepspeed_config=ds_config.json -p 2 --steps=200

# train.py
if __name__ == '__main__':
    print("os.getenv('RANK'):",os.getenv('RANK'))
    args = get_args()
    print("args ",args)
    deepspeed.init_distributed(dist_backend=args.backend)
    args.local_rank = int(os.environ['LOCAL_RANK'])
    torch.cuda.set_device(args.local_rank)
    if args.pipeline_parallel_size == 0:
        train_base(args)
    else:
        train_pipe(args)

os.getenv('RANK'): 1
args  Namespace(local_rank=1, steps=200, pipeline_parallel_size=2, backend='nccl', seed=1138, deepspeed=False, deepspeed_config='ds_config.json', deepscale=False, deepscale_config=None)
[2024-07-26 06:39:50,272] [INFO] [comm.py:637:init_distributed] cdb=None
os.getenv('RANK'): 0
args  Namespace(local_rank=0, steps=200, pipeline_parallel_size=2, backend='nccl', seed=1138, deepspeed=False, deepspeed_config='ds_config.json', deepscale=False, deepscale_config=None)

deepspeed.__init__.py

def _add_core_arguments(parser):
    r"""Helper (internal) function to update an argument parser with an argument group of the core DeepSpeed arguments.
        The core set of DeepSpeed arguments include the following:
        1) --deepspeed: boolean flag to enable DeepSpeed
        2) --deepspeed_config <json file path>: path of a json configuration file to configure DeepSpeed runtime.

        This is a helper function to the public add_config_arguments()

    Arguments:
        parser: argument parser
    Return:
        parser: Updated Parser
    """
    group = parser.add_argument_group('DeepSpeed', 'DeepSpeed configurations')

    group.add_argument('--deepspeed',
                       default=False,
                       action='store_true',
                       help='Enable DeepSpeed (helper flag for user code, no impact on DeepSpeed backend)')

#train.py
	net = AlexNet(num_classes=10)  #表示输出类别的数量
	net = PipelineModule(layers=join_layers(net),
			 loss_fn=torch.nn.CrossEntropyLoss(),
			 num_stages=args.pipeline_parallel_size,
			 partition_method=part,
			 activation_checkpoint_interval=0)

self = PipelineModule(
  (loss_fn): CrossEntropyLoss()
  (tied_modules): ModuleDict()
  (0): Conv2d(3, 64, kernel_size=(11, 11), stride=(4, 4), padding=(2, 2))
  (1): ReLU(inplace=True)
  (2): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
  (3): Conv2d(64, 192, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))
  (4): ReLU(inplace=True)
  (5): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
  (6): Conv2d(192, 384, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (7): ReLU(inplace=True)
  (8): Conv2d(384, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (9): ReLU(inplace=True)
  (10): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (11): ReLU(inplace=True)
  (12): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
  (13): AdaptiveAvgPool2d(output_size=(6, 6))
  (15): Dropout(p=0.5, inplace=False)
)

class PipelineModule(nn.Module):
    """Modules to be parallelized with pipeline parallelism.

    The key constraint that enables pipeline parallelism is the
    representation of the forward pass as a sequence of layers
    and the enforcement of a simple interface between them. The
    forward pass is implicitly defined by the module ``layers``. The key
    assumption is that the output of each layer can be directly fed as
    input to the next, like a ``torch.nn.Sequence``. The forward pass is
    implicitly:

    .. code-block:: python

        def forward(self, inputs):
            x = inputs
            for layer in self.layers:
                x = layer(x)
            return x

    .. note::
        Pipeline parallelism is not compatible with ZeRO-2 and ZeRO-3.

    Args:
        layers (Iterable): A sequence of layers defining pipeline structure. Can be a ``torch.nn.Sequential`` module.
        num_stages (int, optional): The degree of pipeline parallelism. If not specified, ``topology`` must be provided.
        topology (``deepspeed.runtime.pipe.ProcessTopology``, optional): Defines the axes of parallelism axes for training. Must be provided if ``num_stages`` is ``None``.
        loss_fn (callable, optional): Loss is computed ``loss = loss_fn(outputs, label)``
        seed_layers(bool, optional): Use a different seed for each layer. Defaults to False.
        seed_fn(type, optional): The custom seed generating function. Defaults to random seed generator.
        base_seed (int, optional): The starting seed. Defaults to 1234.
        partition_method (str, optional): The method upon which the layers are partitioned. Defaults to 'parameters'.
        activation_checkpoint_interval (int, optional): The granularity activation checkpointing in terms of number of layers. 0 disables activation checkpointing.
        activation_checkpoint_func (callable, optional): The function to use for activation checkpointing. Defaults to ``deepspeed.checkpointing.checkpoint``.
        checkpointable_layers(list, optional): Checkpointable layers may not be checkpointed. Defaults to None which does not additional filtering.
    """

    def __init__(self,
                 layers,
                 num_stages=None,
                 topology=None,
                 loss_fn=None,
                 seed_layers=False,
                 seed_fn=None,
                 base_seed=1234,
                 partition_method='parameters',
                 activation_checkpoint_interval=0,
                 activation_checkpoint_func=checkpointing.checkpoint,
                 checkpointable_layers=None):
        super().__init__()
        if num_stages is None and topology is None:
            raise RuntimeError('must provide num_stages or topology')

        self.micro_offset = 0
        self.loss_fn = loss_fn

        self.checkpointable_layers = checkpointable_layers
        if checkpointable_layers is not None:
            assert isinstance(checkpointable_layers, list), "param `checkpointable_layers` must be type of list."

        self.seed_layers = seed_layers
        self.seed_fn = seed_fn
        self.base_seed = base_seed
        if dist.get_rank() == 0:
            try:
                seed_str = self.seed_fn.__name__
            except AttributeError:
                seed_str = None
            print(f'SEED_LAYERS={self.seed_layers} BASE_SEED={self.base_seed} SEED_FN={seed_str}')

        # Setup world info
        self.world_group = dist.new_group(ranks=range(dist.get_world_size()))   #总的进程组
        self.global_rank = dist.get_rank(group=self.world_group)
        self.world_size = dist.get_world_size(group=self.world_group)  #总的进程数
        self.local_rank = int(os.environ.get("LOCAL_RANK", None))
        assert self.local_rank is not None

        if topology:
            self._topo = topology
            self.num_stages = self._topo.get_dim('pipe')
        else:
            self.num_stages = num_stages
            if topology is None:
                if self.world_size % self.num_stages != 0:
                    raise RuntimeError(
                        f'num_stages ({self.num_stages}) must divide distributed world size ({self.world_size})')
                dp = self.world_size // num_stages  #dp的值
              <sub><sup><sup><sup></sup></sup><sup></sup></sup></sub>  topology = PipeDataParallelTopology(num_pp=num_stages, num_dp=dp)
                self._topo = topology

        # Construct communicators for pipeline topology
        self._grid = PipelineParallelGrid(process_group=self.world_group, topology=self._topo)

        self.stage_id = self._topo.get_coord(self.global_rank).pipe

        # Initialize partition information
        self._layer_specs = list(layers)
        self._num_layers = len(self._layer_specs)
        self._local_start = 0
        self._local_stop = None
        self._partition_layers(method=partition_method)

        self.forward_funcs = []
        self.fwd_map = {}
        self.tied_modules = nn.ModuleDict()
        self.tied_weight_attrs = {}

        # Offset the random seed by the stage ID.
        #newseed = get_accelerator().initial_seed() + self._grid.get_stage_id()
        #ds_utils.set_random_seed(newseed)

        #with torch.random.fork_rng(devices=[get_accelerator().current_device_name()]):
        self._build()
        self.to(get_accelerator().device_name(self.local_rank))

        self.tied_comms = self._index_tied_modules()
        self._synchronize_tied_weights()

        self.activation_checkpoint_interval = activation_checkpoint_interval

        self.activation_checkpoint_func = activation_checkpoint_func
        # if configuration use_reentrant = False, self.activation_checkpoint_func will be set to ``checkpointing.non_reentrant_checkpoint``

 def _partition_layers(self, method='uniform'):
        num_stages = self._topo.get_dim('pipe')
        stage_id = self._topo.get_coord(self.global_rank).pipe

        if self.global_rank == 0:
            logger.info(f'Partitioning pipeline stages with method {method}')

        method = method.lower()

        # Each stage gets a simple uniform number of layers.
        if method == 'uniform':
            num_layers = len(self._layer_specs)
            self.parts = ds_utils.partition_uniform(num_items=num_layers, num_parts=num_stages)
        elif method == 'parameters':
            param_counts = self._count_layer_params()
            self.parts = ds_utils.partition_balanced(weights=param_counts, num_parts=num_stages)
        elif method.startswith('type:'):
            layertype = method.split(':')[1]
            binary_weights = [0] * len(self._layer_specs)
            for idx in self._find_layer_type(layertype):
                binary_weights[idx] = 1
            self.parts = ds_utils.partition_balanced(weights=binary_weights, num_parts=num_stages)
        elif method == 'profile':
            raise NotImplementedError(f'Partitioning method {method} not implemented.')
        else:
            raise NotImplementedError(f'Partitioning method {method} not implemented.')

        # Print some information on the partitioning.
        if self.global_rank == 0:
            for stage in range(num_stages):
                start = self.parts[stage]
                stop = self.parts[stage + 1]
                print(f'stage={stage} layers={stop - start}')
                for idx, layer in enumerate(self._layer_specs[start:stop]):
                    name = str(layer)
                    if isinstance(layer, LayerSpec):
                        name = layer.typename.__name__
                    if isinstance(layer, nn.Module):
                        name = layer.__class__.__name__
                    else:
                        try:
                            name = layer.__name__
                        except AttributeError:
                            pass
                    print(f'    {idx+start:2d}: {name}')
            if self.loss_fn:
                try:
                    print(f'  loss: {self.loss_fn.__name__}')
                except AttributeError:
                    print(f'  loss: {self.loss_fn.__class__.__name__}')

        self._set_bounds(start=self.parts[stage_id], stop=self.parts[stage_id + 1])

    def _count_layer_params(self):
        """Count the trainable parameters in individual layers.

        This routine will only build one layer at a time.

        Returns:
            A list of the number of parameters in each layer.
        """
        param_counts = [0] * len(self._layer_specs)
        for idx, layer in enumerate(self._layer_specs):
            if isinstance(layer, LayerSpec):
                l = layer.build()
                params = filter(lambda p: p.requires_grad, l.parameters())
                param_counts[idx] = sum(p.numel() for p in params)
            elif isinstance(layer, nn.Module):
                params = filter(lambda p: p.requires_grad, layer.parameters())
                param_counts[idx] = sum(p.numel() for p in params)
        return param_counts

param_counts: [23296, 0, 0, 307392, 0, 0, 663936, 0, 884992, 0, 590080, 0, 0, 0, 0, 0, 37752832, 0, 0, 16781312, 0, 40970]
stage=0 layers=19
     0: Conv2d
     1: ReLU
     2: MaxPool2d
     3: Conv2d
     4: ReLU
     5: MaxPool2d
     6: Conv2d
     7: ReLU
     8: Conv2d
     9: ReLU
    10: Conv2d
    11: ReLU
    12: MaxPool2d
    13: AdaptiveAvgPool2d
    14: <lambda>
    15: Dropout
    16: Linear
    17: ReLU
    18: Dropout
stage=1 layers=3
    19: Linear
    20: ReLU
    21: Linear
  loss: CrossEntropyLoss

def _partition_layers(self, method='uniform'):
        num_stages = self._topo.get_dim('pipe')
        stage_id = self._topo.get_coord(self.global_rank).pipe

        if self.global_rank == 0:
            logger.info(f'Partitioning pipeline stages with method {method}')

        method = method.lower()

        # Each stage gets a simple uniform number of layers.
        if method == 'uniform':
            num_layers = len(self._layer_specs)
            self.parts = ds_utils.partition_uniform(num_items=num_layers, num_parts=num_stages)
        elif method == 'parameters':
            param_counts = self._count_layer_params()
            self.parts = ds_utils.partition_balanced(weights=param_counts, num_parts=num_stages)  # parts= [0,19,22]
        elif method.startswith('type:'):
            layertype = method.split(':')[1]
            binary_weights = [0] * len(self._layer_specs)
            for idx in self._find_layer_type(layertype):
                binary_weights[idx] = 1
            self.parts = ds_utils.partition_balanced(weights=binary_weights, num_parts=num_stages)
        elif method == 'profile':
            raise NotImplementedError(f'Partitioning method {method} not implemented.')
        else:
            raise NotImplementedError(f'Partitioning method {method} not implemented.')

        # Print some information on the partitioning.
        if self.global_rank == 0:
            for stage in range(num_stages):
                start = self.parts[stage]
                stop = self.parts[stage + 1]
                print(f'stage={stage} layers={stop - start}')
                for idx, layer in enumerate(self._layer_specs[start:stop]):
                    name = str(layer)
                    if isinstance(layer, LayerSpec):
                        name = layer.typename.__name__
                    if isinstance(layer, nn.Module):
                        name = layer.__class__.__name__
                    else:
                        try:
                            name = layer.__name__
                        except AttributeError:
                            pass
                    print(f'    {idx+start:2d}: {name}')
            if self.loss_fn:
                try:
                    print(f'  loss: {self.loss_fn.__name__}')
                except AttributeError:
                    print(f'  loss: {self.loss_fn.__class__.__name__}')

        self._set_bounds(start=self.parts[stage_id], stop=self.parts[stage_id + 1])

_build()作用：将模型放置在设备中
根据每个 RANK 所保存的不同 self._local_start & stop map 一下所对应的 module forward 函数名字得到 RANK 私有的 self.forward_funcs & self.fwd_map,之后便可以使用 self.module() 内存有不同的 module 数据,此时每张卡内已经存有不同的 model 结构了.之后初始化 engine 时,传入的便已经是分区好的model.

self._build()
    def _build(self):
        specs = self._layer_specs
        for local_idx, layer in enumerate(specs[self._local_start:self._local_stop]):
            layer_idx = local_idx + self._local_start
            if self.seed_layers:
                if self.seed_fn:
                    self.seed_fn(self.base_seed + layer_idx)
                else:
                    ds_utils.set_random_seed(self.base_seed + layer_idx)
            # Recursively build PipelineModule objects
            if isinstance(layer, PipelineModule):
                raise NotImplementedError('RECURSIVE BUILD NOT YET IMPLEMENTED')
            # LayerSpec objects contain an nn.Module that should be allocated now.
            elif isinstance(layer, nn.Module):
                name = str(layer_idx)
                self.forward_funcs.append(layer)
                self.fwd_map.update({name: len(self.forward_funcs) - 1})
                self.add_module(name, layer)
            # TiedLayerSpec objects contain an nn.Module that should be allocated now.
            elif isinstance(layer, TiedLayerSpec):
                # Build and register the module if we haven't seen it before.
                if layer.key not in self.tied_modules:
                    self.tied_modules[layer.key] = layer.build()
                    self.tied_weight_attrs[layer.key] = layer.tied_weight_attr
                if layer.forward_fn is None:
                    # Just use forward()
                    self.forward_funcs.append(self.tied_modules[layer.key])
                else:
                    # User specified fn with args (module, input)
                    self.forward_funcs.append(partial(layer.forward_fn, self.tied_modules[layer.key]))
            # LayerSpec objects contain an nn.Module that should be allocated now.
            elif isinstance(layer, LayerSpec):
                module = layer.build()
                name = str(layer_idx)
                self.forward_funcs.append(module)
                self.fwd_map.update({name: len(self.forward_funcs) - 1})
                self.add_module(name, module)
            # Last option: layer may be a functional (e.g., lambda). We do nothing in
            # that case and just use it in forward()
            else:
                self.forward_funcs.append(layer)
        # All pipeline parameters should be considered as model parallel in the context
        # of our FP16 optimizer
        for p in self.parameters():
            p.ds_pipe_replicated = False

外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传

stage = 0

step_id	_is_even(step_id)	_is_even(stage)	micro_batch_id	is_forward
0	True	True	0	True
1	False	True	-1	False
2	True	True	1	True
3	False	True	0	False
4	True	True	2	True
5	False	True	1	False

3 关键代码调用流程

engine, _, _, _ = deepspeed.initialize( )        file: training/pipeline_parallelism/train.py
loss = engine.train_batch()
    engine = PipelineEngine( )    file:/deepspeed/__init__.py
	    def train_batch(self, data_iter=None):   file: deepspeed/runtime/pipe/engine.py
         sched = schedule.TrainSchedule(micro_batches=self.micro_batches,
                                       stages=self.num_stages,
                                       stage_id=self.stage_id)
        self._exec_schedule(sched)
	         class TrainSchedule(PipeSchedule):   file:deepspeed/runtime/pipe/schedule.py
	         def _exec_schedule(self, pipe_schedule): file: deepspeed/runtime/pipe/engine.py 
				for step_cmds in pipe_schedule:
		            # For each instruction in the step
		            for cmd in step_cmds:
		                if type(cmd) not in self._INSTRUCTION_MAP:
		                    raise RuntimeError(f'{self.__class__.__name__} does not understand instruction {repr(cmd)}')
		
		                # Equivalent to: self._exec_forward_pass(buffer_id=0)
		                self._exec_instr = MethodType(self._INSTRUCTION_MAP[type(cmd)], self)
		                self._exec_instr(**cmd.kwargs)

PipelineModule

def __init__       
	self.pipe_buffers = {
	            'inputs': [],  # batch input and received activations
	            'labels': [],  # labels from batch input
	            'outputs': [],  # activations
	            'output_tensors': [],  # tensor object to preserve backward graph
	        }

一只积极向上的小咸鱼

关注

2
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
Deepspeed代码解析

map 一下所对应的 module forward 函数名字得到 RANK 私有的 self.forward_funcs & self.fwd_map,之后便可以使用 self.module() 内存有不同的 module 数据,此时每张卡内已经存有不同的 model 结构了.之后初始化 engine 时,传入的便已经是分区好的model.根据每个 RANK 所保存的不同。作用：将模型放置在设备中。
复制链接

扫一扫

专栏目录