pytorch DDP模式下, 获取数据的的preftech + stream

本文详细介绍了DDPforward在PyTorch中的实现,特别强调了如何利用device_ids进行设备分配,以及scatter和parallel_apply函数在数据并行中的作用,同时提到了stream的使用来优化CPU到GPU的数据传输。
摘要由CSDN通过智能技术生成

直接上代码
- DDP forward

if self.device_ids:
    if len(self.device_ids) == 1:
        inputs, kwargs = self.to_kwargs(inputs, kwargs, self.device_ids[0])
        output = self.module(*inputs[0], **kwargs[0])
    else:
        inputs, kwargs = self.scatter(inputs, kwargs, self.device_ids)
        outputs = self.parallel_apply(self._module_copies[:len(inputs)], inputs, kwargs)
        output = self.gather(outputs, self.output_device)
else:
    output = self.module(*inputs, **kwargs)
def scatter(self, inputs, kwargs, device_ids):
    return scatter_kwargs(inputs, kwargs, device_ids, dim=self.dim)
def scatter_kwargs(inputs, kwargs, target_gpus, dim=0):
    r"""Scatter with support for kwargs dictionary"""
    inputs = scatter(inputs, target_gpus, dim) if inputs else []
    kwargs = scatter(kwargs, target_gpus, dim) if kwargs else []
    if len(inputs) < len(kwargs):
        inputs.extend([() for _ in range(len(kwargs) - len(inputs))])
    elif len(kwargs) < len(inputs):
        kwargs.extend([{} for _ in range(len(inputs) - len(kwargs))])
    inputs = tuple(inputs)
    kwargs = tuple(kwargs)
    return inputs, kwargs
def scatter(inputs, target_gpus, dim=0):
    r"""
    Slices tensors into approximately equal chunks and
    distributes them across given GPUs. Duplicates
    references to objects that are not tensors.
    """
    def scatter_map(obj):
        if isinstance(obj, torch.Tensor):
            return Scatter.apply(target_gpus, None, dim, obj)
        if is_namedtuple(obj):
            return [type(obj)(*args) for args in zip(*map(scatter_map, obj))]
        if isinstance(obj, tuple) and len(obj) > 0:
            return list(zip(*map(scatter_map, obj)))
        if isinstance(obj, list) and len(obj) > 0:
            return [list(i) for i in zip(*map(scatter_map, obj))]
        if isinstance(obj, dict) and len(obj) > 0:
            return [type(obj)(i) for i in zip(*map(scatter_map, obj.items()))]
        return [obj for targets in target_gpus]

    # After scatter_map is called, a scatter_map cell will exist. This cell
    # has a reference to the actual function scatter_map, which has references
    # to a closure that has a reference to the scatter_map cell (because the
    # fn is recursive). To avoid this reference cycle, we set the function to
    # None, clearing the cell
    try:
        res = scatter_map(inputs)
    finally:
        scatter_map = None
    return res

torch/nn/parallel/_functions.py,默认tensor to gpu,已经有了stream的加持。

from torch.nn.parallel._functions import Scatter, Gather

class Scatter(Function):

    @staticmethod
    def forward(ctx, target_gpus, chunk_sizes, dim, input):
        target_gpus = [_get_device_index(x, True) for x in target_gpus]
        ctx.dim = dim
        ctx.input_device = input.get_device() if input.device.type != "cpu" else -1
        streams = None
        if torch.cuda.is_available() and ctx.input_device == -1:
            # Perform CPU to GPU copies in a background stream
            streams = [_get_stream(device) for device in target_gpus]
        outputs = comm.scatter(input, target_gpus, chunk_sizes, ctx.dim, streams)
        # Synchronize with the copy stream
        if streams is not None:
            for i, output in enumerate(outputs):
                with torch.cuda.device(target_gpus[i]):
                    main_stream = torch.cuda.current_stream()
                    main_stream.wait_stream(streams[i])
                    output.record_stream(main_stream)
        return outputs

    @staticmethod
    def backward(ctx, *grad_output):
        return None, None, None, Gather.apply(ctx.input_device, ctx.dim, *grad_output)

结论:目前DDP模式下,已经有了preftech + stream的加持。

PyTorchDDP(Distributed Data Parallel)是一种多机多卡训练方法,它通过提高batch size来增加并行度,从而加快模型训练速度。DDP使用了一种称为Ring-Reduce的数据交换方法,这种方法提高了通信效率,并且通过启动多个进程的方式减轻了Python GIL(全局解释器锁)的限制。因此,DDP通常比DP(Data Parallel)更快,能够实现略低于使用的卡数的加速比(例如,在四卡下可能会加速3倍)。因此,DDP是目前最流行的多机多卡训练方法之一。 在使用DDP时,你只需要在代码中添加一行简单的语句即可使用。具体来说,你需要将你的模型包装在DDP函数中,并指定设备ID(device_ids)和输出设备(output_device)。这样就可以启用DDP,并在多机多卡环境中运行模型训练。 如果你需要了解更多关于PyTorch DDP的详细信息,可以参考一些相关的教程和示例代码,例如《PyTorch分布式训练简明教程》和《PyTorch多机多卡分布式训练》。这些资源可以帮助你更好地理解和使用PyTorchDDP功能。<span class="em">1</span><span class="em">2</span><span class="em">3</span> #### 引用[.reference_title] - *1* *2* *3* [Pytorch中的DDP](https://blog.csdn.net/flyingluohaipeng/article/details/127900749)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v93^chatsearchT3_2"}}] [.reference_item style="max-width: 100%"] [ .reference_list ]
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值