DataParallel layers (multi-GPU, distributed) torch分布式函数

最新推荐文章于 2024-04-28 19:45:00 发布

PilviMannis

最新推荐文章于 2024-04-28 19:45:00 发布

阅读量418

点赞数

分类专栏： python pytorch 文章标签：分布式 python pytorch

本文链接：https://blog.csdn.net/circleyuanquan/article/details/114024431

版权

python 同时被 2 个专栏收录

9 篇文章 0 订阅

订阅专栏

pytorch

1 篇文章 0 订阅

订阅专栏

DataParallel layers (multi-GPU, distributed)

DataParallel

class torch.nn.DataParallel(module, device_ids=None, output_device=None, dim=0)

在模块级实现数据并行。

此容器通过在批处理维度中分块(其他对象将在每个设备上复制一次)，在指定的设备上分割输入，从而并行化给定模块的应用程序。在正向传递过程中，模块被复制到每个设备上，每个副本处理输入的一部分。在向后传递过程中，每个副本的梯度将累加到原始模块中。

批处理的大小应该大于所使用的gpu的数量。

参见:使用nn.DataParallel 。

除了张量之外，允许将任意位置和关键字输入传递到数据并行中。所有张量都将分散在指定的dim(默认为0)上。原始类型将被广播，但所有其他类型将是浅拷贝，如果写入模型的前向传递，将会被破坏。

parallelized模块必须在device_ids[0]上有它的参数和缓冲区，然后才能运行这个dataparparallel模块。

Warning

在每个forward中，模块被复制到每个设备上，因此任何对forward中正在运行的模块的更新都将丢失。例如，如果module有一个计数器属性，在每次forward中递增，那么它将始终保持初始值，因为在forward之后会对销毁的副本进行更新。然而，dataparparallel保证设备[0]上的副本将有其参数和缓冲区与基本并行化模块共享存储。因此，对设备[0]上的参数或缓冲区的就地更新将被记录。例如，BatchNorm2d和spectral_norm()依赖此行为来更新缓冲区。

定义在module及其子模块上的向前和向后钩子将被调用len(device_ids)次，每个钩子的输入都位于特定的设备上。特别是，钩子只保证在相应设备上的操作以正确的顺序执行。例如，不能保证通过register_forward_pre_hook()设置的钩子在所有len(device_ids) forward()调用之前执行，但保证每个这样的钩子都在相应的forward()调用该设备之前执行。

当module在forward()中返回一个标量(即0维张量)时，这个包装器将返回一个长度等于数据并行中使用的设备数量的向量，包含来自每个设备的结果。

注意：

在数据并行封装的模块中使用包序列——>循环网络——>解包序列模式有一个微妙之处。有关详细信息，请参阅FAQ中的“我的循环网络不能与数据并行一起工作”一节。

Parameters:

module (Module) – module to be parallelized
device_ids (list of python:int or torch.device) – CUDA devices (default: all devices)
output_device (int or torch.device) – device location of output (default: device_ids[0])

Variables: module (Module) – the module to be parallelized

Example:

net = torch.nn.DataParallel(model, device_ids=[0, 1, 2])
output = net(input_var)

DistributedDataParallel

class torch.nn.parallel.DistributedDataParallel(module, device_ids=None, output_device=None, dim=0, broadcast_buffers=True, process_group=None, bucket_cap_mb=25, check_reduction=False)

实现基于torch的分布式数据并行。模块级别的分布式包。

此容器通过在批处理维度中分组，在指定的设备之间分割输入，从而并行地处理给定模块的应用程序。模块被复制到每台机器和每台设备上，每个这样的复制处理输入的一部分。在向后传递过程中，每个节点的梯度取平均值。

批处理的大小应该大于本地使用的gpu数量。它也应该是GPU数量的整数倍，以便每个块都是相同的大小(以便每个GPU处理相同数量的样本)

参见:Basics和使用nn。数据并行而不是多处理。与torch.nn中相同的输入约束。DataParallel适用。

创建这个类需要那个火炬。通过调用torch. distributed_init_process_group()来实现已初始化的分布式

distributeddataparparallel可用于以下两种方式:

Single-Process Multi-GPU

在这种情况下，每个主机/节点上将产生一个进程，每个进程将在它所运行的节点的所有gpu上操作。要以这种方式使用distributeddataparparallel，您可以简单地构建如下模型:

torch.distributed.init_process_group(backend="nccl")
model = DistributedDataParallel(model) # device_ids will include all GPU devices be default

Multi-Process Single-GPU

这是强烈推荐使用distributeddataparparallel的方式，有多个进程，每个进程都在单个GPU上运行。这是目前使用PyTorch进行数据并行训练最快的方法，适用于单节点(多gpu)和多节点数据并行训练。事实证明，它比torch.nn要快得多。dataparparallel用于单节点多gpu数据并行训练。

下面是它的使用方法:在每个有N个GPU的主机上，你应该产生N个进程，同时确保每个进程在单个GPU上从0到N-1独立工作。因此，这是你的工作来确保你的训练脚本运行在一个给定的GPU上调用:

torch.cuda.set_device(i)

i从0到N-1。在每个进程中，您应该参考以下步骤来构造此模块:

torch.distributed.init_process_group(backend='nccl', world_size=4, init_method='...')
model = DistributedDataParallel(model, device_ids=[i], output_device=i)

为了在每个节点上生成多个进程，您可以使用任何一个torch.distributed。启动或torch.multiprocessing.spawn

注意：

nccl后端是目前最快和高度推荐使用的后端与多进程单gpu分布式培训，这适用于单节点和多节点分布式培训

此模块仅适用于gloo和nccl后端。

构造函数、正向方法和输出的微分(或此模块输出的函数)是分布式同步点。考虑到不同的进程可能执行不同的代码。

该模块假定在创建模型时，所有参数都已注册到模型中。以后不应该添加或删除参数。这同样适用于缓冲区。

此模块假设在每个分布式进程的模型中注册的所有参数都是按照相同的顺序注册的。模块本身按照模型注册参数的反向顺序进行梯度全约简。换句话说，用户有责任确保每个分布式流程具有完全相同的模型，从而具有完全相同的参数注册顺序。

这个模块假设所有的缓冲区和梯度都是密集的。

此模块不适用于torch.autograd.grad()(也就是说，它只会在参数的.grad属性中累积渐变时起作用)。

如果你计划将此模块与nccl后端或gloo后端(使用Infiniband)一起使用，以及一个使用多个worker的数据加载器，请将multiprocessing启动方法更改为forkserver(仅限Python 3)或spawn。不幸的是，Gloo(使用Infiniband)和NCCL2不是fork安全的，如果不更改此设置，您可能会遇到死锁。

定义在module及其子模块上的前向和后向钩子将不再被调用，除非这些钩子在Forward()方法中初始化。

在使用distributeddataparparallel封装模型之后，永远不要尝试更改模型的参数。换句话说，当用distributeddataparparallel包装你的模型时，distributeddataparparallel的构造函数将在构建时在模型本身的所有参数上注册额外的梯度约简函数。如果您在distributeddataparparallel构造之后更改模型的参数，这将不受支持，并且可能会发生意外的行为，因为一些参数的梯度缩减函数可能不会被调用。

参数永远不会在进程之间传播。该模块在梯度上执行all-reduce步骤，并假设优化器将在所有进程中以相同的方式对梯度进行修改。缓冲区(例如BatchNorm stats)在每次迭代中从处于0级进程的模块广播到系统中的所有其他副本。

Parameters:

module (Module) – module to be parallelized
device_ids (list of python:int or torch.device) – CUDA devices (default: all devices)
output_device (int or torch.device) – device location of output (default: device_ids[0])
broadcast_buffers (bool) – flag that enables syncing (broadcasting) buffers of the module at beginning of the forward function. (default: True)
process_group – the process group to be used for distributed data all-reduction. If None, the default process group, which is created by torch.distributed.init_process_group, will be used. (default: None)
bucket_cap_mb – DistributedDataParallel will bucket parameters into multiple buckets so that gradient reduction of each bucket can potentially overlap with backward computation. bucket_cap_mb controls the bucket size in MegaBytes (MB) (default: 25)
check_reduction – when setting to True, it enables DistributedDataParallel to automatically check if the previous iteration’s backward reductions were successfully issued at the beginning of every iteration’s forward function. You normally don’t need this option enabled unless you are observing weird behaviors such as different ranks are getting different gradients, which should not happen if DistributedDataParallel is corrected used. (default: False)

变量:模块(module) -要并行化的模块

Example::

torch.distributed.init_process_group(backend=‘nccl’, world_size=4, init_method=’…’)
net = torch.nn.DistributedDataParallel(model, pg)

DistributedDataParallelCPU

class torch.nn.parallel.DistributedDataParallelCPU(module)

在模块级为CPU实现分布式数据并行。

该模块支持mpi和gloo后端。

此容器通过在批处理维度中分组，在指定的设备之间分割输入，从而并行地处理给定模块的应用程序。模块被复制到每台机器上，每个这样的复制处理输入的一部分。在向后传递过程中，每个节点的梯度取平均值。

这个模块可以和DistributedSampler一起使用(参见:类torche .utils.data. DistributedSampler)，它将为每个具有相同批大小的节点加载原始数据集的子集。所以强大的扩展性应该这样配置:

n = 1, batch size = 12
n = 2, batch size = 64
n = 4, batch size = 32
n = 8, batch size = 16

创建这个类需要分布式包已经在进程组模式下初始化(请参阅torch. distribude .init_process_group())。

构造函数、正向方法和输出的微分(或此模块输出的函数)是分布式同步点。考虑到不同的节点可能执行不同的代码。

该模块假定在创建模型时，所有参数都已注册到模型中。以后不应该添加或删除参数。

本模块假设所有梯度都是密集的。

此模块不适用于torch.autograd.grad()(也就是说，它只会在参数的.grad属性中累积渐变时起作用)。

定义在module及其子模块上的前向和后向钩子将不再被调用，除非这些钩子在Forward()方法中初始化。

参数在init()函数的节点之间广播。该模块在梯度上执行allreduce步骤，并假设优化器将以相同的方式在所有节点上修改这些梯度。参数:module -要并行化的模块

torch.distributed.init_process_group(world_size=4, init_method='...')
net = torch.nn.DistributedDataParallelCPU(model)

PilviMannis

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
DataParallel layers (multi-GPU, distributed) torch分布式函数

DataParallel layers (multi-GPU, distributed)DataParallelclass torch.nn.DataParallel(module, device_ids=None, output_device=None, dim=0)在模块级实现数据并行。此容器通过在批处理维度中分块(其他对象将在每个设备上复制一次)，在指定的设备上分割输入，从而并行化给定模块的应用程序。在正向传递过程中，模块被复制到每个设备上，每个副本处理输入的一部分。在向后传递过程中，每个副
复制链接

扫一扫

专栏目录