DataParallel（DP） and DistributedDataParallel （DDP）的区别

CV矿工

已于 2024-04-01 20:30:02 修改

阅读量310

点赞数

分类专栏： python（pytorch）编程基础深度学习文章标签： pytorch

于 2022-08-01 13:28:56 首次发布

本文链接：https://blog.csdn.net/ZauberC/article/details/126099418

版权

python（pytorch）编程基础同时被 2 个专栏收录

88 篇文章 5 订阅

订阅专栏

深度学习

29 篇文章 0 订阅

订阅专栏

First, DataParallel is single-process, multi-thread, and only works on a single machine, while DistributedDataParallel is multi-process and works for both single- and multi- machine training. DataParallel is usually slower than DistributedDataParallel even on a single machine due to GIL contention across threads, per-iteration replicated model, and additional overhead introduced by scattering inputs and gathering outputs.
Recall from the prior tutorial that if your model is too large to fit on a single GPU, you must use model parallel to split it across multiple GPUs. DistributedDataParallel works with model parallel; DataParallel does not at this time. When DDP is combined with model parallel, each DDP process would use model parallel, and all processes collectively would use data parallel.
If your model needs to span multiple machines or if your use case does not fit into data parallelism paradigm, please see the RPC API for more generic distributed training support.

DDP 与DP在具体实现上的区别如下：

关于优化器：

DDP ：在每次迭代之中，DDP 的每个进程都有自己的 optimizer ，每个进程都独立完成所有优化步骤，这和非分布式训练一样。
DP ：在 DP 中只有一个 optimizer，在主线程执行。其对各 GPU 上梯度进行求和，而在主 GPU 进行参数更新，之后再将模型参数 broadcast 到其他 GPU。

关于梯度。

DDP ：每个进程在自己 GPU之上计算损失，运行后向传播来计算梯度，在计算梯度同时对梯度执行all-reduce操作。
DP ：在各进程梯度计算完成之后，各进程需要将梯度进行汇总规约到主进程，主进程用梯度来更新模型权重，然后其 broadcast 模型到所有进程（其他GPU）进行下一步训练。

关于传播数据：

DDP ：只对梯度等少量数据进行交换。由于各进程中的模型，初始参数一致 (初始时刻进行一次 broadcast)，而每次用于更新参数的梯度也一致，因此，各进程的模型参数始终保持一致。相较于 DataParallel来说，torch.distributed 传输的数据量更少，因此速度更快，效率更高。

DP ：每次迭代，有大量交互，比如模型，前向输出，损失，梯度等。

讲的比较好的文章：https://juejin.cn/post/7031363573511094302

CV矿工

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
DataParallel（DP） and DistributedDataParallel （DDP）的区别

First,DataParallelissingle-process,multi-thread,andonlyworksonasinglemachine,whileDistributedDataParallelismulti-processandworksforbothsingle-andmulti-machinetraining.DataParallelisusuallyslowerthanDistributedDataParallelevenonasinglemachineduetoGILconten。
复制链接

扫一扫