PyTorch 分布式训练 DistributedDataParallel 注意事项

最新推荐文章于 2024-01-23 21:48:53 发布

大指挥官

最新推荐文章于 2024-01-23 21:48:53 发布

阅读量737

点赞数

分类专栏： pytorch 文章标签： pytorch

本文链接：https://blog.csdn.net/d14665/article/details/116048389

版权

pytorch 专栏收录该内容

3 篇文章 1 订阅

订阅专栏

本文探讨了使用PyTorch的DDP进行模型训练时遇到的双次前向传播导致的BatchNormalization(BN)层错误，通过实例分析和同步BN转换，提出了解决方案并强调了在分布式训练中处理BN层的重要性。

摘要由CSDN通过智能技术生成

最近写代码，用的pytorch的DDP分布式工具。

发现一个问题，如果在代码中，模型在一次训练中有两次前项传播，如下：

model = Model()
for i, (x1, x2, y) in enumerate(trloader):
	x = x.cuda()
	y = y.cuda()
	p1 = model(x1)
	p2 = model(x2)
	...

程序会爆如下错误：

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [64]] is at version 4; expected version 3 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

检查模型之后，设置所有in_place=False，但是程序还是会爆出上述的错误。

最后发现是BatchNormalization层的问题，把所有的BN层删除之后，问题就不存在了。

最好的处理方法：

model = nn.SyncBatchNorm.convert_sync_batchnorm(model)

确定要放弃本次机会？

福利倒计时

: :

立减 ¥

普通VIP年卡可用

立即使用

大指挥官

关注关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
打赏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录

【分布式】Pytorch在多GPU环境的分布式训练中常见问题汇总

若北辰

06-25

343

分布式训练涉及在多个计算单元上并行地训练深度学习模型。在多 GPU 环境中，这通常意味着模型的训练过程被分散到多个 GPU 上，以加速训练过程并处理更大的数据集。PyTorch 提供了多种工具和库来支持分布式训练，如(DDP) 等。

PyTorch深度学习实战（27）—— PyTorch分布式训练

最新发布

shangjg3的博客

08-29

631

分布式训练等价于增大了一次训练的batch_size，传统的批标准化仅是对一个batch上的数据进行操作，在分布式训练中不同进程对不同的数据进行操作，相当于每一个进程的BN层处理的batch。在确保每个进程执行的操作没有问题后，可以尝试先在较小的集群上（例如，单个节点启动两个进程，两个节点启动两个进程等）启动分布式训练，以此验证进程间的同步与协作是否出错，这样的操作可以降低复现错误、调试程序的成本。在分布式训练中，不同进程之间需要通信来完成参数的汇总更新，因此进程间的同步是至关重要的。

参与评论您还未登录，请先登录后发表或查看评论

torch.nn.DataParallel 注意事项：模型的字典结构改变

Forrest97的博客

09-03

875

发现在使用torch.nn.DataParallel，原有模型的字典结构会发生一定变化使用前 self._model Ner( (cnn): ResNet( 使用后原有属性嵌套到了`DataParallel的module属性中 self._model DataParallel( (module): Net( (cnn): ResNet( 因此在调用原有self._model的参数时候，就变成了self._model.module ...

解决：AttributeError: SyncBatchNorm is only supported within torch.nn.parallel.DistributedDataParallel

道纪书生的博客

10-30

1116

解决报错：AttributeError: SyncBatchNorm is only supported within torch.nn.parallel.DistributedDataParallel

DistributedDataParallel的问题

qq_37087409的博客

11-13

2501

https://zhuanlan.zhihu.com/p/86441879

mysql执行准备,尝试使用mysql python连接器执行准备好的语句时出现NotImplementedError...

weixin_35682082的博客

02-18

421

I want to use prepared statements to insert data into a MySQL DB (version 5.7) using python, but I keep getting a NotImplementedError.I'm following the documentation here: https://dev.mysql.com/doc/co...

Pytorch中DistributedDataParallel基本使用

weixin_44762713的博客

01-03

1808

使用DistributedDataParallel进行并行训练

PyTorch分布式训练：策略与最佳实践全解析

[PyTorch分布式训练：策略与最佳实践全解析](https://img-blog.csdnimg.cn/img_convert/c847b513adcbedfde1a7113cd097a5d3.png) # 1. 分布式训练的基础概念 ## 1.1 分布式训练的定义与优势 分布式训练是机器学习...

Pytorch分布式训练

qq_38765642的博客

11-18

971

Pytorch分布式训练DataParallel和DistributedDataParallel详解本人项目涉及到分布式训练，之前简单介绍了分布式训练神经网络训练——分布式训练（一）笔者用pytorch比较多，故此记录下pytorch的分布式训练的两种方式，开始学习吧~ 1. 入门 Pytorch 分布式训练主要有两种方式： torch.nn.DataParallel ==> 简称 DP，数据并行，PS结构 torch.nn.parallel.DistributedDataParallel ==

【Pytorch分布式训练】MistGPU服务器训练

strawcherry_wj的博客

03-07

1776

MistGPU地址 PyCharm连接MistGPU教程 ifconfig命令查看ip地址报错：zsh: command not found: ifconfig 原因：在服务器上第一次使用该命令需要先安装net-tools 解决办法：sudo apt install net-tools sudo apt install net-tools安装网络包报错报错：unable to locate package net-tools 解决办法：sudo apt-get update ...

在pytorch中由于原地操作引起的报错：RuntimeError: one of the variables needed for gradient computation has been mod

哦牛逼，我放我GitHub博客链接还报违规的

05-08

2427

在python中，对tensor按坐标赋值是一个原地操作

pytorch分布式计算中遇到的坑—— GAN模型:pix2pix模型的discrimator在做torch.nn.parallel.DistributedDataParallel存在

searobbers_duck的博客

03-31

993

坑1: GAN模型:pix2pix模型的discrimator在做torch.nn.parallel.DistributedDataParallel存在的问题描述参考代码：https://github.com/junyanz/pytorch-CycleGAN-and-pix2pix, 这是一份极优秀的pytorch版本的GAN代码，代码中的多卡并行还是通过nn.DataParallel实现的，代码本身是没有问题的，当我把torch.nn.DataParallel改为torch.nn.parallel.D

one of the variables needed for gradient computation has been modified by an inplace operation 解决方案

dream6985的博客

01-23

2852

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [256, 1]], which is output 0 of TBackward, is at version 2; expected version 1 instead. Hint: the backtrace further above

排坑日记1：RuntimeError: one of the variables needed for gradient computation has been modified

liuzi_hang的博客

10-20

3835

排坑：RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation

报错解决：one of the variables needed for gradient computation has been modified by an inplace operation

m0_66237895的博客

11-27

7902

报错解决：one of the variables needed for gradient computation has been modified by an inplace operation

[Pytorch]DistributedDataParallel(DDP)报错 [enforce fail at inline_container.cc:145]和[222]

江南蜡笔小新

10-31

1771

最近的并行化工作中,报错RuntimeError: [enforce fail at inline_container.cc:222]和 [enforce fail at inline_container.cc:145]. 经查,主要原因如下: 1. 随机种子未统一: 根据手册,模型参数的seed只看主卡即可,按道理来说是会boradcast的. 但事实上,就是如此,推测是因为使用了一个非stable的新函数. 统一了随机数种子之后解决问题. 2. 重复保存低级错误,改动代码之后忘记作rank的

Pytorch并行计算(二): DistributedDataParallel介绍