RuntimeError: Expected tensor for argument #1 'input' to have the same device as tensor for argument

最新推荐文章于 2024-04-30 19:17:07 发布

chenpan0615

最新推荐文章于 2024-04-30 19:17:07 发布

阅读量9k

点赞数 5

分类专栏： pytorch 文章标签： pytorch 多GPU训练神经网络

本文链接：https://blog.csdn.net/guyejiyou64/article/details/102500675

版权

pytorch 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

RuntimeError: Expected tensor for argument #1 ‘input’ to have the same device as tensor for argument #2 ‘weight’; but device 0 does not equal 1 (while checking arguments for cudnn_convolution)

pytorch多GPU训练模型出现错误，“RuntimeError: Expected tensor for argument #1 ‘input’ to have the same device as tensor for argument #2 ‘weight’; but device 0 does not equal 1 (while checking arguments for cudnn_convolution)”，度娘找解决方案未果，墙又翻不出去，只能一步步打断点查BUG。。。

网络采用正常的多GPU并行的定义方式，在并行方式上不会有啥问题，按道理不会出现这种GPU编号对不上的问题。

...
##网络定义
net = mynet()
if num_gpus > 1:
	net = nn.DataParallel(net)	#GPU id 已经通过 CUDA_VISIBLE_DEVICES限定好了
	patch_replication_callback(net)
net.cuda()
net.train()
...

BUG处

经过排查，发现问题都出在网络forward函数中一个地方（下面代码标注出的地方），在报错行代码，我的网络这一模块的权重与输入的tensor开始不在一个GPU上了。

...
class mynet(nn.Module):
	def __init__(self)
	super(mynet, self).__init__()
	self.resnet = models.resnet34()
	self.bridge = nn.Sequential("一些定义")
	self.gau_block1 = nn.Sequential("一些定义")
 	self.gau_block2 = nn.Sequential("一些定义")
 	self.gau_block3 = nn.Sequential("一些定义")
 	self.gau = [self.gau_block1, self.gau_block2, self.gau_block3]

def forward(self, x):
	fm_low = self.resnet(x)#	通过
	fm_high = self.bridge(fm_low[3])#	通过
	fm_high = self.gau[0](fm_high, fm_low[0]) #	********报错处*********
	fm_high = self.gau[1(fm_high, fm_low[1])
	...

显然，出错的地方用到了一个比较奇怪的表达方式，那就是list。当我取消用list把三个模块整合起来，直接调用三个模块之后，网络的多GPU训练终于可以正常运行了。
事实证明，在网络定义中，同一个类里面最好不要有python的list的定义且调用，在单GPU虽然不会出错，但是这样在分配多GPU的时候在这个地方就会乱掉。
实在想用list的话，可以考虑pytorch自带的ModuleList

chenpan0615

关注

5
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
RuntimeError: Expected tensor for argument #1 'input' to have the same device as tensor for argument

RuntimeError: Expected tensor for argument #1 ‘input’ to have the same device as tensor for argument #2 ‘weight’; but device 0 does not equal 1 (while checking arguments for cudnn_convolution)pytorch...
复制链接

扫一扫