【编程BUG】Pytorch 使用DataParallel时报告数据和模型不在同一个设备上的错误

最新推荐文章于 2024-06-12 10:28:02 发布

不写论文了吧

最新推荐文章于 2024-06-12 10:28:02 发布

阅读量381

点赞数

文章标签： python 人工智能深度学习 pytorch

本文链接：https://blog.csdn.net/qq_42363777/article/details/131364933

版权

问题描述

类型1：RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu !
类型2：RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1!

解决方案

问题1

可能是模型中有的模块没有正确的写成模型的一部分，比如 self.attr1=xx, 而xx不是Parameter或者Module类。
解决：根据具体问题解决，可能用到的代码：

self.register_buffer() 
self.register_parameter()
self.register_module()
self.xx=nn.ModuleList()

问题2

常见原因：模型中的不同模块分离开了。比如：

out1=self.module1(input)
out2=self.module2(out1)

而module1和module2在不同的GPU上（比如应用CLIP时，image encoder和text encoder可能是分离的）。
这时可以新建一个nn.Module类，把组件写到一起, 如：

class CustomCLIP(nn.Module):
    def __init__(self, cfg, train_dataset, clip_model):
        super().__init__()
        self.text_encoder = TextEncoder(cfg,clip_model)
        self.video_encoder = VideoEncoder(cfg,clip_model)

    def forward(self, video, pairs):
        xx

新原因：：还有一种情况是刚刚遇到的，我的代码是这样的：

class MyModule(nn.Module):

    def __init__(self, dset, cfg):
        super(MyModule, self).__init__()
        xxx
        self.train_forward = self.my_forward_1
    def my_forward_1(self, x):
        xxx
    def forward(self, x):
        if self.training:
        	pred = self.train_forward(x)
        else:
            pred=xxx
        return pred

此时模型只在0号GPU上，导致多GPU运行时报告数据和模型不在统一设备上的错误。改正方案是：

class MyModule(nn.Module):

    def __init__(self, dset, cfg):
        super(MyModule, self).__init__()
        xxx
       
    def my_forward_1(self, x):
        xxx
    def forward(self, x):
        if self.training:
        	pred = self.my_forward_1(x)
        else:
            pred=xxx
        return pred