torch.nn.parallel.DistributedDataParallel使用中卡在第一个epoch的原因之一

最新推荐文章于 2024-04-15 16:30:00 发布

chensi000000

最新推荐文章于 2024-04-15 16:30:00 发布

阅读量2.1k

点赞数 3

分类专栏： pytorch

本文链接：https://blog.csdn.net/jiongta9473/article/details/112392671

版权

pytorch 专栏收录该内容

8 篇文章 0 订阅

订阅专栏

torch.nn.parallel.DistributedDataParallel的模型在进行eval()的时候必须加上with torch.no_grad()，否则就会导致rank==0的卡卡死在运行eval()后的代码的过程中，而其他卡仍然在进行训练，其他卡不会等这个进行eval()的卡。

在使用中有一个地方很容易错误，代码如下：

if int(os.environ.get('RANK')) == 0:
        with torch.no_grad():
            # print('dd0')
            if epoch % 10 == 0:        
                model.eval()
                right_num = 0
                # print('yy0')
                for idx, (data, label) in enumerate(val_dataloader):
                    # print('yy1')
                    data = data.to(device)
                    # print('zz0')
                    label = label.to(device)
                    # print('zz1')
                    x0 = model(data)
                    # print('zz2')
                    x0 = torch.nn.functional.softmax(x0, dim=1)
                    # print('zz3')
                    # x0 = torch.nn.functional.sigmoid(x0)
                    # print(x0)
                    right_num += (torch.argmax(x0, dim=1) == label).sum().cpu().item()
                    # print('zz4')
                    # print('yy2')
                # print('yy3')
                if right_num >= right_num0:
                    # print('yy4')
                    right_num0 = right_num
                    # torch.save(model, "./best_dict_resnest101-softmax-64batch.pth")
                    torch.save(model.state_dict(), "./best_dict_resnest50-softmax-64batch-distr.pth")
                # print('yy5')
                print(right_num)
                torch.cuda.empty_cache()
    if int(os.environ.get('RANK')) == 0:
        with torch.no_grad():
            # print('dd1')
            model.eval()
            # torch.save(model, "./last_dict_resnest101-softmax-64batch.pth")
            torch.save(model.state_dict(), "./last_dict_resnest50-softmax-64batch-distr.pth")

上面代码中的with torch.no_grad()不可省略！！！！

chensi000000

关注

3
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
torch.nn.parallel.DistributedDataParallel使用中卡在第一个epoch的原因之一

torch.nn.parallel.DistributedDataParallel的模型在进行eval()的时候必须加上with torch.no_grad()，否则就会导致一个卡卡死在运行eval()后的代码，而其他卡仍然在进行训练，其他卡不再等这个进行eval()的卡。在使用中有一个地方很容易错误，代码如下：if int(os.environ.get('RANK')) == 0: with torch.no_grad(): # print('dd0')
复制链接

扫一扫