pytorch常见分布式训练报错(另备忘模型分布式后,named_modules,前会加module.)

1、–nproc_per_node=设置错误,比如就2块可见卡,设置3,那么代码中这行torch.cuda.set_device(args.local_rank)
就会报以下错误

Traceback (most recent call last):
  File "trainDDP.py", line 32, in <module>
    torch.cuda.set_device(args.local_rank)
  File "/opt/conda/envs/torch1.10/lib/python3.8/site-packages/torch/cuda/__init__.py", line 311, in set_device
    torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal

2、在使用“DistributedDataParallel”时,model没有先放在GPU上,而是默认的CPU。所以会报如下错误

ValueError: DistributedDataParallel device_ids and output_device arguments only work with single-device/multiple-device GPU modules or CPU modules, but got device_ids [1], output_device 1, and module parameters {device(type='cpu')}.

代码,注意这个args.local_rank 是所有分布式训练中都需要加入的一个args,这个args会被自动传入参数

device = torch.device('cuda:{}'.format(args.local_rank))
#错误
model = VGG16_3D.VGG163D(num_classes=num_classes)
DDPmodel= torch.nn.parallel.DistributedDataParallel(model,device_ids=[args.local_rank],
                                                    output_device=args.local_rank)
#正确
model = VGG16_3D.VGG163D(num_classes=num_classes)
model = model.to(device)
DDPmodel= torch.nn.parallel.DistributedDataParallel(model,device_ids=[args.local_rank],
                                                    output_device=args.local_rank)

3、在优化器中添加参数时,获取参数应该用未DDP封装的model,如果有DDP封装的model就会报如下错误:

optimizer = optim.SGD(train_params, lr=lr, momentum=0.9, weight_decay=5e-4)
  File "/opt/conda/envs/torch1.10/lib/python3.8/site-packages/torch/optim/sgd.py", line 95, in __init__
    super(SGD, self).__init__(params, defaults)
  File "/opt/conda/envs/torch1.10/lib/python3.8/site-packages/torch/optim/optimizer.py", line 54, in __init__
    super(SGD, self).__init__(params, defaults)
  File "/opt/conda/envs/torch1.10/lib/python3.8/site-packages/torch/optim/optimizer.py", line 54, in __init__
    self.add_param_group(param_group)
    self.add_param_group(param_group)  File "/opt/conda/envs/torch1.10/lib/python3.8/site-packages/torch/optim/optimizer.py", line 251, in add_param_group

  File "/opt/conda/envs/torch1.10/lib/python3.8/site-packages/torch/optim/optimizer.py", line 251, in add_param_group
    param_group['params'] = list(params)
  File "/workspace/ws/C3D-main/network/VGG16_3D.py", line 106, in get_1x_lr_params
    param_group['params'] = list(params)
  File "/workspace/ws/C3D-main/network/VGG16_3D.py", line 106, in get_1x_lr_params
    b = [model.conv1a,model.conv1b, model.conv2a,model.conv2b, model.conv3a, model.conv3b,model.conv3c, model.conv4a, model.conv4b,model.conv4c,
  File "/opt/conda/envs/torch1.10/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1177, in __getattr__
    b = [model.conv1a,model.conv1b, model.conv2a,model.conv2b, model.conv3a, model.conv3b,model.conv3c, model.conv4a, model.conv4b,model.conv4c,
  File "/opt/conda/envs/torch1.10/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1177, in __getattr__
        raise AttributeError("'{}' object has no attribute '{}'".format(raise AttributeError("'{}' object has no attribute '{}'".format(

AttributeErrorAttributeError: 'DistributedDataParallel' object has no attribute 'conv1a': 
'DistributedDataParallel' object has no attribute 'conv1a

代码

#错误
model = VGG16_3D.VGG163D(num_classes=num_classes)
model = model.to(device)
DDPmodel= torch.nn.parallel.DistributedDataParallel(model,device_ids=[args.local_rank],
                                                    output_device=args.local_rank)
train_params = [{'params': VGG16_3D.get_1x_lr_params(DDPmodel), 'lr': lr},
                {'params': VGG16_3D.get_10x_lr_params(DDPmodel), 'lr': lr * 10}]
criterion = nn.CrossEntropyLoss()  # standard crossentropy loss for classification
optimizer = optim.SGD(train_params, lr=lr, momentum=0.9, weight_decay=5e-4)
#正确
    model = VGG16_3D.VGG163D(num_classes=num_classes)
    model = model.to(device)
    DDPmodel= torch.nn.parallel.DistributedDataParallel(model,device_ids=[args.local_rank],
                                                    output_device=args.local_rank)
    
    train_params = [{'params': VGG16_3D.get_1x_lr_params(model), 'lr': lr},
                        {'params': VGG16_3D.get_10x_lr_params(model), 'lr': lr * 10}]
    criterion = nn.CrossEntropyLoss()  # standard crossentropy loss for classification
    optimizer = optim.SGD(train_params, lr=lr, momentum=0.9, weight_decay=5e-4)

4、因为分布式是多进程,而且我们一般都是用的all_reduce 即所有的GPU上张量相加,然后再把这些数值都放到这些GPU上,所以在用tensorboardX等工具时,需要注意,不用每个进程都写,因为每个进程都往文件里写入,需要设置锁机制,就会很复杂同时因为用的all_reduce也没必要,所以选定一个进程就行
reduce操作:
在这里插入图片描述
all_reduce操作:
在这里插入图片描述

#错误方式
log_dir = os.path.join(save_dir, 'models', datetime.now().strftime('%b%d_%H-%M-%S') + '_' + socket.gethostname())
writer = SummaryWriter(log_dir=log_dir)
writer.add_scalar('data/train_loss_epoch', epoch_loss, epoch)
writer.close()
#正确方式
log_dir = os.path.join(save_dir, 'models', datetime.now().strftime('%b%d_%H-%M-%S') + '_' + socket.gethostname())
    os.makedirs(log_dir, exist_ok=True) #exist_ok=True表示如果存在就不创建了
    global_rank=torch.distributed.get_rank()
    if global_rank==0:
        writer = SummaryWriter(log_dir=log_dir)
    else:
        writer  = None

if writer is not None:
     writer.add_scalar('data/train_loss_epoch', epoch_loss, epoch)
if writer is not None:
     writer.close()

5、因为分布式是多进程,而且我们一般都是用的all_reduce 即所有的GPU上张量相加,然后再把这些数值都放到这些GPU上, 所以保存模型的时候,一个进程保存就行了

if torch.distributed.get_rank() == 0:  #一般用0,当然,可以选任意的rank保存。
    torch.save(net, "net.pth")

6、

	model
	print('823看看分布式前后模型名称的变化')
    for name, layer in model.named_modules():
        print(name)  # 比如 model.0.conv
    # DP mode
    if cuda and RANK == -1 and torch.cuda.device_count() > 1:
        model = torch.nn.DataParallel(model)
    for name, layer in model.named_modules():
        print(name) #比如 module.model.0.conv
  • 2
    点赞
  • 6
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值