初次使用单机多卡分布式训练遇到的问题汇总

最新推荐文章于 2025-04-07 23:39:38 发布

m0_46738868

最新推荐文章于 2025-04-07 23:39:38 发布

阅读量6.4k

点赞数 21

文章标签：分布式

本文链接：https://blog.csdn.net/m0_46738868/article/details/131470763

版权

文章目录

1.环境
2.错误汇总
3.警告汇总
4.单机多卡分布式训练代码
- .py文件：
- 终端命令：

1.环境

ubuntu18.04
pytorch1.9.1+cu111
RTX3060*4

2.错误汇总

2.1.需要在cuda:0设备中的部分模型的参数在其它cuda设备中发现。

问题描述：

RuntimeError: module must have its parameters and buffers on device cuda:0 (device_ids[0]) but found one of them on device: cuda:2
RuntimeError: module must have its parameters and buffers on device cuda:0 (device_ids[0]) but found one of them on device: cuda:3
RuntimeError: module must have its parameters and buffers on device cuda:0 (device_ids[0]) but found one of them on device: cuda:1
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 2 (pid: 33781) of binary: /home/hopth/miniconda3/envs/siamcar/bin/python

解决方案：

我是由于初次使用DP，在torch.nn.DataParallel(model)之后又向当前设备传入了一次模型，具体如下：
	model = ModelBuilder().cuda()
    dist_model = torch.nn.DataParallel(model).cuda()
事实上，torch.nn.DataParallel()会自动将设备分配给对应的设备。将上述代码中的.cuda()去掉之后，该问题解决。
	model = ModelBuilder().cuda()
    dist_model = torch.nn.DataParallel(model)

** 注：可能model = ModelBuilder().cuda()也需要改成device = torch.device(“cuda:{}”.format(int(local_rank)));model = ModelBuilder().to(device)，后来有出现一些错误，我又乱改动了，最后可以运行的版本是用device = torch.device(“cuda:{}”.format(int(local_rank)));model = ModelBuilder().to(device)代替了model = ModelBuilder().cuda()。但我感觉区别不大。**

2.2.进程间通信超时

问题描述：

[E ProcessGroupNCCL.cpp:566] [Rank 0] Watchdog caught collective operation timeout: 
WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1802779 milliseconds before timing out.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 32240) of binary: /home/hopth/miniconda3/envs/siamcar/bin/python

解决方案：

	网上有人建议将torch.distributed.init_process_group()中的timedelta参数设置的更大一点，我觉得默认超时时间整整30分钟，这都没法通信，时间再长也没有意义了。
	我的解决方案是将DP换成DDP，替换也很简单，将dist_model = torch.nn.DataParallel(model)改成dist_model = torch.nn.parallel.DistributeDataParallel(model)，超时问题解决，然后又报了一个新错误，见下。

2.3.部分参数（或者模块）没有参与回传

问题描述：

“message”: “RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument find_unused_parameters=True to torch.nn.parallel.DistributedDataParallel, and by \nmaking sure all forward function outputs participate in calculating loss. \nIf you already have done the above, then the distributed data parallel module wasn’t able to locate the output tensors in the return value of your module’s forward function. Please include the loss function and the structure of the return value of forward of your module when reporting this issue (e.g. list, dict, iterable).\nParameter indices which did not receive grad for rank 0: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 …\n In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this error”,

解决方案：

这个错误我直接按照它给的提示修改，将dist_model = torch.nn.parallel.DistributeDataParallel(model)改为dist_model = torch.nn.parallel.DistributeDataParallel(model, find_unused_parameters=True)，然后就不报错了。具体为什么，我现在也不懂。

3.警告汇总

不想写了，它报的警告我都没理会，目前代码也能跑，等结果出来再看有没有影响。

4.单机多卡分布式训练代码

.py文件：

#设置local_rank,这个参数在终端运行python -m torch.distributed.launch --nproc_per_node=4 --master_port=3333 <你的train.py文件>时，会自动传入。
parser = argparse.ArgumentParser(description='my args')
parser.add_argument('--local_rank', type=int, default=0,
                    help='compulsory for pytorch launcer')
args = parser.parse_args()

rank = int(os.environ['RANK'])
local_rank = int(os.environ["LOCAL_RANK"])
num_gpus = torch.cuda.device_count()
torch.cuda.set_device(rank % num_gpus)
dist.init_process_group(backend='nccl')
world_size = dist.get_world_size()

#加载模型
device = torch.device("cuda:{}".format(int(local_rank)))
model = ModelBuilder().to(device)

#加载并分布数据集
train_loader = build_data_loader()
train_sampler = None
	if world_size > 1:
        train_sampler = DistributedSampler(train_dataset)
    train_loader = DataLoader(train_dataset,
                              batch_size=BATCH_SIZE,
                              num_workers=NUM_WORKERS,
                              pin_memory=True,
                              sampler=train_sampler)
#DDP
dist_model = nn.parallel.DistributedDataParallel(model, device_ids=[local_rank], output_device=local_rank, find_unused_parameters=True)
#创建optimizer和lr_scheduler
optimizer, lr_scheduler = build_opt_lr(dist_model.module, START_EPOCH)

#训练过程
train(...)

注1：log和tb_writer一般只需要在一个设备上输出就可以了。

终端命令：

export OMP_NUM_THREADS=1
CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node=4 --master_port=3333 <你的train.py文件>

注1：export OMP_NUM_THREADS=1不是必须的，这是为了限制线程数的，如果不设置，代码依然可以运行，不过可能会占用多个线程，以进行提速。
注2：分布式训练不能在平台直接点击run运行py文件，因为这样默认只会分配一个进程号。