问题描述:
- 多GPU运行torchrun --nnodes 1 --nproc_per_node=4,运行环境正常且无变化;
- 仅修改了代码中Tensor变量的channel大小,增大了model的大小;
- 提交服务器,代码报错信息如下:
File "/home/---/anaconda3/lib/python3.7/site-packages/torch/autograd/__init__.py", line 156, in backward
allow_unreachable=True, accumulate_grad=True) # allow_unreachable flag
RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling `cublasCreate(handle)`
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 165592 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 165593 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 165594 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 3 (pid: 165595) of binary
...
初步认为:问题出错在CUDA分配上。
处理方法:修改一下batch_size大小即可。如--batch-size 300改为--batch-size 256。