问题描述:
- 多GPU运行torchrun --nnodes 1 --nproc_per_node=4,运行环境正常且无变化;
- 仅修改了代码中Tensor变量的channel大小,增大了model的大小;
- 提交服务器,代码报错信息如下:
File "/home/---/anaconda3/lib/python3.7/site-packages/torch/autograd/__init__.py", line 156, in backward
allow_unreachable=True, accumulate_grad=True) # allow_unreachable flag
RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling `cublasCreate(handle)`
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 165592 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 165593 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 165594 clo