Error details: RuntimeError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:29500 (errno: 98 - Address already in use). The server socket has failed to bind to ?UNKNOWN? (errno: 98 - Address already in use).
This error occurs when using torch.nn.parallel.DistributedDataParallel
to train a model parallelly. I launched program A with python -m torch.distributed.launch --nproc_per_node=2 trainA.py
and worked fine. Then when A is running, I tried to launch program B with python -m torch.distributed.launch --nproc_per_node=2 trainB.py
yet ended up with the error above.
It turns out that the issue arises from the network address. As the error reports, the address 29500
is being used. Hence, modifying the address should work. So I used the command python -m torch.distributed.launch --nproc_per_node=2 --master_port='29501' trainB.py
.
Problem solved!!!
RuntimeError: The server socket has failed to listen on any local network address. The server socket
最新推荐文章于 2024-07-25 10:40:32 发布
关键词由CSDN通过智能技术生成