报错如下:
RuntimeError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:29500 (errno: 98 - Address already in use). The server socket has failed to bind to ?UNKNOWN? (errno: 98 - Address already in use).
该问题由于默认的局部网络地址25900已被占用,致使报错:解决方法如下
将原来的
python -m torch.distributed.launch --nproc_per_node=2 xx.py
改为如下:
python -m torch.distributed.launch --nproc_per_node=2 --master_port=**29501** xx.py