Traceback (most recent call last):
File "train.py", line 159, in <module>
train(args=args)
File "train.py", line 50, in train
rank = args.local_rank
File "/home/wby/anaconda3/envs/wby/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 400, in init_process_group
store, rank, world_size = next(rendezvous(url))
File "/home/wby/anaconda3/envs/wby/lib/python3.7/site-packages/torch/distributed/rendezvous.py", line 95, in _tcp_rendezvous_handler
store = TCPStore(result.hostname, result.port, world_size, start_daemon)
RuntimeError: Address already in use
问题在于,TCP的端口被占用,一种解决方法是,运行程序的同时指定端口,端口号随意给出:
--master_port 29501
另一种方式,查找占用的端口号(在程序里 插入print输出),然后找到该端口号对应的PID值:netstat -nltp
,然后通过kill -9 PID
来解除对该端口的占用