问题描述:
使用nohup … > log.txt &命令训练深度学习模型时,输出日志报错Connection to remote host was lost…
相关命令可见我的另一篇博客
如何使用远程linux服务器运行深度学习项目
具体错误:
[Errno 111] Connection refused
Traceback (most recent call last):
File "/usr/local/anaconda3/envs/p2p/lib/python3.6/site-packages/urllib3/connection.py", line 170, in _new_conn
(self._dns_host, self.port), self.timeout, **extra_kw
File "/usr/local/anaconda3/envs/p2p/lib/python3.6/site-packages/urllib3/util/connection.py", line 96, in create_connection
raise err
File "/usr/local/anaconda3/envs/p2p/lib/python3.6/site-packages/urllib3/util/connection.py", line 86, in create_connection
sock.connect(sa)
ConnectionRefusedError: [Errno 111] Connection refused
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/anaconda3/envs/p2p/lib/python3.6/site-packages/urllib3/connectionpool.py", line 706, in urlopen
chunked=chunked,
File "/usr/local/anaconda3/envs/p2p/lib/python3.6/site-packages/urllib3/connectionpool.py", line 394, in _make_request
conn.request(method, url, **httplib_request_kw)
File "/usr/local/anaconda3/envs/p2p/lib/python3.6/site-packages/urllib3/connection.py", line 234, in request
super(HTTPConnection, self).request(method, url, body=body, headers=headers)
File "/usr/local/anaconda3/envs/p2p/lib/python3.6/http/client.py", line 1287, in request
self._send_request(method, url, body, headers, encode_chunked)
File "/usr/local/anaconda3/envs/p2p/lib/python3.6/http/client.py", line 1333, in _send_request
self.endheaders(body, encode_chunked=encode_chunked)
File "/usr/local/anaconda3/envs/p2p/lib/python3.6/http/client.py", line 1282, in endheaders
self._send_output(message_body, encode_chunked=encode_chunked)
File "/usr/local/anaconda3/envs/p2p/lib/python3.6/http/client.py", line 1042, in _send_output
self.send(msg)
File "/usr/local/anaconda3/envs/p2p/lib/python3.6/http/client.py", line 980, in send
self.connect()
File "/usr/local/anaconda3/envs/p2p/lib/python3.6/site-packages/urllib3/connection.py", line 200, in connect
conn = self._new_conn()
File "/usr/local/anaconda3/envs/p2p/lib/python3.6/site-packages/urllib3/connection.py", line 182, in _new_conn
self, "Failed to establish a new connection: %s" % e
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7f5bc655ce48>: Failed to establish a new connection: [Errno 111] Connection refused
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/anaconda3/envs/p2p/lib/python3.6/site-packages/requests/adapters.py", line 449, in send
timeout=timeout
File "/usr/local/anaconda3/envs/p2p/lib/python3.6/site-packages/urllib3/connectionpool.py", line 756, in urlopen
method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
File "/usr/local/anaconda3/envs/p2p/lib/python3.6/site-packages/urllib3/util/retry.py", line 574, in increment
raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='localhost', port=2019): Max retries exceeded with url: /events (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f5bc655ce48>: Failed to establish a new connection: [Errno 111] Connection refused',))
......(循环地报类似的错误)
报错原因:
在训练中使用了visdom可视化工具对模型训练进行可视化监测,因为visdom无法正常工作导致报错。
具体原因可能如下:
①在训练前没使用python -m visdom.server命令开启visdom服务
②在训练中因为断开了远程连接导致python -m visdom.server命令停止运行,visdom无法正常工作
③设定的端口已被占用
解决方案:
1)在开始训练前使用lsof -i:端口号
查看所需要使用的端口是否被占用(listened状态),如果此时端口被占用,可以选择结束此进程kill -9 进程id
或者在程序中更改visdom需要使用的端口号(此处我就将端口号改成了2019)
2)在确认端口无占用后,使用nohup pyhton -m visdom.server &
开启visdom服务,保证命令在后台运行不挂断
如果更换端口号,请使用nohup python -m visdom.server -p 端口号 &
开启visdom服务 对应网址为http://localhost:端口号/
3)然后就可以开始愉快地训练啦~