背景
最近跑实验,用大模型大批量合成数据,用的实验室部署模型的API。明明是国内的服务器,也不用VPN,但是requests.post总是报错Failed to establish a new connection: [WinError 10060],或者程序到一半突然不动了,导致实验经常中断,十分耽误时间。最近进度赶完,下功夫研究一下。
报错信息
requests.exceptions.ConnectionError: HTTPSConnectionPool(host=‘**.**.**.**’, port=443): Max retries exceeded with url: / (Caused by NewConnectionError(‘<urllib3.connection.HTTPSConnection object at 0x0000022B3A071730>: Failed to establish a new connection: [WinError 10060] 由于连接方在一段时间后没有正确答复或连接的主机没有反应,连接尝试失败。’))
先说原因
http连接建立阶段超时,且requests请求未设置超时时间
先说解决方法
首先设置一个超时时间timeout,接着可以添加捕获异常后重新连接或其他处理的代码,多重连几次能很大概率减少程序中断的可能
解决过程
首先根据报错信息检索, 不考虑IP地址和网络等问题,很多帖子都认为headers里面Connection参数默认设置为keep-alive(http1.1使用长连接,http连接在post/get结束后不会立刻断开,而会保持连接等待复用),短时间内大量请求情况会创建很多长连接,超出所支持的http连接上限导致错误,将Connection的参数设置为close(表示短连接,请求结束后立刻断开)就可以了。
headers = {
'Content-Type': 'application/json',
'Authorization': f'Bearer {api_key}',
'Connection': 'close' # 增加
}
因为被网上教程坑多了,再加上实践出真知的原则,还是要实验一下情况是否真是如此。既然说是http连接过多导致的,那就看一下目前PC上建立了多少到服务器的连接,网上搜到powershell运行netstat -n命令可以查看本机连接详情,代码跑起来。
结果发现,5线程同时运行,存在二十多条TCP连接,其中5条的状态是ESTABLISHED,其余的都是TIME_WAIT状态。但是ESTABLISHED和TIME_WAIT分别代表什么呢?参考:https://draveness.me/whys-the-design-tcp-time-wait/
ESTABLISHED: 连接状态。
TIME_WAIT: 使用 TCP 协议通信的双方会在关闭连接时触发TIME_WAIT状态,关闭连接的操作其实是告诉通信的另一方自己没有需要发送的数据,但是它仍然保持了接收对方数据的能力,一个常见的关闭连接过程如下(4次握手):
客户端发起关闭请求后,等待两个最大数据段生命周期(Maximum segment lifetime,MSL)的时间后(大概4分钟)才会进入 CLOSED状态,这个等待的过程中就处于TIME_WAIT状态。
问题就来了,不难发现5条ESTABLISHED的TCP连接正好对应5线程,那么为什么其余的TCP都处于关闭状态呢?这不是和keep-alive的定义矛盾吗?严谨起见,我将Connection参数设置为close试验了一遍:
import requests
import time
# 用bilibili实验一下,域名ping一下就知道了就知道了
requests.get('https://www.bilibili.com',headers={'Connection':'keep-alive'})
# requests.get('https://www.bilibili.com',headers={'Connection':'close'})
# requests.get('https://www.bilibili.com')
time.sleep(300)
keep-alive & 默认情况:存在一个到bilibili的状态为TIME_WAIT的连接,说明即使connection是keep-alive,连接也是被关闭了的,为什么呢?
close:找不到连接信息,但打印下response返回200,说明是连接上的。按照https://zhuanlan.zhihu.com/p/648729501的说法,“大多数 Web 服务的实现,不管哪一方禁用了 HTTP Keep-Alive,都是由服务端主动关闭连接”,这就说的通了,按照TCP连接关闭的逻辑(上图),服务端主动关闭连接应该是服务端进入TIME_WAIT状态,而客户端会直接进入CLOSED,我们在客户端上就看不到连接信息了。
查一下keep-alive的用法,发现官方是配合Session()使用的,即requests.Session().post()
:
参考:http://docs.python-requests.org/en/master/user/advanced/#keep-alive
不懂requests.Session().post()
和requests.post()
的区别,扒一下源码api.py
def post(url, data=None, json=None, **kwargs):
r"""Sends a POST request.
:param url: URL for the new :class:`Request` object.
:param data: (optional) Dictionary, list of tuples, bytes, or file-like
object to send in the body of the :class:`Request`.
:param json: (optional) A JSON serializable Python object to send in the body of the :class:`Request`.
:param \*\*kwargs: Optional arguments that ``request`` takes.
:return: :class:`Response <Response>` object
:rtype: requests.Response
"""
return request("post", url, data=data, json=json, **kwargs)
def request(method, url, **kwargs):
"""Constructs and sends a :class:`Request <Request>`.
:param method: method for the new :class:`Request` object: ``GET``, ``OPTIONS``, ``HEAD``, ``POST``, ``PUT``, ``PATCH``, or ``DELETE``.
:param url: URL for the new :class:`Request` object.
:param params: (optional) Dictionary, list of tuples or bytes to send
in the query string for the :class:`Request`.
:param data: (optional) Dictionary, list of tuples, bytes, or file-like
object to send in the body of the :class:`Request`.
:param json: (optional) A JSON serializable Python object to send in the body of the :class:`Request`.
:param headers: (optional) Dictionary of HTTP Headers to send with the :class:`Request`.
:param cookies: (optional) Dict or CookieJar object to send with the :class:`Request`.
:param files: (optional) Dictionary of ``'name': file-like-objects`` (or ``{'name': file-tuple}``) for multipart encoding upload.
``file-tuple`` can be a 2-tuple ``('filename', fileobj)``, 3-tuple ``('filename', fileobj, 'content_type')``
or a 4-tuple ``('filename', fileobj, 'content_type', custom_headers)``, where ``'content_type'`` is a string
defining the content type of the given file and ``custom_headers`` a dict-like object containing additional headers
to add for the file.
:param auth: (optional) Auth tuple to enable Basic/Digest/Custom HTTP Auth.
:param timeout: (optional) How many seconds to wait for the server to send data
before giving up, as a float, or a :ref:`(connect timeout, read
timeout) <timeouts>` tuple.
:type timeout: float or tuple
:param allow_redirects: (optional) Boolean. Enable/disable GET/OPTIONS/POST/PUT/PATCH/DELETE/HEAD redirection. Defaults to ``True``.
:type allow_redirects: bool
:param proxies: (optional) Dictionary mapping protocol to the URL of the proxy.
:param verify: (optional) Either a boolean, in which case it controls whether we verify
the server's TLS certificate, or a string, in which case it must be a path
to a CA bundle to use. Defaults to ``True``.
:param stream: (optional) if ``False``, the response content will be immediately downloaded.
:param cert: (optional) if String, path to ssl client cert file (.pem). If Tuple, ('cert', 'key') pair.
:return: :class:`Response <Response>` object
:rtype: requests.Response
Usage::
>>> import requests
>>> req = requests.request('GET', 'https://httpbin.org/get')
>>> req
<Response [200]>
"""
# By using the 'with' statement we are sure the session is closed, thus we
# avoid leaving sockets open which can trigger a ResourceWarning in some
# cases, and look like a memory leak in others.
with sessions.Session() as session:
return session.request(method=method, url=url, **kwargs)
以及Session.py
def post(self, url, data=None, json=None, **kwargs):
r"""Sends a POST request. Returns :class:`Response` object.
:param url: URL for the new :class:`Request` object.
:param data: (optional) Dictionary, list of tuples, bytes, or file-like
object to send in the body of the :class:`Request`.
:param json: (optional) json to send in the body of the :class:`Request`.
:param \*\*kwargs: Optional arguments that ``request`` takes.
:rtype: requests.Response
"""
return self.request("POST", url, data=data, json=json, **kwargs)
原来requests.post()
和requests.Session().post()
的实现本质上都是调用了requests.Session().request()
方法,底层逻辑都是一样的。只不过前者使用了with上下文管理,而with上下文在调用完成后会自动关闭掉这个会话(https://www.cnblogs.com/wongbingming/p/10519553.html),也就是说,直接调用requests.post()
,即使Connection参数默认为keep-alive,也是会自动关闭连接的。要想使用keep-alive长连接,正确方法是使用第二种方法:
import requests
s = requests.Session()
s.post('https://www.bilibili.com')
time.sleep(300)
连接一直处于ESTABLISHED状态,长连接确实建立起来了!
那么我的程序既然本身就是创建短连接,那么导致报错的原因还是因为连接数超限吗?考虑到就20多条TCP还大部分都TIME_WAIT,感觉不太正确,测试一下能建立多少条短连接:
import requests
import tqdm
# 连续多次访问IP可能被ban掉,最好是用自己的服务器
num = 0
try:
for i in tqdm.tqdm(range(1000)):
requests.get('http://www.bilibili.com',hearders = {'Connection':'close'})
num += 1
except Exception as e:
print(e)
print(num)
结果1000个循环跑下来都没异常,说明肯定不是连接数过多的问题…
那是因为连接超时?但是看一下报错信息:Caused by NewConnectionError+winError10060,而网上搜索到一般超时的报错信息都是:ConnectTimeoutError或者Read Time Out,如果是超时导致的话,那么为什么报错信息不一样,又有什么区别?
分析一下,这些ConnectTimeoutError或者Read Time Out的错误的代码都设置了timeout超时时间:https://www.cnblogs.com/gl1573/p/10129382.html
而我的错误又有winError,猜想会不会是因为我没设置timeout,超时是由系统引起的?试验一下:
import requests
# requests.get('https://www.bilibili.com',timeout=0.0001) # 1.connect超时和read超时都设置为0.0001s,引发连接超时
# requests.get('https://www.bilibili.com', timeout=(10,0.0001)) # 2.connect超时设置为10s,read超时设置为0.0001s,引发读取超时
requests.get('https://www.google.com') # 3.国内不挂梯子连google,在不设置timeout的情况下引发连接超时
- 不加timeout,连接超时:
- 不加timeout,读取超时:
- 加了timeout,连接超时:
不对啊,怎么还是connectTimeoutError。感觉可能跟系统有关系,换到我跑实验的环境下试一下:
这下都对了!在我的实验环境下,设置了timeout就会报ConnectTimeoutError或者Read Time Out的错,不设置就会报Caused by NewConnectionError+winError10060的错,但本质上都是连接超时了(网络原因),程序长时间卡住不动也不报错也很可能是因为网络问题导致服务端发送数据丢失,但又没有设置read的超时时间,导致客户端一直等待服务端的数据而卡住
经过这一晚上痛苦的分析,检索和实验,发现原因其实相比解决过程来说非常简单!推断出这个报错是由连接超时(可能是网络原因)+没设置timeout引起的,解决方法很简单,可以添加捕获异常后重新连接或其他处理的代码,多重连几次能很大概率减少程序中断的可能。
解决代码示例:
import requests
s = requests.Session()
for i in range(3):# 尝试3次
try:
s.get('https://www.bilibili.com', timeout=(10, 0.01))
break
except Exception as e:
print(e)
print(f'retries{i+1}')
if i==2:
print('Max retries exceeded, skkiping this request')
print('program going on')
记录一下自己修Bug探索的过程,当然可能有不对的地方,但花了太多时间在这上面了,没时间做更进一步的研究。总之欢迎大家批评指正
全部实验代码:
import requests
import time
import tqdm
if __name__ == '__main__':
# 用bilibili实验一下,域名ping一下就知道了就知道了
# requests.get('https://www.bilibili.com', headers={'Connection': 'keep-alive'})
# requests.get('https://www.bilibili.com',headers={'Connection':'close'})
# requests.get('https://www.bilibili.com')
# time.sleep(300)
# s = requests.Session()
# s.post('https://www.bilibili.com')
# time.sleep(300)
# 连续多次访问IP可能被ban掉,最好是用自己的服务器
# num = 0
# try:
# for i in tqdm.tqdm(range(1000)):
# requests.get('http://www.bilibili.com',hearders = {'Connection':'close'})
# num += 1
# except Exception as e:
# print(e)
# print(num)
# requests.get('https://www.bilibili.com',timeout=0.0001) # 1.connect超时和read超时都设置为0.0001s,引发连接超时
# requests.get('https://www.bilibili.com', timeout=(10,0.0001)) # 2.connect超时设置为10s,read超时设置为0.0001s,引发读取超时
requests.get('https://www.google.com') # 3.国内不挂梯子连google,在不设置timeout的情况下引发连接超时