最近在使用scrapy开发项目。
scrapy的项目,使用了代理,
然后在本地运行,一切正常,数据能够正常抓取。
部署到线上的gerapy里运行起来后报错,日志显示:
2022-10-08 17:03:24 [scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying <GET https://www.xxx.com/en/product?o=100&p=100> (failed 3 times): User timeout caused connection failure: Getting https://www.xxx.com/en/product?o=100&p=100 took longer than 180.0 seconds..
2022-10-08 17:03:24 [xxx] ERROR: <twisted.python.failure.Failure twisted.internet.error.TimeoutError: User timeout caused connection failure: Getting https://www.xxx.com/en/product?o=100&p=100 took longer than 180.0 seconds..>
2022-10-08 17:03:24 [xxx] ERROR: TimeoutError on https://www.xxx.com/en/product?o=100&p=100
以为是代理的问题,然后去掉代理,本地也运行正常
再部署到线上的gerapy里运行起来后又报错,日志显示:
2022-10-09 10:39:45 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.xxx.com/en/product?o=100&p=100> (failed 3 times): [<twisted.python.failure.Failure twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion: Connection lost.>]
2022-10-09 10:39:56 [scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying <GET https://www.xxx.com/en/product?o=100&p=100> (failed 3 times): [<twisted.python.failure.Failure twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion: Connection lost.>]
搜索一番发现stackoverflow的结果:
python - twisted.internet.error.TimeoutError: User timeout caused connection failure - Stack Overflow https://stackoverflow.com/questions/51913874/twisted-internet-error-timeouterror-user-timeout-caused-connection-failure
然后在spider里加了一个方法:
# 需要引用的
from scrapy.spidermiddlewares.httperror import HttpError
from twisted.internet.error import DNSLookupError
from twisted.internet.error import TimeoutError
# 使用方法
yield scrapy.Request(url=url, meta={'dont_redirect': True, dont_filter=True, callback=self.parse_list, errback=self.errback_work)
# 定义方法
def errback_work(self, failure):
self.logger.error(repr(failure))
if failure.check(HttpError):
response = failure.value.response
self.logger.error('HttpError on %s', response.url)
elif failure.check(DNSLookupError):
request = failure.request
self.logger.error('DNSLookupError on %s', request.url)
elif failure.check(TimeoutError):
request = failure.request
self.logger.error('TimeoutError on %s', request.url)
再部署到线上的gerapy里运行起来后,仍然又报错。
后面经大佬指点,看scrapy的版本,发现是2.6.1版本,而大佬的是2.5.1版本,就改为2.5.1。
pip3 install scrapy==2.5.1
再本地运行一下,报错:
AttributeError: module 'OpenSSL.SSL' has no attribute 'SSLv3_METHOD'
看看pyopenssl库的版本:
pip3 show pyopenssl
发现版本不对。
然后又将库版本改为:22.0.0
pip3 install pyopenssl==22.0.0
再本地运行一下,正常了!
然后再部署到线上的gerapy里运行起来,也正常了!