gerapy运行scrapy程序，报错timeout

最新推荐文章于 2024-07-23 11:34:57 发布

willgon123

最新推荐文章于 2024-07-23 11:34:57 发布

阅读量986

点赞数

分类专栏：错误重拾文章标签： scrapy gerapy timeout

本文链接：https://blog.csdn.net/u010182162/article/details/127226590

版权

错误重拾专栏收录该内容

18 篇文章 1 订阅

订阅专栏

最近在使用scrapy开发项目。

scrapy的项目，使用了代理，

然后在本地运行，一切正常，数据能够正常抓取。

部署到线上的gerapy里运行起来后报错，日志显示：

2022-10-08 17:03:24 [scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying <GET https://www.xxx.com/en/product?o=100&p=100> (failed 3 times): User timeout caused connection failure: Getting https://www.xxx.com/en/product?o=100&p=100 took longer than 180.0 seconds..

2022-10-08 17:03:24 [xxx] ERROR: <twisted.python.failure.Failure twisted.internet.error.TimeoutError: User timeout caused connection failure: Getting https://www.xxx.com/en/product?o=100&p=100 took longer than 180.0 seconds..>

2022-10-08 17:03:24 [xxx] ERROR: TimeoutError on https://www.xxx.com/en/product?o=100&p=100

以为是代理的问题，然后去掉代理，本地也运行正常

再部署到线上的gerapy里运行起来后又报错，日志显示：

2022-10-09 10:39:45 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.xxx.com/en/product?o=100&p=100> (failed 3 times): [<twisted.python.failure.Failure twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion: Connection lost.>]

2022-10-09 10:39:56 [scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying <GET https://www.xxx.com/en/product?o=100&p=100> (failed 3 times): [<twisted.python.failure.Failure twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion: Connection lost.>]

搜索一番发现stackoverflow的结果：

python - twisted.internet.error.TimeoutError: User timeout caused connection failure - Stack Overflow https://stackoverflow.com/questions/51913874/twisted-internet-error-timeouterror-user-timeout-caused-connection-failure

然后在spider里加了一个方法：

# 需要引用的
from scrapy.spidermiddlewares.httperror import HttpError
from twisted.internet.error import DNSLookupError
from twisted.internet.error import TimeoutError


    # 使用方法
    yield scrapy.Request(url=url, meta={'dont_redirect': True, dont_filter=True, callback=self.parse_list, errback=self.errback_work)


    # 定义方法
    def errback_work(self, failure):
        self.logger.error(repr(failure))
        if failure.check(HttpError):
            response = failure.value.response
            self.logger.error('HttpError on %s', response.url)
        elif failure.check(DNSLookupError):
            request = failure.request
            self.logger.error('DNSLookupError on %s', request.url)
        elif failure.check(TimeoutError):
            request = failure.request
            self.logger.error('TimeoutError on %s', request.url)

再部署到线上的gerapy里运行起来后，仍然又报错。

后面经大佬指点，看scrapy的版本，发现是2.6.1版本，而大佬的是2.5.1版本，就改为2.5.1。