scrapy回调函数中使用异步请求

同步测试

在scrapy中使用同步io会阻塞所有的异步io:

class MySpider(Spider):
    """测试用脚本"""
    name = 'spider_test_aio'
    custom_settings = dict(
        LOG_LEVEL='DEBUG',
    )

    def start_requests(self):
        for i in range(10, 0, -1):
            yield Request('http://httpbin.org/delay/%s' % i, callback=self.parse)

    def parse(self, response):
        self.log(response)
        requests.get('http://httpbin.org/delay/10')

2020-08-14 14:17:02 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://httpbin.org/delay/3> (referer: None)
2020-08-14 14:17:03 [spider_test_aio] DEBUG: <200 http://httpbin.org/delay/3>
2020-08-14 14:17:03 [urllib3.connectionpool] DEBUG: Starting new HTTP connection (1): httpbin.org:80
2020-08-14 14:17:14 [urllib3.connectionpool] DEBUG: http://httpbin.org:80 “GET /delay/10 HTTP/1.1” 200 359
2020-08-14 14:17:14 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://httpbin.org/delay/10> (referer: None)
2020-08-14 14:17:14 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://httpbin.org/delay/9> (referer: None)
2020-08-14 14:17:14 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://httpbin.org/delay/7> (referer: None)
2020-08-14 14:17:14 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://httpbin.org/delay/6> (referer: None)
2020-08-14 14:17:14 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://httpbin.org/delay/2> (referer: None)
2020-08-14 14:17:14 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://httpbin.org/delay/5> (referer: None)
2020-08-14 14:17:14 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://httpbin.org/delay/4> (referer: None)
2020-08-14 14:17:14 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://httpbin.org/delay/8> (referer: None)
2020-08-14 14:17:14 [spider_test_aio] DEBUG: <200 http://httpbin.org/delay/9>
2020-08-14 14:17:14 [urllib3.connectionpool] DEBUG: Starting new HTTP connection (1): httpbin.org:80
2020-08-14 14:17:26 [urllib3.connectionpool] DEBUG: http://httpbin.org:80 “GET /delay/10 HTTP/1.1” 200 359
2020-08-14 14:17:26 [spider_test_aio] DEBUG: <200 http://httpbin.org/delay/7>
2020-08-14 14:17:26 [urllib3.connectionpool] DEBUG: Starting new HTTP connection (1): httpbin.org:80
2020-08-14 14:17:37 [urllib3.connectionpool] DEBUG: http://httpbin.org:80 “GET /delay/10 HTTP/1.1” 200 359
2020-08-14 14:17:37 [spider_test_aio] DEBUG: <200 http://httpbin.org/delay/10>
2020-08-14 14:17:37 [urllib3.connectionpool] DEBUG: Starting new HTTP connection (1): httpbin.org:80
2020-08-14 14:17:48 [urllib3.connectionpool] DEBUG: http://httpbin.org:80 “GET /delay/10 HTTP/1.1” 200 359
2020-08-14 14:17:48 [spider_test_aio] DEBUG: <200 http://httpbin.org/delay/6>
2020-08-14 14:17:48 [urllib3.connectionpool] DEBUG: Starting new HTTP connection (1): httpbin.org:80
2020-08-14 14:17:58 [urllib3.connectionpool] DEBUG: http://httpbin.org:80 “GET /delay/10 HTTP/1.1” 200 359
2020-08-14 14:17:58 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://httpbin.org/delay/1> (referer: None)
2020-08-14 14:17:58 [spider_test_aio] DEBUG: <200 http://httpbin.org/delay/8>
2020-08-14 14:17:58 [urllib3.connectionpool] DEBUG: Starting new HTTP connection (1): httpbin.org:80
2020-08-14 14:18:09 [urllib3.connectionpool] DEBUG: http://httpbin.org:80 “GET /delay/10 HTTP/1.1” 200 359
2020-08-14 14:18:09 [spider_test_aio] DEBUG: <200 http://httpbin.org/delay/5>
2020-08-14 14:18:09 [urllib3.connectionpool] DEBUG: Starting new HTTP connection (1): httpbin.org:80
2020-08-14 14:18:19 [urllib3.connectionpool] DEBUG: http://httpbin.org:80 “GET /delay/10 HTTP/1.1” 200 359
2020-08-14 14:18:19 [spider_test_aio] DEBUG: <200 http://httpbin.org/delay/4>
2020-08-14 14:18:19 [urllib3.connectionpool] DEBUG: Starting new HTTP connection (1): httpbin.org:80
2020-08-14 14:18:31 [urllib3.connectionpool] DEBUG: http://httpbin.org:80 “GET /delay/10 HTTP/1.1” 200 359
2020-08-14 14:18:31 [spider_test_aio] DEBUG: <200 http://httpbin.org/delay/2>
2020-08-14 14:18:31 [urllib3.connectionpool] DEBUG: Starting new HTTP connection (1): httpbin.org:80
2020-08-14 14:18:42 [urllib3.connectionpool] DEBUG: http://httpbin.org:80 “GET /delay/10 HTTP/1.1” 200 359
2020-08-14 14:18:42 [scrapy.extensions.logstats] INFO: Crawled 10 pages (at 10 pages/min), scraped 0 items (at 0 items/min)
2020-08-14 14:18:42 [spider_test_aio] DEBUG: <200 http://httpbin.org/delay/1>
2020-08-14 14:18:42 [urllib3.connectionpool] DEBUG: Starting new HTTP connection (1): httpbin.org:80
2020-08-14 14:18:52 [urllib3.connectionpool] DEBUG: http://httpbin.org:80 “GET /delay/10 HTTP/1.1” 200 359

实际耗时110s,即异步最长耗时10s+同步10s*10次。

修改为异步

先在setting中进行设置TWISTED_REACTOR = 'twisted.internet.asyncioreactor.AsyncioSelectorReactor',该设置写在custom_settings中无效(https://github.com/scrapy/scrapy/issues/4485)。再将回调函数中requests的请求过程改为异步:

class MySpider(Spider):
    """测试用脚本"""
    name = 'spider_test_aio'
    custom_settings = dict(
        LOG_LEVEL='DEBUG',
    )

    def start_requests(self):
        for i in range(10, 0, -1):
            yield Request('http://httpbin.org/delay/%s' % i, callback=self.parse)

    async def parse(self, response):
        self.log(response)
        data = await self.parse_with_asyncio()
        yield UniversalItem({'collection': 'test', 'data': {'rsp': data}})

    async def parse_with_asyncio(self):
        self.log('aiohttp begins.')
        async with aiohttp.ClientSession() as session:
            async with session.get('http://httpbin.org/delay/10') as additional_response:
                additional_data = await additional_response.text()
                self.log('aiohttp ends.')
                return additional_data

2020-08-14 14:36:42 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6024
2020-08-14 14:36:48 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://httpbin.org/delay/4> (referer: None)
2020-08-14 14:36:48 [spider_test_aio] DEBUG: <200 http://httpbin.org/delay/4>
2020-08-14 14:36:48 [spider_test_aio] DEBUG: aiohttp begins.
2020-08-14 14:36:49 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://httpbin.org/delay/5> (referer: None)
2020-08-14 14:36:49 [spider_test_aio] DEBUG: <200 http://httpbin.org/delay/5>
2020-08-14 14:36:49 [spider_test_aio] DEBUG: aiohttp begins.
2020-08-14 14:36:49 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://httpbin.org/delay/3> (referer: None)
2020-08-14 14:36:49 [spider_test_aio] DEBUG: <200 http://httpbin.org/delay/3>
2020-08-14 14:36:49 [spider_test_aio] DEBUG: aiohttp begins.
2020-08-14 14:36:50 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://httpbin.org/delay/2> (referer: None)
2020-08-14 14:36:50 [spider_test_aio] DEBUG: <200 http://httpbin.org/delay/2>
2020-08-14 14:36:50 [spider_test_aio] DEBUG: aiohttp begins.
2020-08-14 14:36:50 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://httpbin.org/delay/6> (referer: None)
2020-08-14 14:36:50 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://httpbin.org/delay/7> (referer: None)
2020-08-14 14:36:50 [spider_test_aio] DEBUG: <200 http://httpbin.org/delay/6>
2020-08-14 14:36:50 [spider_test_aio] DEBUG: aiohttp begins.
2020-08-14 14:36:50 [spider_test_aio] DEBUG: <200 http://httpbin.org/delay/7>
2020-08-14 14:36:50 [spider_test_aio] DEBUG: aiohttp begins.
2020-08-14 14:36:51 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://httpbin.org/delay/1> (referer: None)
2020-08-14 14:36:51 [spider_test_aio] DEBUG: <200 http://httpbin.org/delay/1>
2020-08-14 14:36:51 [spider_test_aio] DEBUG: aiohttp begins.
2020-08-14 14:36:51 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://httpbin.org/delay/8> (referer: None)
2020-08-14 14:36:51 [spider_test_aio] DEBUG: <200 http://httpbin.org/delay/8>
2020-08-14 14:36:51 [spider_test_aio] DEBUG: aiohttp begins.
2020-08-14 14:36:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://httpbin.org/delay/9> (referer: None)
2020-08-14 14:36:53 [spider_test_aio] DEBUG: <200 http://httpbin.org/delay/9>
2020-08-14 14:36:53 [spider_test_aio] DEBUG: aiohttp begins.
2020-08-14 14:36:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://httpbin.org/delay/10> (referer: None)
2020-08-14 14:36:54 [spider_test_aio] DEBUG: <200 http://httpbin.org/delay/10>
2020-08-14 14:36:54 [spider_test_aio] DEBUG: aiohttp begins.
2020-08-14 14:36:59 [spider_test_aio] DEBUG: aiohttp ends.
2020-08-14 14:37:00 [scrapy.core.scraper] DEBUG: Scraped from <200 http://httpbin.org/delay/4>
None
2020-08-14 14:37:01 [spider_test_aio] DEBUG: aiohttp ends.
2020-08-14 14:37:01 [scrapy.core.scraper] DEBUG: Scraped from <200 http://httpbin.org/delay/3>
None
2020-08-14 14:37:01 [spider_test_aio] DEBUG: aiohttp ends.
2020-08-14 14:37:01 [scrapy.core.scraper] DEBUG: Scraped from <200 http://httpbin.org/delay/7>
None
2020-08-14 14:37:01 [spider_test_aio] DEBUG: aiohttp ends.
2020-08-14 14:37:01 [scrapy.core.scraper] DEBUG: Scraped from <200 http://httpbin.org/delay/6>
None
2020-08-14 14:37:01 [spider_test_aio] DEBUG: aiohttp ends.
2020-08-14 14:37:01 [scrapy.core.scraper] DEBUG: Scraped from <200 http://httpbin.org/delay/2>
None
2020-08-14 14:37:01 [spider_test_aio] DEBUG: aiohttp ends.
2020-08-14 14:37:01 [scrapy.core.scraper] DEBUG: Scraped from <200 http://httpbin.org/delay/5>
None
2020-08-14 14:37:02 [spider_test_aio] DEBUG: aiohttp ends.
2020-08-14 14:37:02 [scrapy.core.scraper] DEBUG: Scraped from <200 http://httpbin.org/delay/1>
None
2020-08-14 14:37:02 [spider_test_aio] DEBUG: aiohttp ends.
2020-08-14 14:37:02 [scrapy.core.scraper] DEBUG: Scraped from <200 http://httpbin.org/delay/8>
None
2020-08-14 14:37:04 [spider_test_aio] DEBUG: aiohttp ends.
2020-08-14 14:37:04 [scrapy.core.scraper] DEBUG: Scraped from <200 http://httpbin.org/delay/9>
None
2020-08-14 14:37:05 [spider_test_aio] DEBUG: aiohttp ends.
2020-08-14 14:37:05 [scrapy.core.scraper] DEBUG: Scraped from <200 http://httpbin.org/delay/10>
None

整个过程耗时约20s,即两个耗时最长的delay/10的请求时间之和。

注意事项

若使用scrapyd部署到服务器上,可能会出现错误:

/root/envs/scrapyd/lib/python3.6/site-packages/scrapy/utils/project.py:94: ScrapyDeprecationWarning: Use of environment variables prefixed with SCRAPY_ to override settings is deprecated. The following environment variables are currently defined: EGG_VERSION
ScrapyDeprecationWarning
Traceback (most recent call last):
File “/usr/lib/python3.6/runpy.py”, line 193, in _run_module_as_main
main”, mod_spec)
File “/usr/lib/python3.6/runpy.py”, line 85, in _run_code
exec(code, run_globals)
File “/root/envs/scrapyd/lib/python3.6/site-packages/scrapyd/runner.py”, line 40, in
main()
File “/root/envs/scrapyd/lib/python3.6/site-packages/scrapyd/runner.py”, line 37, in main
execute()
File “/root/envs/scrapyd/lib/python3.6/site-packages/scrapy/cmdline.py”, line 142, in execute
cmd.crawler_process = CrawlerProcess(settings)
File “/root/envs/scrapyd/lib/python3.6/site-packages/scrapy/crawler.py”, line 280, in init
super(CrawlerProcess, self).init(settings)
File “/root/envs/scrapyd/lib/python3.6/site-packages/scrapy/crawler.py”, line 156, in init
self._handle_twisted_reactor()
File “/root/envs/scrapyd/lib/python3.6/site-packages/scrapy/crawler.py”, line 344, in _handle_twisted_reactor
super()._handle_twisted_reactor()
File “/root/envs/scrapyd/lib/python3.6/site-packages/scrapy/crawler.py”, line 252, in _handle_twisted_reactor
verify_installed_reactor(self.settings[“TWISTED_REACTOR”])
File “/root/envs/scrapyd/lib/python3.6/site-packages/scrapy/utils/reactor.py”, line 78, in verify_installed_reactor
raise Exception(msg)
Exception: The installed reactor (twisted.internet.epollreactor.EPollReactor) does not match the requested one (twisted.internet.asyncioreactor.AsyncioSelectorReactor)

即已安装的reactor与设置中的不一致,这时需要手动安装AsyncioSelectorReactor

先在settings.py中注释对TWISTED_REACTOR的设置,然后手动安装reactor,同样写在settings.py中即可:

scrapy.utils.reactor.install_reactor('twisted.internet.asyncioreactor.AsyncioSelectorReactor')

相关官方文档:https://docs.scrapy.org/en/latest/topics/settings.html#twisted-reactor

  • 3
    点赞
  • 9
    收藏
    觉得还不错? 一键收藏
  • 2
    评论
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值