2021SC@SDUSC
ExecutionEngine 和 Scraper 关联的地方
跟了这么远,我们现在回到 ExecutionEngine 中的 _next_request_from_scheduler
方法,为了方便,我把这段代码再复制下放到这里来
# scrapy/core/engine.py
class ExecutionEngine(object):
# ...
def _next_request_from_scheduler(self, spider):
slot = self.slot
request = slot.scheduler.next_request()
if not request:
return
d = self._download(request, spider)
d.addBoth(self._handle_downloader_output, request, spider)
d.addErrback(lambda f: logger.info('Error while handling downloader output',
exc_info=failure_to_exc_info(f),
extra={'spider': spider}))
d.addBoth(lambda _: slot.remove_request(request))
d.addErrback(lambda f: logger.info('Error while removing request from slot',
exc_info=failure_to_exc_info(f),
extra={'spider': spider}))
d.addBoth(lambda _: slot.nextcall.schedule())
d.addErrback(lambda f: logger.info('Error while scheduling new request',
exc_info=failure_to_exc_info(f),
extra={'spider': spider}))
再跟下 _handle_downloader_output
class ExecutionEngine(object):
# ...
def _handle_downloader_output(self, response, request, spider):
assert isinstance(response, (Request, Response, Failure)), response
# downloader middleware can return requests (for example, redirects)
if isinstance(response, Request):
self.crawl(response, spider)
return
# response is a Response or Failure
d = self.scraper.enqueue_scrape(response, request, spider)
d.addErrback(lambda f: logger.error('Error while enqueuing downloader output',
exc_info=failure_to_exc_info(f),
extra={'spider': spider}))
- 如果 response 是请求的话,执行 crawl。因为响应会经过下载器中间件,所以有可能返回的是请求
- 将响应以及请求加到 scraper 的处理队列中,这部分操作和 Spider Middleware 和 Item Pipelines 有关联。感兴趣的也可以像跟 downloader 那样慢慢跟下去,多看几遍,应该能大致理解,这里也不展开了
至此,我们通过9次代码分析,对于scrapy的源码框架进行了基本的分析,从Twisted开始,分析了DeferredList,gatherResults defer.inlineCallbacks ExecutionEngine Downloader。。。经过这9次源码分析,我个人也算是对于scrapy的基础结构有了比较清晰地了解。
从10期开始,我计划通过对于scrapy的重点部分代码源码的重点分析,对scrapy有一个精通的了解与体会。