2021SC@SDUSC
回到 engine 的 open_spider
方法,这个时候 slot 相关的初始化以及启动,我们应该大致能理解了。后面依次启动了 scraper 和 crawler.stats,需要注意的是 downloader 是在 engine 刚创建的时候就已经初始化好了。另外,当这些都初始化好后,engine 发出了 spider_opened
信号。
前面我们提到过 _next_request
多次,我们顺着这个方法跟下去。
# scrapy/core/engine.py
class ExecutionEngine(object):
# ...
def _next_request(self, spider):
# ...
while not self._needs_backout(spider):
if not self._next_request_from_scheduler(spider):
break
if slot.start_requests and not self._needs_backout(spider):
try:
request = next(slot.start_requests)
except StopIteration:
# ...
else:
self.crawl(request, spider)
if self.spider_is_idle(spider) and slot.close_if_idle:
self._spider_idle(spider)
这个方法主要是做 2 个事情
- 从 scheduelr 中获取请求并下载,涉及方法
_next_request_from_scheduler
- 从 spider 初始请求中获取请求加入 scheduler,涉及方法 crawl
先看后者 crawl
class ExecutionEngine(object):
# ...
def crawl(self, request, spider):
assert spider in self.open_spiders, \
"Spider %r not opened when crawling: %s" % (spider.name, request)
self.schedule(request, spider)
self.slot.nextcall.schedule()
def schedule(self, request, spider):
self.signals.send_catch_log(signal=signals.request_scheduled,
request=request, spider=spider)
if not self.slot.scheduler.enqueue_request(request):
self.signals.send_catch_log(signal=signals.request_dropped,
request=request, spider=spider)
就是调用了 schedule 然后触发下一次 _next_request
。schedule 中发出 request_scheduled
信号后,将请求加入 scheduler 的队列,之后又发出 request_dropped
信号。
再看前者 _next_request_from_scheduler
class ExecutionEngine(object):
# ...
def _next_request_from_scheduler(self, spider):
slot = self.slot
request = slot.scheduler.next_request()
if not request:
return
d = self._download(request, spider)
d.addBoth(self._handle_downloader_output, request, spider)
d.addErrback(lambda f: logger.info('Error while handling downloader output',
exc_info=failure_to_exc_info(f),
extra={'spider': spider}))
d.addBoth(lambda _: slot.remove_request(request))
d.addErrback(lambda f: logger.info('Error while removing request from slot',
exc_info=failure_to_exc_info(f),
extra={'spider': spider}))
d.addBoth(lambda _: slot.nextcall.schedule())
d.addErrback(lambda f: logger.info('Error while scheduling new request',
exc_info=failure_to_exc_info(f),
extra={'spider': spider}))
主要做了这么几件事
- 从 scheduler 中获取下一个请求,没有则退出
- 下载请求,将请求加入 slot 的 inprogress 集合中,涉及方法
_download
- 下载完成后,将响应放到 scraper 中,涉及方法
_handle_downloader_output
- 下载完成后,将请求从 slot 的 inprogress 集合中移除
- 下载完成,触发下一次
_next_request
看下 _download
class ExecutionEngine(object):
# ...
def _download(self, request, spider):
slot = self.slot
slot.add_request(request)
def _on_success(response):
assert isinstance(response, (Response, Request))
if isinstance(response, Response):
response.request = request # tie request to response received
logkws = self.logformatter.crawled(request, response, spider)
logger.log(*logformatter_adapter(logkws), extra={'spider': spider})
self.signals.send_catch_log(signal=signals.response_received, \
response=response, request=request, spider=spider)
return response
def _on_complete(_):
slot.nextcall.schedule()
return _
dwld = self.downloader.fetch(request, spider)
dwld.addCallbacks(_on_success)
dwld.addBoth(_on_complete)
return dwld
- 将请求加入 slot 的 inprogress 集合
- 调用 downloader 的 fetch 方法,具体做了些什么后面会分析
- 下载成功后,打印成功日志,并发送
response_received
信号 - 下载成功或者失败后,都会触发下一次
_next_request