项目需求:采集昨日产生的数据,比如新闻,需要避免数据重复,因为新闻都是按照发布时间逆序排列的,所以只需根据新闻发布时间进行过滤,当发现采集到比昨天更早的数据时关闭爬虫。
可以在直接spider、pipeline和downloaderMiddlewares中关闭爬虫
在spider中时在方法里直接写
self.crawler.engine.close_spider(self, 'response msg error %s, job done!' % response.text)
在pipeline和downloaderMiddlewares里
spider.crawler.engine.close_spider(spider, 'yestoday: %s news collection and completion' % self.YES_TODAY)
我在在了downloaderMiddlewares,运行时后
2018-07-23 11:37:34 [spiders] INFO: NewsItem pub_time is 2018-07-03 08:57:55 YES_TODAY is 2018-07-22 spider will be close
2018-07-23 11:37:34 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.spiders.net.cn/news/a9626.html> None
2018-07-23 11:37:34 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.spiders.net.cn/news/a9785.html> (referer: http://www.spiders.net.cn/news/c7.html)
2018-07-23 11:37:35 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.spiders.net.cn/news/c5_3.html> (referer: http://www.spiders.net.cn/news/c5_2.html)
2018-07-23 11:37:35 [spiders] INFO: NewsItem pub_time is 2018-07-18 15:46:34 YES_TODAY is 2018-07-22 spider will be close
2018-07-23 11:37:35 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.spiders.net.cn/news/a9785.html>
None
2018-07-23 11:37:35 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 13249,
'downloader/request_count': 36,
'downloader/request_method_count/GET': 36,
'downloader/response_bytes': 249457,
'downloader/response_count': 36,
'downloader/response_status_count/200': 36,
'finish_reason': 'yestoday: 2018-07-22 news collection and completion',
'finish_time': datetime.datetime(2018, 7, 23, 3, 37, 35, 138463),
'item_scraped_count': 11,
'log_count/DEBUG': 49,
'log_count/INFO': 19,
'offsite/domains': 1,
'offsite/filtered': 257,
'request_depth_max': 4,
'response_received_count': 36,
'scheduler/dequeued': 35,
'scheduler/dequeued/memory': 35,
'scheduler/enqueued': 451,
'scheduler/enqueued/memory': 451,
'start_time': datetime.datetime(2018, 7, 23, 3, 37, 27, 422021)}
2018-07-23 11:37:35 [scrapy.core.engine] INFO: Spider closed (yestoday: 2018-07-22 news collection and completion)
可以看到日志最后一行的close reason变了