最近在scrapy爬虫项目中遇到一些问题,手动通过CrawlProcess调度爬虫,报出异常错误“Scrapy - Reactor not Restartable”,原因是在同一个进程中无法重启twisted框架中的reactor堆。
解决方案:
通过另外一个进程中启动reactor,示例代码
import scrapy
import scrapy.crawler as crawler
from multiprocessing import Process, Queue
from twisted.internet import reactor
# your spider
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = ['http://quotes.toscrape.com/tag/humor/']
def parse(self, response):
for quote in response.css('div.quote'):
print(quote.css('span.text::text').extract_first())
# the wrapper to make it run more times
def run_spider():
def f(q):
try:
runner = crawler.CrawlerRunner()
deferred = runner.crawl(QuotesSpider)
deferred.addBoth(lambda _: reactor.stop())
reactor.run()
q.put(None)
except Exception as e:
q.put(e)
q = Queue()
p = Process(target=f, args=(q,))
p.start()
result = q.get()
p.join()
if result is not None:
raise result
print('first run:')
run_spider()
print('\nsecond run:')
run_spider()
结果:
first run:
“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”
“A day without sunshine is like, you know, night.”
...
second run:
“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”
“A day without sunshine is like, you know, night.”