以下为SpiderKeeper的源码解析
SpiderKeeper/app/spider/controller.py中的job_add()添加任务到sqlite库
1.if request.form['daemon'] != 'auto':
2. spider_args = []
3. if request.form['spider_arguments']:
4. spider_args = request.form['spider_arguments'].split(",")
5. spider_args.append("daemon={}".format(request.form['daemon']))
6. job_instance.spider_arguments = ','.join(spider_args)
第4行显示将网页端的form中的spider_arguments以逗号分隔存入整合后存入job_instance的spider_arguments字段写入sqlite库
SpiderKeeper/app/proxy/spiderctrl.py的start_spider()函数负责调用scrapyd启动任务
arguments = {}
if job_instance.spider_arguments:
arguments = dict(map(lambda x: x.split("="), job_instance.spider_arguments.split(",")))
将库中的参数以逗号拆隔,再以等号分割组成字典(kwarg)至此可以直接使用添加到scrapyd的参数上了(使用见SpiderKeeper/app\proxy/contrib/scrapy.py的start_spider())
使用举例说明
接下来如果要使用自定义参数(在你的爬虫脚本中添加初始化函数利用参数):
class Ssrpider(scrapy.Spider):
name = 'ssr'
def __init__(self, *args, **kwargs):
super(Ssrpider, self).__init__(*args, **kwargs)
kwargs.setdefault('crawl_time', str(datetime.datetime.now().date()))
crawl_time = kwargs.get('crawl_time')
msg = '当前传入的日期为:{}'.format(crawl_time)
logging.info(msg)
self.crawl_time = crawl_time
def parse(self, response):
self.crawler.engine.close_spider(self, '强制结束程序')
我需要在网页端输入crawl_time=2019-01-01就可以使用设置参数进行抓取(不可有空格)