今天鼓捣scrapy,因为接下来爬取的数据需要之前已经存入数据库的数据的id来爬取详情页。
但是不知道怎么获取pipeline对象,百度了一下,发现的全都是 Scrapy 爬取数据通过pipeline 存入 mongodb/mysql。
最终还是找到了stackoverflow 上的一篇文章,看了回答的说明,确实是这样,因为刚接触Scrapy没几天,在此记录一下,发现问这个的不多,估计大家都很6了。
原文如下, 也不用翻译了,简单易懂:
A Scrapy Pipeline has an open_spider method that gets executed after the spider is initialized.
You can pass a reference to the database connection, the get_date() method, or the Pipeline itself, to your spider.
An example of the latter with your code is:
# This is my Pipline
class MongoDBPipeline(object):
def __init__(self, mongodb_db=None, mongodb_collection=None):
self.connection = pymongo.Connection(settings['MONGODB_SERVER'], settings['MONGODB_PORT'])
....
def process_item(self, item, spider):
....
def get_date(self):
....
def open_spider(self, spider):
spider.myPipeline = self
Then, in the spider:
class Spider(Spider):
name = "test"
def __init__(self, name=None, **kwargs):
self.myPipeline = None
def parse(self, response):
self.myPipeline.get_date()
I don't think the __init__()
method is necessary here, but I put it here to show that open_spider replaces it after initialization.
如果用scrapyd调度scrapy爬虫报错: 多余的参数 '_job' 错误 参考:https://blog.csdn.net/benpaodelulu_guajian/article/details/86485148