1.setting.py
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter" # 将request转换为指纹
SCHEDULER = "scrapy_redis.scheduler.Scheduler" # request调度器
SCHEDULER_PERSIST = True # 是否持久化请求队列和指纹集合, 默认False(不持久化)
ITEM_PIPELINES = {
'scrapy_redis.pipelines.RedisPipeline': 400, # 将item写入redis
}
REDIS_URL = "redis://127.0.0.1:6379"
2.spider
from scrapy_redis.spiders import RedisSpider
class MySpider(RedisSpider):
name = 'myspider_redis'
redis_key = 'myspider:start_urls'
allowed_domains = ['jd.com', 'p.3.cn']
def parse(self, response):
yield{
'name': response.css('title::text').extract_first(),
'url': response.url,
}
3.注意事项
1)启动时,必须把全部分布式节点上的爬虫进程启动来,让其继续等待;
2) 再向redis中Push起始的url;
lpush myspider:start_urls ‘https://www.baidu.com’