Scrapy提供了一个内置的重复请求过滤器,用于根据网址过滤重复的请求。可以根据业务需求制定规则过滤重复的请求
基于url过滤重复的请求
假设已访问过
http:://www.abc.com/p/xyz.html?id=1234&refer=4567
我想做的是过滤请求,如:
http:://www.abc.com/p/xyz.html?id=1234&refer=5678
通过编写自定义中间件以进行重复删除,并将其添加到设置中
import os
from scrapy.dupefilter import RFPDupeFilter
from scrapy.utils.request import request_fingerprint
class CustomFilter(RFPDupeFilter):
"""A dupe filter that considers specific ids in the url"""
def __getid(self, url):
mm = url.split("&refer")[0] #or something like that
return mm
def request_seen(self, request):
fp = self.__getid(request.url)
if fp in self.fingerprints:
return True
self.fingerprints.add(fp)
if self.file:
self.file.write(fp + os.linesep)
然后在settings.py中添加如下代码
DUPEFILTER_CLASS = 'scraper.duplicate_filter.CustomFilter'
不过滤重复任何请求
若在 scrapy 中,不过滤任何 request 请求,可以自定义如下文件
from scrapy.dupefilter import RFPDupeFilter
class CloseDupefilter(RFPDupeFilter):
def request_seen(self, request):
return False
然后在settings.py中添加如下代码
DUPEFILTER_CLASS = 'scraper.duplicate_filter.CustomFilter'