之前我们是通过在parse函数里设置集合来解决url去重的问题。
首先先在根目录中建立一个新的duplication的py文件,在from scrapy.dupefilter import RFPDupeFilter,在RFPDupeFilter源码中把BaseDupeFilter类复制到新建的duolication中。
class RepeatFilter(object):
def __init__(self):
self.visited_set = set()
@classmethod
def from_settings(cls, settings):#用类方法建立RepeatFilter类对象返回的是RepeatFliter()
return cls()
def request_seen(self, request):#过滤url的方法
if request.url in self.visited_set:
return True
else:
self.visited_set.add(request.url)
return False
def open(self):#爬虫开始
print("---开始爬取---")
def close(self, reason): # 爬虫结束
print("---爬取结束---")
def log(self, request, spider): # 记录日志
pass
在request_open方法中把过滤的url方法写好
执行顺序是
1、from_setting
2、__init__
3、open
4、log
5、close
最后别忘了要再settings.py文件中添加一条DUPEFILTER_CLASS = "shan.duplication.RepeatFilter"
默认的是DUPEFILTER_CLASS = "shan.dupefilter.RFPDupeFilter"
(venv) D:\shan>scrapy crawl chouti --nolog
D:\shan\shan\spiders\chouti.py:9: ScrapyDeprecationWarning: Module `scrapy.dupefilter` is deprecated, use `scrapy.dupefilters` instead
from scrapy.dupefilter import RFPDupeFilter
---开始爬取---
https://dig.chouti.com/
https://dig.chouti.com/all/hot/recent/2
https://dig.chouti.com/all/hot/recent/3
https://dig.chouti.com/all/hot/recent/8
https://dig.chouti.com/all/hot/recent/5
https://dig.chouti.com/all/hot/recent/7
https://dig.chouti.com/all/hot/recent/6
https://dig.chouti.com/all/hot/recent/10
https://dig.chouti.com/all/hot/recent/9
https://dig.chouti.com/all/hot/recent/4
---爬取结束---