在开始介绍scrapy的去重之前,先想想我们是怎么对requests对去重的。requests只是下载器,本身并没有提供去重功能。所以我们需要自己去做。很典型的做法是事先定义一个去重队列,判断抓取的url是否在其中,如下:
crawled_urls = set()
def check_url(url):
if url not in crawled_urls:
return True
return False
此时的集合是保存在内存中的,随着爬虫抓取内容变多,该集合会越来越大,有什么办法呢?
接着往下看,你会知道的。
scrapy的去重
scrapy对request不做去重很简单,只需要在request对象中设置dont_filter为True,如
yield scrapy.Request(url, callback=self.get_response, dont_filter=True)
看看源码是如何做的,位置
_fingerprint_cache = weakref.WeakKeyDictionary()
def request_fingerprint(request, include_headers=None):
if include_headers:
include_headers = tuple(to_bytes(h.lower())
for h in sorted(include_headers))
cache = _fingerprint_cache.setdefault(request, {})
if include_headers not in cache:
fp = hashlib.sha1()
fp.update(to_bytes(request.method))
fp.update(to_bytes(canonicalize_url(request.url)))
fp.update(request.body or b'')
if include_headers:
for hdr in include_headers:
if hdr in request.headers:
fp.update(hdr)
for v in request.headers.getlist(hdr):
fp.update(v)
cache[include_headers] = fp.hexdigest()
return cache[include_heade