- 背景:网站不同字段对应的文件(图片)为同一个,需要重复下载。譬如,某图片网站同一张图片在不同分类出现
- 前置:1. 该文件URL不同,但response响应相同,以下略过;2. URL相同时,在yield request 时需要带上参数 ‘dont_filter=True’(在spider.py需要添加,使该request能到pipeline)
- 核心:查看 FilesPipeline 的父类 MediaPipeline 源码,发现其处理请求的时候有这么两段,对其进行注释
- 该部分完整源码:重写该方法是还要在 pipeline.py 导入以下
from scrapy.utils.request import request_fingerprint
from twisted.internet.defer import Deferred
from scrapy.utils.defer import mustbe_deferred
def _process_request(self, request, info):
fp = request_fingerprint(request)
cb = request.callback or (lambda _: _)
eb = request.errback
request.callback = None
request.errback = None
# Return cached result if request was already seen
# if fp in info.downloaded:
# return defer_result(info.downloaded[fp]).addCallbacks(cb, eb)
# Otherwise, wait for result
wad = Deferred().addCallbacks(cb, eb)
info.waiting[fp].append(wad)
# Check if request is downloading right now to avoid doing it twice
# if fp in info.downloading:
# return wad
# Download request checking media_to_download hook output first
info.downloading.add(fp)
dfd = mustbe_deferred(self.media_to_download, request, info)
dfd.addCallback(self._check_media_to_download, request, info)
dfd.addBoth(self._cache_result_and_execute_waiters, fp, info)
# dfd.addErrback(lambda f: logger.error(
# f.value, exc_info=failure_to_exc_info(f), extra={'spider': info.spider})
# )
return dfd.addBoth(lambda _: wad) # it must return wad at last
- 最终大致结构图:
- 结语:本文虽然展示的是文件,但是图片也是一样的处理方式,因为 ImagesPipeline 本身就是继承 FilesPipeline 的;如同 我们自定义的 FilePipeline 继承 FilesPipeline ,而 FilesPipeline 又继承 MediaPipeline 一般。