一、问题思考
小明是一名爬虫工程师,最近在负责某投标项目时遇到了以下难题:
1.需要将采集的多种请求封装在一个方法里面;
2.采集数量可能达到亿级数量,数量较多
3.RedisCrawlSpider对接布隆过滤器后实现增量采集只采集起始的请求,翻页的请求默认被布隆过滤器去重
二、问题解答
2.1 scrapy-redis修改起始请求方式
源码分析:(由于源码太多,这里只黏贴部分有关的方法)
def next_requests(self):
"""Returns a request to be scheduled or none."""
# XXX: Do we need to use a timeout here?
found = 0
datas = self.fetch_data(self.redis_key, self.redis_batch_size) #在redis服务器获取初始请求
for data in datas:
# ---------关键性代码---------
reqs = self.make_request_from_data(data) #关键调用,这里的data就包含了start_url
# ---------关键性代码---------
if isinstance(reqs, Iterable):
for req in reqs:
yield req
# XXX: should be here?
found += 1
self.logger.info(f'start req url:{req.url}')
elif reqs:
yield reqs
found += 1
else:
self.logger.debug("Request not made from data: %r", data)
if found:
self.logger.debug("Read %s requests from '%s'", found, self.redis_key)
def make_request_from_data(self, data):
"""Returns a Request instance from data coming from
url = bytes_to_str(data, self.redis_encoding)
return self.make_requests_from_url(url) #可以看出发起请求是在该方法发起请求,所以重写该方法即可
def make_requests_from_url(self, url):
""" This method is deprecated. """
warnings.warn(
"Spider.make_requests_from_url method is deprecated: "
"it will be removed and not be called by the default "
"Spider.start_requests method in future Scrapy releases. "
"Please override Spider.start_requests method instead."
)
return Request(url, dont_filter=True)
代码说明:
从上面源码看出第一个方法为获取下一个请求(从redis数据库中截取相应的数据data(可以自己封装数据类型))给第二个方法提取url传递给make_requests_from_url发起请求,所以重写该方法即可获取多种请求方式
- 源码重写:
代码如下(示例):
重写make_requests_from_url方法源码后完整代码如下:
注意:该方法为从start_urls中第一次拿取数据,需将dont_filter=True,否则该请求将被过滤
def make_requests_from_url(self, url):
start_type = json.loads(self.redis_conn.hget(f'{self.tag_name}:json', self.tag_name))['start_type'] #获取爬取类型
if start_type == 'dynamic_formdata':
form_data = json.loads(self.redis_conn.hget(f'{self.tag_name}:json', self.tag_name))['form_data']
return FormRequest(url=url, formdata=form_data,dont_filter=True)
elif start_type == 'dynamic_jsondata':
form_data = json.loads(self.redis_conn.hget(f'{self.tag_name}:json', self.tag_name))['form_data']
return JsonRequest(url=url, data=form_data,dont_filter=True)
return scrapy.Request(url,dont_filter=True)
2.2 修改去重方式
问题总结:
由于存在采集数量较大,所以这里不宜采用scrapy_redis的去重机制(该机制是将request的指纹存储到了redis集合中),该方式会导致redis的占用内存太大;
可以采用两个方案:
1.mongdb查询;(但是得优化查询速度,这里就不进行介绍了;)
2.使用第三方库scrapy-redis-bloomfilter进行去重 由于我们采集是多站点采集,但是通过分析scrapy-redis-bloomfilter源码可得,该源码会对应一个spider会去新建一个集合,所以这里需要重写方法覆盖它,保证在spiders文件下对应的spider用的是同一个bloomfilter集合。
- 源码重写:
重写布隆过滤器的RFPDupeFilter源码后完整代码如下:
import logging
import time
from scrapy_redis_bloomfilter.dupefilter import RFPDupeFilter
from scrapy_redis_bloomfilter.defaults import BLOOMFILTER_HASH_NUMBER, BLOOMFILTER_BIT, DUPEFILTER_DEBUG
from scrapy_redis_bloomfilter import defaults
from scrapy_redis.connection import get_redis_from_settings
logger = logging.getLogger(__name__)
class CCRDupeFilter(RFPDupeFilter):
logger = logger
@classmethod
def from_settings(cls, settings):
server = get_redis_from_settings(settings)
key = defaults.DUPEFILTER_KEY % {'timestamp': int(time.time())}
debug = settings.getbool('DUPEFILTER_DEBUG', DUPEFILTER_DEBUG)
bit = settings.getint('BLOOMFILTER_BIT', BLOOMFILTER_BIT)
hash_number = settings.getint('BLOOMFILTER_HASH_NUMBER', BLOOMFILTER_HASH_NUMBER)
return cls(server, key=key, debug=debug, bit=bit, hash_number=hash_number)
@classmethod
def from_crawler(cls, crawler):
instance = cls.from_settings(crawler.settings)
return instance
@classmethod
def from_spider(cls, spider):
settings = spider.settings
server = get_redis_from_settings(settings)
dupefilter_key = settings.get("SCHEDULER_DUPEFILTER_KEY", defaults.SCHEDULER_DUPEFILTER_KEY) #注意关键性重写代码,从该settings获取写入的key
key = dupefilter_key
debug = settings.getbool('DUPEFILTER_DEBUG', DUPEFILTER_DEBUG)
bit = settings.getint('BLOOMFILTER_BIT', BLOOMFILTER_BIT)
hash_number = settings.getint('BLOOMFILTER_HASH_NUMBER', BLOOMFILTER_HASH_NUMBER)
print(key, bit, hash_number)
instance = cls(server, key=key, debug=debug, bit=bit, hash_number=hash_number)
return instance
upefilter_key = settings.get(“SCHEDULER_DUPEFILTER_KEY”,
defaults.SCHEDULER_DUPEFILTER_KEY) 整个源码重写过程中就是该语句在起作用
配合settings脚本中设置即可: SCHEDULER_DUPEFILTER_KEY = ‘Bidding:bloomfilter’
2.3 RedisCrawlSpider增量采集翻页问题
问题总结:
1.crawlspide中rule翻页需要过滤url
2.rule中翻页后的请求默认会被过滤器去重,需要设置不过滤
通过观察源码分析得到以下方法的作用:
process_links=None, :可以设置回调函数,对所有提取到的url进行拦截
process_request=identity : 可以设置回调函数,对request对象进行拦截
- 方法重写:
只需对以上两个方法重写即可解决问题
Rule(LinkExtractor(**rules_config_dict['next_page']), process_links='proc_links',process_request = self.process_request)
def proc_links(self, links):
config = json.loads(self.redis_conn.hget(f'{self.tag_name}:json', self.tag_name))
if self.start_type == 'dynamic' or self.is_run_history:
return links
else:
return links[:3]
def process_request(self,request, response):
start_urls = json.loads(self.redis_conn.hget(f'{self.tag_name}:json', self.tag_name))['start_urls']
if request.url not in start_urls:
request.dont_filter=True
return request
总结
本文有些地方没有进行源码分析,只是得出结论,重写方法可能也存在代码冗余,希望读者在读取过程中有错误可以指定出来,大家一起学习。也欢迎把问题打在评论区上
附加: 此代码是基于python3.7版本
各个库的版本号如下:
Scrapy == 2.5.1
scrapy-redis == 0.7.2
scrapy_redis_bloomfilter==0.8.1