目录
大家好!上上节给大家介绍了分布式爬虫的理念,本节我们深入了解一下如何利用Redis实现Scrapy分布式?
1.获取源码
可以把源码克隆下来,执行以下命令:
git clone https://github.com/rmax scrapy-redis.git
核心源码在scrapy_redis/src/scrapy_redis目录下。
2.爬取队列
从爬取队列入手,看看他的具体实现,源码文件为queue.py,它有三个队列实现,首先它实现了一个父类Base,提供一些基本方法和属性,如下:
classBase(object):
def_init_ (self, server, spider, key, serializer=None):
if serializer is None:
serializer = picklecompat
if not hasattr(serializer, loads'):
raise TypeError(" serializer does not implement ' loads' function: % r"
% serializer )
if not hasattr(serializer, dumps' ):
raise TypeError("serializer *%s' does not implement dumps fuction%r"
% serializer)
self.server = server
self.server = redis
self.key = key % {‘spider’ :spider.name}
self.serializer = serializer
def _encoderequest(self,request):
obj = request todict(request,self . spider)
returnself. serializer. dumps(obj)
defencode_ request(self, request):
objF request to _dict(request, self.spider)return self.serializer . dumps(obj)
defdecode_ request(self, encoded request):
obi = self. serializer.loads (encoded request)return request_ from dict(obj, self.spider)
deflen (self):
raise Not ImplementedError
def push(self, request):
raise Not ImplementedError
def pop(self, timeout=o):raise NotImplementedErrordef clear(self):
self.server delete(self.key)
首先看一下encode_ request 和_decode_ request 方法,因为我们需要把一个 Request对象存储到数据库中,但数据库无法直接存储对象,所以需要将Request 序列化转成字