1、Spider组件
Spider类定义了抓取一个或者多个网页的动作,以及如何提取结构化的数据。对Spider来说整个流程如同下面的循环:
(1)从第一个URL中生成第一个初始化的Request,并设置回调函数,当这些Requst下载生成response后,这些回调函数将被调用。
(2)在回调函数内解析返回的response并生返回Item对象或者Request对象,或者是一个包含二者的可迭代对象。返回的Request会通过scrapy处理,并交由Downloader下载,并调用设置的回调函数
(3)在回调函数中利用你喜欢的解析工具来解析网页并生成Item
(4)这些生成的Item会经过Itempipelines保存到数据库或者其他你定义的操作
我们来看看Spider类的源码:
# scrapy.spiders
class Spider(object_ref):
"""Base class for scrapy spiders. All spiders must inherit from this
class.
"""
name = None
custom_settings = None
def __init__(self, name=None, **kwargs):
if name is not None:
self.name = name
elif not getattr(self, 'name', None):
raise ValueError("%s must have a name" % type(self).__name__)
self.__dict__.update(kwargs)
if not hasattr(self, 'start_urls'):
self.start_urls = []
@property
def logger(self):
logger = logging.getLogger(self.name)
return logging.LoggerAdapter(logger, {
'spider': self})
def log(self, message, level=logging.DEBUG, **kw):
"""Log the given message at the given log level
This helper wraps a log call to the logger within the spider, but you
can use it directly (e.g. Spider.logger.info('msg')) or use any other
Python logger too.
"""
self.logger.log(level, message, **kw)
@classmethod
def