Scrapy源码(2)——爬虫开始的地方

Scrapy运行命令

一般来说,运行Scrapy项目的写法有,(这里不考虑从脚本运行Scrapy)

Usage examples:

$ scrapy crawl myspider
[ ... myspider starts crawling ... ]

$ scrapy runspider myspider.py
[ ... spider starts crawling ... ]

但是更好的写法是,新建一个Python文件,如下,(便于调试)

from scrapy import cmdline

cmdline.execute("scrapy crawl myspider".split())

很容易就发现,Scrapy运行文件是cmdline.py文件里面的execute()函数,下面学习下这个函数在做什么。

分析源码

def execute(argv=None, settings=None):
   if argv is None:
       argv = sys.argv

   # --- backwards compatibility for scrapy.conf.settings singleton ---
   if settings is None and 'scrapy.conf' in sys.modules:
       from scrapy import conf
       if hasattr(conf, 'settings'):
           settings = conf.settings
   # ------------------------------------------------------------------

寻找 scrapy.conf配置文件,argv直接取sys.argv

    if settings is None:
       settings = get_project_settings()
       # set EDITOR from environment if available
       try:
           editor = os.environ['EDITOR']
       except KeyError: pass
       else:
           settings['EDITOR'] = editor
   check_deprecated_settings(settings)

   # --- backwards compatibility for scrapy.conf.settings singleton ---
   import warnings
   from scrapy.exceptions import ScrapyDeprecationWarning
   with warnings.catch_warnings():
       warnings.simplefilter("ignore", ScrapyDeprecationWarning)
       from scrapy import conf
       conf.settings = settings
   # ------------------------------------------------------------------
set EDITOR from environment if available

读取settings设置文件,导入项目,调用get_project_settings()函数,此处为utils文件夹下的project.py文件:

def get_project_settings():
   if ENVVAR not in os.environ:
       project = os.environ.get('SCRAPY_PROJECT', 'default')
       init_env(project)
project.py

init_env() 函数如下:

def init_env(project='default', set_syspath=True):
   """Initialize environment to use command-line tool from inside a project
   dir. This sets the Scrapy settings module and modifies the Python path to
   be able to locate the project module.
   """
   cfg = get_config()
   if cfg.has_option('settings', project):
       os.environ['SCRAPY_SETTINGS_MODULE'] = cfg.get('settings', project)
   closest = closest_scrapy_cfg()
   if closest:
       projdir = os.path.dirname(closest)
       if set_syspath and projdir not in sys.path:
           sys.path.append(projdir)
conf.py

如注释所说,初始化环境,循环递归找到用户项目中的配置文件settings.py,并且将其设置到环境变量Scrapy settings module中。然后修改Python路径,确保能找到项目模块。

    settings = Settings()
   settings_module_path = os.environ.get(ENVVAR)
   if settings_module_path:
       settings.setmodule(settings_module_path, priority='project')

   # XXX: remove this hack
   pickled_settings = os.environ.get("SCRAPY_PICKLED_SETTINGS_TO_OVERRIDE")
   if pickled_settings:
       settings.setdict(pickle.loads(pickled_settings), priority='project')

   # XXX: deprecate and remove this functionality
   env_overrides = {k[7:]: v for k, v in os.environ.items() if
                    k.startswith('SCRAPY_')}
   if env_overrides:
       settings.setdict(env_overrides, priority='project')

   return settings
project.py

至此,get_project_settings()该函数结束,如函数名字一样,最后返回项目配置,到此为止,接着往下看

    inproject = inside_project()
   cmds = _get_commands_dict(settings, inproject)
   cmdname = _pop_command_name(argv)
   parser = optparse.OptionParser(formatter=optparse.TitledHelpFormatter(), \
       conflict_handler='resolve')
   if not cmdname:
       _print_commands(settings, inproject)
       sys.exit(0)
   elif cmdname not in cmds:
       _print_unknown_command(settings, cmdname, inproject)
       sys.exit(2)

导入相应的module爬虫模块(inside_project)

执行环境是否在项目中,主要检查scrapy.cfg配置文件是否存在,读取commands文件夹,把所有的命令类转换为{cmd_name: cmd_instance}的字典

    cmd = cmds[cmdname]
   parser.usage = "scrapy %s %s" % (cmdname, cmd.syntax())
   parser.description = cmd.long_desc()
   settings.setdict(cmd.default_settings, priority='command')
   cmd.settings = settings
   cmd.add_options(parser)
   opts, args = parser.parse_args(args=argv[1:])
   _run_print_help(parser, cmd.process_options, args, opts)

根据命令名称找到对应的命令实例,设置项目配置和级别为command,添加解析规则,解析命令参数,并交由Scrapy命令实例处理。

最后,看看下面这段代码。

    cmd.crawler_process = CrawlerProcess(settings)
   _run_print_help(parser, _run_command, cmd, args, opts)
   sys.exit(cmd.exitcode)

初始化CrawlerProcess实例,将对应的命令执行,这里是crawl

def _run_command(cmd, args, opts):
   if opts.profile:
       _run_command_profiled(cmd, args, opts)
   else:
       cmd.run(args, opts)

看到这,想起了文档中的介绍 Run Scrapy from a script

# Here’s an example showing how to run a single spider with it.

import scrapy
from scrapy.crawler import CrawlerProcess

class MySpider(scrapy.Spider):
   # Your spider definition
   ...

process = CrawlerProcess({
   'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})

process.crawl(MySpider)
process.start() # the script will block here until the crawling is finished

所以Scrapy爬虫运行都有用使用到CrawlerProcess,想要深入了解可以去看看源码 scrapy/scrapy/crawler.py

    """
   A class to run multiple scrapy crawlers in a process simultaneously.

   This class extends :class:`~scrapy.crawler.CrawlerRunner` by adding support
   for starting a Twisted `reactor`_ and handling shutdown signals, like the
   keyboard interrupt command Ctrl-C. It also configures top-level logging.

   This utility should be a better fit than
   :class:`~scrapy.crawler.CrawlerRunner` if you aren't running another
   Twisted `reactor`_ within your application.

   The CrawlerProcess object must be instantiated with a
   :class:`~scrapy.settings.Settings` object.

   :param install_root_handler: whether to install root logging handler
       (default: True)

   This class shouldn't be needed (since Scrapy is responsible of using it
   accordingly) unless writing scripts that manually handle the crawling
   process. See :ref:`run-from-script` for an example.
   """

最后,附上Scrapy的路径图

 

总结

简单来说,有这么几步:

  1. 读取配置文件,应用到爬虫中

  2. 把所有的命令类转换名称与实例字典

  3. 初始化CrawlerProcess实例,运行爬虫

(看的头疼,好多函数名记不住)

回顾

Scrapy源码(1)——爬虫流程概览

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值