文章目录
目的
最近在写 scrapy,有一个需求,就是在 scrapy 项目外部启动,由于 scrapy crawl
这个命令只能在 scrapy.cfg
所在的目录下使用,于是对 scrapy 的命令执行产生了好奇。
scrapy
的命令分为 全局命令 和 项目命令
- 全局命令可以在任何地方使用,比如
startproject
这些 - 项目命令只能在 scrapy 项目中使用
分析 Scrapy 是如何执行命令的?如何判断在命令执行在项目之内?
官方文档
Command line tool — Scrapy 2.11.1 documentation 文档中提到:
-
The directory where the
scrapy.cfg
file resides is known as the project root directory.- 存放
scrapy.cfg
文件的目录会被认为是项目的根目录
- 存放
-
You can also add your custom project commands by using the
COMMANDS_MODULE
setting.- 通过
COMMANDS_MODULE
来添加自定义的项目命令
- 通过
调用分析
新建一个 scrapy 项目,在 项目外 执行命令: scrapy crawl xx
,输出为:
Scrapy 2.11.1 - no active project
Unknown command: crawl
Use "scrapy" to see available commands
在 项目内 执行命令则输出:
2024-03-26 16:44:33 [scrapy.utils.log] INFO: Scrapy 2.11.1 started (bot: test_scrapy)
....
Traceback (most recent call last):
File "D:\Project\Python\TestProject\.venv\lib\site-packages\scrapy\spiderloader.py", line 87, in load
return self._spiders[spider_name]
KeyError: 'xx'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "D:\ProgramAddress\python\lib\runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "D:\ProgramAddress\python\lib\runpy.py", line 86, in _run_code
exec(code, run_globals)
File "D:\Project\Python\TestProject\.venv\Scripts\scrapy.exe\__main__.py", line 7, in <module>
File "D:\Project\Python\TestProject\.venv\lib\site-packages\scrapy\cmdline.py", line 161, in execute
_run_print_help(parser, _run_command, cmd, args, opts)
File "D:\Project\Python\TestProject\.venv\lib\site-packages\scrapy\cmdline.py", line 114, in _run_print_help
func(*a, **kw)
File "D:\Project\Python\TestProject\.venv\lib\site-packages\scrapy\cmdline.py", line 169, in _run_command
cmd.run(args, opts)
...
spidercls = self.spider_loader.load(spidercls)
File "D:\Project\Python\TestProject\.venv\lib\site-packages\scrapy\spiderloader.py", line 89, in load
raise KeyError(f"Spider not found: {spider_name}")
KeyError: 'Spider not found: xx'
分析:
-
第一行后面的
test_scrapy
是settings.py
文件中定义的BOT_NAME
,这就能说明该命令执行后某个地方了解析settings.py
-
__main__.py
中包含了开始执行相关的代码 -
spiderloader.py
负责加载项目中的 spider 以及读取对应的 spider
上面两种执行的差异就是执行命令的所在路径是否存在 scrapy.cfg
将 scrapy.cfg
移动到项目外,再次执行命令:
Traceback (most recent call last):
....
File "D:\Project\Python\TestProject\.venv\Scripts\scrapy.exe\__main__.py", line 7, in <module>
File "D:\Project\Python\TestProject\.venv\lib\site-packages\scrapy\cmdline.py", line 128, in execute
settings = get_project_settings()
File "D:\Project\Python\TestProject\.venv\lib\site-packages\scrapy\utils\project.py", line 71, in get_project_settings
settings.setmodule(settings_module_path, priority="project")
File "D:\Project\Python\TestProject\.venv\lib\site-packages\scrapy\settings\__init__.py", line 385, in setmodule
module = import_module(module)
File "D:\ProgramAddress\python\lib\importlib\__init__.py", line 126, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
File "<frozen importlib._bootstrap>", line 1004, in _find_and_load_unlocked
ModuleNotFoundError: No module named 'test_scrapy.settings'
分析:这里 get_project_settings
读取了文件,但是明显 scrapy.cfg
中的 test_scrapy.settings
不符合移动后的路径,所以出现问题
综上所述,猜测总的执行过程是:
-
判断当前命令执行路径是否属于项目内(通过判断
scrapy.cfg
是否存在实现) -
如果存在则通过
scrapy.cfg
中的settings.default
定义的路径去读取settings.py
中的配置信息(通过加载settings.py
为模块实现) -
加载整个项目的内容(加载 spiders、pipeline 等组件)
-
执行命令
crawl xx
从已经加载的 spiders 中查找 (return self._spiders[spider_name]
)
源码分析
在 pip
管理包的路径下 .../site-packages/scrapy
中有 scrapy 框架的源码
或者可以在 github 上查看:scrapy/scrapy at master · scrapy/scrapy)
__main__.py
从报错的异常调用栈中,__main__.py
是被 scrapy.exe
调用
from scrapy.cmdline import execute
if __name__ == "__main__":
execute()
cmdline.py#execute()
以下省略部分无关代码
# 传入参数 None, None
def execute(argv=None, settings=None):
if argv is None:
# sys.argv 是外部调用程序传入的参数
# 举个例子: ["xxx/Scrapy.exe", "crawl", "xx"]
argv = sys.argv
if settings is None:
# 读取项目的配置信息,主要调用请看下面
settings = get_project_settings()
...
# 通过导入模块判断当前目录是否处于项目中
inproject = inside_project()
# cmds是 scrapy.commands 和 COMMANDS_MODULE 的命令
# inproject 如果为 False(不在项目中), 那么 cmds 仅包含全局级命令
cmds = _get_commands_dict(settings, inproject)
# 获取命令名称
cmdname = _pop_command_name(argv)
if not cmdname:
# 命令格式错误
_print_commands(settings, inproject)
sys.exit(0)
elif cmdname not in cmds:
# 没有该命令
_print_unknown_command(settings, cmdname, inproject)
sys.exit(2)
# 获取到对应的命令
cmd = cmds[cmdname]
...
cmd.settings = settings
cmd.add_options(parser)
# opts 命令后面添加的额外参数 (比如 logfile=xxx)
# args 命令所需要的必要参数(比如 scrapy crawl xx 中的 xx)
opts, args = parser.parse_known_args(args=argv[1:])
# _run_print_help 用于执行传入的 func,捕获 UsageError 并使用 parser 输出
_run_print_help(parser, cmd.process_options, args, opts)
# 创建进程,传入了 settings
cmd.crawler_process = CrawlerProcess(settings)
# 这里就是执行 cmd,实际上是 调用了对应的 run 方法
_run_print_help(parser, _run_command, cmd, args, opts)
sys.exit(cmd.exitcode)
解析项目的 settings.py
project.py#get_project_settings()
:定位 settings.py
的位置并读取配置信息
def get_project_settings() -> Settings:
# 全局:ENVVAR = "SCRAPY_SETTINGS_MODULE"
if ENVVAR not in os.environ:
# 标记:a:没有环境变量,需要读取 scrapy.cfg 文件来获取 settings.py 所在位置,并且添加到环境变量中
project = os.environ.get("SCRAPY_PROJECT", "default")
init_env(project)
settings = Settings()
# 读取 settings.py 所在位置
settings_module_path = os.environ.get(ENVVAR)
# 读取 settings_module 所有配置信息添加到 settings 实例中
if settings_module_path:
settings.setmodule(settings_module_path, priority="project")
....
return settings
标志 a:Scrapy官方文档中 有提到:
- When you use Scrapy, you have to tell it which settings you’re using. You can do this by using an environment variable,
SCRAPY_SETTINGS_MODULE
. - 使用 Scrapy 时,需要告知使用哪种 settings,可以通过环境变量
SCRAPY_SETTINGS_MODULE
指定;
可以通过设置环境变量
SCRAPY_PROJECT
来指定使用scrapy.cfg
的settings
路径
scrapy.cfg
内容[settings] default = test_scrapy.settings test=xxxx.settings
添加环境变量:
SCRAPY_PROJECT
if __name__ == '__main__': os.environ.setdefault("SCRAPY_PROJECT", "test") os.system("scrapy crawl xx") # 此时使用的就是test对应的路径
确定项目的位置
conf.py
文件中
-
closest_scrapy_cfg()
: 用于递归查找scrapy.cfg
所在路径(逐层向外查找)def closest_scrapy_cfg( path: Union[str, os.PathLike] = ".", prevpath: Optional[Union[str, os.PathLike]] = None, ) -> str: """Return the path to the closest scrapy.cfg file by traversing the current directory and its parents """ if prevpath is not None and str(path) == str(prevpath): return "" path = Path(path).resolve() cfgfile = path / "scrapy.cfg" if cfgfile.exists(): return str(cfgfile) return closest_scrapy_cfg(path.parent, path)
-
get_sources()
:获取所有scrapy.cfg
可能存在的路径(use_closest
指定是否为离当前路径最近)def get_sources(use_closest: bool = True) -> List[str]: xdg_config_home = ( os.environ.get("XDG_CONFIG_HOME") or Path("~/.config").expanduser() ) sources = [ "/etc/scrapy.cfg", r"c:\scrapy\scrapy.cfg", str(Path(xdg_config_home) / "scrapy.cfg"), str(Path("~/.scrapy.cfg").expanduser()), ] if use_closest: sources.append(closest_scrapy_cfg()) return sources
-
get_config()
:解析sources
(也就是scrapy.cfg
)内容def get_config(use_closest: bool = True) -> ConfigParser: """Get Scrapy config file as a ConfigParser""" sources = get_sources(use_closest) cfg = ConfigParser() cfg.read(sources) return cfg
-
init_env()
:解析完 cfg 文件后,读取其中的[settings]
内容(就是settings.py
所在路径),并加入到环境变量os.environ
和 系统路径sys.path
def init_env(project: str = "default", set_syspath: bool = True) -> None: cfg = get_config() # 将 settings.py 的路径加入到环境变量 if cfg.has_option("settings", project): os.environ["SCRAPY_SETTINGS_MODULE"] = cfg.get("settings", project) closest = closest_scrapy_cfg() if closest: projdir = str(Path(closest).parent) # 将项目路径加入到系统路径中 if set_syspath and projdir not in sys.path: sys.path.append(projdir)
总结
execute
总的调用流程:
get_project_settings
: 定位项目位置(scrapy.cfg 所在位置)并解析项目的settings.py
配置信息inside_project
: 判断是否处于项目内_get_commands_dict
: 获取框架的 commands 和项目自定义的 commands(如果存在)cmd = cmds[cmdname]
:获取与命令名称对应的实例- 最后调用其
cmd.run
方法执行命令
scrapy.commands.ScrapyCommand
scrapy.commands.__init__.py
中包含了 命令所继承的基类 ScrapyCommand
以及其子类 BaseRunSpiderCommand
ScrapyCommand
指定了一些属性:
requires_project
:指定该命令是否需要在项目中执行crawler_process
:指定命令运行的进程exitcode
:命令的执行情况syntax()
:返回表示该命令用法的字符串(不包含命令名称)short_desc()
:返回该命令简单描述的字符串long_desc()
:返回该命令详细描述的字符串help()
:返回该命令使用方法的字符串,help
命令将会调用该方法add_options()
:该命令可以添加的额外参数process_options(args, opts)
:处理该命令的额外参数run()
:执行命令的主要代码
而 BaseRunSpiderCommand
则是对 add_options
和 process_options
补充了部分内容
例子:scrapy.commands.crawl
以这个最常用命令作为例子,同时也可以看到 scrapy 是如何启动一个项目的
class Command(BaseRunSpiderCommand):
requires_project = True
def syntax(self):
return "[options] <spider>"
def short_desc(self):
return "Run a spider"
def run(self, args, opts):
if len(args) < 1:
raise UsageError()
elif len(args) > 1:
raise UsageError(
"running 'scrapy crawl' with more than one spider is not supported"
)
# 参数第一个作为要运行的 spider
spname = args[0]
# 这里用 crawl_process 来启动该 spider
crawl_defer = self.crawler_process.crawl(spname, **opts.spargs)
if getattr(crawl_defer, "result", None) is not None and issubclass(
crawl_defer.result.type, Exception
):
self.exitcode = 1
else:
# 启动进程,堵塞代码,直至进程结束
self.crawler_process.start()
if (
self.crawler_process.bootstrap_failed
or hasattr(self.crawler_process, "has_exception")
and self.crawler_process.has_exception
):
self.exitcode = 1
总的来说,crawl
这个命令使用了 CrawlProcess
来启动爬虫
应用
项目外启动 Scrapy
从上面的分析中以及 官方文档-Run Scrapy from a script 得出,只要我们能够实现与 get_project_settings
类似的功能,将项目中的 settings.py
读取,就能够从项目外启动爬虫
当前目录结构:
test
└─test_scrapy
├─scrapy.cfg
└─test_scrapy
├─spiders
├─...
└─settings.py
想要在 test
目录下启动 scrapy 项目中的 spiders
最开始是以下代码(存在问题,请勿直接复制)
import sys
from scrapy.crawler import CrawlerProcess
from scrapy.settings import Settings
import importlib
def get_project_settings(module_path: str) -> Settings:
settings_module = importlib.import_module(module_path)
settings = Settings()
for key in dir(settings_module):
if key.isupper():
settings.set(key, getattr(settings_module, key), priority="project")
return settings
def run_spider(spider_name: str, _settings):
crawler_process = CrawlerProcess(settings)
crawl_defer = crawler_process.crawl(spider_name)
if getattr(crawl_defer, "result", None) is not None and issubclass(
crawl_defer.result.type, Exception
):
exitcode = 1
else:
crawler_process.start()
if (
crawler_process.bootstrap_failed
or hasattr(crawler_process, "has_exception")
and crawler_process.has_exception
):
exitcode = 1
else:
exitcode = 0
sys.exit(exitcode)
if __name__ == '__main__':
settings = get_project_settings("test_scrapy.test_scrapy.settings")
run_spider("test", settings)
发现报错:ModuleNotFoundError: No module named 'test_scrapy.spiders'
看调用栈时发现 spiderloader 加载 Spider 时还是需要读取 settings.py 中的 SPIDER_MODULES
,现在调用位置发生改变,但是配置文件中的路径没有改变,所以导致了无法找到该模块
于是打算在读取 settings.py 文件时将其所有的相关路径进行修改:
- 原本项目的路径为
test_scrapy.xxx
- 而现在的路径为
test_scrapy.test_scrapy.xxx
- 可以发现后面的
test_scrapy.xxx
是不变的,只需要给前面添加前缀即可
完整代码:
import importlib
import sys
from scrapy.crawler import CrawlerProcess
from scrapy.settings import Settings
def get_project_settings(module_path: str) -> Settings:
split_path = module_path.split(".")
new_prefix = '.'.join(split_path[:-2]) # 新的路径前缀
old_prefix = '.'.join(split_path[-2:-1]) # 旧的项目前缀
def modify(value):
if isinstance(value, str) and value.startswith(old_prefix):
return '.'.join([new_prefix, value])
elif isinstance(value, list):
return [modify(item) for item in value]
elif isinstance(value, dict):
return {
modify(k): v
for k, v in value.items()
}
else:
return value
settings_module = importlib.import_module(module_path)
settings = Settings()
for key in dir(settings_module):
if key.isupper():
settings.set(key, modify(getattr(settings_module, key)), priority="project")
return settings
def run_spider(spider_name: str, _settings):
crawler_process = CrawlerProcess(settings)
crawl_defer = crawler_process.crawl(spider_name)
if getattr(crawl_defer, "result", None) is not None and issubclass(
crawl_defer.result.type, Exception
):
exitcode = 1
else:
crawler_process.start()
if (
crawler_process.bootstrap_failed
or hasattr(crawler_process, "has_exception")
and crawler_process.has_exception
):
exitcode = 1
else:
exitcode = 0
sys.exit(exitcode)
if __name__ == '__main__':
settings = get_project_settings("test_scrapy.test_scrapy.settings")
run_spider("test", settings)
本文为个人创作,如有错误,请及时指出,谢谢