Configuration settings
Scrapy将在标准位置的ini样式文件scrapy.cfg中查找配置参数:
/etc/scrapy.cfg
或c:\scrapy\scrapy.cfg
(系统范围),~/.config/scrapy.cfg
($XDG_CONFIG_HOME
)和~/.scrapy.cfg
($HOME
)用于全局(用户范围)设置scrapy.cfg
在一个项目的根(见下一节)。
这些文件的设置按列出的首选顺序进行合并:用户定义的值的优先级高于系统默认值,而在定义时,项目范围的设置将覆盖所有其他设置。Scrapy also understands, and can be configured through, a number of environment variables. Currently these are:
SCRAPY_SETTINGS_MODULE
(see Designating the settings)SCRAPY_PROJECT
SCRAPY_PYTHON_SHELL
(see Scrapy shell)
Default structure of Scrapy projects
在深入使用命令行工具及其子命令之前,首先了解Scrapy项目的目录结构。
scrapy.cfg myproject/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py spider1.py spider2.py ...
scrapy.cfg
文件所在的目录称为
项目根目录
。该文件包含项目设置的python模块的名称。这是一个例子:
[settings] default = myproject.settings
Using the scrapy
tool
直接在命令行运行 scrapy,会提示一些可以选择的命令,选项,参数等信息
创建我们的项目myproject在project_dir 目录下
scrapy startproject myproject [ project_dir ]
你进入新的项目目录:
cd project_dir
scrapy genspider mydomain mydomain.com
一些Scrapy命令(如crawl
)必须从Scrapy项目中运行。
记住,您可以随时通过运行以下命令获取有关每个命令的更多信息
scrapy <command> -h
例如:scrapy view -h
您可以看到所有可用的命令:
scrapy -h
全局命令:不需要创建项目,不用指定项目
startproject scrapy startproject <project_name> [project_dir]
genspider scrapy genspider [-t template] <name> <domain>
- Usage example:
$ scrapy genspider -l Available templates: basic crawl csvfeed xmlfeed $ scrapy genspider example example.com Created spider 'example' using template 'basic' $ scrapy genspider -t crawl scrapyorg scrapy.org Created spider 'scrapyorg' using template 'crawl'
settings
$ scrapy settings --get BOT_NAME scrapybot $ scrapy settings --get DOWNLOAD_DELAY 0
runspider
$ scrapy runspider myspider.py [ ... spider starts crawling ... ]
shell
fetch scrapy fetch <url>
$ scrapy fetch --nolog http://www.example.com/some/page.html [... html内容...] $ scrapy fetch --nolog - headers http://www.example.com/ {'Accept-Ranges':['bytes'], '年龄':['1263'], '连接':['close'], 'Content-Length':['596'], 'Content-Type':['text / html; 字符集= UTF-8' ], '日期':['Wed,18 Aug 2010 23:59:46 GMT'], 'Etag':['“573c1-254-48c9c87349680”'] 'Last-Modified':['Fri,30 Jul 2010 15:30:18 GMT'], '服务器':['Apache / 2.2.3(CentOS)']}
view
$ scrapy shell http://www.example.com/some/page.html [...刮壳开始...] $ scrapy shell --nolog http://www.example.com/ -c'(response.status,response.url)' (200,'http://www.example.com/') #shell默认情况下遵循HTTP重定向 $ scrapy shell --nolog http://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.com%2F -c'(response.status,response.url)' (200,'http://example.com/') #你可以使用--no-redirect禁用它 #(仅适用于作为命令行参数传递的URL) $ scrapy shell --no-redirect --nolog http://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.com%2F -c'(response.status,response.url)' (302,'http://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.com%2F')
$ scrapy view http://www.example.com/some/page.html [...浏览器启动...]
version
项目命令:项目要已经创建好,需要指定
crawl scrapy crawl <spider> 运行爬虫
Usage examples:
$ scrapy crawl myspider [ ... myspider starts crawling ... ]
check scrapy check [-l] <spider>
$ scrapy check -l first_spider * parse * parse_item second_spider * parse * parse_item $ scrapy check [FAILED] first_spider:parse_item >>> 'RetailPricex' field is missing [FAILED] first_spider:parse >>> Returned 92 requests, expected 0..4
list scrapy list
Usage example:
$ scrapy list spider1 spider2
edit scrapy edit <spider>
parse
$ scrapy parse http://www.example.com/ -c parse_item [... scrapy log lines crawling example.com spider ...] >>>状态深度1 <<< #Scraped Items ----------------------------------------------- ------------- [{'name':u'Example item', 'category':u'Furniture', 'length':u'12 cm'}] # 要求 - - - - - - - - - - - - - - - - - - - - - - - - ----------------- []
bench