使用scrapy -h
查看所有可用命令,scrapy <command> -h
查看具体命令帮助。
全局命令(Global commands)
startproject
新建 Scrapy 工程。
scrapy startproject myproject
genspider
使用 Scrapy 内置模版创建 spider。
$ scrapy genspider -l # 列出所有可用模版
Available templates:
basic
crawl
csvfeed
xmlfeed
$ scrapy genspider example example.com # 使用默认 basic 模版
Created spider 'example' using template 'basic'
$ scrapy genspider -t crawl scrapyorg scrapy.org # 指定 crawl 模版
Created spider 'scrapyorg' using template 'crawl'
settings
用于获取 Scrapy 配置字段信息,同样在工程内返回工程配置,否则返回 Scrapy 默认配置。
# outside any project
$ scrapy settings --get BOT_NAME
scrapybot
$ scrapy settings --get DOWNLOAD_DELAY
0
# inside a project
$ scrapy settings --get BOT_NAME
jd_comment
$ scrapy settings --get DOWNLOAD_DELAY
3
runspider
在不创建 Scrapy 工程情况下,直接运行 spider 脚本(默认在 Scrapy 目录下 spiders 内实现)。但如果爬虫脚本使用了非默认设置,则无法启动,因为没了项目只能使用默认配置。
scrapy runspider <spider_file.py>
shell
开启 Scrapy shell (如果给定 url 则爬取页面),用于测试解析 xpath 表达式。
可选参数:
--spider=SPIDER
: 使用指定 spier 进行爬取-c code
: 在 shell 内求值,打印结果并退出 shell--no-redirect
: 禁用 3xx 重定向,只影响通过命令行进入 shell 的 url,在 shell 内再次爬取仍然默认支持重定向
$ scrapy shell http://www.taobao.com
...
2017-09-28 11:27:10 [scrapy.core.engine] INFO: Spider opened
2017-09-28 11:27:10 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://www.taobao.com/> from <GET http://www.taobao.com>
2017-09-28 11:27:10 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.taobao.com/> (referer: None)
2017-09-28 11:27:11 [traitlets] DEBUG: Using default logger
2017-09-28 11:27:11 [traitlets] DEBUG: Using default logger
[s] Available Scrapy objects:
[s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s] crawler <scrapy.crawler.Crawler object at 0x7fb84e533748>
[s] item {}
[s] request <GET http://www.taobao.com>
[s] response <200 https://www.taobao.com/>
[s] settings <scrapy.settings.Settings object at 0x7fb84c9bb9e8>
[s] spider <DefaultSpider 'default' at 0x7fb84c59c898>
[s] Useful shortcuts:
[s] fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s] fetch(req) Fetch a scrapy.Request and update local objects
[s] shelp() Shell help (print this help)
[s] view(response) View response in a browser
In [1]:
# shell follows HTTP redirects by default
$ scrapy shell --nolog http://www.taobao.com -c '(response.status, response.url)'
(200, 'https://www.taobao.com/')
# you can disable this with --no-redirect
# (only for the URL passed as command line argument)
$ scrapy shell --nolog http://www.taobao.com -c '(response.status, response.url)' --no-redirect
(302, 'http://www.taobao.com')
fetch
下载指定 URL 页面并输出到屏幕。在工程内则使用工程的 Scrapy 配置,否则使用默认配置。
可选参数:
--spider=SPIDER
: 指定爬取的 soider--headers
: 显示响应的 headers,而不显示 body--no-redirect
: 禁用重定向
$ scrapy fetch http://www.baidu.com --nolog
<!DOCTYPE html>
<!--STATUS OK--><html> <head><meta http-equiv=content-type content=text/html;charset=utf-8><meta http-equiv=X-UA-Compatible content=IE=Edge><meta content=always name=referrer><link rel=stylesheet type=text/css href=http://s1.bdstatic.com/r/www/cache/bdorz/baidu.min.css><title>百度一下,你就知道</title></head> <body link=#0000cc> <div id=wrapper> <div id=head> <div class=head_wrapper> <div class=s_form> <div class=s_form_wrapper> <div id=lg> <img hidefocus=true src=//www.baidu.com/img/bd_logo1.png width=270 height=129> </div> <form id=form name=f action=//www.baidu.com/s class=fm> <input type=hidden name=bdorz_come value=1> <input type=hidden name=ie value=utf-8> <input type=hidden name=f value=8> <input type=hidden name=rsv_bp value=1> <input type=hidden name=rsv_idx value=1> <input type=hidden name=tn value=baidu><span class="bg s_ipt_wr"><input id=kw name=wd class=s_ipt value maxlength=255 autocomplete=off autofocus></span><span class="bg s_btn_wr"><input type=submit id=su value=百度一下 class="bg s_btn"></span> </form> </div> </div> <div id=u1> <a href=http://news.baidu.com name=tj_trnews class=mnav>新闻</a> <a href=http://www.hao123.com name=tj_trhao123 class=mnav>hao123</a> <a href=http://map.baidu.com name=tj_trmap class=mnav>地图</a> <a href=http://v.baidu.com name=tj_trvideo class=mnav>视频</a> <a href=http://tieba.baidu.com name=tj_trtieba class=mnav>贴吧</a> <noscript> <a href=http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2f%3fbdorz_come%3d1 name=tj_login class=lb>登录</a> </noscript> <script>document.write('<a href="http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u='+ encodeURIComponent(window.location.href+ (window.location.search === "" ? "?" : "&")+ "bdorz_come=1")+ '" name="tj_login" class="lb">登录</a>');</script> <a href=//www.baidu.com/more/ name=tj_briicon class=bri style="display: block;">更多产品</a> </div> </div> </div> <div id=ftCon> <div id=ftConw> <p id=lh> <a href=http://home.baidu.com>关于百度</a> <a href=http://ir.baidu.com>About Baidu</a> </p> <p id=cp>©2017 Baidu <a href=http://www.baidu.com/duty/>使用百度前必读</a> <a href=http://jianyi.baidu.com/ class=cp-feedback>意见反馈</a> 京ICP证030173号 <img src=//www.baidu.com/img/gs.gif> </p> </div> </div> </div> </body> </html>
$ scrapy fetch http://www.baidu.com --nolog --headers
> Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
> Accept-Language: en
> User-Agent: Scrapy/1.4.0 (+http://scrapy.org)
> Accept-Encoding: gzip,deflate
>
< Server: bfe/1.0.8.18
< Date: Thu, 28 Sep 2017 03:38:23 GMT
< Content-Type: text/html
< Last-Modified: Mon, 23 Jan 2017 13:27:29 GMT
< Cache-Control: private, no-cache, no-store, proxy-revalidate, no-transform
< Pragma: no-cache
< Set-Cookie: BDORZ=27315; max-age=86400; domain=.baidu.com; path=/
view
爬取指定 URL 并打开浏览器显示爬取的页面,用于验证爬取内容是否正确。
可选参数:
--spider=SPIDER
: 指定爬取的 spider--no-redirect
: 禁用重定向
$ scrapy view http://www.example.com/some/page.html
[ ... browser starts ... ]
bench
Benchmark 测试,创建本地 HTTP本地服务器并使用最大可能速度爬取,用于测试硬件性能。
$ scrapy bench
2017-09-28 10:42:43 [scrapy.core.engine] INFO: Spider opened
2017-09-28 10:42:44 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-09-28 10:42:45 [scrapy.extensions.logstats] INFO: Crawled 53 pages (at 3180 pages/min), scraped 0 items (at 0 items/min)
2017-09-28 10:42:46 [scrapy.extensions.logstats] INFO: Crawled 101 pages (at 2880 pages/min), scraped 0 items (at 0 items/min)
2017-09-28 10:42:47 [scrapy.extensions.logstats] INFO: Crawled 149 pages (at 2880 pages/min), scraped 0 items (at 0 items/min)
2017-09-28 10:42:48 [scrapy.extensions.logstats] INFO: Crawled 197 pages (at 2880 pages/min), scraped 0 items (at 0 items/min)
2017-09-28 10:42:49 [scrapy.extensions.logstats] INFO: Crawled 237 pages (at 2400 pages/min), scraped 0 items (at 0 items/min)
2017-09-28 10:42:50 [scrapy.extensions.logstats] INFO: Crawled 277 pages (at 2400 pages/min), scraped 0 items (at 0 items/min)
2017-09-28 10:42:51 [scrapy.extensions.logstats] INFO: Crawled 309 pages (at 1920 pages/min), scraped 0 items (at 0 items/min)
2017-09-28 10:42:52 [scrapy.extensions.logstats] INFO: Crawled 349 pages (at 2400 pages/min), scraped 0 items (at 0 items/min)
2017-09-28 10:42:53 [scrapy.extensions.logstats] INFO: Crawled 381 pages (at 1920 pages/min), scraped 0 items (at 0 items/min)
2017-09-28 10:42:54 [scrapy.core.engine] INFO: Closing spider (closespider_timeout)
2017-09-28 10:42:54 [scrapy.extensions.logstats] INFO: Crawled 413 pages (at 1920 pages/min), scraped 0 items (at 0 items/min)
2017-09-28 10:42:55 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 174487,
'downloader/request_count': 429,
'downloader/request_method_count/GET': 429,
'downloader/response_bytes': 1155395,
'downloader/response_count': 429,
'downloader/response_status_count/200': 429,
'finish_reason': 'closespider_timeout',
'finish_time': datetime.datetime(2017, 9, 28, 2, 42, 55, 61436),
'log_count/INFO': 17,
'memusage/max': 44789760,
'memusage/startup': 44789760,
'request_depth_max': 15,
'response_received_count': 429,
'scheduler/dequeued': 429,
'scheduler/dequeued/memory': 429,
'scheduler/enqueued': 8580,
'scheduler/enqueued/memory': 8580,
'start_time': datetime.datetime(2017, 9, 28, 2, 42, 44, 119249)}
2017-09-28 10:42:55 [scrapy.core.engine] INFO: Spider closed (closespider_timeout)
version
顾名思义,显示 Scrapy 版本信息。
$ scrapy version
Scrapy 1.4.0
此外,可带参数-v
,附带显示 Python,System,Twisted,lxml等版本。
$ scrapy version -v
Scrapy : 1.4.0
lxml : 3.7.3.0
libxml2 : 2.9.4A
cssselect : 1.0.1
parsel : 1.2.0
w3lib : 1.18.0
Twisted : 17.5.0
Python : 3.6.1 |Anaconda 4.4.0 (64-bit)| (default, May 11 2017, 13:09:58) - [GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
pyOpenSSL : 17.0.0 (OpenSSL 1.0.2l 25 May 2017)
Platform : Linux-3.10.0-327.el7.x86_64-x86_64-with-centos-7.0.1406-Core
项目命令(Project commands)
crawl
运行爬虫。
scrapy crawl myspider
check
检测 spider 是否有语法错误。
$ scrapy check -l
first_spider
* parse
* parse_item
second_spider
* parse
* parse_item
$ scrapy check
[FAILED] first_spider:parse_item
>>> 'RetailPricex' field is missing
[FAILED] first_spider:parse
>>> Returned 92 requests, expected 0..4
list
列出工程内所有可用的 spider,分行显示。
$ scrapy list
spider1
spider2
edit
编辑 spider,一般都会使用 IDE 编辑,但有时候需要在服务器使用 vim 等做修改。
scrapy edit spider1
parse
爬取给定 URL 页面,并进行页面解析(若不使用参数--callback
指定回调函数使用默认 parse 函数),用于解析测试。
可选参数:
--spider=SPIDER
: 指定 url 使用的爬虫--a NAME=VALUE
: 设置爬虫使用参数,非默认参数一般在spider的__init__
定义--callback
或-c
: 指定解析回调函数--pipeline
: 指定解析后使用的 pipeline--rules
或-r
: 在 CrawlSpider 中使用,指定规则抽取新连接--noitems
: 不显示解析的 item--nolinks
: 不显示抽取的连接--nocolour
: 不使用 pygments 高亮输出--depth
或-d
: 指定爬取页面深度(默认 1)--verbose
或-v
: 显示每层深度的信息
$ scrapy parse http://www.example.com/ -c parse_item
[ ... scrapy log lines crawling example.com spider ... ]
>>> STATUS DEPTH LEVEL 1 <<<
# Scraped Items ------------------------------------------------------------
[{'name': u'Example item',
'category': u'Furniture',
'length': u'12 cm'}]
# Requests -----------------------------------------------------------------
[]