Scrapy命令行工具

使用scrapy -h查看所有可用命令,scrapy <command> -h查看具体命令帮助。

全局命令(Global commands)

startproject

新建 Scrapy 工程。

scrapy startproject myproject

genspider

使用 Scrapy 内置模版创建 spider。

$ scrapy genspider -l    # 列出所有可用模版

Available templates:
  basic
  crawl
  csvfeed
  xmlfeed

$ scrapy genspider example example.com    # 使用默认 basic 模版
Created spider 'example' using template 'basic'

$ scrapy genspider -t crawl scrapyorg scrapy.org    # 指定 crawl 模版
Created spider 'scrapyorg' using template 'crawl'

settings

用于获取 Scrapy 配置字段信息,同样在工程内返回工程配置,否则返回 Scrapy 默认配置。

# outside any project

$ scrapy settings --get BOT_NAME
scrapybot

$ scrapy settings --get DOWNLOAD_DELAY
0
# inside a project

$ scrapy settings --get BOT_NAME
jd_comment

$ scrapy settings --get DOWNLOAD_DELAY
3

runspider

在不创建 Scrapy 工程情况下,直接运行 spider 脚本(默认在 Scrapy 目录下 spiders 内实现)。但如果爬虫脚本使用了非默认设置,则无法启动,因为没了项目只能使用默认配置。

scrapy runspider <spider_file.py>

shell

开启 Scrapy shell (如果给定 url 则爬取页面),用于测试解析 xpath 表达式。

可选参数:

  • --spider=SPIDER: 使用指定 spier 进行爬取
  • -c code: 在 shell 内求值,打印结果并退出 shell
  • --no-redirect: 禁用 3xx 重定向,只影响通过命令行进入 shell 的 url,在 shell 内再次爬取仍然默认支持重定向
$ scrapy shell http://www.taobao.com

...
2017-09-28 11:27:10 [scrapy.core.engine] INFO: Spider opened
2017-09-28 11:27:10 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://www.taobao.com/> from <GET http://www.taobao.com>
2017-09-28 11:27:10 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.taobao.com/> (referer: None)
2017-09-28 11:27:11 [traitlets] DEBUG: Using default logger
2017-09-28 11:27:11 [traitlets] DEBUG: Using default logger
[s] Available Scrapy objects:
[s]  scrapy    scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s]  crawler    <scrapy.crawler.Crawler object at 0x7fb84e533748>
[s]  item      {}
[s]  request    <GET http://www.taobao.com>
[s]  response  <200 https://www.taobao.com/>
[s]  settings  <scrapy.settings.Settings object at 0x7fb84c9bb9e8>
[s]  spider    <DefaultSpider 'default' at 0x7fb84c59c898>
[s] Useful shortcuts:
[s]  fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s]  fetch(req)                  Fetch a scrapy.Request and update local objects 
[s]  shelp()          Shell help (print this help)
[s]  view(response)    View response in a browser
In [1]:

# shell follows HTTP redirects by default
$ scrapy shell --nolog http://www.taobao.com -c '(response.status, response.url)'
(200, 'https://www.taobao.com/')


# you can disable this with --no-redirect
# (only for the URL passed as command line argument)
$ scrapy shell --nolog http://www.taobao.com -c '(response.status, response.url)' --no-redirect
(302, 'http://www.taobao.com')

fetch

下载指定 URL 页面并输出到屏幕。在工程内则使用工程的 Scrapy 配置,否则使用默认配置。

可选参数:

  • --spider=SPIDER: 指定爬取的 soider
  • --headers: 显示响应的 headers,而不显示 body
  • --no-redirect: 禁用重定向
$ scrapy fetch http://www.baidu.com --nolog

<!DOCTYPE html>
<!--STATUS OK--><html> <head><meta http-equiv=content-type content=text/html;charset=utf-8><meta http-equiv=X-UA-Compatible content=IE=Edge><meta content=always name=referrer><link rel=stylesheet type=text/css href=http://s1.bdstatic.com/r/www/cache/bdorz/baidu.min.css><title>百度一下,你就知道</title></head> <body link=#0000cc> <div id=wrapper> <div id=head> <div class=head_wrapper> <div class=s_form> <div class=s_form_wrapper> <div id=lg> <img hidefocus=true src=//www.baidu.com/img/bd_logo1.png width=270 height=129> </div> <form id=form name=f action=//www.baidu.com/s class=fm> <input type=hidden name=bdorz_come value=1> <input type=hidden name=ie value=utf-8> <input type=hidden name=f value=8> <input type=hidden name=rsv_bp value=1> <input type=hidden name=rsv_idx value=1> <input type=hidden name=tn value=baidu><span class="bg s_ipt_wr"><input id=kw name=wd class=s_ipt value maxlength=255 autocomplete=off autofocus></span><span class="bg s_btn_wr"><input type=submit id=su value=百度一下 class="bg s_btn"></span> </form> </div> </div> <div id=u1> <a href=http://news.baidu.com name=tj_trnews class=mnav>新闻</a> <a href=http://www.hao123.com name=tj_trhao123 class=mnav>hao123</a> <a href=http://map.baidu.com name=tj_trmap class=mnav>地图</a> <a href=http://v.baidu.com name=tj_trvideo class=mnav>视频</a> <a href=http://tieba.baidu.com name=tj_trtieba class=mnav>贴吧</a> <noscript> <a href=http://www.baidu.com/bdorz/login.gif?login&amp;tpl=mn&amp;u=http%3A%2F%2Fwww.baidu.com%2f%3fbdorz_come%3d1 name=tj_login class=lb>登录</a> </noscript> <script>document.write('<a href="http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u='+ encodeURIComponent(window.location.href+ (window.location.search === "" ? "?" : "&")+ "bdorz_come=1")+ '" name="tj_login" class="lb">登录</a>');</script> <a href=//www.baidu.com/more/ name=tj_briicon class=bri style="display: block;">更多产品</a> </div> </div> </div> <div id=ftCon> <div id=ftConw> <p id=lh> <a href=http://home.baidu.com>关于百度</a> <a href=http://ir.baidu.com>About Baidu</a> </p> <p id=cp>&copy;2017&nbsp;Baidu&nbsp;<a href=http://www.baidu.com/duty/>使用百度前必读</a>&nbsp; <a href=http://jianyi.baidu.com/ class=cp-feedback>意见反馈</a>&nbsp;京ICP证030173号&nbsp; <img src=//www.baidu.com/img/gs.gif> </p> </div> </div> </div> </body> </html>

$ scrapy fetch http://www.baidu.com --nolog --headers

> Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
> Accept-Language: en
> User-Agent: Scrapy/1.4.0 (+http://scrapy.org)
> Accept-Encoding: gzip,deflate
>
< Server: bfe/1.0.8.18
< Date: Thu, 28 Sep 2017 03:38:23 GMT
< Content-Type: text/html
< Last-Modified: Mon, 23 Jan 2017 13:27:29 GMT
< Cache-Control: private, no-cache, no-store, proxy-revalidate, no-transform
< Pragma: no-cache
< Set-Cookie: BDORZ=27315; max-age=86400; domain=.baidu.com; path=/

view

爬取指定 URL 并打开浏览器显示爬取的页面,用于验证爬取内容是否正确。

可选参数:

  • --spider=SPIDER: 指定爬取的 spider
  • --no-redirect: 禁用重定向
$ scrapy view http://www.example.com/some/page.html
[ ... browser starts ... ]

bench

Benchmark 测试,创建本地 HTTP本地服务器并使用最大可能速度爬取,用于测试硬件性能。

$ scrapy bench

2017-09-28 10:42:43 [scrapy.core.engine] INFO: Spider opened
2017-09-28 10:42:44 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-09-28 10:42:45 [scrapy.extensions.logstats] INFO: Crawled 53 pages (at 3180 pages/min), scraped 0 items (at 0 items/min)
2017-09-28 10:42:46 [scrapy.extensions.logstats] INFO: Crawled 101 pages (at 2880 pages/min), scraped 0 items (at 0 items/min)
2017-09-28 10:42:47 [scrapy.extensions.logstats] INFO: Crawled 149 pages (at 2880 pages/min), scraped 0 items (at 0 items/min)
2017-09-28 10:42:48 [scrapy.extensions.logstats] INFO: Crawled 197 pages (at 2880 pages/min), scraped 0 items (at 0 items/min)
2017-09-28 10:42:49 [scrapy.extensions.logstats] INFO: Crawled 237 pages (at 2400 pages/min), scraped 0 items (at 0 items/min)
2017-09-28 10:42:50 [scrapy.extensions.logstats] INFO: Crawled 277 pages (at 2400 pages/min), scraped 0 items (at 0 items/min)
2017-09-28 10:42:51 [scrapy.extensions.logstats] INFO: Crawled 309 pages (at 1920 pages/min), scraped 0 items (at 0 items/min)
2017-09-28 10:42:52 [scrapy.extensions.logstats] INFO: Crawled 349 pages (at 2400 pages/min), scraped 0 items (at 0 items/min)
2017-09-28 10:42:53 [scrapy.extensions.logstats] INFO: Crawled 381 pages (at 1920 pages/min), scraped 0 items (at 0 items/min)
2017-09-28 10:42:54 [scrapy.core.engine] INFO: Closing spider (closespider_timeout)
2017-09-28 10:42:54 [scrapy.extensions.logstats] INFO: Crawled 413 pages (at 1920 pages/min), scraped 0 items (at 0 items/min)
2017-09-28 10:42:55 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 174487,
'downloader/request_count': 429,
'downloader/request_method_count/GET': 429,
'downloader/response_bytes': 1155395,
'downloader/response_count': 429,
'downloader/response_status_count/200': 429,
'finish_reason': 'closespider_timeout',
'finish_time': datetime.datetime(2017, 9, 28, 2, 42, 55, 61436),
'log_count/INFO': 17,
'memusage/max': 44789760,
'memusage/startup': 44789760,
'request_depth_max': 15,
'response_received_count': 429,
'scheduler/dequeued': 429,
'scheduler/dequeued/memory': 429,
'scheduler/enqueued': 8580,
'scheduler/enqueued/memory': 8580,
'start_time': datetime.datetime(2017, 9, 28, 2, 42, 44, 119249)}
2017-09-28 10:42:55 [scrapy.core.engine] INFO: Spider closed (closespider_timeout)

version

顾名思义,显示 Scrapy 版本信息。

$ scrapy version

Scrapy 1.4.0

此外,可带参数-v,附带显示 Python,System,Twisted,lxml等版本。

$ scrapy version -v

Scrapy    : 1.4.0
lxml      : 3.7.3.0
libxml2  : 2.9.4A
cssselect : 1.0.1
parsel    : 1.2.0
w3lib    : 1.18.0
Twisted  : 17.5.0
Python    : 3.6.1 |Anaconda 4.4.0 (64-bit)| (default, May 11 2017, 13:09:58) - [GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
pyOpenSSL : 17.0.0 (OpenSSL 1.0.2l  25 May 2017)
Platform  : Linux-3.10.0-327.el7.x86_64-x86_64-with-centos-7.0.1406-Core

项目命令(Project commands)

crawl

运行爬虫。

scrapy crawl myspider

check

检测 spider 是否有语法错误。

$ scrapy check -l

first_spider
  * parse
  * parse_item
second_spider
  * parse
  * parse_item

$ scrapy check
[FAILED] first_spider:parse_item
>>> 'RetailPricex' field is missing

[FAILED] first_spider:parse
>>> Returned 92 requests, expected 0..4

list

列出工程内所有可用的 spider,分行显示。

$ scrapy list

spider1
spider2

edit

编辑 spider,一般都会使用 IDE 编辑,但有时候需要在服务器使用 vim 等做修改。

scrapy edit spider1

parse

爬取给定 URL 页面,并进行页面解析(若不使用参数--callback指定回调函数使用默认 parse 函数),用于解析测试。

可选参数:

  • --spider=SPIDER: 指定 url 使用的爬虫
  • --a NAME=VALUE: 设置爬虫使用参数,非默认参数一般在spider的__init__定义
  • --callback-c: 指定解析回调函数
  • --pipeline: 指定解析后使用的 pipeline
  • --rules-r: 在 CrawlSpider 中使用,指定规则抽取新连接
  • --noitems: 不显示解析的 item
  • --nolinks: 不显示抽取的连接
  • --nocolour: 不使用 pygments 高亮输出
  • --depth-d: 指定爬取页面深度(默认 1)
  • --verbose-v: 显示每层深度的信息
$ scrapy parse http://www.example.com/ -c parse_item

[ ... scrapy log lines crawling example.com spider ... ]

>>> STATUS DEPTH LEVEL 1 <<<
# Scraped Items  ------------------------------------------------------------
[{'name': u'Example item',
'category': u'Furniture',
'length': u'12 cm'}]

# Requests  -----------------------------------------------------------------
[]
  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值