一、基础操作
1、建立爬虫项目:tutorial
scrapy startproject tutorial
2、使用 scrapy shell
The best way to learn how to extract data with Scrapy is trying selectors using the Scrapy shell. Run:
scrapy shell 'http://quotes.toscrape.com/page/1/'
然后使用css、xpath等工具进行数据提取,例如
response.xpath('//title')
response.css('title::text').getall()
response.css('title::text').get()
还可以在这个终端进行编程
在spider中进行编程提取数据
二、command line tool
常见命令:
scrapy <command> [options] [args]
scrapy <command> -h
scrapy -h # 查看帮助
scrapy genspider [-t template] <name> <domain>
startproject
scrapy startproject <project_name> [project_dir]
example:
scrapy startproject myproject
genspider
scrapy genspider [-t template] <name> <domain>
example:
scrapy genspider -l
scrapy genspider example example.com
scrapy genspider -t crawl scrapyorg scrapy.org
crawl
scrapy crawl <spider>
example:
scrapy crawl myspider
# 无日志输出
scrapy crawl spider_name --nolog
scrapy check [-l] <spider>
example:
scrapy check -l
scrapy check
list
scrapy list
edit
scrapy edit <spider>
example:
scrapy edit spider1
fetch
scrapy fetch <url>
Supported options:
• --spider=SPIDER: bypass spider autodetection and force use of specific spider
• --headers: print the response’s HTTP headers instead of the response’s body
• --no-redirect: do not follow HTTP 3xx redirects (default is to follow them)
example:
scrapy fetch --nolog http://www.example.com/some/page.html
scrapy fetch --nolog --headers http://www.example.com/
view
scrapy view <url>
Supported options:
• --spider=SPIDER: bypass spider autodetection and force use of specific spider
• --no-redirect: do not follow HTTP 3xx redirects (default is to follow them)
example:
scrapy view http://www.example.com/some/page.html
shell
scrapy shell [url]
Supported options:
• --spider=SPIDER
• -c code
• --no-redirect
example:
scrapy shell http://www.example.com/some/page.html
shell follows HTTP redirects by default
parse
scrapy parse <url> [options]
Supported options:
• --spider=SPIDER: bypass spider autodetection and force use of specific spider
• --a NAME=VALUE: set spider argument (may be repeated)
• --callback or -c: spider method to use as callback for parsing the response
• --meta or -m: additional request meta that will be passed to the callback request. This must be a valid json
string. Example: –meta=’{“foo” : “bar”}’
• --cbkwargs: additional keyword arguments that will be passed to the callback. This must be a valid json
string. Example: –cbkwargs=’{“foo” : “bar”}’
• --pipelines: process items through pipelines
• --rules or -r: use CrawlSpider rules to discover the callback (i.e. spider method) to use for parsing the
response
• --noitems: don’t show scraped items
• --nolinks: don’t show extracted links
• --nocolour: avoid using pygments to colorize the output
• --depth or -d: depth level for which the requests should be followed recursively (default: 1)
• --verbose or -v: display information for each depth level
• --output or -o: dump scraped items to a file
example:
scrapy parse http://www.example.com/ -c parse_item
settings
scrapy settings [options]
example:
scrapy settings --get BOT_NAME
scrapy settings --get DOWNLOAD_DELAY
runspider
scrapy runspider <spider_file.py>
example:
scrapy runspider myspider.py
version
scrapy version [-v]
bench
scrapy bench