scrapy

一、基础操作

1、建立爬虫项目:tutorial

scrapy startproject tutorial

2、使用 scrapy shell
The best way to learn how to extract data with Scrapy is trying selectors using the Scrapy shell. Run:

scrapy shell 'http://quotes.toscrape.com/page/1/'

然后使用css、xpath等工具进行数据提取,例如

response.xpath('//title')
response.css('title::text').getall()
response.css('title::text').get()

还可以在这个终端进行编程
在这里插入图片描述
在spider中进行编程提取数据
在这里插入图片描述

二、command line tool

常见命令:

scrapy <command> [options] [args]
scrapy <command> -h
scrapy -h # 查看帮助
scrapy genspider [-t template] <name> <domain>

startproject

scrapy startproject <project_name> [project_dir]
example:
scrapy startproject myproject

genspider

scrapy genspider [-t template] <name> <domain>
example:
scrapy genspider -l
scrapy genspider example example.com
scrapy genspider -t crawl scrapyorg scrapy.org

crawl

scrapy crawl <spider>
example:
scrapy crawl myspider
# 无日志输出
scrapy crawl spider_name --nolog
scrapy check [-l] <spider>
example:
scrapy check -l 
scrapy check

list

scrapy list

edit

scrapy edit <spider>
example:
scrapy edit spider1

fetch

scrapy fetch <url>
Supported options:
• --spider=SPIDER: bypass spider autodetection and force use of specific spider
• --headers: print the response’s HTTP headers instead of the response’s body
• --no-redirect: do not follow HTTP 3xx redirects (default is to follow them)
example:
scrapy fetch --nolog http://www.example.com/some/page.html
scrapy fetch --nolog --headers http://www.example.com/

view

scrapy view <url>
Supported options:
• --spider=SPIDER: bypass spider autodetection and force use of specific spider
• --no-redirect: do not follow HTTP 3xx redirects (default is to follow them)
example:
scrapy view http://www.example.com/some/page.html

shell

scrapy shell [url]
Supported options:
• --spider=SPIDER
• -c code
• --no-redirect
example:
scrapy shell http://www.example.com/some/page.html
shell follows HTTP redirects by default

parse

scrapy parse <url> [options]
Supported options:
• --spider=SPIDER: bypass spider autodetection and force use of specific spider
• --a NAME=VALUE: set spider argument (may be repeated)
• --callback or -c: spider method to use as callback for parsing the response
• --meta or -m: additional request meta that will be passed to the callback request. This must be a valid json
string. Example: –meta={“foo” : “bar”}’
• --cbkwargs: additional keyword arguments that will be passed to the callback. This must be a valid json
string. Example: –cbkwargs={“foo” : “bar”}’
• --pipelines: process items through pipelines
• --rules or -r: use CrawlSpider rules to discover the callback (i.e. spider method) to use for parsing the
response
• --noitems: don’t show scraped items
• --nolinks: don’t show extracted links
• --nocolour: avoid using pygments to colorize the output
• --depth or -d: depth level for which the requests should be followed recursively (default: 1)
• --verbose or -v: display information for each depth level
• --output or -o: dump scraped items to a file
example:
scrapy parse http://www.example.com/ -c parse_item

settings

scrapy settings [options]
example:
scrapy settings --get BOT_NAME
scrapy settings --get DOWNLOAD_DELAY

runspider

scrapy runspider <spider_file.py>
example:
scrapy runspider myspider.py

version

scrapy version [-v]

bench

scrapy bench
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值