scrapy

最新推荐文章于 2022-03-19 13:16:40 发布

qq_45911550

最新推荐文章于 2022-03-19 13:16:40 发布

阅读量84

点赞数

分类专栏：网络爬虫

本文链接：https://blog.csdn.net/qq_45911550/article/details/113530185

版权

网络爬虫专栏收录该内容

14 篇文章 2 订阅

订阅专栏

一、基础操作

1、建立爬虫项目：tutorial

scrapy startproject tutorial

2、使用 scrapy shell
The best way to learn how to extract data with Scrapy is trying selectors using the Scrapy shell. Run:

scrapy shell 'http://quotes.toscrape.com/page/1/'

然后使用css、xpath等工具进行数据提取，例如

response.xpath('//title')
response.css('title::text').getall()
response.css('title::text').get()

还可以在这个终端进行编程
在这里插入图片描述
在spider中进行编程提取数据

二、command line tool

常见命令：

scrapy <command> [options] [args]

scrapy <command> -h
scrapy -h # 查看帮助

scrapy genspider [-t template] <name> <domain>

startproject

scrapy startproject <project_name> [project_dir]
example：
scrapy startproject myproject

genspider

scrapy genspider [-t template] <name> <domain>
example：
scrapy genspider -l
scrapy genspider example example.com
scrapy genspider -t crawl scrapyorg scrapy.org

crawl

scrapy crawl <spider>
example：
scrapy crawl myspider
# 无日志输出
scrapy crawl spider_name --nolog

scrapy check [-l] <spider>
example：
scrapy check -l 
scrapy check

list

scrapy list

edit

scrapy edit <spider>
example：
scrapy edit spider1

fetch

scrapy fetch <url>
Supported options:
• --spider=SPIDER: bypass spider autodetection and force use of specific spider
• --headers: print the response’s HTTP headers instead of the response’s body
• --no-redirect: do not follow HTTP 3xx redirects (default is to follow them)
example:
scrapy fetch --nolog http://www.example.com/some/page.html
scrapy fetch --nolog --headers http://www.example.com/

view

scrapy view <url>
Supported options:
• --spider=SPIDER: bypass spider autodetection and force use of specific spider
• --no-redirect: do not follow HTTP 3xx redirects (default is to follow them)
example:
scrapy view http://www.example.com/some/page.html

shell

scrapy shell [url]
Supported options:
• --spider=SPIDER
• -c code
• --no-redirect
example:
scrapy shell http://www.example.com/some/page.html
shell follows HTTP redirects by default

parse

scrapy parse <url> [options]
Supported options:
• --spider=SPIDER: bypass spider autodetection and force use of specific spider
• --a NAME=VALUE: set spider argument (may be repeated)
• --callback or -c: spider method to use as callback for parsing the response
• --meta or -m: additional request meta that will be passed to the callback request. This must be a valid json
string. Example: –meta=’{“foo” : “bar”}’
• --cbkwargs: additional keyword arguments that will be passed to the callback. This must be a valid json
string. Example: –cbkwargs=’{“foo” : “bar”}’
• --pipelines: process items through pipelines
• --rules or -r: use CrawlSpider rules to discover the callback (i.e. spider method) to use for parsing the
response
• --noitems: don’t show scraped items
• --nolinks: don’t show extracted links
• --nocolour: avoid using pygments to colorize the output
• --depth or -d: depth level for which the requests should be followed recursively (default: 1)
• --verbose or -v: display information for each depth level
• --output or -o: dump scraped items to a file
example:
scrapy parse http://www.example.com/ -c parse_item

settings

scrapy settings [options]
example:
scrapy settings --get BOT_NAME
scrapy settings --get DOWNLOAD_DELAY

runspider

scrapy runspider <spider_file.py>
example:
scrapy runspider myspider.py

version

scrapy version [-v]

bench

scrapy bench

qq_45911550

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
scrapy

一、基础操作1、建立爬虫项目：tutorialscrapy startproject tutorial2、使用 scrapy shellThe best way to learn how to extract data with Scrapy is trying selectors using the Scrapy shell. Run:scrapy shell 'http://quotes.toscrape.com/page/1/'然后使用css、xpath等工具进行数据提取，例如res
复制链接

扫一扫

专栏目录