scrapy基本使用

最新推荐文章于 2021-12-05 15:17:39 发布

gs_every

最新推荐文章于 2021-12-05 15:17:39 发布

阅读量379

点赞数

分类专栏：爬虫

本文链接：https://blog.csdn.net/s1h2e3n4g5/article/details/77412664

版权

爬虫专栏收录该内容

5 篇文章 0 订阅

订阅专栏

Global commands:

startproject

genspider

settings

runspider

shell

fetch

view

Project-only-commands:

crawl

check

list

edit

parse

bench

###stratproject

scrapy startproject <project_name> [project_dir]
#例：
scrapy startproject myproject

需要项目：否
project_name在project_dir 目录下创建一个名为“Scrapy”的新项目。如果project_dir没有指定，project_dir将会相同project_name。

genspider

scrapy genspider [-t template] <name> <domain>
#例：
$ scrapy genspider -l
Available templates:
  basic
  crawl
  csvfeed
  xmlfeed
#template 模板
$ scrapy genspider example example.com
Created spider 'example' using template 'basic'

$ scrapy genspider -t crawl scrapyorg scrapy.org
Created spider 'scrapyorg' using template 'crawl'

需要项目：否
在当前文件夹或当前项目的spiders文件夹中创建一个新的爬虫，如果从项目中调用。该参数设置为爬虫name，同时用于生成allowed_domains和start_urls爬虫的属性，即网页链接

crawl

scrapy crawl <spider>
#例
scrapy crawl myspider

需要项目：是
开始爬取运行爬虫

check

scrapy check [-l] <spider>
#例
scrapy check
[FAILED] first_spider:parse_item
>>> 'RetailPricex' field is missing

[FAILED] first_spider:parse
>>> Returned 92 requests, expected 0..4

需要项目：是的
执行合同检查。

list

scrapy list
#例
scrapy list
spider1
spider2

需要项目：是的
列出当前项目中的所有可用爬虫。输出是每行一个爬虫

fetch

scrapy fetch <url>
#例
scrapy fetch --nolog http://www.example.com/some/page.html
[ ... html content here ... ]

$ scrapy fetch --nolog --headers http://www.example.com/
{'Accept-Ranges': ['bytes'],
 'Age': ['1263   '],
 'Connection': ['close     '],
 'Content-Length': ['596'],
 'Content-Type': ['text/html; charset=UTF-8'],
 'Date': ['Wed, 18 Aug 2010 23:59:46 GMT'],
 'Etag': ['"573c1-254-48c9c87349680"'],
 'Last-Modified': ['Fri, 30 Jul 2010 15:30:18 GMT'],
 'Server': ['Apache/2.2.3 (CentOS)']}

需要项目：否
使用Scrapy下载器下载给定的URL，并将内容写入标准输出。

这个命令的有趣之处在于它会获取蜘蛛将如何下载它的页面。例如，如果蜘蛛具有USER_AGENT 覆盖用户代理的属性，那么它将使用该属性。

所以这个命令可以用来“看”蜘蛛如何获取某个页面。

如果在项目之外使用，则不会应用特定的每蜘蛛行为，并且只会使用默认的Scrapy下载器设置。

支持的选项：

–spider=SPIDER：绕过蜘蛛自动检测并强制使用特定的蜘蛛
–headers：打印响应的HTTP标头而不是响应的正文
–no-redirect：不要按照HTTP 3xx重定向（默认是跟随它们

view

scrapy view
需要项目：否
在浏览器中打开给定的URL，因为您的Scrapy蜘蛛会“看到”它。有时蜘蛛会看到与普通用户不同的页面，所以这可以用来检查蜘蛛“看到”什么，并确认它是你期望的。

支持的选项：

–spider=SPIDER：绕过蜘蛛自动检测并强制使用特定的蜘蛛
–no-redirect：不要按照HTTP 3xx重定向（默认是跟随它们）

$ scrapy view http://www.example.com/some/page.html
[ ... browser starts ... ]

shell

scrapy shell [url]
#例
$ scrapy shell http://www.example.com/some/page.html
[ ... scrapy shell starts ... ]

$ scrapy shell --nolog http://www.example.com/ -c '(response.status, response.url)'
(200, 'http://www.example.com/')

# shell follows HTTP redirects by default
$ scrapy shell --nolog http://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.com%2F -c '(response.status, response.url)'
(200, 'http://example.com/')

# you can disable this with --no-redirect
# (only for the URL passed as command line argument)
$ scrapy shell --no-redirect --nolog http://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.com%2F -c '(response.status, response.url)'
(302, 'http://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.com%2F')

需要项目：否
启动给定URL（如果给定）的Scrapy shell，如果没有给出URL，则为空。同时支持UNIX风格的本地文件路径，无论是相对 ./或../前缀或绝对文件路径。请参阅Scrapy shell了解更多信息。
支持的选项：

–spider=SPIDER：绕过蜘蛛自动检测并强制使用特定的蜘蛛
-c code：评估shell中的代码，打印结果并退出
–no-redirect：不要按照HTTP 3xx重定向（默认是跟随它们）; 这只会影响您作为命令行参数传递的URL; 一旦你在shell中，fetch(url)默认情况下仍然会遵循HTTP重定向。

parse

scrapy parse <url> [option]

需要项目：是的
获取给定的URL并用处理它的爬虫解析它，使用通过–callback选项传递的方法，或者parse如果没有给出
Supported options:

–spider=SPIDER: bypass spider autodetection and force use of specific spider
–a NAME=VALUE: set spider argument (may be repeated)
–callback or -c: spider method to use as callback for parsing the response
–pipelines: process items through pipelines
–rules or -r: use CrawlSpider rules to discover the callback (i.e. spider method) to use for parsing the response
–noitems: don’t show scraped items
–nolinks: don’t show extracted links
–nocolour: avoid using pygments to colorize the output
–depth or -d: depth level for which the requests should be followed recursively (default: 1)
–verbose or -v: display information for each depth level

settings

scrapy settings [option]
#
$ scrapy settings --get BOT_NAME
scrapybot
$ scrapy settings --get DOWNLOAD_DELAY
0

需要项目：否
获取Scrapy设置的值。

如果在项目中使用它将显示项目设置值，否则将显示该设置的默认Scrapy值。

runspider

scrapy runspider <spider_file.py>
#
$ scrapy runspider myspider.py
[ ... spider starts crawling ... ]

需要项目：否
运行一个独立于Python文件的蜘蛛，无需创建一个项目。

version

scrapy version [-v]

需要项目：否
打印Scrapy版本。如果使用-v它也会打印Python，Twisted和Platform信息，这对于错误报告很有用。

bench

新版本0.17。

句法： scrapy bench
需要项目：否
运行一个快速的基准测试。标杆

gs_every

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录