Scrapy基本概念——命令行工具

♂愤怒的it男♂

已于 2023-08-08 23:01:27 修改

阅读量877

点赞数

分类专栏： JavaScript 文章标签： scrapy python linux 爬虫开发语言

于 2022-11-20 12:05:56 首次发布

本文链接：https://blog.csdn.net/xuyuanfan/article/details/127947164

版权

JavaScript 专栏收录该内容

12 篇文章 0 订阅

订阅专栏

一、构建项目的命令行使用

1、多项目的目录结构

scrapy.cfg
firstproject/
    __init__.py
    items.py
    middlewares.py
    pipelines.py
    settings.py
    spiders/
        __init__.py
        spider1.py
        spider2.py
        ...
secondproject/
    __init__.py
    items.py
    middlewares.py
    pipelines.py
    settings.py
    spiders/
        __init__.py
        spider1.py
        spider2.py
        ...

2、多项目的配置（scrapy.cfg）

default = firstproject.settings
project2 = secondproject.settings

3、多项目的切换

>>> scrapy settings --get BOT_NAME
firstproject
>>> set SCRAPY_PROJECT=project2
>>> scrapy settings --get BOT_NAME
secondproject

4、创建项目

scrapy startproject <project_name> [project_dir]

5、打开项目目录

cd <project_dir>

6、创建蜘蛛

scrapy genspider [-t template] <name> <domain>

7、查看scrapy和命令行的帮助

scrapy -h scrapy <command> -h

二、全局命令

1、startproject

语法： scrapy startproject <project_name> [project_dir]
用法：创建项目
项目依赖： no
例如：scrapy startproject myproject

2、genspider

语法：scrapy genspider [-t template] <name> <domain>
用法：创建蜘蛛
项目依赖：no
例如：
>>> scrapy genspider -l
Available templates:
  basic
  crawl
  csvfeed
  xmlfeed
>>> scrapy genspider example example.com
Created spider 'example' using template 'basic'
>>> scrapy genspider -t crawl scrapyorg scrapy.org
Created spider 'scrapyorg' using template 'crawl'

3、settings

语法：scrapy settings [options]
用法：获取Scrapy设置的值
项目依赖：no
例如：
>>> scrapy settings --get BOT_NAME
scrapybot
>>> scrapy settings --get DOWNLOAD_DELAY
0

4、runspider

语法：scrapy runspider <spider_file.py>

用法：不需要创建项目，直接运行一个包含在python文件中的蜘蛛
项目依赖：no
例如：
>>> scrapy runspider myspider.py
[ ... spider starts crawling ... ]

5、shell

语法：scrapy shell [url]
项目依赖：no
用法：根据url启动scrapy shell

选项：
--spider=SPIDER：绕过Spider自动检测并强制使用特定Spider
-c code：评估shell中的代码，打印结果并退出
--no-redirect：不遵循HTTP 3xx重定向（默认为遵循）
例如：
>>> scrapy shell http://www.example.com/some/page.html
[ ... scrapy shell starts ... ]
>>> scrapy shell --nolog http://www.example.com/ -c '(response.status, response.url)'
(200, 'http://www.example.com/')
# shell follows HTTP redirects by default
>>> scrapy shell --nolog http://httpbin.org/redirect-to?url=http://example.com/ -c '(response.status, response.url)'
(200, 'http://example.com/')
# you can disable this with --no-redirect
# (only for the URL passed as command line argument)
>>> scrapy shell --no-redirect --nolog http://httpbin.org/redirect-to?url=http://example.com/ -c '(response.status, response.url)'
(302, 'http://httpbin.org/redirect-to?url=http://example.com/')

6、fetch

语法：scrapy fetch <url>
项目依赖：no
用法：使用蜘蛛的设置获取响应页面并写入标准输出，如项目外使用，则使用scrapy下载器默认的设置
选项：
--spider=SPIDER：绕过Spider自动检测并强制使用特定Spider
--headers：打印响应的HTTP头而不是响应的正文
--no-redirect：不遵循HTTP 3xx重定向（默认为遵循）
例如：
>>> scrapy fetch --nolog http://www.example.com/some/page.html
[ ... html content here ... ]
>>> scrapy fetch --nolog --headers http://www.example.com/
{'Accept-Ranges': ['bytes'],
 'Age': ['1263   '],
 'Connection': ['close     '],
 'Content-Length': ['596'],
 'Content-Type': ['text/html; charset=UTF-8'],
 'Date': ['Wed, 18 Aug 2010 23:59:46 GMT'],
 'Etag': ['"573c1-254-48c9c87349680"'],
 'Last-Modified': ['Fri, 30 Jul 2010 15:30:18 GMT'],
 'Server': ['Apache/2.2.3 (CentOS)']}

7、view

语法：scrapy view <url>
项目依赖：no
用法：使用蜘蛛的设置获取响应页面，下载并用浏览器打开
选项：
--spider=SPIDER：绕过Spider自动检测并强制使用特定Spider
--no-redirect：不遵循HTTP 3xx重定向（默认为遵循）
例如：
>>> scrapy view http://www.example.com/some/page.html
[ ... browser starts ... ]

8、version

语法：scrapy version [-v]
项目依赖：no
用法：打印版本。如果使用-v还打印python、twisted和platform的信息

三、项目命令

1、crawl

语法：scrapy crawl <spider>
用法：使用蜘蛛爬取
项目依赖：yes
例如：
>>> scrapy crawl myspider
[ ... myspider starts crawling ... ]

2、check

语法：scrapy check [-l] <spider>
用法：合约检查
项目依赖：yes
例如：
>>> scrapy check -l
first_spider
  * parse
  * parse_item
second_spider
  * parse
  * parse_item
>>> scrapy check
[FAILED] first_spider:parse_item
'RetailPricex' field is missing
[FAILED] first_spider:parse
Returned 92 requests, expected 0..4

3、list

语法：scrapy list

用法：列出当前项目中所有可用蜘蛛
项目依赖：yes
例如：
>>> scrapy list
spider1
spider2

4、edit

语法：scrapy edit <spider>
用法：在可编辑环境使用编辑器编写蜘蛛代码
项目依赖：yes
例如：
>>> scrapy edit spider1

5、parse

语法：scrapy parse [options] <url>
用法：测试解析
项目依赖：yes
选项：
--spider=SPIDER ：绕过Spider自动检测并强制使用特定Spider
--a NAME=VALUE ：set spider参数（可以重复）
--callback 或 -c ：用作分析响应的回调的spider方法
--meta 或 -m ：将传递给回调请求的附加请求元。这必须是有效的JSON字符串。示例：--meta='“foo”：“bar”'
--cbkwargs ：将传递给回调的其他关键字参数。这必须是有效的JSON字符串。示例：--cbkwargs='“foo”：“bar”'
--pipelines ：通过管道处理项目
--rules 或 -r 使用 CrawlSpider 发现用于解析响应的回调（即spider方法）的规则
--noitems ：不显示爬取的项目
--nolinks ：不显示提取的链接
--nocolour ：避免使用Pygments对输出着色
--depth 或 -d ：应递归执行请求的深度级别（默认值：1）
--verbose 或 -v ：显示每个深度级别的信息
--output 或 -o ：将刮取的项目转储到文件
例如：
>>> scrapy parse --spider=toscrape-css -c parse -d 2 http://quotes.toscrape.com/

6、bench

语法：scrapy bench
用法：快速基准测试
项目依赖：no

更多爬虫知识以及实例源码，可关注微信公众号：angry_it_man

♂愤怒的it男♂

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
Scrapy基本概念——命令行工具

一、构建项目的命令行使用二、全局命令三、项目命令
复制链接

扫一扫

专栏目录