Scrapy命令行工具shell使用

最新推荐文章于 2023-05-16 16:39:33 发布

pyfreyr

最新推荐文章于 2023-05-16 16:39:33 发布

阅读量3.8k

点赞数 1

分类专栏： scrapy 文章标签： scrapy

本文链接：https://blog.csdn.net/chenfeidi1/article/details/80890406

版权

scrapy 专栏收录该内容

8 篇文章 1 订阅

订阅专栏

shell 作为 Scrapy 内置的有力交互工具，在其内进行爬取调试和解析验证非常方便。

配置 shell 使用的终端

若系统安装 IPython，则使用它替换默认的 Python 终端。可在工程内的 scrapy.cfg 文件内指定终端，如：

[settings]
shell = bpython

登录 shell

使用命令：

scrapy shell <url>

进入 shell，其中 url 支持本地文件：

# UNIX-style
scrapy shell ./path/to/file.html
scrapy shell ../other/path/to/file.html
scrapy shell /absolute/path/to/file.html

# File URI
scrapy shell file:///absolute/path/to/file.html

使用 shell

可用方法

shelp(): 打印可用的对象和方法
fetch(url[, redirect=True]): 爬取新的 URL 并更新相关对象
fetch(request): 通过 request 爬取，并更新相关对象
view(response): 使用本地浏览器打开爬取的页面

可用对象

crawler: Crawler 对象
spider: 爬取使用的 spider
request: 请求
response: 响应
settings: 设置

示例

$ scrapy shell 'http://scrapy.org' --nolog

[s] Available Scrapy objects:
[s]  scrapy    scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s]  crawler    <scrapy.crawler.Crawler object at 0x7f4e2fb915f8>
[s]  item      {}
[s]  request    <GET http://scrapy.org>
[s]  response  <200 https://scrapy.org/>
[s]  settings  <scrapy.settings.Settings object at 0x7f4e2e0179e8>
[s]  spider    <DefaultSpider 'default' at 0x7f4e2dbf98d0>
[s] Useful shortcuts:
[s]  fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s]  fetch(req)                  Fetch a scrapy.Request and update local objects 
[s]  shelp()          Shell help (print this help)
[s]  view(response)    View response in a browser

In [1]: response.xpath('//title/text()').extract_first()
Out[1]: 'Scrapy | A Fast and Powerful Scraping and Web Crawling Framework'

In [2]: fetch("http://reddit.com")

In [3]: response.xpath('//title/text()').extract()
Out[3]: ['reddit: the front page of the internet']

In [4]: request = request.replace(method="POST")

In [5]: fetch(request)

In [6]: response.status
Out[6]: 404

在 spider 内调用 shell

使用 scrapy.shell.inspect_response 函数：

import scrapy


class MySpider(scrapy.Spider):
    name = "myspider"
    start_urls = [
        "http://example.com",
        "http://example.org",
        "http://example.net",
    ]

    def parse(self, response):
        # We want to inspect one specific response.
        if ".org" in response.url:
            from scrapy.shell import inspect_response
            inspect_response(response, self)

        # Rest of parsing code.

启动爬虫，将会在执行到inspect_response时进入 shell，当处使用完使用Ctrl-D退出 shell，爬虫会恢复运行。

pyfreyr

关注

1
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
Scrapy命令行工具shell使用

配置 shell 使用的终端登录 shell使用 shell可用方法可用对象示例在 spider 内调用 shellshell 作为 Scrapy 内置的有力交互工具，在其内进行爬取调试和解析验证非常方便。配置 shell 使用的终端若系统安装 IPython，则使用它替换默认的 Python 终端。可在工程内的 scrapy.cfg 文件内指定终端...
复制链接

扫一扫