Python爬虫笔记（十）——Scrapy官方文档阅读——Scrapy shell

最新推荐文章于 2024-06-22 16:33:22 发布

菜到怀疑人生

最新推荐文章于 2024-06-22 16:33:22 发布

阅读量1.2k

点赞数 1

分类专栏： crawler python爬虫

本文链接：https://blog.csdn.net/dhaiuda/article/details/81529697

版权

crawler 同时被 2 个专栏收录

18 篇文章 3 订阅

订阅专栏

python爬虫

16 篇文章 14 订阅

订阅专栏

Scrapy shell用于测试Xpath和css表达式，查看它们提取的数据，Scrapy可以使用ipython、bpython、标准的python shell中的一个，可以通过设置SCRAPY_PYTHON_SHELL的值来决定，也可以在scrapy.cfg中定义：

[settings]
shell = bpython

启动scrapy shell

启动scrapy shell的命令：

scrapy shell <url>

url是自己想要爬取页面的url，shell也可以与本地的文件一起工作

# UNIX-style
scrapy shell ./path/to/file.html
scrapy shell ../other/path/to/file.html
scrapy shell /absolute/path/to/file.html

# File URI
scrapy shell file:///absolute/path/to/file.html

当使用相对路径时，需要使用./，因此，当使用scrapy index.html时，将会出现问题，由于Scrapy更偏向于HTTP 的url，所以index.html会被当成域名进行DNS查询

shell命令的参数

shelp（）：查看可用命令

fetch（url[，redirect=True]）：对url发起请求，获取响应，更新所有的相关对象（例如response对象），如果不想进行重定向，可以将redirect·置为false

fetch（request）：根据request获取响应，更新所有相关对象

view（response）：通过本地的浏览器打开response，response会保存为一个文件

使用Ctrl-Z可以退出当前的shell环境

可用的Scrapy对象

crawler：目前的crawler对象，相应的API：https://doc.scrapy.org/en/latest/topics/api.html#scrapy.crawler.Crawler

spider：当前的spider对象，spider类定义了如何爬取某个网站，包括了爬取的动作以及如何从网页中提取结构化数据

request：最后一个爬取页面的request对象，可以通过replace（）方法

response：最后一个请求url的应答

settings：当前的Scrapy设置

从spider中调用shell来检查response

如果我们先用shell检查我们自己爬虫的response，可以在代码中插入inspect_response（）函数：

import scrapy


class MySpider(scrapy.Spider):
    name = "myspider"
    start_urls = [
        "http://example.com",
        "http://example.org",
        "http://example.net",
    ]

    def parse(self, response):
        # We want to inspect one specific response.
        if ".org" in response.url:
            from scrapy.shell import inspect_response
            inspect_response(response, self)

        # Rest of parsing code.

相当于在代码中插入了一个中断，接下来和使用scrapy shell一样

2014-01-23 17:48:31-0400 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://example.com> (referer: None)
2014-01-23 17:48:31-0400 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://example.org> (referer: None)
[s] Available Scrapy objects:
[s]   crawler    <scrapy.crawler.Crawler object at 0x1e16b50>
...

>>> response.url
'http://example.org'

此时Scrapy的引擎是被阻塞的，使用fetch命令是没有用的，当我们关闭了shell环境，函数将会继续运行