Python：Scrapy Shell的使用教程

最新推荐文章于 2023-12-13 09:16:22 发布

曾是土木人

最新推荐文章于 2023-12-13 09:16:22 发布

阅读量1.7w

点赞数 6

本文链接：https://blog.csdn.net/php_fly/article/details/19555969

版权

Python 专栏收录该内容

6 篇文章 0 订阅

订阅专栏

Scrapy shell 是一个交互式的shell，一旦你习惯使用了Scrapy shell，你将会发现Scrapy shell对于开发爬虫是非常好用的一个测试工具。
在使用Scrapy shell之前，你需要先安装ipython（可以在http://www.lfd.uci.edu/~gohlke/pythonlibs/查找相应版本的ipython进行安装）。

启用shell
可以使用如下命令启用shell

scrapy shell <url>

其中<url>就是你想抓取的页面url

使用shell

Scrapy shell可以看成是一个内置了几个有用的功能函数的python控制台程序。

功能函数

shelp() - 输出一系列可用的对象和函数
fetch(request_or_url)-从给定的url或既有的request请求对象重新生成response对象，并更新原有的相关对象
view(response)-使用浏览器打开原有的response对象（换句话说就是html页面）

Scrapy 对象

使用Scrapy shell下载指定页面的时候，会生成一些可用的对象，比如Response对象和Selector对象（Html和XML均适用）
这些可用的对象有：

crawler - 当前的Crawler对象
spider
request - 最后获取页面的请求对象
response - 一个包含最后获取页面的响应对象
sel - 最新下载页面的Selector对象
settings - 当前的Scrapy settings

Scrapy shell例子

以我的个人博客作为测试: http://blog.csdn.net/php_fly
首先,我们启动shell

scrapy shell http://blog.csdn.net/php_fly --nolog

以上命令执行后,会使用Scrapy downloader下载指定url的页面数据,并且打印出可用的对象和函数列表

    [s] Available Scrapy objects:
    [s]   crawler    <scrapy.crawler.Crawler object at 0x0000000002AEF7B8>
    [s]   item       {}
    [s]   request    <GET http://blog.csdn.net/php_fly>
    [s]   response   <200 http://blog.csdn.net/php_fly>
    [s]   sel        <Selector xpath=None data=u'<html xmlns="http://www.w3.org/1999/xhtm'>
    [s]   settings   <CrawlerSettings module=None>
    [s]   spider     <Spider 'default' at 0x4cdb940>
    [s] Useful shortcuts:
    [s]   shelp()           Shell help (print this help)
    [s]   fetch(req_or_url) Fetch request (or URL) and update local objects
    [s]   view(response)    View response in a browser

获取曾是土木人博客的文章列表超链接

 In [9]: sel.xpath("//span[@class='link_title']/a/@href").extract()
    Out[9]:
    [u'/php_fly/article/details/19364913',
     u'/php_fly/article/details/18155421',
     u'/php_fly/article/details/17629021',
     u'/php_fly/article/details/17619689',
     u'/php_fly/article/details/17386163',
     u'/php_fly/article/details/17266889',
     u'/php_fly/article/details/17172381',
     u'/php_fly/article/details/17171985',
     u'/php_fly/article/details/17145295',
     u'/php_fly/article/details/17122961',
     u'/php_fly/article/details/17117891',
     u'/php_fly/article/details/14533681',
     u'/php_fly/article/details/13162011',
     u'/php_fly/article/details/12658277',
     u'/php_fly/article/details/12528391',
     u'/php_fly/article/details/12421473',
     u'/php_fly/article/details/12319943',
     u'/php_fly/article/details/12293587',
     u'/php_fly/article/details/12293381',
     u'/php_fly/article/details/12289803']

修改scrapy shell的请求方式:

     >>> request = request.replace(method="POST")
        >>> fetch(request)
        [s] Available Scrapy objects:
        [s]   crawler    <scrapy.crawler.Crawler object at 0x1e16b50>
        ...

从Spider中调用Scrapy shell

在爬虫运行过程中,有时需要检查某个响应是否是你所期望的。
这个需求可以通过scrapy.shell.inspect_response函数进行实现
以下是一个关于如何从spider中调用scrapy shell的例子

from scrapy.spider import Spider


class MySpider(Spider):
    name = "myspider"
    start_urls = [
        "http://example.com",
        "http://example.org",
        "http://example.net",
    ]

    def parse(self, response):
        # We want to inspect one specific response.
        if ".org" in response.url:
            from scrapy.shell import inspect_response
            inspect_response(response)

        # Rest of parsing code.

当你启动爬虫的时候，控制台将打印出类似如下的信息

2014-02-20 17:48:31-0400 [myspider] DEBUG: Crawled (200) <GET http://example.com> (referer: None)
    2014-02-20 17:48:31-0400 [myspider] DEBUG: Crawled (200) <GET http://example.org> (referer: None)
    [s] Available Scrapy objects:
    [s]   crawler    <scrapy.crawler.Crawler object at 0x1e16b50>
    ...
    >>> response.url
    'http://example.org'

注意：当Scrapy engine被scrapy shell占用的时候，Scrapy shell中的fetch函数是无法使用的。然而，当你退出Scrapy shell的时候，蜘蛛将从停止的地方继续爬行

作者：曾是土木人（http://blog.csdn.net/php_fly）

原文地址：http://blog.csdn.net/php_fly/article/details/19555969

参考文章：Scrapy shell