scrapy SpiderMiddleware DownloaderMiddleware

Spider中间件(Middleware)

中间件是介入到Scrapy的spider处理机制的钩子框架,您可以添加代码来处理发送给 Spiders的response及spider产生的item和request。 

官方文档

Scrapy各个组件执行顺序

了解各个组件执行顺序后,首先 来看官方文档中的解释:

process_spider_input(response, spider) 

process_spider_input(response, spider)
    参数:	
    response (Response 对象) – 被处理的response
    spider (Spider 对象) – 该response对应的spider


当response通过spider中间件时,该方法被调用,处理该response。
process_spider_input() 应该返回 None 或者抛出一个异常。
如果其返回 None ,Scrapy将会继续处理该response,调用所有其他的中间件直到spider处理该response。
如果其跑出一个异常(exception),Scrapy将不会调用任何其他中间件的 process_spider_input() 方法,并调用request的errback。 errback的输出将会以另一个方向被重新输入到中间件链中,使用 process_spider_output() 方法来处理,当其抛出异常时则带调用 process_spider_exception() 。

process_spider_output(response, result, spider) 

process_spider_output(response, result, spider)


当Spider处理response返回result时,该方法被调用。
process_spider_output() 必须返回包含 Request 或 Item 对象的可迭代对象(iterable)。
参数:	
response (Response 对象) – 生成该输出的response
result (包含 Request 或 Item 对象的可迭代对象(iterable)) – spider返回的result
spider (Spider 对象) – 其结果被处理的spider

我们来用个demo 看下具体数据示例:

首先我们使用scrapy 对某个网站进行抓取 ,并提取两个字段

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html
import scrapy


class MovieHeavenItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title = scrapy.Field()
    image_urls = scrapy.Field()

在中间件中打标记进行测试

# -*- coding: utf-8 -*-

# Define here the models for your spider middleware
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/spider-middleware.html

import scrapy
from scrapy import log
from scrapy import signals
from scrapy.exceptions import NotConfigured


class SpiderOpenCloseLogging(object):

    def __init__(self):
        self.items_scraped = 0
        self.items_dropped = 0

    @classmethod
    def from_crawler(cls, crawler):
        # 读取settings配置信息,检查是否启动扩展,没有启用则抛出异常,扩展被禁用
        if not crawler.settings.getbool('MY_EXTENSION'):
            raise NotConfigured
            # 实例化扩展对象
        ext = cls()
        # 注册信号处理函数
        crawler.signals.connect(ext.spider_opened, signal=signals.spider_opened)
        crawler.signals.connect(ext.spider_closed, signal=signals.spider_closed)
        crawler.signals.connect(ext.item_scraped, signal=signals.item_scraped)
        crawler.signals.connect(ext.item_dropped, signal=signals.item_dropped)

        return ext

        # 自定义的3个信号处理函数
    def spider_opened(self, spider):
        spider.log(">>> opened spider %s" % spider.name)

    def spider_closed(self, spider):
        print('+++'*5)
        spider.log('&&&&&&')
        spider.log(">>> closed spider %s" % spider.name)
        spider.log(">>>scraped %d items" % self.items_scraped)
        spider.log(">>>dropped %d items" % self.items_dropped)
        print('+++' * 5)
        # 获取状态收集器信息
        print('*' * 15)
        print(spider.crawler.stats.get_stats())
        print('*' * 15)

    def item_scraped(self, item, response, spider):
        self.items_scraped += 1

    def item_dropped(self, item, response, exception, spider):
        self.items_dropped += 1


class MovieHeavenSpiderMiddleware(object):
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the spider middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_spider_input(self, response, spider):
        # Called for each response that goes through the spider
        # middleware and into the spider.

        # Should return None or raise an exception.
        print('---spider  input -----  中间件--response-----')
        print(response.headers)
        print(response.url)
        print(response)
        print(spider.log('spider'))
        print(spider.name)
        print('---spider  input -----  中间件--response-----')
        # return None

    def process_spider_output(self, response, result, spider):
        # Called with the results returned from the Spider, after
        # it has processed the response.

        # Must return an iterable of Request, dict or Item objects.
        print('---spider  output -----  中间件--result-----')
        print(result)
        print('---spider  output -----  中间件--result-----')
        for i in result:
            print(i)
            print('---spider  output -----  中间件--iiiiiiii-----')
            yield i

    def process_spider_exception(self, response, exception, spider):
        # Called when a spider or process_spider_input() method
        # (from other spider middleware) raises an exception.

        # Should return either None or an iterable of Response, dict
        # or Item objects.
        pass

    def process_start_requests(self, start_requests, spider):
        # Called with the start requests of the spider, and works
        # similarly to the process_spider_output() method, except
        # that it doesn’t have a response associated.

        # Must return only requests (not items).
        for r in start_requests:
            print('---spider  start_request -----  中间件--rrrrr-----')
            yield r

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)

运行spider 查看结果:

首先在spider 开启时候第一个请求
# Called with the start requests of the spider
首先实行了中间件中此函数
process_start_requests(self, start_requests, spider):

首先在spider 开启时候第一个请求
# Called with the start requests of the spider
首先实行了中间件中此函数
process_start_requests(self, start_requests, spider):

 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-05-07 16:34:35 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'movie_heaven.middlewares.MovieHeavenSpiderMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-05-07 16:34:35 [scrapy.middleware] INFO: Enabled item pipelines:
['movie_heaven.pipelines.NewsPipeline']
2019-05-07 16:34:35 [scrapy.core.engine] INFO: Spider opened
2019-05-07 16:34:35 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-05-07 16:34:35 [dyttspider] DEBUG: >>> opened spider dyttspider
2019-05-07 16:34:35 [dyttspider] INFO: Spider opened: dyttspider
---spider  start_request -----  中间件--rrrrr-----
2019-05-07 16:34:35 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://dytt8.net/> (referer: None)

然后执行了中间件函数
process_spider_output(self, response, result, spider):
我们可以看到,当第一个请求发送出去后,我们从返回的response中提取请求压入队列,

然后执行了中间件函数
process_spider_output(self, response, result, spider):
我们可以看到,当第一个请求发送出去后,我们从返回的response中提取了很多请求压入队列,



---spider  start_request -----  中间件--rrrrr-----
2019-05-07 16:34:35 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://dytt8.net/> (referer: None)
---spider  input -----  中间件--response-----
{b'Content-Type': [b'text/html'], b'Content-Location': [b'http://dytt8.net/index.htm'], b'Last-Modified': [b'Tue, 07 May 2019 06:19:14 GMT'], b'Accept-Ranges': [b'bytes'], b'Etag': [b'"09dbbca9c4d51:320"'], b'Vary': [b'Accept-Encoding'], b'Server': [b'Microsoft-IIS/6.0'], b'Date': [b'Tue, 07 May 2019 08:31:23 GMT'], b'X-Via': [b'1.1 SN201275 (random:706742 Fikker/Webcache/3.7.8)']}
http://dytt8.net/
<200 http://dytt8.net/>
2019-05-07 16:34:35 [dyttspider] DEBUG: spider
None
dyttspider
---spider  input -----  中间件--response-----
---spider  output -----  中间件--result-----
<generator object RefererMiddleware.process_spider_output.<locals>.<genexpr> at 0x1119a30a0>
---spider  output -----  中间件--result-----
<GET https://www.dytt8.net/html/gndy/jddy/20160320/50523.html>
---spider  output -----  中间件--iiiiiiii-----
<GET https://www.dytt8.net/html/gndy/jddy/20190507/58577.html>
---spider  output -----  中间件--iiiiiiii-----
<GET https://www.dytt8.net/html/gndy/dyzz/20190506/58576.html>
---spider  output -----  中间件--iiiiiiii-----
<GET https://www.dytt8.net/html/gndy/dyzz/20190506/58567.html>
---spider  output -----  中间件--iiiiiiii-----
<GET https://www.dytt8.net/html/gndy/jddy/20190506/58566.html>
---spider  output -----  中间件--iiiiiiii-----
<GET https://www.dytt8.net/html/gndy/dyzz/20190505/58556.html>
---spider  output -----  中间件--iiiiiiii-----
<GET https://www.dytt8.net/html/gndy/dyzz/20180629/57052.html>
---spider  output -----  中间件--iiiiiiii-----
<GET https://www.dytt8.net/html/gndy/jddy/20190505/58554.html>
---spider  output -----  中间件--iiiiiiii-----
<GET https://www.dytt8.net/html/gndy/jddy/20190505/58555.html>
---spider  output -----  中间件--iiiiiiii-----
<GET https://www.dytt8.net/html/gndy/jddy/20190504/58550.html>
---spider  output -----  中间件--iiiiiiii-----
<GET https://www.dytt8.net/html/gndy/jddy/20190504/58549.html>
---spider  output -----  中间件--iiiiiiii-----
<GET https://www.dytt8.net/html/gndy/dyzz/20190503/58539.html>
---spider  output -----  中间件--iiiiiiii-----
<GET https://www.dytt8.net/html/gndy/jddy/20190503/58538.html>
---spider  output -----  中间件--iiiiiiii-----
<GET https://www.dytt8.net/html/gndy/jddy/20190503/58537.html>
---spider  output -----  中间件--iiiiiiii-----

 然后

我们可以看到此函数,在每次请求response对象 返回spider时候调用 包含response 所有信息
        def process_spider_input(self, response, spider):
        # Called for each response that goes through the spider
        # middleware and into the spider.
        # Should return None or raise an exception.
        print('---spider  input -----  中间件--response-----')
        print(response.headers)
        print(response.url)
        print(response)
        print(spider.log('spider'))
        print(spider.name)
        print(response.body)
        print('---spider  input -----  中间件--response-----')
        # return None

可以看到, 此函数,在spider 返回引擎时候调用,将数据通过item - pipline,并将新的reques重新压入队列

        def process_spider_output(self, response, result, spider):
        # Called with the results returned from the Spider, after
        # it has processed the response.

        # Must return an iterable of Request, dict or Item objects.
        print('---spider  output -----  中间件--result-----')
        print(result)
        print('---spider  output -----  中间件--result-----')
        for i in result:
            print(i)
            print('---spider  output -----  中间件--iiiiiiii-----')
            yield i




---spider  input -----  中间件--response-----
2019-05-07 16:09:25 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.dytt8.net/html/gndy/dyzz/20190326/58384.html> (referer: http://dytt8.net/)
---spider  input -----  中间件--response-----
{b'Content-Type': [b'text/html'], b'Last-Modified': [b'Sat, 27 Apr 2019 17:32:26 GMT'], b'Accept-Ranges': [b'bytes'], b'Etag': [b'"0311c2e1ffdd41:320"'], b'Vary': [b'Accept-Encoding'], b'Server': [b'Microsoft-IIS/6.0'], b'Date': [b'Tue, 07 May 2019 03:33:49 GMT'], b'X-Via': [b'1.1 st1385 (random:685532 Fikker/Webcache/3.7.8)']}
https://www.dytt8.net/html/gndy/dyzz/20190326/58384.html
<200 https://www.dytt8.net/html/gndy/dyzz/20190326/58384.html>
2019-05-07 16:09:25 [dyttspider] DEBUG: spider
None
dyttspider
---spider  input -----  中间件--response-----
---spider  output -----  中间件--result-----
<generator object RefererMiddleware.process_spider_output.<locals>.<genexpr> at 0x1125ac048>
---spider  output -----  中间件--result-----
{'image_urls': ['https://extraimage.net/images/2019/03/24/2b17a4a657287477ef03ee1bce1130b2.jpg',
                'https://lookimg.com/images/2019/03/25/l2xoE.jpg'],
 'title': '2018年剧情《三次元女友/3D女友》BD日语中字'}
---spider  output -----  中间件--iiiiiiii-----
2019-05-07 16:09:25 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.dytt8.net/html/gndy/jddy/20190326/58383.html>
{'image_urls': ['https://extraimage.net/images/2019/03/24/2b17a4a657287477ef03ee1bce1130b2.jpg',
                'https://lookimg.com/images/2019/03/25/l2xoE.jpg'],
 'title': '2018年剧情《三次元女友/3D女友》BD日语中字'}
---spider  output -----  中间件--result-----
<generator object RefererMiddleware.process_spider_output.<locals>.<genexpr> at 0x11268ee60>
---spider  output -----  中间件--result-----
{'image_urls': ['https://extraimage.net/images/2019/03/25/573abfc1844e0514d614000849d2168a.jpg',
                'https://lookimg.com/images/2019/03/25/l2zwd.jpg'],
 'title': '2018年获奖剧情喜剧《副总统》BD中英双字幕'}
---spider  output -----  中间件--iiiiiiii-----
2019-05-07 16:09:26 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.dytt8.net/html/gndy/dyzz/20190326/58384.html>
{'image_urls': ['https://extraimage.net/images/2019/03/25/573abfc1844e0514d614000849d2168a.jpg',
                'https://lookimg.com/images/2019/03/25/l2zwd.jpg'],
 'title': '2018年获奖剧情喜剧《副总统》BD中英双字幕'}
^C2019-05-07 16:09:26 [scrapy.crawler] INFO: Received SIGINT, shutting down gracefully. Send again to force 
2019-05-07 16:09:26 [scrapy.core.engine] INFO: Closing spider (shutdown)
2019-05-07 16:09:26 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.dytt8.net/html/gndy/dyzz/20190326/58386.html> (referer: http://dytt8.net/)
---spider  input -----  中间件--response-----
{b'Content-Type': [b'text/html'], b'Last-Modifi

 下载器中间件(Downloader Middleware)

下载器中间件是介于Scrapy的request/response处理的钩子框架。 是用于全局修改Scrapy request和response的一个轻量、底层的系统。

官方文档解释

def process_request(self, request, spider):
    # Called for each request that goes through the downloader
    # middleware.
process_request(request, spider)
当每个request通过下载中间件时,该方法被调用。

process_request() 必须返回其中之一: 返回 None 、返回一个 Response 对象、返回一个 Request 对象或raise IgnoreRequest 。

如果其返回 None ,Scrapy将继续处理该request,执行其他的中间件的相应方法,直到合适的下载器处理函数(download handler)被调用, 该request被执行(其response被下载)。

如果其返回 Response 对象,Scrapy将不会调用 任何 其他的 process_request() 或 process_exception() 方法,或相应地下载函数; 其将返回该response。 已安装的中间件的 process_response() 方法则会在每个response返回时被调用。

如果其返回 Request 对象,Scrapy则停止调用 process_request方法并重新调度返回的request。当新返回的request被执行后, 相应地中间件链将会根据下载的response被调用。

如果其raise一个 IgnoreRequest 异常,则安装的下载中间件的 process_exception() 方法会被调用。如果没有任何一个方法处理该异常, 则request的errback(Request.errback)方法会被调用。如果没有代码处理抛出的异常, 则该异常被忽略且不记录(不同于其他异常那样)。

参数:	
request (Request 对象) – 处理的request
spider (Spider 对象) – 该request对应的spider
def process_response(self, request, response, spider):
    # Called with the response returned from the downloader.
process_response(request, response, spider)
process_request() 必须返回以下之一: 返回一个 Response 对象、 返回一个 Request 对象或raise一个 IgnoreRequest 异常。

如果其返回一个 Response (可以与传入的response相同,也可以是全新的对象), 该response会被在链中的其他中间件的 process_response() 方法处理。

如果其返回一个 Request 对象,则中间件链停止, 返回的request会被重新调度下载。处理类似于 process_request() 返回request所做的那样。

如果其抛出一个 IgnoreRequest 异常,则调用request的errback(Request.errback)。 如果没有代码处理抛出的异常,则该异常被忽略且不记录(不同于其他异常那样)。

参数:	
request (Request 对象) – response所对应的request
response (Response 对象) – 被处理的response
spider (Spider 对象) – response所对应的spider

可以结合这个博客看 

class MovieHeavenDownloaderMiddleware(object):
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the downloader middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_request(self, request, spider):
        # Called for each request that goes through the downloader
        # middleware.

        # Must either:
        # - return None: continue processing this request
        # - or return a Response object
        # - or return a Request object
        # - or raise IgnoreRequest: process_exception() methods of
        #   installed downloader middleware will be called
        return None

    def process_response(self, request, response, spider):
        # Called with the response returned from the downloader.

        # Must either;
        # - return a Response object
        # - return a Request object
        # - or raise IgnoreRequest
        return response

    def process_exception(self, request, exception, spider):
        # Called when a download handler or a process_request()
        # (from other downloader middleware) raises an exception.

        # Must either:
        # - return None: continue processing this exception
        # - return a Response object: stops process_exception() chain
        # - return a Request object: stops process_exception() chain
        return None

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)

 

  • 2
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值