Spider中间件(Middleware)
中间件是介入到Scrapy的spider处理机制的钩子框架,您可以添加代码来处理发送给 Spiders的response及spider产生的item和request。
了解各个组件执行顺序后,首先 来看官方文档中的解释:
process_spider_input(response, spider)
process_spider_input(response, spider)
参数:
response (Response 对象) – 被处理的response
spider (Spider 对象) – 该response对应的spider
当response通过spider中间件时,该方法被调用,处理该response。
process_spider_input() 应该返回 None 或者抛出一个异常。
如果其返回 None ,Scrapy将会继续处理该response,调用所有其他的中间件直到spider处理该response。
如果其跑出一个异常(exception),Scrapy将不会调用任何其他中间件的 process_spider_input() 方法,并调用request的errback。 errback的输出将会以另一个方向被重新输入到中间件链中,使用 process_spider_output() 方法来处理,当其抛出异常时则带调用 process_spider_exception() 。
process_spider_output(response, result, spider)
process_spider_output(response, result, spider)
当Spider处理response返回result时,该方法被调用。
process_spider_output() 必须返回包含 Request 或 Item 对象的可迭代对象(iterable)。
参数:
response (Response 对象) – 生成该输出的response
result (包含 Request 或 Item 对象的可迭代对象(iterable)) – spider返回的result
spider (Spider 对象) – 其结果被处理的spider
我们来用个demo 看下具体数据示例:
首先我们使用scrapy 对某个网站进行抓取 ,并提取两个字段
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html
import scrapy
class MovieHeavenItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
title = scrapy.Field()
image_urls = scrapy.Field()
在中间件中打标记进行测试
# -*- coding: utf-8 -*-
# Define here the models for your spider middleware
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/spider-middleware.html
import scrapy
from scrapy import log
from scrapy import signals
from scrapy.exceptions import NotConfigured
class SpiderOpenCloseLogging(object):
def __init__(self):
self.items_scraped = 0
self.items_dropped = 0
@classmethod
def from_crawler(cls, crawler):
# 读取settings配置信息,检查是否启动扩展,没有启用则抛出异常,扩展被禁用
if not crawler.settings.getbool('MY_EXTENSION'):
raise NotConfigured
# 实例化扩展对象
ext = cls()
# 注册信号处理函数
crawler.signals.connect(ext.spider_opened, signal=signals.spider_opened)
crawler.signals.connect(ext.spider_closed, signal=signals.spider_closed)
crawler.signals.connect(ext.item_scraped, signal=signals.item_scraped)
crawler.signals.connect(ext.item_dropped, signal=signals.item_dropped)
return ext
# 自定义的3个信号处理函数
def spider_opened(self, spider):
spider.log(">>> opened spider %s" % spider.name)
def spider_closed(self, spider):
print('+++'*5)
spider.log('&&&&&&')
spider.log(">>> closed spider %s" % spider.name)
spider.log(">>>scraped %d items" % self.items_scraped)
spider.log(">>>dropped %d items" % self.items_dropped)
print('+++' * 5)
# 获取状态收集器信息
print('*' * 15)
print(spider.crawler.stats.get_stats())
print('*' * 15)
def item_scraped(self, item, response, spider):
self.items_scraped += 1
def item_dropped(self, item, response, exception, spider):
self.items_dropped += 1
class MovieHeavenSpiderMiddleware(object):
# Not all methods need to be defined. If a method is not defined,
# scrapy acts as if the spider middleware does not modify the
# passed objects.
@classmethod
def from_crawler(cls, crawler):
# This method is used by Scrapy to create your spiders.
s = cls()
crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
return s
def process_spider_input(self, response, spider):
# Called for each response that goes through the spider
# middleware and into the spider.
# Should return None or raise an exception.
print('---spider input ----- 中间件--response-----')
print(response.headers)
print(response.url)
print(response)
print(spider.log('spider'))
print(spider.name)
print('---spider input ----- 中间件--response-----')
# return None
def process_spider_output(self, response, result, spider):
# Called with the results returned from the Spider, after
# it has processed the response.
# Must return an iterable of Request, dict or Item objects.
print('---spider output ----- 中间件--result-----')
print(result)
print('---spider output ----- 中间件--result-----')
for i in result:
print(i)
print('---spider output ----- 中间件--iiiiiiii-----')
yield i
def process_spider_exception(self, response, exception, spider):
# Called when a spider or process_spider_input() method
# (from other spider middleware) raises an exception.
# Should return either None or an iterable of Response, dict
# or Item objects.
pass
def process_start_requests(self, start_requests, spider):
# Called with the start requests of the spider, and works
# similarly to the process_spider_output() method, except
# that it doesn’t have a response associated.
# Must return only requests (not items).
for r in start_requests:
print('---spider start_request ----- 中间件--rrrrr-----')
yield r
def spider_opened(self, spider):
spider.logger.info('Spider opened: %s' % spider.name)
运行spider 查看结果:
首先在spider 开启时候第一个请求
# Called with the start requests of the spider
首先实行了中间件中此函数
process_start_requests(self, start_requests, spider):
首先在spider 开启时候第一个请求
# Called with the start requests of the spider
首先实行了中间件中此函数
process_start_requests(self, start_requests, spider):
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-05-07 16:34:35 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'movie_heaven.middlewares.MovieHeavenSpiderMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-05-07 16:34:35 [scrapy.middleware] INFO: Enabled item pipelines:
['movie_heaven.pipelines.NewsPipeline']
2019-05-07 16:34:35 [scrapy.core.engine] INFO: Spider opened
2019-05-07 16:34:35 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-05-07 16:34:35 [dyttspider] DEBUG: >>> opened spider dyttspider
2019-05-07 16:34:35 [dyttspider] INFO: Spider opened: dyttspider
---spider start_request ----- 中间件--rrrrr-----
2019-05-07 16:34:35 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://dytt8.net/> (referer: None)
然后执行了中间件函数
process_spider_output(self, response, result, spider):
我们可以看到,当第一个请求发送出去后,我们从返回的response中提取请求压入队列,
然后执行了中间件函数
process_spider_output(self, response, result, spider):
我们可以看到,当第一个请求发送出去后,我们从返回的response中提取了很多请求压入队列,
---spider start_request ----- 中间件--rrrrr-----
2019-05-07 16:34:35 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://dytt8.net/> (referer: None)
---spider input ----- 中间件--response-----
{b'Content-Type': [b'text/html'], b'Content-Location': [b'http://dytt8.net/index.htm'], b'Last-Modified': [b'Tue, 07 May 2019 06:19:14 GMT'], b'Accept-Ranges': [b'bytes'], b'Etag': [b'"09dbbca9c4d51:320"'], b'Vary': [b'Accept-Encoding'], b'Server': [b'Microsoft-IIS/6.0'], b'Date': [b'Tue, 07 May 2019 08:31:23 GMT'], b'X-Via': [b'1.1 SN201275 (random:706742 Fikker/Webcache/3.7.8)']}
http://dytt8.net/
<200 http://dytt8.net/>
2019-05-07 16:34:35 [dyttspider] DEBUG: spider
None
dyttspider
---spider input ----- 中间件--response-----
---spider output ----- 中间件--result-----
<generator object RefererMiddleware.process_spider_output.<locals>.<genexpr> at 0x1119a30a0>
---spider output ----- 中间件--result-----
<GET https://www.dytt8.net/html/gndy/jddy/20160320/50523.html>
---spider output ----- 中间件--iiiiiiii-----
<GET https://www.dytt8.net/html/gndy/jddy/20190507/58577.html>
---spider output ----- 中间件--iiiiiiii-----
<GET https://www.dytt8.net/html/gndy/dyzz/20190506/58576.html>
---spider output ----- 中间件--iiiiiiii-----
<GET https://www.dytt8.net/html/gndy/dyzz/20190506/58567.html>
---spider output ----- 中间件--iiiiiiii-----
<GET https://www.dytt8.net/html/gndy/jddy/20190506/58566.html>
---spider output ----- 中间件--iiiiiiii-----
<GET https://www.dytt8.net/html/gndy/dyzz/20190505/58556.html>
---spider output ----- 中间件--iiiiiiii-----
<GET https://www.dytt8.net/html/gndy/dyzz/20180629/57052.html>
---spider output ----- 中间件--iiiiiiii-----
<GET https://www.dytt8.net/html/gndy/jddy/20190505/58554.html>
---spider output ----- 中间件--iiiiiiii-----
<GET https://www.dytt8.net/html/gndy/jddy/20190505/58555.html>
---spider output ----- 中间件--iiiiiiii-----
<GET https://www.dytt8.net/html/gndy/jddy/20190504/58550.html>
---spider output ----- 中间件--iiiiiiii-----
<GET https://www.dytt8.net/html/gndy/jddy/20190504/58549.html>
---spider output ----- 中间件--iiiiiiii-----
<GET https://www.dytt8.net/html/gndy/dyzz/20190503/58539.html>
---spider output ----- 中间件--iiiiiiii-----
<GET https://www.dytt8.net/html/gndy/jddy/20190503/58538.html>
---spider output ----- 中间件--iiiiiiii-----
<GET https://www.dytt8.net/html/gndy/jddy/20190503/58537.html>
---spider output ----- 中间件--iiiiiiii-----
然后
我们可以看到此函数,在每次请求response对象 返回spider时候调用 包含response 所有信息
def process_spider_input(self, response, spider):
# Called for each response that goes through the spider
# middleware and into the spider.
# Should return None or raise an exception.
print('---spider input ----- 中间件--response-----')
print(response.headers)
print(response.url)
print(response)
print(spider.log('spider'))
print(spider.name)
print(response.body)
print('---spider input ----- 中间件--response-----')
# return None
可以看到, 此函数,在spider 返回引擎时候调用,将数据通过item - pipline,并将新的reques重新压入队列
def process_spider_output(self, response, result, spider):
# Called with the results returned from the Spider, after
# it has processed the response.
# Must return an iterable of Request, dict or Item objects.
print('---spider output ----- 中间件--result-----')
print(result)
print('---spider output ----- 中间件--result-----')
for i in result:
print(i)
print('---spider output ----- 中间件--iiiiiiii-----')
yield i
---spider input ----- 中间件--response-----
2019-05-07 16:09:25 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.dytt8.net/html/gndy/dyzz/20190326/58384.html> (referer: http://dytt8.net/)
---spider input ----- 中间件--response-----
{b'Content-Type': [b'text/html'], b'Last-Modified': [b'Sat, 27 Apr 2019 17:32:26 GMT'], b'Accept-Ranges': [b'bytes'], b'Etag': [b'"0311c2e1ffdd41:320"'], b'Vary': [b'Accept-Encoding'], b'Server': [b'Microsoft-IIS/6.0'], b'Date': [b'Tue, 07 May 2019 03:33:49 GMT'], b'X-Via': [b'1.1 st1385 (random:685532 Fikker/Webcache/3.7.8)']}
https://www.dytt8.net/html/gndy/dyzz/20190326/58384.html
<200 https://www.dytt8.net/html/gndy/dyzz/20190326/58384.html>
2019-05-07 16:09:25 [dyttspider] DEBUG: spider
None
dyttspider
---spider input ----- 中间件--response-----
---spider output ----- 中间件--result-----
<generator object RefererMiddleware.process_spider_output.<locals>.<genexpr> at 0x1125ac048>
---spider output ----- 中间件--result-----
{'image_urls': ['https://extraimage.net/images/2019/03/24/2b17a4a657287477ef03ee1bce1130b2.jpg',
'https://lookimg.com/images/2019/03/25/l2xoE.jpg'],
'title': '2018年剧情《三次元女友/3D女友》BD日语中字'}
---spider output ----- 中间件--iiiiiiii-----
2019-05-07 16:09:25 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.dytt8.net/html/gndy/jddy/20190326/58383.html>
{'image_urls': ['https://extraimage.net/images/2019/03/24/2b17a4a657287477ef03ee1bce1130b2.jpg',
'https://lookimg.com/images/2019/03/25/l2xoE.jpg'],
'title': '2018年剧情《三次元女友/3D女友》BD日语中字'}
---spider output ----- 中间件--result-----
<generator object RefererMiddleware.process_spider_output.<locals>.<genexpr> at 0x11268ee60>
---spider output ----- 中间件--result-----
{'image_urls': ['https://extraimage.net/images/2019/03/25/573abfc1844e0514d614000849d2168a.jpg',
'https://lookimg.com/images/2019/03/25/l2zwd.jpg'],
'title': '2018年获奖剧情喜剧《副总统》BD中英双字幕'}
---spider output ----- 中间件--iiiiiiii-----
2019-05-07 16:09:26 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.dytt8.net/html/gndy/dyzz/20190326/58384.html>
{'image_urls': ['https://extraimage.net/images/2019/03/25/573abfc1844e0514d614000849d2168a.jpg',
'https://lookimg.com/images/2019/03/25/l2zwd.jpg'],
'title': '2018年获奖剧情喜剧《副总统》BD中英双字幕'}
^C2019-05-07 16:09:26 [scrapy.crawler] INFO: Received SIGINT, shutting down gracefully. Send again to force
2019-05-07 16:09:26 [scrapy.core.engine] INFO: Closing spider (shutdown)
2019-05-07 16:09:26 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.dytt8.net/html/gndy/dyzz/20190326/58386.html> (referer: http://dytt8.net/)
---spider input ----- 中间件--response-----
{b'Content-Type': [b'text/html'], b'Last-Modifi
下载器中间件(Downloader Middleware)
下载器中间件是介于Scrapy的request/response处理的钩子框架。 是用于全局修改Scrapy request和response的一个轻量、底层的系统。
官方文档解释
def process_request(self, request, spider):
# Called for each request that goes through the downloader
# middleware.
process_request(request, spider)
当每个request通过下载中间件时,该方法被调用。
process_request() 必须返回其中之一: 返回 None 、返回一个 Response 对象、返回一个 Request 对象或raise IgnoreRequest 。
如果其返回 None ,Scrapy将继续处理该request,执行其他的中间件的相应方法,直到合适的下载器处理函数(download handler)被调用, 该request被执行(其response被下载)。
如果其返回 Response 对象,Scrapy将不会调用 任何 其他的 process_request() 或 process_exception() 方法,或相应地下载函数; 其将返回该response。 已安装的中间件的 process_response() 方法则会在每个response返回时被调用。
如果其返回 Request 对象,Scrapy则停止调用 process_request方法并重新调度返回的request。当新返回的request被执行后, 相应地中间件链将会根据下载的response被调用。
如果其raise一个 IgnoreRequest 异常,则安装的下载中间件的 process_exception() 方法会被调用。如果没有任何一个方法处理该异常, 则request的errback(Request.errback)方法会被调用。如果没有代码处理抛出的异常, 则该异常被忽略且不记录(不同于其他异常那样)。
参数:
request (Request 对象) – 处理的request
spider (Spider 对象) – 该request对应的spider
def process_response(self, request, response, spider):
# Called with the response returned from the downloader.
process_response(request, response, spider)
process_request() 必须返回以下之一: 返回一个 Response 对象、 返回一个 Request 对象或raise一个 IgnoreRequest 异常。
如果其返回一个 Response (可以与传入的response相同,也可以是全新的对象), 该response会被在链中的其他中间件的 process_response() 方法处理。
如果其返回一个 Request 对象,则中间件链停止, 返回的request会被重新调度下载。处理类似于 process_request() 返回request所做的那样。
如果其抛出一个 IgnoreRequest 异常,则调用request的errback(Request.errback)。 如果没有代码处理抛出的异常,则该异常被忽略且不记录(不同于其他异常那样)。
参数:
request (Request 对象) – response所对应的request
response (Response 对象) – 被处理的response
spider (Spider 对象) – response所对应的spider
class MovieHeavenDownloaderMiddleware(object):
# Not all methods need to be defined. If a method is not defined,
# scrapy acts as if the downloader middleware does not modify the
# passed objects.
@classmethod
def from_crawler(cls, crawler):
# This method is used by Scrapy to create your spiders.
s = cls()
crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
return s
def process_request(self, request, spider):
# Called for each request that goes through the downloader
# middleware.
# Must either:
# - return None: continue processing this request
# - or return a Response object
# - or return a Request object
# - or raise IgnoreRequest: process_exception() methods of
# installed downloader middleware will be called
return None
def process_response(self, request, response, spider):
# Called with the response returned from the downloader.
# Must either;
# - return a Response object
# - return a Request object
# - or raise IgnoreRequest
return response
def process_exception(self, request, exception, spider):
# Called when a download handler or a process_request()
# (from other downloader middleware) raises an exception.
# Must either:
# - return None: continue processing this exception
# - return a Response object: stops process_exception() chain
# - return a Request object: stops process_exception() chain
return None
def spider_opened(self, spider):
spider.logger.info('Spider opened: %s' % spider.name)