downloader middleware 研读(1)

最新推荐文章于 2024-04-19 16:04:17 发布
星空永恒&&卡利达
最新推荐文章于 2024-04-19 16:04:17 发布
阅读量841
点赞数
分类专栏： python-爬虫
本文链接：https://blog.csdn.net/qq_24683561/article/details/53980304
版权
python-爬虫专栏收录该内容
5 篇文章 0 订阅
订阅专栏
对requests和response会产生影响，像代理IP什么的就跟这个有关了
	The downloader middleware is a framework of hooks into Scrapy’s request/response processing. It’s a light, low-level system for globally altering Scrapy’s requests and responses.


激活downloader middleware
	To activate a downloader middleware component, add it to the DOWNLOADER_MIDDLEWARES setting, which is a dict whose keys are the middleware class paths and their values are the middleware orders.

Here’s an example:
DOWNLOADER_MIDDLEWARES = {
    'myproject.middlewares.CustomDownloaderMiddleware': 543,
}
myproject是 project name
middlewares是project目录下的一个.py文件
CustomDownloaderMiddleware是.py文件内的一个自定义类
543是middleware的顺序

DOWNLOADER_MIDDLEWARES(自己定义的下载中间件)DOWNLOADER_MIDDLEWARES_BASE(框架自带的)这两者是相互合并的
	The DOWNLOADER_MIDDLEWARES setting is merged with the DOWNLOADER_MIDDLEWARES_BASE setting defined in Scrapy (and not meant to be overridden) and then sorted by order to get the final sorted list of enabled middlewares: the first middleware is the one closer to the engine and the last is the one closer to the downloader. In other words, the process_request() method of each middleware will be invoked in increasing middleware order (100, 200, 300, ...) and the process_response() method of each middleware will be invoked in decreasing order.
process_request()你的代理IP可以在这里添加

中间件的顺序是有要求的，他们之间有些可能会有依赖关系
	To decide which order to assign to your middleware see the DOWNLOADER_MIDDLEWARES_BASE setting and pick a value according to where you want to insert the middleware. The order does matter because each middleware performs a different action and your middleware could depend on some previous (or subsequent) middleware being applied.

class scrapy.downloadermiddlewares.DownloaderMiddleware

你的代理就是在这个函数里添加的
process_request(request, spider)
	这个函数返回值的类型
	process_request() should either: return None, return a Response object, return a Request object, or raise IgnoreRequest

	如果返回值为None，那么scrapy就会调用相关的downloader handler来处理这个请求，我在这里遇到了一个问题，如果代理IP不稳定，那么这种情况下，会致使爬虫出问题的
	If it returns None, Scrapy will continue processing this request, executing all other middlewares until, finally, the appropriate downloader handler is called the request performed (and its response downloaded).

	所以相对可靠的还是这个返回值Response object，看过Architecture overview这节的会知道，最终response会返回spiders进行处理，比如可以调用requests库里的requests.get(url,timeout,proxies)来返回一个response
	If it returns a Response object, Scrapy won’t bother calling any other process_request() or process_exception() methods, or the appropriate download function; it’ll return that response. The process_response() methods of installed middleware is always called on every response.

	这个返回值是Request object，没看出来与返回值None的差别
	If it returns a Request object, Scrapy will stop calling process_request methods and reschedule the returned request. Once the newly returned request is performed, the appropriate middleware chain will be called on the downloaded response.

	If it raises an IgnoreRequest exception, the process_exception() methods of installed downloader middleware will be called. If none of them handle the exception, the errback function of the request (Request.errback) is called. If no code handles the raised exception, it is ignored and not logged (unlike other exceptions).

Parameters:	

    request (Request object) – the request being processed
    spider (Spider object) – the spider for which this request is intended

还没仔细研究
process_response(request, response, spider)

    process_response() should either: return a Response object, return a Request object or raise a IgnoreRequest exception.

    If it returns a Response (it could be the same given response, or a brand-new one), that response will continue to be processed with the process_response() of the next middleware in the chain.

    If it returns a Request object, the middleware chain is halted and the returned request is rescheduled to be downloaded in the future. This is the same behavior as if a request is returned from process_request().

    If it raises an IgnoreRequest exception, the errback function of the request (Request.errback) is called. If no code handles the raised exception, it is ignored and not logged (unlike other exceptions).
    Parameters:	

        request (is a Request object) – the request that originated the response
        response (Response object) – the response being processed
        spider (Spider object) – the spider for which this response is intended    返回到对应的spiders里的爬虫去

没仔细研究
process_exception(request, exception, spider)

    Scrapy calls process_exception() when a download handler or a process_request() (from a downloader middleware) raises an exception (including an IgnoreRequest exception)

    process_exception() should return: either None, a Response object, or a Request object.

    If it returns None, Scrapy will continue processing this exception, executing any other process_exception() methods of installed middleware, until no middleware is left and the default exception handling kicks in.

    If it returns a Response object, the process_response() method chain of installed middleware is started, and Scrapy won’t bother calling any other process_exception() methods of middleware.

    If it returns a Request object, the returned request is rescheduled to be downloaded in the future. This stops the execution of process_exception() methods of the middleware the same as returning a response would.
    Parameters:	

        request (is a Request object) – the request that generated the exception
        exception (an Exception object) – the raised exception
        spider (Spider object) – the spider for which this request is intended