downloader middleware 研读(1)

	The downloader middleware is a framework of hooks into Scrapy’s request/response processing. It’s a light, low-level system for globally altering Scrapy’s requests and responses.

激活downloader middleware
	To activate a downloader middleware component, add it to the DOWNLOADER_MIDDLEWARES setting, which is a dict whose keys are the middleware class paths and their values are the middleware orders.

Here’s an example:
    'myproject.middlewares.CustomDownloaderMiddleware': 543,
myproject是 project name

	The DOWNLOADER_MIDDLEWARES setting is merged with the DOWNLOADER_MIDDLEWARES_BASE setting defined in Scrapy (and not meant to be overridden) and then sorted by order to get the final sorted list of enabled middlewares: the first middleware is the one closer to the engine and the last is the one closer to the downloader. In other words, the process_request() method of each middleware will be invoked in increasing middleware order (100, 200, 300, ...) and the process_response() method of each middleware will be invoked in decreasing order.

	To decide which order to assign to your middleware see the DOWNLOADER_MIDDLEWARES_BASE setting and pick a value according to where you want to insert the middleware. The order does matter because each middleware performs a different action and your middleware could depend on some previous (or subsequent) middleware being applied.

class scrapy.downloadermiddlewares.DownloaderMiddleware

process_request(request, spider)
	process_request() should either: return None, return a Response object, return a Request object, or raise IgnoreRequest

	如果返回值为None,那么scrapy就会调用相关的downloader handler来处理这个请求,我在这里遇到了一个问题,如果代理IP不稳定,那么这种情况下,会致使爬虫出问题的
	If it returns None, Scrapy will continue processing this request, executing all other middlewares until, finally, the appropriate downloader handler is called the request performed (and its response downloaded).

	所以相对可靠的还是这个返回值Response object,看过Architecture overview这节的会知道,最终response会返回spiders进行处理,比如可以调用requests库里的requests.get(url,timeout,proxies)来返回一个response
	If it returns a Response object, Scrapy won’t bother calling any other process_request() or process_exception() methods, or the appropriate download function; it’ll return that response. The process_response() methods of installed middleware is always called on every response.

	这个返回值是Request object,没看出来与返回值None的差别
	If it returns a Request object, Scrapy will stop calling process_request methods and reschedule the returned request. Once the newly returned request is performed, the appropriate middleware chain will be called on the downloaded response.

	If it raises an IgnoreRequest exception, the process_exception() methods of installed downloader middleware will be called. If none of them handle the exception, the errback function of the request (Request.errback) is called. If no code handles the raised exception, it is ignored and not logged (unlike other exceptions).


    request (Request object) – the request being processed
    spider (Spider object) – the spider for which this request is intended

process_response(request, response, spider)

    process_response() should either: return a Response object, return a Request object or raise a IgnoreRequest exception.

    If it returns a Response (it could be the same given response, or a brand-new one), that response will continue to be processed with the process_response() of the next middleware in the chain.

    If it returns a Request object, the middleware chain is halted and the returned request is rescheduled to be downloaded in the future. This is the same behavior as if a request is returned from process_request().

    If it raises an IgnoreRequest exception, the errback function of the request (Request.errback) is called. If no code handles the raised exception, it is ignored and not logged (unlike other exceptions).

        request (is a Request object) – the request that originated the response
        response (Response object) – the response being processed
        spider (Spider object) – the spider for which this response is intended    返回到对应的spiders里的爬虫去

process_exception(request, exception, spider)

    Scrapy calls process_exception() when a download handler or a process_request() (from a downloader middleware) raises an exception (including an IgnoreRequest exception)

    process_exception() should return: either None, a Response object, or a Request object.

    If it returns None, Scrapy will continue processing this exception, executing any other process_exception() methods of installed middleware, until no middleware is left and the default exception handling kicks in.

    If it returns a Response object, the process_response() method chain of installed middleware is started, and Scrapy won’t bother calling any other process_exception() methods of middleware.

    If it returns a Request object, the returned request is rescheduled to be downloaded in the future. This stops the execution of process_exception() methods of the middleware the same as returning a response would.

        request (is a Request object) – the request that generated the exception
        exception (an Exception object) – the raised exception
        spider (Spider object) – the spider for which this request is intended





