scrapy---中间件--设置User-Agent、代理

本文详细介绍了Scrapy的下载器中间件的工作流程,包括process_request()和process_response()方法的执行顺序,以及如何设置静态和动态User-Agent。同时,讨论了代理IP的使用,包括免费和付费代理,并提到了在中间件中使用Selenium的情况。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

本文主要讲述scrapy—中间件,理解中间件的处理流程。

下载器中间件

下载器中间件,介于下载器和引擎之间,设置User-Agent、cookie,以及代理。在中间件里使用selenium。
要用下载器中间件,先在settings.py文件中开启下载器中间件
跟管道一样的,权重值越小越先执行。

DOWNLOADER_MIDDLEWARES = {
    "Mid.middlewares.MidDownloaderMiddleware": 543,
}

process_request()方法:设置User-Agent、cookie,以及代理。一堆中间件中的该方法是:权重值越小越先执行。

'''在引擎将请求的信息交给下载器之前,自动的调用该方法。
        :param request:当前请求
        :param spider:发出该请求的spider
        :return:不能随便给。
                注意:process_request返回值是有规定的。
                1.如果返回的是None或者不写return,不做拦截,继续向后的中间件执行。(引擎与下载器之间有一堆的中间件,如果不做拦截,会根据中间件的权重一直执行,直到中间件执行结束,才传递给下载器。)
                2.如果返回的是request,后续的中间件将不再执行,将请求重新交给引擎,引擎重新扔给调度器。下载器也拿不到url。
                3.如果返回的是response,后续的中间件将不在执行,将响应信息交给引擎,引擎将响应丢给spider,进行数据处理。(意思是:不会到达下载器,会直接通过返回response的这个中间件,将响应信息交给引擎,接着spider做数据处理)
                '''

process_response()方法:主要是下载器返回引擎的地方。一堆中间件中的该方法是:权重值越大越先执行。

return response 不做拦截,继续向前进行提交返回。
 return request 响应被拦截,将返回内容直接回馈给调度器(通过引擎),后续process_response()接收不到响应内容。
class MidDownloaderMiddleware:
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the downloader middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_request(self, request, spider):#设置user-agent、cookie
        # Called for each request that goes through the downloader
        # middleware.

        # Must either:
        # - return None: continue processing this request
        # - or return a Response object
        # - or return a Request object
        # - or raise IgnoreRequest: process_exception() methods of
        #   installed downloader middleware will be called
        
        print('我是ware,process_request')
        return None

    def process_response(self, request, response, spider):#主要是下载器返回引擎的地方
        # Called with the response returned from the downloader.

        # Must either;
        # - return a Response object
        # - return a Request object
        # - or raise IgnoreRequest
        '''

        :param request:
        :param response:
        :param spider:
        :return:
               return response 通过引擎将响应内容继续传递给其他的组件或者 其他的process_response()处理。不做拦截,继续向前进行提交返回。
               return request 响应被拦截,将返回内容直接回馈给调度器(通过引擎),后续process_response()接收不到响应内容。'''
        print('我是ware,process_response')
        return response

    def process_exception(self, request, exception, spider):#当前请求过程中出错了,自动执行
        # Called when a download handler or a process_request()
        # (from other downloader middleware) raises an exception.

        # Must either:
        # - return None: continue processing this exception
        # - return a Response object: stops process_exception() chain
        # - return a Request object: stops process_exception() chain
        pass

    def spider_opened(self, spider):#开启,最开始是运行
        print('我是ware,spider_opened')
        #spider.logger.info("Spider opened: %s" % spider.name)


运行结果:

我是ware,spider_opened
我是ware,process_request
我是ware,process_response
百度一下,你就知道

若文件中有多个下载器中间件,他们的运行流程是怎样的。

要用下载器中间件,先在settings.py文件中开启下载器中间件

DOWNLOADER_MIDDLEWARES = {
    "Mid.middlewares.MidDownloaderMiddleware1": 543,
"Mid.middlewares.MidDownloaderMiddleware2": 544,
}

多个下载器中间件。

#下载器中间件,介于下载器和引擎之间,设置user-agent、cookie
class MidDownloaderMiddleware1:
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the downloader middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_request(self, request, spider):#设置user-agent、cookie
        # Called for each request that goes through the downloader
        # middleware.

        # Must either:
        # - return None: continue processing this request
        # - or return a Response object
        # - or return a Request object
        # - or raise IgnoreRequest: process_exception() methods of
        #   installed downloader middleware will be called
        
        print('我是ware1,process_request')
        return None

    def process_response(self, request, response, spider):#主要是下载器返回引擎的地方
        # Called with the response returned from the downloader.

        # Must either;
        # - return a Response object
        # - return a Request object
        # - or raise IgnoreRequest
       
        print('我是ware1,process_response')
        return response

    def process_exception(self, request, exception, spider):#当前请求过程中出错了,自动执行
        # Called when a download handler or a process_request()
        # (from other downloader middleware) raises an exception.

        # Must either:
        # - return None: continue processing this exception
        # - return a Response object: stops process_exception() chain
        # - return a Request object: stops process_exception() chain
        pass

    def spider_opened(self, spider):#开启,最开始是运行
        print('我是ware1,spider_opened')
        #spider.logger.info("Spider opened: %s" % spider.name)


class MidDownloaderMiddleware2:
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the downloader middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_request(self, request, spider):  # 设置user-agent、cookie
        # Called for each request that goes through the downloader
        # middleware.

        # Must either:
        # - return None: continue processing this request
        # - or return a Response object
        # - or return a Request object
        # - or raise IgnoreRequest: process_exception() methods of
        #   installed downloader middleware will be called
      
        print('我是ware2,process_request')
        return None

    def process_response(self, request, response, spider):  # 主要是下载器返回引擎的地方
        # Called with the response returned from the downloader.

        # Must either;
        # - return a Response object
        # - return a Request object
        # - or raise IgnoreRequest
      
        print('我是ware2,process_response')
        return response

    def process_exception(self, request, exception, spider):  # 当前请求过程中出错了,自动执行
        # Called when a download handler or a process_request()
        # (from other downloader middleware) raises an exception.

        # Must either:
        # - return None: continue processing this exception
        # - return a Response object: stops process_exception() chain
        # - return a Request object: stops process_exception() chain
        pass

    def spider_opened(self, spider):  # 开启,最开始是运行
        print('我是ware2,spider_opened')
        # spider.logger.info("Spider opened: %s" % spider.name)

运行结果

我是ware1,spider_opened
我是ware2,spider_opened
我是ware1,process_request
我是ware2,process_request
我是ware2,process_response
我是ware1,process_response
百度一下,你就知道

总结:process_request()该函数是权重小的先执行,process_response()权重大的先执行。
在这里插入图片描述

爬虫中间件,介于spider和引擎之间。(暂时不讲。)


class MidSpiderMiddleware:
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the spider middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_spider_input(self, response, spider):
        # Called for each response that goes through the spider
        # middleware and into the spider.

        # Should return None or raise an exception.
        return None

    def process_spider_output(self, response, result, spider):
        # Called with the results returned from the Spider, after
        # it has processed the response.

        # Must return an iterable of Request, or item objects.
        for i in result:
            yield i

    def process_spider_exception(self, response, exception, spider):
        # Called when a spider or process_spider_input() method
        # (from other spider middleware) raises an exception.

        # Should return either None or an iterable of Request or item objects.
        pass

    def process_start_requests(self, start_requests, spider):
        # Called with the start requests of the spider, and works
        # similarly to the process_spider_output() method, except
        # that it doesn’t have a response associated.

        # Must return only requests (not items).
        for r in start_requests:
            yield r

    def spider_opened(self, spider):
        spider.logger.info("Spider opened: %s" % spider.name)

User-Agent设置

有两种方法。

一,设置一个固定的User-Agent。在settings.py文件中设置一个User-Agent。

USER_AGENT = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36"

二,设置动态随机User-Agent。在settings.py文件中添加User-Agent的列表,在中间件中处理。

settings.py

USER_AGENT_list=['Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.63 Safari/537.36',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.54 Safari/537.36',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.61 Safari/537.36',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.71 Safari/537.36',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.81 Safari/537.36',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.54 Safari/537.36', # 2021.10
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36', # 2021.11
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.93 Safari/537.36', # 2021.12
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36', # 2022.01
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.81 Safari/537.36', # 2022.02
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.82 Safari/537.36',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.51 Safari/537.36', # 2022.03
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.74 Safari/537.36',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.82 Safari/537.36',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.60 Safari/537.36',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.75 Safari/537.36', # 2022.04
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.88 Safari/537.36',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.41 Safari/537.36',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.54 Safari/537.36', # 2022.05
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.64 Safari/537.36',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.67 Safari/537.36',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.5005.63 Safari/537.36',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.5005.115 Safari/537.36', # 2022.06
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.53 Safari/537.36',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.66 Safari/537.36',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.114 Safari/537.36', # 2022.07
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.134 Safari/537.36',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.81 Safari/537.36', # 2022.08
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.102 Safari/537.36',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.5195.54 Safari/537.36', # 2022.09
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.5195.102 Safari/537.36',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.5195.127 Safari/537.36',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.5249.91 Safari/537.36', # 2022.10
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.5249.103 Safari/537.36',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.5249.119 Safari/537.36',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.5304.63 Safari/537.36',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.5304.88 Safari/537.36',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.5304.106 Safari/537.36', # 2022.11
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.5304.107 Safari/537.36',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.5304.122 Safari/537.36',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.5359.72 Safari/537.36',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.5359.95 Safari/537.36', # 2022.12
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.5359.99 Safari/537.36',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.5359.100 Safari/537.36',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.5359.125 Safari/537.36',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.5414.75 Safari/537.36', # 2023.01
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.5414.120 Safari/537.36',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.5481.78 Safari/537.36', # 2023.02
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.5481.104 Safari/537.36',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.5481.105 Safari/537.36',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.5481.178 Safari/537.36',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.5481.180 Safari/537.36', # 2023.03
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.5563.64 Safari/537.36',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.5563.65 Safari/537.36',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.5563.111 Safari/537.36',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.5563.112 Safari/537.36',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.5563.147 Safari/537.36',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.5615.50 Safari/537.36', # 2023.04
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.5615.87 Safari/537.36',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.5615.121 Safari/537.36',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.5615.138 Safari/537.36',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.5672.64 Safari/537.36', # 2023.05
]

要把这些User-Agent添加到请求头上。所以要在下载器中间件上处理。
settings.py中开启下载器中间件

DOWNLOADER_MIDDLEWARES = {
   "douban.middlewares.DoubanDownloaderMiddleware": 543,
}

下载器中间件。process_request()只用这个方法。


class DoubanDownloaderMiddleware:
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the downloader middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_request(self, request, spider):
        #设置User-Agent
        ua=choice(USER_AGENT_list)
        #放在请求头里
        request.headers['User-Agent']=ua
        return None#注意这里不能返回任何东西,不然会被拦截。

    def process_response(self, request, response, spider):
        return response

    def process_exception(self, request, exception, spider):
        pass

    def spider_opened(self, spider):
        spider.logger.info("Spider opened: %s" % spider.name)

代理IP

免费代理IP

这个时效性差,速度慢,不推荐使用。但是可以用用。
代理IP网站:https://www.kuaidaili.com

settongs.py

PROXY_IP_LIST=[
        你的ip列表
]
DOWNLOADER_MIDDLEWARES = {
   "douban.middlewares.DoubanDownloaderMiddleware": 543,
    "douban.middlewares.ProxyDoubanDownloaderMiddleware": 544,
}

下载器中间件。免费的一般成功不高。

class ProxyDoubanDownloaderMiddleware:


    def process_request(self, request, spider):
        #设置代理IP
        ip=choice(PROXY_IP_LIST)
        #放到request
        request.meta['proxy']='https://'+ip

        return None#注意这里不能返回任何东西,不然会被拦截。

付费代理IP。----代理IP服务器。

https://www.kuaidaili.com/tps,隧道代理
这个里面有文档,代码案例。
在这里插入图片描述
在这里插入图片描述
需要图片中的隧道host,端口号、用户名、密码。
根据提供的代码案例,直接放到下载器中间件。

class MoneyProxyDoubanDownloaderMiddleware:
    _proxy = ('XXX.XXX.com', '15818')


    def process_request(self, request, spider):

        # 用户名密码认证
        username = "username"
        password = "password"
        request.meta['proxy'] = "http://%(user)s:%(pwd)s@%(proxy)s/" % {"user": username, "pwd": password,
                                                                        "proxy": ':'.join(
                                                                            MoneyProxyDoubanDownloaderMiddleware._proxy)}

        # 白名单认证
        # request.meta['proxy'] = "http://%(proxy)s/" % {"proxy": proxy}

        request.headers["Connection"] = "close"
        return None

在中间件里使用selenium

#由于要用selenium所以想要替换原来的downloader,原来的中间件其实对我们来说没有意义了。
#原来的中间件最大的优先级是100,所以selenium要定义在100之前。

DOWNLOADER_MIDDLEWARES = {
  
    "boss.middlewares.BossSeleniumDownloaderMiddleware": 99,
}

操作步骤:

#注意:spider文件夹里会有很多个爬虫py,所以需要先判断那些是使用selenium请求的。
        #所以这里应该设置成两种请求的
        #设计流程:1.在项目的地方新建一个request.py的文件。request.py这个需要继承Request.导致,当前的seleniumRequest在功能上和Request一样的。
        #设计流程:2.在爬虫文件中重写 start_requests(self)
        #设计流程:3.在下载器中间件process_request()中做判断。
        #设计流程:4.当程序运行时就应该启动selenium,在spider_opened()启动
        #设计流程:5.再在设计流程:3中去请求。
        #设计流程:6.封装成响应对象
class BossSeleniumDownloaderMiddleware:

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        crawler.signals.connect(s.spider_closed, signal=signals.spider_closed)
        return s

    def process_request(self, request, spider):
        #所有的请求都回到这里。
        #开始selenium的操作,返回页面源代码组装的response。
        #注意:spider文件夹里会有很多个爬虫py,所以需要先判断那些是使用selenium请求的。
        #所以这里应该设置成两种请求的
        #设计流程:1.在项目的地方新建一个request.py的文件。request.py这个需要继承Request.导致,当前的seleniumRequest在功能上和Request一样的。
        #设计流程:2.在爬虫文件中重写 start_requests(self)
        #设计流程:3.在下载器中间件process_request()中做判断。
        #设计流程:4.当程序运行时就应该启动selenium,在spider_opened()启动
        #设计流程:5.再在设计流程:3中去请求。
        #设计流程:6.封装成响应对象
        if isinstance(request,SeleniumRequest):#isinstance判断xxxx是不是xxxx类型的
            #selenium处理
            #注意:process_request()方法的返回值的三种情况(None,request,response)。所以这里应该封装成一个响应对象。
            self.browser.get(request.url)
            page_source=self.browser.page_source
            time.sleep(2)
            
            return HtmlResponse(url=request.url, status=200, body=page_source, request=request, encoding='utf-8')

        else:
            return None




    def spider_opened(self, spider):
        self.options = webdriver.ChromeOptions()
        self.browser = webdriver.Chrome(chrome_options=self.options)

    def spider_closed(self, spider):
        self.browser.close()

返回的HtmlResponse是要看源码是怎么定义的。
父类没有什么定义。

"""
This module implements the HtmlResponse class which adds encoding
discovering through HTML encoding declarations to the TextResponse class.

See documentation in docs/topics/request-response.rst
"""

from scrapy.http.response.text import TextResponse


class HtmlResponse(TextResponse):
    pass

看TextResponse发现

class TextResponse(Response):
    def __init__(self, *args, **kwargs):
        self._encoding = kwargs.pop("encoding", None)
        self._cached_benc = None
        self._cached_ubody = None
        self._cached_selector = None
        super().__init__(*args, **kwargs)

也没有什么定义。再看TextResponse(Response)Response发现:

    def __init__(
        self,
        url: str,
        status=200,
        headers=None,
        body=b"",
        flags=None,
        request=None,
        certificate=None,
        ip_address=None,
        protocol=None,
    ):
        self.headers = Headers(headers or {})
        self.status = int(status)
        self._set_body(body)
        self._set_url(url)
        self.request = request
        self.flags = [] if flags is None else list(flags)
        self.certificate = certificate
        self.ip_address = ip_address
        self.protocol = protocol

从这里我们就可以知道怎么去写HtmlResponse

HtmlResponse(url=request.url, status=200, body=page_source, request=request, encoding='utf-8')
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值