Scrapy中间件的使用

下载中间件(MiddleproDownloaderMiddleware)

  • 位置:引擎和下载器之间
  • 作用:批量拦截到整个工程中所有的请求和响应
  • 拦截请求:
    • UA伪装
    • IP代理
  • 拦截响应:
    • 篡改响应数据、响应请求
[middlewares.py] MiddleproDownloaderMiddleware类中有3个重要方法
import random
from fake_useragent import UserAgent

class MiddleproDownloaderMiddleware:
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the downloader middleware does not modify the
    # passed objects.
    USER_AGENT_LIST = [
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
        "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",
        "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",
        "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",
        "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
        "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
        "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SE 2.X MetaSr 1.0; SE 2.X MetaSr 1.0; .NET CLR 2.0.50727; SE 2.X MetaSr 1.0)",
        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
        "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
        "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
        "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
        "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
    ]
    
    PROXY_http = [
        '153.180.102.104:80',
        '195.208.131.189:56055'
    ]
    PROXY_https = [
        '120.83.49.90:9000',
        '95.189.112.214:35508'
    ]
  • process_request() 拦截请求

    1. 使用UA池(不推荐)

          def process_request(self, request, spider):
              # Called for each request that goes through the downloader
              # middleware.
      
              # Must either:
              # - return None: continue processing this request
              # - or return a Response object
              # - or return a Request object
              # - or raise IgnoreRequest: process_exception() methods of
              #   installed downloader middleware will be called
              """
              函数说明:拦截请求
              :param request:
              :param spider:
              :return:
              """
              # UA伪装
              request.headers['User-Agent'] = rando.chiose(self.USER_AGENT_LIST)
      
              return None
      
    2. 使用 fake-useragent 模块(推荐)

      安装模块:pip install fake-useragent

          def process_request(self, request, spider):
                  # Called for each request that goes through the downloader
                  # middleware.
      
                  # Must either:
                  # - return None: continue processing this request
                  # - or return a Response object
                  # - or return a Request object
                  # - or raise IgnoreRequest: process_exception() methods of
                  #   installed downloader middleware will be called
                  """
                  函数说明:拦截请求
                  :param request:
                  :param spider:
                  :return:
                  """
                  # UA伪装
                  request.headers['User-Agent'] = UserAgent().random
      
  • process_response() 拦截所有的响应

  • process_exception() 拦截异常的请求

    • 代理IP

      PROXY_http = [
          '153.180.102.104:80',
          '195.208.131.189:56055'
      ]
      PROXY_https = [
          '120.83.49.90:9000',
          '95.189.112.214:35508'
      ]
        
        	def process_exception(self, request, exception, spider):
              # Called when a download handler or a process_request()
              # (from other downloader middleware) raises an exception.
      
              # Must either:
              # - return None: continue processing this exception
              # - return a Response object: stops process_exception() chain
              # - return a Request object: stops process_exception() chain
              """
              函数说明:拦截发生异常的请求
              :param request:
              :param exception:
              :param spider:
              :return:
              """
              # 代理IP
              if request.url.split(':')[0] == 'http':
                  request.meta['proxy'] = 'http://' + random.choice(self.PROXY_http)
              else:
                  request.meta['proxy'] = 'https://' + random.choice(self.PROXY_https)
      
              # 请修正之后的请求对象进行重新的请求发送
              return request
      
  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值