Scrapy爬虫之下载器中间件(反爬:随机请求头、IP代理池)

一、下载器中间件配置随机请求头 

下载器中间件

实现两个方法:process_request和process_response

能获取当前浏览器请求头的网站:http://httpbin.org/user-agent

全世界所有浏览器的请求头:http://www.useragentstring.com/pages/useragentstring.php?typ=Browser

修改 UseragentrandomDownloaderMiddleware:

class UseragentrandomDownloaderMiddleware:
    USER_AGENTS = [
        'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36',
        'Mozilla/5.0 (X11; Ubuntu; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2919.83 Safari/537.36',
        'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2866.71 Safari/537.36',
        'Mozilla/5.0 (X11; Ubuntu; Linux i686 on x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2820.59 Safari/537.36'
    ]

    def process_request(self, request, spider):
        user_agent = random.choice(self.USER_AGENTS)
        request.headers['User-Agent'] = user_agent
        return None

settings.py中开启中间件:

DOWNLOADER_MIDDLEWARES = {
   'userAgentRandom.middlewares.UseragentrandomDownloaderMiddleware': 543,
}

spider中不断发请求,输出请求头:

class HttpbinSpider(scrapy.Spider):
    name = 'httpbin'
    allowed_domains = ['httpbin.org']
    start_urls = ['http://httpbin.org/user-agent']

    def parse(self, response):
        user_agent = json.loads(response.text)['user-agent']
        yield scrapy.Request(self.start_urls[0],
                             dont_filter=True)  # 不要去重
        print(user_agent)

运行并输出: 

二、下载器中间件配置IP代理池

class IPProxyrandomDownloaderMiddleware:
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the downloader middleware does not modify the
    # passed objects.
    PROXYIES = [
        "183.166.138.181:4216",
        "183.166.162.157:4216",
        "112.194.112.175:8118"
    ]
    PROXYIES = ['https://' + p for p in PROXYIES]  # 不加报错
    print(PROXYIES)

    def process_request(self, request, spider):
        proxy = random.choice(self.PROXYIES)
        request.meta['proxy'] = proxy

spider中通过查ip的网站输出当前ip:

class IpproxySpider(scrapy.Spider):
    name = 'ipproxy'
    allowed_domains = ['ip.cn', 'httpbin.org']
    # start_urls = ['http://httpbin.org/ip']
    start_urls = ['https://www.ip.cn/']

    def parse(self, response):
        print('执行ipproxy.py中的parse')
        ip = response.xpath('//div[@class="well"]/p[1]/code/text()').get()
        # ip = json.loads(response.text)['origin']
        print(ip)
        yield scrapy.Request(self.start_urls[0], dont_filter=True)

运行!然而免费代理ip质量堪忧... 

评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值