爬虫实战记录——requests与scrapy中IP池的建立与使用（及scrapy代理中间件详解、重写代理中间件等）

最新推荐文章于 2024-07-28 15:46:11 发布

PeryeLee

最新推荐文章于 2024-07-28 15:46:11 发布

阅读量835

点赞数

分类专栏：爬虫文章标签： python 大数据中间件

本文链接：https://blog.csdn.net/weixin_43930363/article/details/104608643

版权

爬虫专栏收录该内容

3 篇文章 0 订阅

订阅专栏

获取免费代理IP

这一部分中，我希望获取一些主流代理网站的免费IP供我个人使用。由于免费IP可用性相对私密代理较差，因此我希望在获取到代理后进行进行可用性的校验，并将可用IP保存至本地。同时，我希望可以对IP列表进行更新。

所需模块

import requests
from lxml import etree

import time
import datetime
import random

import os
from pathlib import Path

IP地址探测

下面编写IPDetector类，这个类中的方法用于将IP地址保存在本地，并在本地文件中标识获取IP的网站和日期。（其中IPValidator为有效性检验类，将在下一节贴出）。

class IPDetector:
    """
    IP地址探测类
    
    """

    @staticmethod
    def detector_of_xicidaili():

        # 西刺代理IP列表页URL
        url = 'https://www.xicidaili.com/nn/'

        # 创建文件流
        fp = open(os.path.dirname(__file__) + '/IP_pool_cache/IP-xicidaili-' + str(datetime.date.today()) +
                  '.txt', 'w', encoding='utf-8')

        # 获取前9页IP地址
        for i in range(1, 10):

            # 请求
            with requests.get(url + str(i)) as response:

                # 若请求错误则跳出循环访问下一页
                if response.status_code != 200:
                    continue

                # 解析为xml树
                html = etree.HTML(response.content)

                # 从第二个tr标签开始遍历
                j = 2
                while True:

                    # 直到不能找到元素时停止
                    if not html.xpath('//*[@id="ip_list"]/tr[%d]/td[2]' % j):
                        break

                    ip = html.xpath('//*[@id="ip_list"]/tr[%d]/td[2]/text()' % j)[0]

                    port = html.xpath('//*[@id="ip_list"]/tr[%d]/td[3]/text()' % j)[0]

                    # 检验IP有效性
                    if IPValidator.validate(ip, port):
                        fp.write(ip + ':' + port)
                        fp.write('\n')

                    j += 1

        # 关闭文件流
        fp.close()

    @staticmethod
    def detector_of_kuaidaili():

        # 快代理IP列表页面URL
        url = 'https://www.kuaidaili.com/free/inha/'

        # 创建文件流
        fp = open(os.path.dirname(__file__) + '/IP_pool_cache/IP-kuaidaili-' + str(datetime.date.today()) + '.txt', 'w',
                  encoding='utf-8')

        # 获取前4页IP地址
        for i in range(1, 5):

            # 请求
            with requests.get(url + str(i)) as response:

                # 若请求错误则跳出循环访问下一页
                if response.status_code != 200:
                    continue

                html = etree.HTML(response.content)

                j = 1
                while True:

                    if not html.xpath('//*[@id="list"]/table/tbody/tr[1]/td[%d]' % j):
                        break

                    ip = html.xpath('//div[@id="list"]//tbody/tr[%d]/td[1]/text()' % j)[0]

                    port = html.xpath('//div[@id="list"]//tbody/tr[%d]/td[2]/text()' % j)[0]

                    if IPValidator.validate(ip, port):
                        fp.write(ip + ':' + port)
                        fp.write('\n')
                    j += 1

            # 突破快代理时间检测
            time.sleep(random.randint(1, 5))

        # 关闭文件流
        fp.close()

这部分代码比较易懂，对于新手来说需要注意的有两点，第一是获取xpath的时候需要删除路径中的tbody，第二是快@代@理会禁止间隔时间过短的请求，sleep一下即可。

IP有效性检验

这一节我来编写IPValidator类，这个类用来检测代理IP是否可用。原理很简单，访问一下百度（或自定义网址）看一下能否得到200的状态吗即可。

class IPValidator:
    """
    IP地址有效性检验类

    """

    '''
    参数为IP地址及端口号
    如需指明测试网址可在domain参数中设置，默认为百度
    
    '''

    @staticmethod
    def validate(ip, port, domain='https://www.baidu.com'):

        ip_and_port = str(ip) + ":" + str(port)
        proxies = {'http': 'http://' + ip_and_port}

        try:
            response = requests.get(domain, proxies=proxies, timeout=3)
            if response.status_code == 200:
                return True

        except:
            return False

        return False

现在就可以调用IPDetector.detector_of_xicidaili();得到当日可用IP并保存在本地了。

从本地IP列表中获取

这一部分与不太需要对爬虫技术有所了解，主要是文件读写。IPGetter类提供四个方法，分别返回’http://host:port’字符串形式的IP或字典形式的IP。

class IPGetter:

    @staticmethod
    def get_an_ip():

        # 若有今日获取的IP列表，则从今日列表中读取
        try:
            fp = open(Path(os.path.dirname(__file__)) / 'IP_pool_cache' / ('IP-' + str(agent_domain) + '-' +
                      str(datetime.date.today()) + '.txt'), 'r', encoding='utf-8')

        # 否则从备用IP列表中读取
        except IOError:
            fp = open(Path(os.path.dirname(__file__)) / 'IP_pool_cache' / ('IP-' + str(agent_domain) + '-' +
                      str(datetime.date.today() - datetime.timedelta(days=1)) + '.txt'), 'r', encoding='utf-8')

        # 从文件中读取至列表
        ip_list = fp.readlines()

        # 若列表长度为0则不可用，从备用列表中读取
        if len(ip_list) == 0:
            fp = open(Path(os.path.dirname(__file__)) / 'IP_pool_cache' / 'IP-alternate.txt', 'r', encoding='utf-8')
            ip_list = fp.readlines()

        # 关闭文件流
        fp.close()

        # 返回一个随机IP
        return random.sample(ip_list, 1)[0]

    @staticmethod
    def get_ip_list():

        # 若有今日获取的IP列表，则从今日列表中读取
        try:
            fp = open(Path(os.path.dirname(__file__)) / 'IP_pool_cache' / ('IP-' + str(agent_domain) + '-' +
                      str(datetime.date.today()) + '.txt'), 'r', encoding='utf-8')

        # 否则从昨日IP列表中读取
        except IOError:
            fp = open(Path(os.path.dirname(__file__)) / 'IP_pool_cache' / ('IP-' + str(agent_domain) + '-' +
                      str(datetime.date.today() - datetime.timedelta(days=1)) + '.txt'), 'r', encoding='utf-8')

        # 从文件中读取至列表
        ip_list = fp.readlines()

        # 若列表长度为0则不可用，从备用列表中读取
        if len(ip_list) == 0:
            fp = open(Path(os.path.dirname(__file__)) / 'IP_pool_cache' / 'IP-alternate.txt', 'r', encoding='utf-8')
            ip_list = fp.readlines()

        # 关闭文件流
        fp.close()

        # 返回IP列表
        return ip_list

    @staticmethod
    def get_a_proxy():
        return {'http': IPGetter.get_an_ip()}

    @staticmethod
    def get_proxy_list():
        return [{'http': i} for i in IPGetter.get_ip_list()]

由于需要在不同系统下写爬虫，这一部分代码使用了Pathlib库，主要是为了处理不同系统路径格式不同的问题。
现在只需引用本类，并调用本类中的方法，即可使用代理IP。

requests下的使用示例

from 上面几个类所在的文件名 import IPGetter

response = requests.get(domain, proxies=IPGetter.get_a_proxy())

Scrapy下的使用指南

在Scrapy中代理的问题我们利用中间件来解决，首先我们来看一下Scrapy的代理中间件HttpProxyMiddleware。
Scrapy原生的HttpProxyMiddleware支持设置http_proxy、https_proxy和no_proxy三个环境变量从而使用代理IP。但如果在伪装爬虫时，我们希望对每一个请求都使用不同的IP，这样的方法就比较难处理。
所以我们HttpProxyMiddleware文档中最后一段写到的设置meta key的方法，为spider的请求设置meta key。

yield Request(url=page, callback=self.parse, meta={"proxy": IPGetter.get_a_proxy()})

这里的IPGetter就是上面写好的IPGetter。
但采用这种方式就需要对每一个parse函数进行这样的修改，所以我们自定义一个代理中间件，打开middlewares.py，创建自定义代理中间件。

class ProxyMiddleware(object):

    def process_request(self, request, spider):
        request.meta['proxy'] = 'http://' + IPGetter.get_an_ip()

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)

打开settings.py，启用自定义代理中间件，将原生代理中间件禁用，并将自定义代理中间件优先级设置为原生代理中间件对优先级，原生中间件的优先级可在Scrapy文档中查阅。

DOWNLOADER_MIDDLEWARES = {
    # 关闭默认代理中间件，替换为自己的中间件
    'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': None,
    '爬虫名.middlewares.ProxyMiddleware': 551,
}

这样我们就可以在Scrapy中对每一个请求使用不同IP了。

Scrapy代理中间件补充说明

以上是我个人认为比较方便的方式，不过既然写了这么多了，我们不妨再研究一下，如果比较强迫症，希望用Scrapy原生代理中间件该怎么解决，顺便也和大家一起分析一下源代码。如果懒得看源代码分析可以直接看最后结论。

我们来看一下scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware类。
首先来看构造函数。

    def __init__(self, auth_encoding='latin-1'):
        self.auth_encoding = auth_encoding
        self.proxies = {}
        for type_, url in getproxies().items():
            self.proxies[type_] = self._get_proxy(url, type_)

我们关注一下里面的proxies属性，该属性为一个字典，由下面的循环语句初始化，我们来看一下其中的getproxies()方法，该方法来自urllib.request模块。

# Proxy handling
def getproxies_environment():
    """Return a dictionary of scheme -> proxy server URL mappings.

    Scan the environment for variables named <scheme>_proxy;
    this seems to be the standard convention.  If you need a
    different way, you can pass a proxies dictionary to the
    [Fancy]URLopener constructor.

    """
    proxies = {}
    # in order to prefer lowercase variables, process environment in
    # two passes: first matches any, second pass matches lowercase only
    for name, value in os.environ.items():
        name = name.lower()
        if value and name[-6:] == '_proxy':
            proxies[name[:-6]] = value
    # CVE-2016-1000110 - If we are running as CGI script, forget HTTP_PROXY
    # (non-all-lowercase) as it may be set from the web server by a "Proxy:"
    # header from the client
    # If "proxy" is lowercase, it will still be used thanks to the next block
    if 'REQUEST_METHOD' in os.environ:
        proxies.pop('http', None)
    for name, value in os.environ.items():
        if name[-6:] == '_proxy':
            name = name.lower()
            if value:
                proxies[name[:-6]] = value
            else:
                proxies.pop(name[:-6], None)
    return proxies

可以看到该方法会从环境变量中获取键值对，并且从环境变量中找到键名最后六个字符为_proxy（大小写无所谓），且该键对应的值存在的环境变量。找到这些变量后，将键名除最后六个字符之外的变量作为字典的key名，将键对应的值保存到字典中并返回。

例如环境变量中存在如下键值对：
http_proxy:0.0.0.0:0000, https_proxy:1.1.1.1:1111, aa:2.2.2.2:2222。

getproxies_environment()方法则会返回如下的字典：
{‘http’: ‘0.0.0.0:0000’, ‘https’: ‘1.1.1.1:1111’}

现在回到scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware的构造函数，我们可以知道其中的循环语句会将读取到的环境变量利用_get_proxy(url, type_)方法解析成代理类型和地址的键值对，保存在self.proxies中。
之后我们来看process_request()方法，该方法首先看spider请求的meta key中是否有proxy，我们自定义的代理中间件正是利用了设置meta key的方法。如果meta key中含有proxy的话，会直接带着这个proxy去请求。如果没有的话，则会使用从环境变量中找到的proxy。
不过我们再看下_set_proxy()方法。

    def _set_proxy(self, request, scheme):
        creds, proxy = self.proxies[scheme]
        request.meta['proxy'] = proxy
        if creds:
            request.headers['Proxy-Authorization'] = b'Basic ' + creds

会发现scrapy最终还是利用设置meta key来设置代理IP，哈哈。

所以如果希望用原生中间件解决代理问题，只要在环境变量中设置http_proxy并在每次请求时更换即可。

import os

os.environ['http_proxy'] = '代理IP地址'

……是不是感觉还不如直接在request里面设置meta key。

PeryeLee

关注

0
点赞
踩
4

收藏

觉得还不错? 一键收藏
1
评论
爬虫实战记录——requests与scrapy中IP池的建立与使用（及scrapy代理中间件详解、重写代理中间件等）

获取免费代理IP这一部分中，我希望获取一些主流代理网站的免费IP供我个人使用。由于免费IP可用性相对私密代理较差，因此我希望在获取到代理后进行进行可用性的校验，并将可用IP保存至本地。同时，我希望可以对IP列表进行更新。所需模块import requestsfrom lxml import etreeimport timeimport datetimeimport random...
复制链接

扫一扫