Scrapy爬虫之下载器中间件（反爬：随机请求头、IP代理池）

最新推荐文章于 2024-03-30 09:49:34 发布

木尧大兄弟

最新推荐文章于 2024-03-30 09:49:34 发布

阅读量581

点赞数

分类专栏： Scrapy爬虫

本文链接：https://blog.csdn.net/muyao987/article/details/105857045

版权

Scrapy爬虫专栏收录该内容

10 篇文章 0 订阅

订阅专栏

一、下载器中间件配置随机请求头

实现两个方法：process_request和process_response

能获取当前浏览器请求头的网站：http://httpbin.org/user-agent

全世界所有浏览器的请求头：http://www.useragentstring.com/pages/useragentstring.php?typ=Browser

修改 UseragentrandomDownloaderMiddleware：

class UseragentrandomDownloaderMiddleware:
    USER_AGENTS = [
        'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36',
        'Mozilla/5.0 (X11; Ubuntu; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2919.83 Safari/537.36',
        'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2866.71 Safari/537.36',
        'Mozilla/5.0 (X11; Ubuntu; Linux i686 on x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2820.59 Safari/537.36'
    ]

    def process_request(self, request, spider):
        user_agent = random.choice(self.USER_AGENTS)
        request.headers['User-Agent'] = user_agent
        return None

settings.py中开启中间件：

DOWNLOADER_MIDDLEWARES = {
   'userAgentRandom.middlewares.UseragentrandomDownloaderMiddleware': 543,
}

spider中不断发请求，输出请求头：

class HttpbinSpider(scrapy.Spider):
    name = 'httpbin'
    allowed_domains = ['httpbin.org']
    start_urls = ['http://httpbin.org/user-agent']

    def parse(self, response):
        user_agent = json.loads(response.text)['user-agent']
        yield scrapy.Request(self.start_urls[0],
                             dont_filter=True)  # 不要去重
        print(user_agent)

运行并输出：

二、下载器中间件配置IP代理池

class IPProxyrandomDownloaderMiddleware:
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the downloader middleware does not modify the
    # passed objects.
    PROXYIES = [
        "183.166.138.181:4216",
        "183.166.162.157:4216",
        "112.194.112.175:8118"
    ]
    PROXYIES = ['https://' + p for p in PROXYIES]  # 不加报错
    print(PROXYIES)

    def process_request(self, request, spider):
        proxy = random.choice(self.PROXYIES)
        request.meta['proxy'] = proxy

spider中通过查ip的网站输出当前ip：

class IpproxySpider(scrapy.Spider):
    name = 'ipproxy'
    allowed_domains = ['ip.cn', 'httpbin.org']
    # start_urls = ['http://httpbin.org/ip']
    start_urls = ['https://www.ip.cn/']

    def parse(self, response):
        print('执行ipproxy.py中的parse')
        ip = response.xpath('//div[@class="well"]/p[1]/code/text()').get()
        # ip = json.loads(response.text)['origin']
        print(ip)
        yield scrapy.Request(self.start_urls[0], dont_filter=True)

运行！然而免费代理ip质量堪忧...