使用scrapy爬取免费代理ip并存入MongoDB数据库中

最新推荐文章于 2020-04-30 16:16:34 发布

不认输的小蜗牛

最新推荐文章于 2020-04-30 16:16:34 发布

阅读量362

点赞数

本文链接：https://blog.csdn.net/CosCXY/article/details/85265579

版权

python爬虫同时被 3 个专栏收录

4 篇文章 0 订阅

订阅专栏

免费

2 篇文章 0 订阅

订阅专栏

scrapy

1 篇文章 0 订阅

订阅专栏

鸣谢：刘硕

部分代码来源于刘硕编写的《精通scrapy网络爬虫》，在此声明

通常，我们在爬取一些较大型的网站的时候，都会遇到一个非常令人头疼的事情，就是他们的反爬机制，稍微爬快一点就被封，真的很难受，爬的太慢了自己等着也烦，所以很多人都会用代理来进行爬取数据，但是，选择一个代理服务器成本比较高，所以对于我们一些平民来说，爬取一些免费的代理ip更适合我们，当然，如果资金充足，你完全可以去购买代理服务器，在这里，创建spider项目和启动spider略过，相信大家都知道，嘻嘻

此爬虫拥有功能（对于新手来说也很友好）：

1.可以指定爬取的代理ip数（根据代理网站页数来选择）

2.可以删去无用代理ip

3.可自由选择存储方式（需要自己编写，我这里存入MongoDB数据库）

废话不多说：

1.首先，我们需要选择一个免费提供代理ip的网址，这里我们选择http://www.xicidaili.com，如下图

通过按F12开发者工具，找到每行数据的ip信息，这里的话自己去琢磨一下吧：

所有的内容都在（xpath） "//table[@id="ip_list"]/tr[position()>1]" 下，然后通过css选择器来选择每一列的数据

ip地址：td:nth-child(2)::text

端口号：td:nth-child(3)::text

类型（请求方式）：td:nth-child(6)::text

具体获取代码如下：

        for sel in response.xpath('//table[@id="ip_list"]/tr[position()>1]'):
            # 提取代理ip，端口port，请求方式（http，https）
            ip = sel.css('td:nth-child(2)::text').extract_first()
            port = sel.css('td:nth-child(3)::text').extract_first()
            scheme = sel.css('td:nth-child(6)::text').extract_first().lower()

解析：选取所有符合 “//table[@id="ip_list"]/tr[position()>1]” 的子元素，通过遍历取出所有有用信息，为接下来做准备。

当然，在获取信息之前我们需要构建url，代码如下，通过修改range里面的数字即可修改要爬去的页数（间接实现了指定ip数量）：

# 重写url构造方法
    def start_requests(self):
        for i in range(1, 2):
            yield scrapy.Request('http://www.xicidaili.com/nn/%s' % i)

2.验证取到的每一个ip是否可用，同时判断该ip是否为高匿ip（隐藏ip）：

    def parse(self, response):
        for sel in response.xpath('//table[@id="ip_list"]/tr[position()>1]'):
            # 提取代理ip，端口port，请求方式（http，https）
            ip = sel.css('td:nth-child(2)::text').extract_first()
            port = sel.css('td:nth-child(3)::text').extract_first()
            scheme = sel.css('td:nth-child(6)::text').extract_first().lower()

            # 使用爬取到的代理访问，验证ip是否可用
            url = '%s://httpbin.org/ip' % scheme
            proxy = '%s://%s:%s' % (scheme, ip, port)

            meta = {
                'proxy': proxy,
                'dont_retry': True,
                'download_timeout': 10,

                # 以下两个字段传给check_available方法的信息，方便检测
                '_proxy_scheme': scheme,
                '_proxy_ip': ip,
            }

            # 迭代请求
            yield scrapy.Request(url, callback=self.check_available, meta=meta, dont_filter=True)

    def check_available(self, response):
        proxy_ip = response.meta['_proxy_ip']
        if proxy_ip == json.loads(response.text)["origin"]:
            ip = IpProxyItem()
            ip["scheme"] = response.meta["_proxy_scheme"]
            ip["proxy"] = response.meta["proxy"]
            yield ip

解析：check_available函数即为判断该ip是否为高匿ip（通过返回的“origin”参数即可判断），若程序进入该函数则证明该代理ip可用，所以我们将它存入Item，为接下来存入Mongodb做准备，爬虫完整代码如下（proxy.py）：

# -*- coding: utf-8 -*-
import scrapy
import json
from ..items import IpProxyItem


class ProxySpider(scrapy.Spider):
    # 免费ip代理获取
    name = 'proxy'
    allowed_domains = ['www.xicidaili.com']
    start_urls = ['http://www.xicidaili.com/']

    # 重写url构造方法
    def start_requests(self):
        for i in range(1, 2):
            yield scrapy.Request('http://www.xicidaili.com/nn/%s' % i)

    # 获取数据
    def parse(self, response):
        for sel in response.xpath('//table[@id="ip_list"]/tr[position()>1]'):
            # 提取代理ip，端口port，请求方式（http，https）
            ip = sel.css('td:nth-child(2)::text').extract_first()
            port = sel.css('td:nth-child(3)::text').extract_first()
            scheme = sel.css('td:nth-child(6)::text').extract_first().lower()

            # 使用爬取到的代理访问，验证ip是否可用
            url = '%s://httpbin.org/ip' % scheme
            proxy = '%s://%s:%s' % (scheme, ip, port)

            meta = {
                'proxy': proxy,
                'dont_retry': True,
                'download_timeout': 10,

                # 以下两个字段传给check_available方法的信息，方便检测
                '_proxy_scheme': scheme,
                '_proxy_ip': ip,
            }

            # 迭代请求
            yield scrapy.Request(url, callback=self.check_available, meta=meta, dont_filter=True)

    def check_available(self, response):
        proxy_ip = response.meta['_proxy_ip']
        if proxy_ip == json.loads(response.text)["origin"]:
            ip = IpProxyItem()
            ip["scheme"] = response.meta["_proxy_scheme"]
            ip["proxy"] = response.meta["proxy"]
            yield ip

3.开始存入MongoDB：由于注释比较详细，所以不做解释了，代码如下（pipelines.py）：

# -*- coding: utf-8 -*-
from scrapy import Item
import pymongo
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html


class IpProxyPipeline(object):
    DB_URL = 'mongodb://localhost:27017/'
    DB_NAME = 'proxy'
    
    # 该函数在spyder启动时就执行，连接mongodb数据库
    def open_spider(self, spider):
        self.client = pymongo.MongoClient(self.DB_URL)
        self.db = self.client[self.DB_NAME]
    
    # 该函数在spyder运行结束后执行，关闭连接
    def close_spider(self, spider):
        self.client.close()

    def process_item(self, item, spider):
        # 创建数据库表名
        con = self.db[spider.name]
        # 由于需要插入字典类型的数据（不能插入Item类型），所以需要判断类型是否是字典类型
        post = dict(item) if isinstance(item, Item) else item
        # 将数据插入MongoDB
        con.insert_one(post)
        return item

4.在Items.py 中要添加如下字段

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class IpProxyItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    scheme = scrapy.Field()
    proxy = scrapy.Field()

5.setting.py 文件中需要启动pipelines（该地方被注释了，需要取消注释），同时要添加请求头和robot规则

# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   'ip_proxy.pipelines.IpProxyPipeline': 300,
}


# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:64.0) Gecko/20100101 Firefox/64.0'


# Obey robots.txt rules
ROBOTSTXT_OBEY = False

这样，运行spider，数据将自动写入数据库

如有不明白地方，欢迎留言

不认输的小蜗牛

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
使用scrapy爬取免费代理ip并存入MongoDB数据库中

鸣谢：刘硕部分代码来源于刘硕编写的《精通scrapy网络爬虫》，在此声明通常，我们在爬取一些较大型的网站的时候，都会遇到一个非常令人头疼的事情，就是他们的反爬机制，稍微爬快一点就被封，真的很难受，爬的太慢了自己等着也烦，所以很多人都会用代理来进行爬取数据，但是，选择一个代理服务器成本比较高，所以对于我们一些平民来说，爬取一些免费的代理ip更适合我们，当然，如果资金充足，你完全可以去购买代理...
复制链接

扫一扫

专栏目录