当Scrapy遇上MongoDB

听说MongoDB与爬虫更搭对?

安排。

这次爬取小二IP代理的IP和归属地,http://www.xiaoerdaili.com/https/page1/

这里还是跟上一篇一样,按照scrapy五部曲编写代码即可

 

编写Items.py 文件

import scrapy


class IpscrapyItem(scrapy.Item):

    ip = scrapy.Field()
    city = scrapy.Field()

 

编写spider.py 文件,这里只是爬取前200页的数据

import scrapy
from ..items import IpscrapyItem


class XiaoerdailiSpider(scrapy.Spider):
    name = 'xiaoerdaili'
    allowed_domains = ['xiaoerdaili.com']
    # 抓取前200页
    start_urls = ["http://www.xiaoerdaili.com/https/page" + str(x) for x in range(1, 200)]

    def parse(self, response):

        divs = response.xpath("//div[@id='list']/div")
        for div in divs:
            item = IpscrapyItem()
            item['ip'] = div.xpath("./div[1]/text()").get()
            item['city'] = div.xpath("./div[2]/text()").get()

            print(item)
            yield item

编写middlewares.py 文件 

class IpscrapyDownloaderMiddleware:

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_request(self, request, spider):
        referer = request.url
        if referer:
            request.headers['referer'] = referer

        return None

    def process_response(self, request, response, spider):
        # Called with the response returned from the downloader.

        # Must either;
        # - return a Response object
        # - return a Request object
        # - or raise IgnoreRequest
        return response

    def process_exception(self, request, exception, spider):
        # Called when a download handler or a process_request()
        # (from other downloader middleware) raises an exception.

        # Must either:
        # - return None: continue processing this exception
        # - return a Response object: stops process_exception() chain
        # - return a Request object: stops process_exception() chain
        pass

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)

 

编写pipelines.py 文件

import pymongo

class IpscrapyPipeline:
    def __init__(self):
        # 建立MongoDB数据库连接
        client = pymongo.MongoClient('127.0.0.1', 27017)
        # 连接所需数据库,scrapy为数据库名
        db = client['scrapy']
        # 连接所用集合,也就是我们通常所说的表,ips为表名
        self.ips = db['ips']

    def process_item(self, item, spider):
        data = dict(item)  # 把item转化成字典形式
        self.ips.insert(data)  # 向数据库插入一条记录
        return item  # 会在控制台输出原item数据,可以选择不写

 

编写settings.py 文件


BOT_NAME = 'ipScrapy'

SPIDER_MODULES = ['ipScrapy.spiders']
NEWSPIDER_MODULE = 'ipScrapy.spiders'

FEED_EXPORT_ENCODING = 'utf-8'

# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3314.0 Safari/537.36 SE 2.X MetaSr 1.0'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Disable cookies (enabled by default)
COOKIES_ENABLED = True

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
   'ipScrapy.middlewares.IpscrapyDownloaderMiddleware': 543,
}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   'ipScrapy.pipelines.IpscrapyPipeline': 300,
}


 

开始奏响你的五部曲吧:

scrapy crawl xiaoerdaili

 

已经可以爬取到数据了 

 

 

 我们来看下,MongoDB

就这么结束了,毫无波澜

  • 1
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 2
    评论
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值