当Scrapy遇上MongoDB

最新推荐文章于 2024-06-14 14:52:32 发布

FootMark.run

最新推荐文章于 2024-06-14 14:52:32 发布

阅读量467

点赞数 1

本文链接：https://blog.csdn.net/qq_36034503/article/details/109231149

版权

听说MongoDB与爬虫更搭对？

安排。

这次爬取小二IP代理的IP和归属地，http://www.xiaoerdaili.com/https/page1/

这里还是跟上一篇一样，按照scrapy五部曲编写代码即可

编写Items.py 文件

import scrapy


class IpscrapyItem(scrapy.Item):

    ip = scrapy.Field()
    city = scrapy.Field()

编写spider.py 文件，这里只是爬取前200页的数据

import scrapy
from ..items import IpscrapyItem


class XiaoerdailiSpider(scrapy.Spider):
    name = 'xiaoerdaili'
    allowed_domains = ['xiaoerdaili.com']
    # 抓取前200页
    start_urls = ["http://www.xiaoerdaili.com/https/page" + str(x) for x in range(1, 200)]

    def parse(self, response):

        divs = response.xpath("//div[@id='list']/div")
        for div in divs:
            item = IpscrapyItem()
            item['ip'] = div.xpath("./div[1]/text()").get()
            item['city'] = div.xpath("./div[2]/text()").get()

            print(item)
            yield item

编写middlewares.py 文件

class IpscrapyDownloaderMiddleware:

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_request(self, request, spider):
        referer = request.url
        if referer:
            request.headers['referer'] = referer

        return None

    def process_response(self, request, response, spider):
        # Called with the response returned from the downloader.

        # Must either;
        # - return a Response object
        # - return a Request object
        # - or raise IgnoreRequest
        return response

    def process_exception(self, request, exception, spider):
        # Called when a download handler or a process_request()
        # (from other downloader middleware) raises an exception.

        # Must either:
        # - return None: continue processing this exception
        # - return a Response object: stops process_exception() chain
        # - return a Request object: stops process_exception() chain
        pass

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)

编写pipelines.py 文件

import pymongo

class IpscrapyPipeline:
    def __init__(self):
        # 建立MongoDB数据库连接
        client = pymongo.MongoClient('127.0.0.1', 27017)
        # 连接所需数据库,scrapy为数据库名
        db = client['scrapy']
        # 连接所用集合，也就是我们通常所说的表，ips为表名
        self.ips = db['ips']

    def process_item(self, item, spider):
        data = dict(item)  # 把item转化成字典形式
        self.ips.insert(data)  # 向数据库插入一条记录
        return item  # 会在控制台输出原item数据，可以选择不写

编写settings.py 文件


BOT_NAME = 'ipScrapy'

SPIDER_MODULES = ['ipScrapy.spiders']
NEWSPIDER_MODULE = 'ipScrapy.spiders'

FEED_EXPORT_ENCODING = 'utf-8'

# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3314.0 Safari/537.36 SE 2.X MetaSr 1.0'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Disable cookies (enabled by default)
COOKIES_ENABLED = True

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
   'ipScrapy.middlewares.IpscrapyDownloaderMiddleware': 543,
}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   'ipScrapy.pipelines.IpscrapyPipeline': 300,
}

开始奏响你的五部曲吧：

scrapy crawl xiaoerdaili

已经可以爬取到数据了

我们来看下，MongoDB

就这么结束了，毫无波澜

FootMark.run

关注

1
点赞
踩
1

收藏

觉得还不错? 一键收藏
2
评论
当Scrapy遇上MongoDB

听说MongoDB与爬虫更搭对？安排。这次爬取小二IP代理的IP和归属地，http://www.xiaoerdaili.com/https/page1/这里还是跟上一篇一样，按照scrapy五部曲编写代码即可编写Items.py 文件import scrapyclass IpscrapyItem(scrapy.Item): ip = scrapy.Field() city = scrapy.Field()编写spider.py 文件，这里只是..
复制链接

扫一扫