scrapy之分布式爬虫scrapy-redis

scrapy_redis的作用

Scrapy_redis在scrapy的基础上实现了更多,更强大的功能,具体体现在:

通过持久化请求队列和请求的指纹集合来实现:

  • 断点续爬
  • 分布式快速抓取
    其他概念性的东西可自行百度。我们就只写怎么将普通爬虫改写为分布式爬虫

    第一步:导入分布式爬虫类(抄官方)
    第二步:继承分布式爬虫类(记不住就抄)
    第三步:注销起始url和允许的域
    第四步:设置redis-key(随便写,看官网也行)
    第五步:设置–init–(抄官方例子)

    根据以前爬取页面的不同,我们主要写了crawlspider和普通的spider爬虫,下面我们将这两种爬虫改写为分布式爬虫

    首先你要从git上下载官方模板
    git clone https://github.com/rolando/scrapy-redis.git

改写crawlspider爬虫

目标爬去有缘网的信息
改写后爬虫如下

from scrapy.spiders import CrawlSpider, Rule
from youyuanwang.items import YouyuanwangItem
from scrapy.linkextractors import LinkExtractor
# 第一步 导入需要的类
from scrapy_redis.spiders import RedisCrawlSpider


# 第二步 继承类
class MarrigeSpider(RedisCrawlSpider):
    name = 'marrige'
    # 第三步 注销起始的网址和允许的域
    # allowed_domains = ['youyuan.com']
    # start_urls = ['http://www.youyuan.com/find/xian/mm18-0/advance-0-0-0-0-0-0-0/p1/']

    # 第四步 设置redis——key
    redis_key = 'guoshaosong'

    # 第五步 通过init设置允许的域
    rules = (
        Rule(LinkExtractor(allow=r'^.*youyuan.*xian.*'), callback='parse_item', follow=True),
    )

    # print(rules)

    def __init__(self, *args, **kwargs):
        # Dynamically define the allowed domains list.
        domain = kwargs.pop('domain', '')
        self.allowed_domains = filter(None, domain.split(','))
        super(MarrigeSpider, self).__init__(*args, **kwargs)#为当前类名

    def parse_item(self, response):
        item = YouyuanwangItem()
        student_list = response.xpath('//div[@class="student"]/ul/li')
        for li in student_list:
            item['name'] = li.xpath('./dl/dd/a[1]/strong/text()').extract_first()
            item['desc'] = li.xpath('./dl/dd/font//text()').extract()
            item['img'] = li.xpath('./dl/dt/a/img/@src').extract_first()
            yield item

下来是settings,也是抄官方

# Scrapy settings for youyuanwang project
#
# For simplicity, this file contains only the most important settings by
# default. All the other settings are documented here:
#
#     http://doc.scrapy.org/topics/settings.html
#
SPIDER_MODULES = ['youyuanwang.spiders']
NEWSPIDER_MODULE = 'youyuanwang.spiders'

USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.66 Safari/537.36'

DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
SCHEDULER_PERSIST = False
#SCHEDULER_QUEUE_CLASS = "scrapy_redis.queue.SpiderPriorityQueue"
#SCHEDULER_QUEUE_CLASS = "scrapy_redis.queue.SpiderQueue"
#SCHEDULER_QUEUE_CLASS = "scrapy_redis.queue.SpiderStack"

ITEM_PIPELINES = {
    # 'youyuanwang.pipelines.YouyuanwangPipeline': 300,
    'scrapy_redis.pipelines.RedisPipeline': 400,
}

LOG_LEVEL = 'DEBUG'

REDIS_HOST = '127.0.0.1'
REDIS_PORT = 6379
# REDIS_PASS = 'root'
SPIDER_MIDDLEWARES = { 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware': None, }
# Introduce an artifical delay to make use of parallelism. to speed up the
# crawl.
DOWNLOAD_DELAY = 1


运行,先给redis数据库扔一个起始链接
lpush youyuanwang:start_urls http://www.youyuan.com/find/xian/mm18-0/advance-0-0-0-0-0-0-0/p1/
然后pycharm运行
scrapy runspider 爬虫名

中间可以暂停一下,看到底是不是断点续爬

改写spider爬虫

过程都一样,不会就抄官方,主要是来打个样
爬虫目标:爬去新浪新闻的内容

import scrapy
from news.items import NewsItem
# 第一步:导入必要的模块
from scrapy_redis.spiders import RedisSpider


# 第二步:更改继承类
class SinaNewsSpider(RedisSpider):
    name = 'sina_news'

    # 第三步:注释允许的域和起始url
    # allowed_domains = ['sina.com.cn']
    # start_urls = ['http://news.sina.com.cn/guide/']
    # 第四步:设置redis-key
    redis_key = 'myspider:start_urls'

    # 第五步:导入配置函数
    def __init__(self, *args, **kwargs):
        # Dynamically define the allowed domains list.
        domain = kwargs.pop('domain', '')
        self.allowed_domains = filter(None, domain.split(','))
        super(SinaNewsSpider, self).__init__(*args, **kwargs)


    # 第六步:修改setting文件,全部选中是c+s+a+j
    def parse(self, response):
        # items = []
        # 所有大类的url 和 标题
        parentUrls = response.xpath('//div[@id="tab01"]/div/h3/a/@href').extract()
        parentTitle = response.xpath('//div[@id="tab01"]/div/h3/a/text()').extract()

        # 所有小类的ur 和 标题
        subUrls = response.xpath('//div[@id="tab01"]/div/ul/li/a/@href').extract()
        subTitle = response.xpath('//div[@id="tab01"]/div/ul/li/a/text()').extract()

        # 判断小标签是否属于大标签
        for i in range(0, len(parentUrls)):
            for j in range(0, len(subUrls)):
                item = NewsItem()
                item['headline'] = parentTitle[i]
                item['headline_url'] = parentUrls[i]
                # 对一个大分类,如果我是你的分类,就拼接这个item对象
                if subUrls[j].startswith(item['headline_url']):
                    item['subtitle'] = subTitle[i]
                    item['subtitle_url'] = subUrls[j]
                    yield scrapy.Request(url=item['subtitle_url'], callback=self.subtitle_parse, meta={'meta_1': item})

    def subtitle_parse(self, response):
        item = NewsItem()
        meta_1 = response.meta['meta_1']
        item['content'] = response.xpath('//title/text()').extract_first()
        item['headline'] = meta_1['headline']
        item['headline_url'] = meta_1['headline_url']
        item['subtitle'] = meta_1['subtitle']
        item['subtitle_url'] = meta_1['subtitle_url']
        yield item

settings

# Scrapy settings for news project
#
# For simplicity, this file contains only the most important settings by
# default. All the other settings are documented here:
#
#     http://doc.scrapy.org/topics/settings.html
#
SPIDER_MODULES = ['news.spiders']
NEWSPIDER_MODULE = 'news.spiders'

USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.66 Safari/537.36'

DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
SCHEDULER_PERSIST = True
#SCHEDULER_QUEUE_CLASS = "scrapy_redis.queue.SpiderPriorityQueue"
#SCHEDULER_QUEUE_CLASS = "scrapy_redis.queue.SpiderQueue"
#SCHEDULER_QUEUE_CLASS = "scrapy_redis.queue.SpiderStack"

ITEM_PIPELINES = {
    # 'news.pipelines.ExamplePipeline': 300,
    'scrapy_redis.pipelines.RedisPipeline': 400,
}

# 配置redis数据库
REDIS_HOST = "127.0.0.1"
REDIS_PORT = 6379

LOG_LEVEL = 'DEBUG'
SPIDER_MIDDLEWARES = { 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware': None, }
# Introduce an artifical delay to make use of parallelism. to speed up the
# crawl.
DOWNLOAD_DELAY = 1


到这基本就结束,我个人还去看了看分布式爬虫的部署以及管理,但因为我是win10教育版,docker安装等步骤总是出错,就不再折腾了,主要有scrapyd,以及gerapy等技术。csdn上都可以搜到相关博客,感兴趣自己研究。。。。。。

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 2
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值