Scrapy—redis分布式组件

最新推荐文章于 2024-08-05 10:37:56 发布

xiaoming0018

最新推荐文章于 2024-08-05 10:37:56 发布

阅读量1.5k

点赞数 1

分类专栏：爬虫

本文链接：https://blog.csdn.net/xiaoming0018/article/details/80387712

版权

爬虫专栏收录该内容

16 篇文章 0 订阅

订阅专栏

分布式：一个业务分拆多个子业务，部署在不同的服务器上。集群是个物理形态，分布式是个工作方式。

scrapy-redis架构

Scrapy是一个比较好用的Python爬虫框架，你只需要编写几个组件就可以实现网页数据的爬取。但是当我们要爬取的页面非常多的时候，单个主机的处理能力就不能满足我们的需求了（无论是处理速度还是网络请求的并发数），这时候分布式爬虫的优势就显现出来。

而Scrapy-Redis则是一个基于Redis的Scrapy分布式组件。它利用Redis对用于爬取的请求(Requests)进行存储和调度(Schedule)，并对爬取产生的项目(items)存储以供后续处理使用。scrapy-redi重写了scrapy一些比较关键的代码，将scrapy变成一个可以在多个主机上同时运行的分布式爬虫。

scrapy-redis提供了下面四种组件

scrapy-redis在scrapy的架构上增加了redis提供了下面四种组件（components）(四种组件意味着这四个模块都要做相应的修改)：
1）Scheduler（调度器）
2）Duplication Filter（requst的去重过滤器）
3）Item Pipeline（将Item存储在redis中以实现分布式处理）

4）Base Spider

安装包括服务端和客户端在 Ubuntu 系统安装 Redis 可以使用以下命令:

sudo apt-get update

sudo apt-get install redis-server

启动 Redis 服务：sudo service redis start

启动 Redis 服务器：redis-server

Linux中停止redis服务：sudo kill -9 redis的进程id 或者：sudo service redis stop

Linux中重启redis服务：sudo service redis restart

使用 redis 客户端查看是否启动：redis-cli

查看版本服务端：redis-server –version 和 redis-server -v

查看客户端：redis-cli –version 和 redis-cli -v

查看redis当前状态：ps ajx|grep redis

Slaver从机端

Slaver从机端启动：redis-cli -h 192.168.31.114，-h 参数表示连接到指定主机的redis数据库

window下的安装redis Window下载连接：https://github.com/MicrosoftArchive/redis/releases

windows下：来到C:\Program Files\Redis目录下：按住键盘shift键+右键，弹出窗口后选择”在此处选择命令窗口”

Master端主机

测试中，Master端Linux的IP地址为：192.168.31.114,其实Master端只需要装redis数据库，scrapy和scrapy-redis都不需要装。Master端按指定配置文件启动 redis-server，示例：
linux系统开启服务端：sudo redis-server /etc/redis/redis.conf

Master端启动本地：redis-cli

注意：Slave端无需启动redis-server，Master端启动即可。只要 Slave 端读取到了 Master 端的 Redis 数据库，则表示能够连接成功，可以实施分布式。

在Master端的redis-cli输入push指令让Slaver端爬虫获取到请求，开始爬取。

$redis > lpush myspider:start_urls http://www.chinadmoz.org/

scrapy-redis的安装

安装scrapy-redis：Python3安装命令：sudo pip3 install scrapy-redis

如果pip3没有安装：sudo apt-get install python3-pip

下载scrapy-redis代码路径：https://codeload.github.com/rmax/scrapy-redis/zip/master

或者使用git下载：git clone https://github.com/rolando/scrapy-redis.git

如果没有装git可以用命令安装：sudo apt install git

运行命令：scrapy runspider myspider_redis.py

scrapy-redis的源代码

从配置文件settings.py看出先是经过我们自己写的ExamplePipeline，才会执行系统的RedisPipeline。

SPIDER_MODULES = ['example.spiders']
NEWSPIDER_MODULE = 'example.spiders'

USER_AGENT = 'scrapy-redis (+https://github.com/rolando/scrapy-redis)'
#使用scrapy-redis自己的组件去重,不使用scrapy默认的去重
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
#使用scrapy-redis自己调度器,不使用scrapy默认的调度器
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
#不调度状态持久化，不清理redis缓存，允许暂停/启动爬虫
SCHEDULER_PERSIST = True
#默认使用按照scrapy的请求(优先级队列)队列形式,不管是那种得到的结果一样
#按照sorted 排序顺序出队列，建议使用某一个，这样才能在redis数据库中看到，如下图
SCHEDULER_QUEUE_CLASS = "scrapy_redis.queue.SpiderPriorityQueue"
#队列形式,先进先出
#SCHEDULER_QUEUE_CLASS = "scrapy_redis.queue.SpiderQueue"
#栈形式,先进后出
#SCHEDULER_QUEUE_CLASS = "scrapy_redis.queue.SpiderStack"
ITEM_PIPELINES = {
    'example.pipelines.ExamplePipeline': 300,
    #下面这个管道是一般要启用的--支持数据存储到redis数据库里
    'scrapy_redis.pipelines.RedisPipeline': 400,
}

#日志级别
# LOG_LEVEL = 'DEBUG'
# Introduce an artifical delay to make use of parallelism. to speed up the
# crawl.
#下载延迟
DOWNLOAD_DELAY = 17
#下面是自己指定redis相关信息,不能写错
REDIS_HOST = "192.168.31.117"
REDIS_PORT = 6379

pipelines.py 配置

from datetime import datetime
class ExamplePipeline(object):
    def process_item(self, item, spider):
        #当前系统的时间戳
        item["crawled"] = datetime.utcnow()
        #爬虫的名称,因为分布式会有多个爬虫同时爬
        item["spider"] = spider.name
        return item

Scrapy—redis分布式三大爬虫

1、 dmoz (class DmozSpider(CrawlSpider))

这个爬虫继承的是CrawlSpider，它是用来说明Redis的持续性，当我们第一次运行dmoz爬虫，然后Ctrl + C停掉之后，再运行dmoz爬虫，之前的爬取记录是保留在Redis里的。分析起来，其实这就是一个 scrapy-redis 版 CrawlSpider 类，需要设置Rule规则，以及callback不能写parse()方法。

执行方式：scrapy crawl dmoz

from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class DmozSpider(CrawlSpider):     #scrapy规则爬虫
   """Follow categories and extract links."""
    name = 'atguigu'
    allowed_domains = ['atguigu.com']
    start_urls = ['http://www.atguigu.com/teacher.shtml']
    rules = [
        # Rule(LinkExtractor(restrict_css=('.top-cat', '.sub-cat', '.cat-item')
        # ), callback='parse_directory', follow=True),
        Rule(LinkExtractor(allow=r"download"), callback='parse_directory', follow=True),
    ]
    def parse_directory(self, response):
        # for div in response.css('.title-and-desc'):
        #     yield {
        #         'name': div.css('.site-title::text').extract_first(),
        #         'description': div.css('.site-descr::text').extract_first().strip(),
        #         'link': div.css('a::attr(href)').extract_first(),
        #     }

        yield {
            'name': response.xpath('//title/text()').extract_first(),
            'description': response.xpath('//meta[@name="keywords"]/@content').extract_first().strip(),
            'link': response.url,
        }
        # for div in response.xpath('//div[@class="item"]')
        #     yield {
        #         'name': div.xpath('.//div[@class="hd"]/a/span[1]/text()').extract_first(),
        #         'description': div.xpath('.//div[@class="bd"]/p/text()').extract_first().strip(),
        #         'link': div.url,
        #     }

2. myspider_redis (class MySpider(RedisSpider))

这个爬虫继承了RedisSpider，它能够支持分布式的抓取，采用的是basic spider，需要写parse函数。

其次就是不再有start_urls了，取而代之的是redis_key，scrapy-redis将key从Redis里pop出来，成为请求的url地址。

执行方式：scrapy runspider myspider_redis.py

通过runspider方法执行爬虫的py文件（也可以分次执行多条），爬虫（们）将处于等待准备状态：

from scrapy_redis.spiders import RedisSpider

class MySpider(RedisSpider):
    """Spider that reads urls from redis queue (myspider:start_urls)."""
    #爬虫名称
    name = 'myspider_redis'
    #爬虫的标识,建议这样写:爬虫名:start_urls,这样能保持唯一性
    redis_key = 'myspider_redis:start_urls'
    #可以使用这个是固定的也可以使用下面
    #  allowed_domains = ['sina.com.cn']
    #动态获取域的范围,等价allowed_domains = ['chinadmoz.org']这个是固定的,__init__是动态的
    def __init__(self, *args, **kwargs):
        # Dynamically define the allowed domains list.
        domain = kwargs.pop('domain', '')
        self.allowed_domains = filter(None, domain.split(','))
        super(MySpider, self).__init__(*args, **kwargs)

    def parse(self, response):
        return {
            'name': response.css('title::text').extract_first(),
            'url': response.url,
        }

RedisSpider注意：

RedisSpider类不需要写allowd_domains和start_urls：
scrapy-redis将从在构造方法__init__()里动态定义爬虫爬取域范围，也可以选择直接写allowd_domains。
必须指定redis_key，即启动爬虫的命令，参考格式：redis_key = 'myspider:start_urls'
根据指定的格式，start_urls将在 Master端的 redis-cli 里 lpush 到 Redis数据库里，RedisSpider 将在数据库里获取start_urls。

3、mycrawler_redis (class MyCrawler(RedisCrawlSpider))

这个RedisCrawlSpider类爬虫继承了RedisCrawlSpider，能够支持分布式的抓取。因为采用的是crawlSpider，所以需要遵守Rule规则，以及callback不能写parse()方法。
同样也不再有start_urls了，取而代之的是redis_key，scrapy-redis将key从Redis里pop出来，成为请求的url地址。

执行方式：scrapy runspider mycrawler_redis.py

from scrapy.spiders import Rule
from scrapy.linkextractors import LinkExtractor
from scrapy_redis.spiders import RedisCrawlSpider

class MyCrawler(RedisCrawlSpider):
    """Spider that reads urls from redis queue (myspider:start_urls)."""
    name = 'mycrawler_redis'
    redis_key = 'mycrawler:start_urls'
    rules = (
        # follow all links,爬取所有链接,并且回调parse_page方法,深度爬取
        #注意不能重写parse
        Rule(LinkExtractor(), callback='parse_page', follow=True),
    )
    # __init__方法必须按规定写，使用时只需要修改super()里的类名参数即可
    def __init__(self, *args, **kwargs):
        # Dynamically define the allowed domains list.
        domain = kwargs.pop('domain', '')
        self.allowed_domains = filter(None, domain.split(','))
        # 修改这里的类名为当前类名
        super(MyCrawler, self).__init__(*args, **kwargs)
    def parse_page(self, response):
        return {
            'name': response.css('title::text').extract_first(),
            'url': response.url,
        }

RedisCrawlSpider注意：

同样的，RedisCrawlSpider类不需要写allowd_domains和start_urls：，scrapy-redis将从在构造方法__init__()里动态定义爬虫爬取域范围，也可以选择直接写allowd_domains。必须指定redis_key，即启动爬虫的命令，参考格式：redis_key = 'myspider:start_urls'.根据指定的格式，start_urls将在 Master端的 redis-cli 里 lpush 到 Redis数据库里，RedisSpider 将在数据库里获取start_urls。通过runspider方法执行爬虫的py文件（也可以分次执行多条），爬虫（们）将处于等待准备状态。