Scrapy与redis的结合（Scrapy 分布式）

最新推荐文章于 2024-06-23 16:01:36 发布

无痕的雨

最新推荐文章于 2024-06-23 16:01:36 发布

阅读量1.2k

点赞数

分类专栏：爬虫

本文链接：https://blog.csdn.net/qq_45451647/article/details/114636229

版权

爬虫专栏收录该内容

24 篇文章 0 订阅

订阅专栏

一，Scrapy-分布式

（1）什么是scrapy_redis

scrapy_redis:Redis-based components for scrapy

github地址:https://github.com/rmax/scrapy-redis

（2）Scrapy和Scrapy-redis 有什么区别？

1.Scrapy是爬虫的一个框架爬取效率非常高具有高度的可定制性不支持分布式

2.Scrapy-redis 它是基于redis数据库运行在scrapy框架之上的一个组件可以让scrapy支持分布式策略支持主从同步

二，回顾scrapy⼯作流程

------------------------------\ .scrapy工作流程.\------------------

三，scrapy_redis⼯作流程

在这里插入图片描述 ----------------------------------\ .scrapy工作流程.\----------------------

四、scrapy_redis下载

（一）下载源码文件，以此为参照

clone github scrapy_redis源码⽂件
git clone https://github.com/rolando/scrapy-redis.git

（二）scrapy_redis中的settings⽂件

# Scrapy settings for example project 
2 # 
3 # For simplicity, this file contains only the most important sett ings by 
4 # default. All the other settings are documented here: 
5 # 
6 # http://doc.scrapy.org/topics/settings.html
7 # 
8 SPIDER_MODULES = ['example.spiders'] 
9 NEWSPIDER_MODULE = 'example.spiders' 
10
11 USER_AGENT = 'scrapy-redis (+https://github.com/rolando/scrapy-re dis)' 
12
13 DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter" # 指定那个去重⽅法给request对象去重 
14 SCHEDULER = "scrapy_redis.scheduler.Scheduler" # 指定Scheduler 队列
15 SCHEDULER_PERSIST = True # 队列中的内容是否持久保存,为false的时 候在关闭Redis的时候,清空Redis 
16 #SCHEDULER_QUEUE_CLASS = "scrapy_redis.queue.SpiderPriorityQueue" 
17 #SCHEDULER_QUEUE_CLASS = "scrapy_redis.queue.SpiderQueue" 
18 #SCHEDULER_QUEUE_CLASS = "scrapy_redis.queue.SpiderStack" 
19
20 ITEM_PIPELINES = { 
21 'example.pipelines.ExamplePipeline': 300, 22 'scrapy_redis.pipelines.RedisPipeline': 400, # scrapy_redi s实现的items保存到redis的pipline 
23 } 
24
25 LOG_LEVEL = 'DEBUG' 2627 # Introduce an artifical delay to make use of parallelism. to spe ed up the 
28 # crawl. 
29 DOWNLOAD_DELAY = 1

五、scrapy_redis运⾏

（一）爬取网页起始设置（运行github中下载的py实例文件）

1 allowed_domains = [‘dmoztools.net’]
2 start_urls = [‘http://www.dmoztools.net/’]
3
4 scrapy crawl dmoz

（二）运⾏结束后redis中多了三个键

dmoz:requests #存放的是待爬取的requests对象 
dmoz:item #爬取到的信息 
dmoz:dupefilter #爬取的requests的指纹 ```

这里是引用

六、如何改写分布式爬虫

（一）在原有的scrapy基础上进行更改

如何把普通的爬虫文件改写成分布式爬虫文件
普通的爬虫
1. 创建Scrapy项目/爬虫项目
2. 明确爬取的目标
3. 保存数据

改写成分布式
1. 改写爬虫文件
1.1 导入模块
1.2 继承类
1.3 把start_urs --> redis_key
2. 改写配置文件

#将以下内容添加到setting
#
#--------
#  
USER_AGENT = 'scrapy-redis (+https://github.com/rolando/scrapy-redis)'
# 去重过滤
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
# 指定Scheduler队列
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
SCHEDULER_PERSIST = True

ITEM_PIPELINES = {
    'example.pipelines.ExamplePipeline': 300,
    'scrapy_redis.pipelines.RedisPipeline': 400,
}

#-------

七、案例演示

（一）需求：

1.抓取当当图书的大标题，中标题，以及小标题。
2.通过跳转小标题链接进行抓取图书的图片，以及链接

(二) 代码实现

1.前期准备

创建scrapy 项目
。scrapy startproject xxx
。cd xxx
。scrapy genspider dangdang dangdang.com
(进入爬虫文件后先去改 start_urls 起始url)

2.分析页面

a.大分类

在 div class="con flq_body"下面 class=“level_one” dl dt span文本
注意 span标签没有 (在源码中没有) 有的大标题是多个通过//找到所有用extract()

b.中分类

中分类的内容都在div class=“inner_dl” dt 中分类注意中分类 dt标签下面也有一个a标签所有这里也用//找文本数据

c. 小分类

dd a 注意 span标签没有 (在源码中没有)

d.图片，图书名称
小分类的页面结构图书的名字以及图片的src 所有的数据都是在ul标签里面。
下面的li 图片的src a img @src 图书的名字 p a text() 注意在网页的源码当
中 src =‘images/model/guan/url_none.png’ 我们要找的是 data-original 注意网页源码和elements选项当中的数据

总结：
1.先爬取所有大标签，通过遍历，在从中爬取中标签，以及小标签
2.通过一层层遍历进行反复爬取
3.然后通过小标签进行跳转到另一个函数，将爬取图片，与数名的任务交给另一个函数

注意
在网页的源码当中 src = ‘images/model/guan/url_none.png’ 我们要找的是 data-original
注意网页源码和 elements选项当中的数据

3.实现逻辑

看代码

4.改写程序

改写爬虫文件
1.1 导入模块
1.2 继承类
1.3 把start_urs --> redis_key
改写配置文件

5.代码（改写后）

爬虫文件

import scrapy
from copy import deepcopy
from scrapy_redis.spiders import RedisSpider
import redis


'''
改写成分布式
1. 改写爬虫文件
1.1 导入模块
1.2 继承类
1.3 把start_urs --> redis_key
2. 改写配置文件
'''

# https://pypi.douban.com/simple
class DangdangSpider(RedisSpider):
    name = 'dangdang'
    allowed_domains = ['dangdang.com']
    #start_urls = ['http://book.dangdang.com/']
    redis_key = 'dangdang'

    def parse(self, response):
        div_list=response.xpath('//div[@class="con flq_body"]/div')
        for div in div_list:
            item={}
            item['b_cate']=div.xpath('./dl/dt//text()').extract()
            item['b_cate']=[i.strip() for i in item['b_cate'] if len(i.strip())>0]

            # if  item['b_cate']:
            #     item['b_cate'] = [i.strip() for i in item['b_cate'] if len(i.strip()) > 0]


            dl_list=div.xpath('.//dl[@class="inner_dl"]')
            for dl in dl_list:
                #获取中分类
                item['m_cate']=dl.xpath('./dt//text()').extract()
                item['m_cate'] = [i.strip() for i in item['m_cate'] if len(i.strip()) > 0]


                # 获取小分类
                a_list=dl.xpath('./dd/a')
                for a in a_list:
                    item['s_cate']=a.xpath('./text()').extract_first()
                    #获取列表页面url
                    item['s_href'] = a.xpath('./@href').extract_first()
                    if item['s_href'] is not None:
                        yield scrapy.Request(
                            url=item['s_href'],
                            callback=self.parse_book_list,
                            meta={"item":deepcopy(item)}
                        )
                    print(item)

    def parse_book_list(self,response):
        item=response.meta.get('item')
        li_list=response.xpath('//ul[@class="list_aa "]/li')
        for li in li_list:
            #获取图片的url
            item['book_img'] = li.xpath('./a[@class="img"]/img/@src').extract_first()
            if item['book_img']=="images/model/guan/url_none.png": #当前页源码与网页源码不同，需要以网页源码为准
                item['book_img']=li.xpath('./a[@class="img"]/img/@data-original').extract_first()
            #获取图片的名字

            item['book_name']=li.xpath('./p[@class="name"]/a/@title').extract_first()
            #print(item)
            yield item

setting页面

# -*- coding: utf-8 -*-

# Scrapy settings for book project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'book'

SPIDER_MODULES = ['book.spiders']
NEWSPIDER_MODULE = 'book.spiders'
# LOG_LEVEL = 'WARNING'


# 去重过滤
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
# 指定Scheduler队列
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
SCHEDULER_PERSIST = True

# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'book (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
  'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36',
  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  'Accept-Language': 'en',
}

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'book.middlewares.BookSpiderMiddleware': 543,
#}

REDIS_HOST = '127.0.0.1'
REDIS_PORT = 6379
# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'book.middlewares.BookDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   'book.pipelines.BookPipeline': 300,
   'scrapy_redis.pipelines.RedisPipeline': 400,
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

执行页面

from scrapy import cmdline
cmdline.execute(['scrapy','crawl','dangdang'])

6.后续内容

1.此时爬虫程序进入一个在阻塞状态

2.我们需要在redis中 PUSH redis_key start_url，这时阻塞状态解除。

无痕的雨

关注

0
点赞
踩
4

收藏

觉得还不错? 一键收藏
0
评论
Scrapy与redis的结合（Scrapy 分布式）

一，Scrapy-分布式（1）什么是scrapy_redisscrapy_redis:Redis-based components for scrapygithub地址:https://github.com/rmax/scrapy-redis（2）Scrapy和Scrapy-redis 有什么区别？1.Scrapy是爬虫的一个框架爬取效率非常高具有高度的可定制性不支持分布式2.Scrapy-redis 它是基于redis数据库运行在scrapy框架之上的一个组件
复制链接

扫一扫