爬虫(27)scrap_redis案例

第二十四章 scrap_redis案例

1. settings文件的分析

配置文件,在配置文件里面,和以前的settings文件也有不同之处。
在这里插入图片描述

DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"

这个是过滤去重的语句,它的逻辑写在这里:
在这里插入图片描述
这个文件在Pycharm里面的External-Libraries>site-packages>scrapy-redis>dupefilter.py

SCHEDULER = "scrapy_redis.scheduler.Scheduler"
# 指定SCHEDULER队列
SCHEDULER_PERSIST = True
# Redis保存数据的持久化,默认为TRue


后面这两个语句,一个管指定SCHEDULER队列,一个是Redis保存数据的持久化,默认为TRue

ITEM_PIPELINES = {
    'example.pipelines.ExamplePipeline': 300,
    'scrapy_redis.pipelines.RedisPipeline': 400,
}



在pipelines中直接走的是第二个,值为400的这个。
下面我们直接来跑一下这个例子,直接cd 到目标文件夹


D:\work>cd D:\work\爬虫\Day26\example-project

D:\work\爬虫\Day26\example-project>scrapy crawl dmoz


回车后得到如下结果:

2021-04-12 10:12:35 [scrapy.utils.log] INFO: Scrapy 2.4.1 started (bot: scrapybot)
2021-04-12 10:12:35 [scrapy.utils.log] INFO: Versions: lxml 4.6.2.0, libxml2 2.9.5, cssselect 1.1.0, parse
l 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.8.6rc1 (tags/v3.8.6rc1:08bd63d, Sep  7 2020, 23:10:23) [MS
C v.1927 64 bit (AMD64)], pyOpenSSL 20.0.1 (OpenSSL 1.1.1i  8 Dec 2020), cryptography 3.3.1, Platform Wind
ows-10-10.0.18362-SP0
2021-04-12 10:12:35 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2021-04-12 10:12:35 [scrapy.crawler] INFO: Overridden settings:
{'DOWNLOAD_DELAY': 1,
 'DUPEFILTER_CLASS': 'scrapy_redis.dupefilter.RFPDupeFilter',
 'NEWSPIDER_MODULE': 'example.spiders',
 'SCHEDULER': 'scrapy_redis.scheduler.Scheduler',
 'SPIDER_MODULES': ['example.spiders'],
 'USER_AGENT': 'scrapy-redis (+https://github.com/rolando/scrapy-redis)'}
2021-04-12 10:12:35 [scrapy.extensions.telnet] INFO: Telnet Password: cfd6099a1c0f124e
2021-04-12 10:12:35 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2021-04-12 10:12:35 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2021-04-12 10:12:35 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2021-04-12 10:12:35 [scrapy.middleware] INFO: Enabled item pipelines:
['example.pipelines.ExamplePipeline', 'scrapy_redis.pipelines.RedisPipeline']
2021-04-12 10:12:35 [scrapy.core.engine] INFO: Spider opened
Unhandled error in Deferred:
2021-04-12 10:12:39 [twisted] CRITICAL: Unhandled error in Deferred:

Traceback (most recent call last):
  File "d:\python38\lib\site-packages\scrapy\crawler.py", line 192, in crawl
    return self._crawl(crawler, *args, **kwargs)
  File "d:\python38\lib\site-packages\scrapy\crawler.py", line 196, in _crawl
    d = crawler.crawl(*args, **kwargs)
  File "d:\python38\lib\site-packages\twisted\internet\defer.py", line 1613, in unwindGenerator
    return _cancellableInlineCallbacks(gen)
  File "d:\python38\lib\site-packages\twisted\internet\defer.py", line 1529, in _cancellableInlineCallback
s
    _inlineCallbacks(None, g, status)
--- <exception caught here> ---
  File "d:\python38\lib\site-packages\twisted\internet\defer.py", line 1418, in _inlineCallbacks
    result = g.send(result)
  File "d:\python38\lib\site-packages\scrapy\crawler.py", line 89, in crawl
    yield self.engine.open_spider(self.spider, start_requests)
redis.exceptions.ConnectionError: Error 10061 connecting to localhost:6379. 由于目标计算机积极拒绝,无法连
接。.

2021-04-12 10:12:39 [twisted] CRITICAL:
Traceback (most recent call last):
  File "d:\python38\lib\site-packages\redis\connection.py", line 559, in connect
    sock = self._connect()
  File "d:\python38\lib\site-packages\redis\connection.py", line 615, in _connect
    raise err
  File "d:\python38\lib\site-packages\redis\connection.py", line 603, in _connect
    sock.connect(socket_address)
ConnectionRefusedError: [WinError 10061] 由于目标计算机积极拒绝,无法连接。

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "d:\python38\lib\site-packages\twisted\internet\defer.py", line 1418, in _inlineCallbacks
    result = g.send(result)
  File "d:\python38\lib\site-packages\scrapy\crawler.py", line 89, in crawl
    yield self.engine.open_spider(self.spider, start_requests)
redis.exceptions.ConnectionError: Error 10061 connecting to localhost:6379. 由于目标计算机积极拒绝,无法连
接。.



出现了无法连结的错误,这是因为redis服务器没有开启,我们开启一下:

C:\Users\MI>d:

D:\>cd D:\Download\redis-latest\redis-latest

D:\Download\redis-latest\redis-latest>redis-server
[8300] 12 Apr 10:16:55.177 # Warning: no config file specified, using the default config. In order to specify a config file use redis-server /path/to/redis.conf
                _._
           _.-``__ ''-._
      _.-``    `.  `_.  ''-._           Redis 3.0.503 (00000000/0) 64 bit
  .-`` .-```.  ```\/    _.,_ ''-._
 (    '      ,       .-`  | `,    )     Running in standalone mode
 |`-._`-...-` __...-.``-._|'` _.-'|     Port: 6379
 |    `-._   `._    /     _.-'    |     PID: 8300
  `-._    `-._  `-./  _.-'    _.-'
 |`-._`-._    `-.__.-'    _.-'_.-'|
 |    `-._`-._        _.-'_.-'    |           http://redis.io
  `-._    `-._`-.__.-'_.-'    _.-'
 |`-._`-._    `-.__.-'    _.-'_.-'|
 |    `-._`-._        _.-'_.-'    |
  `-._    `-._`-.__.-'_.-'    _.-'
      `-._    `-.__.-'    _.-'
          `-._        _.-'
              `-.__.-'

[8300] 12 Apr 10:16:55.177 # Server started, Redis version 3.0.503
[8300] 12 Apr 10:16:55.177 * DB loaded from disk: 0.000 seconds
[8300] 12 Apr 10:16:55.177 * The server is now ready to accept connections on port 6379


我们再打开客户端:

D:\>cd D:\Download\redis-latest\redis-latest

D:\Download\redis-latest\redis-latest>redis-cli
127.0.0.1:6379>

现在我们把刚才的命令再跑一下:

  • 1
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值