Scrapy学习过程之二:架构及简单示例

1、Scrapy架构

参考:https://docs.scrapy.org/en/latest/topics/architecture.html#data-flow

以下是架构图:

 

Scrapy architecture

从上图可以看出,Scrapy是组件化的,每个组件实现特定的功能,组件之间是独立的,松耦合的,在大规模应用中应该可以分布式部署。 

红色箭头表示数据流,其它表示组件,首先对Scrapy包含那些组件,以及数据是如何流动的有个大概的印象,在接下来进一步的学习中再加深理解。

2、Scrapy简单示例

参考:https://docs.scrapy.org/en/latest/intro/overview.html#walk-through-of-an-example-spider

代码如下:

import scrapy


class QuotesSpider(scrapy.Spider):
    name = 'quotes'
    start_urls = [
        'http://quotes.toscrape.com/tag/humor/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.xpath('span/small/text()').get(),
            }

        next_page = response.css('li.next a::attr("href")').get()
        if next_page is not None:
            yield response.follow(next_page, self.parse)

解析一下这个代码。QuotesSpider的父类是scrapy.Spider,每个scrapy.Spider的子类都被认为是一个SPIDER,这段代码与上边架构图中的"SPIDERS"相对应。

"SPIDERS"是复数,可以有多个,那么类QuotesSpider中的成员name就是当前这个SPIDER的唯一标识符。 

start_urls就是起始的url,scrap.Spider类中有一个默认实现的方法,它会根据start_urls中的内容构建request,同时默认指定这个request产生的response将由类中的parse方法处理,也就是parse是回调方法。

然后就按上边架构图中的步骤开始运行。

当ENGINE发现SCHEDULER队列中已经没有待处理的REQUEST,并且所有RESPONSE已经被SPIDER的parser处理完成,不可能再有新的REQUEST进入队列,这个时候ENGINE是通知SPIDER任务已经完成,整个运行过程结构。

测试一下以上代码

启动构建好的Scrapy docker image,命令如下:

docker run -it --name scrapy-test scrapy-clear /bin/sh

其中scrapy-clear是我构建的scrapy镜像的名称。

启动以后创建一个临时用的测试目录如scrapy-test,进入新创建的目录并创建新文件quotes_spider.py,然后将以上的代码复制进quotes_spider.py文件中。

最后运行如下命令:

scrapy runspider quotes_spider.py -o quotes.json

输出如下日志: 

/scrapy-test # scrapy runspider quotes_spider.py -o quotes.json
2019-07-24 07:21:40 [scrapy.utils.log] INFO: Scrapy 1.7.1 started (bot: scrapybot)
2019-07-24 07:21:40 [scrapy.utils.log] INFO: Versions: lxml 4.3.4.0, libxml2 2.9.9, cssselect 1.0.3, parsel 1.5.1, w3lib 1.20.0, Twisted 19.2.1, Python 3.7.3 (default, May  3 2019, 11:24:39) - [GCC 8.3.0], pyOpenSSL 19.0.0 (OpenSSL 1.1.1c  28 May 2019), cryptography 2.7, Platform Linux-4.4.0-116-generic-x86_64-with
2019-07-24 07:21:40 [scrapy.crawler] INFO: Overridden settings: {'FEED_FORMAT': 'json', 'FEED_URI': 'quotes.json', 'SPIDER_LOADER_WARN_ONLY': True}
2019-07-24 07:21:40 [scrapy.extensions.telnet] INFO: Telnet Password: ee6cc9f3a24c449d
2019-07-24 07:21:40 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.logstats.LogStats']
2019-07-24 07:21:40 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-07-24 07:21:40 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-07-24 07:21:40 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2019-07-24 07:21:40 [scrapy.core.engine] INFO: Spider opened
2019-07-24 07:21:40 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-07-24 07:21:40 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2019-07-24 07:21:41 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/tag/humor/> (referer: None)
2019-07-24 07:21:41 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/tag/humor/>
{'text': '“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”', 'author': 'Jane Austen'}
2019-07-24 07:21:41 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/tag/humor/>
{'text': '“A day without sunshine is like, you know, night.”', 'author': 'Steve Martin'}
2019-07-24 07:21:41 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/tag/humor/>
{'text': '“Anyone who thinks sitting in church can make you a Christian must also think that sitting in a garage can make you a car.”', 'author': 'Garrison Keillor'}
2019-07-24 07:21:41 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/tag/humor/>
{'text': '“Beauty is in the eye of the beholder and it may be necessary from time to time to give a stupid or misinformed beholder a black eye.”', 'author': 'Jim Henson'}
2019-07-24 07:21:41 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/tag/humor/>
{'text': "“All you need is love. But a little chocolate now and then doesn't hurt.”", 'author': 'Charles M. Schulz'}
2019-07-24 07:21:41 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/tag/humor/>
{'text': "“Remember, we're madly in love, so it's all right to kiss me anytime you feel like it.”", 'author': 'Suzanne Collins'}
2019-07-24 07:21:41 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/tag/humor/>
{'text': '“Some people never go crazy. What truly horrible lives they must lead.”', 'author': 'Charles Bukowski'}
2019-07-24 07:21:41 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/tag/humor/>
{'text': '“The trouble with having an open mind, of course, is that people will insist on coming along and trying to put things in it.”', 'author': 'Terry Pratchett'}
2019-07-24 07:21:41 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/tag/humor/>
{'text': '“Think left and think right and think low and think high. Oh, the thinks you can think up if only you try!”', 'author': 'Dr. Seuss'}
2019-07-24 07:21:41 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/tag/humor/>
{'text': '“The reason I talk to myself is because I’m the only one whose answers I accept.”', 'author': 'George Carlin'}
2019-07-24 07:21:41 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/tag/humor/page/2/> (referer: http://quotes.toscrape.com/tag/humor/)
2019-07-24 07:21:41 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/tag/humor/page/2/>
{'text': '“I am free of all prejudice. I hate everyone equally. ”', 'author': 'W.C. Fields'}
2019-07-24 07:21:41 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/tag/humor/page/2/>
{'text': "“A lady's imagination is very rapid; it jumps from admiration to love, from love to matrimony in a moment.”", 'author': 'Jane Austen'}
2019-07-24 07:21:41 [scrapy.core.engine] INFO: Closing spider (finished)
2019-07-24 07:21:42 [scrapy.extensions.feedexport] INFO: Stored json feed (12 items) in: quotes.json
2019-07-24 07:21:42 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 511,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 3725,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 2,
 'elapsed_time_seconds': 1.322739,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2019, 7, 24, 7, 21, 42, 3560),
 'item_scraped_count': 12,
 'log_count/DEBUG': 14,
 'log_count/INFO': 11,
 'memusage/max': 46931968,
 'memusage/startup': 46931968,
 'request_depth_max': 1,
 'response_received_count': 2,
 'scheduler/dequeued': 2,
 'scheduler/dequeued/memory': 2,
 'scheduler/enqueued': 2,
 'scheduler/enqueued/memory': 2,
 'start_time': datetime.datetime(2019, 7, 24, 7, 21, 40, 680821)}
2019-07-24 07:21:42 [scrapy.core.engine] INFO: Spider closed (finished)
/scrapy-test #

这个日志在学习Scrapy的学习过程中应该是很重要的,每一条记录中都包含诸如[scrapy.core.engine]等内容,这个应该与上边的架构图中的组件是相对应的,通过日志内容大概就能够看出各个组件之间如何交互,数据如何流动,每个组件都完成了那些动作。 

quotes.json文件内容如下:

/scrapy-test # cat quotes.json
[
{"text": "\u201cThe person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.\u201d", "author": "Jane Austen"},
{"text": "\u201cA day without sunshine is like, you know, night.\u201d", "author": "Steve Martin"},
{"text": "\u201cAnyone who thinks sitting in church can make you a Christian must also think that sitting in a garage can make you a car.\u201d", "author": "Garrison Keillor"},
{"text": "\u201cBeauty is in the eye of the beholder and it may be necessary from time to time to give a stupid or misinformed beholder a black eye.\u201d", "author": "Jim Henson"},
{"text": "\u201cAll you need is love. But a little chocolate now and then doesn't hurt.\u201d", "author": "Charles M. Schulz"},
{"text": "\u201cRemember, we're madly in love, so it's all right to kiss me anytime you feel like it.\u201d", "author": "Suzanne Collins"},
{"text": "\u201cSome people never go crazy. What truly horrible lives they must lead.\u201d", "author": "Charles Bukowski"},
{"text": "\u201cThe trouble with having an open mind, of course, is that people will insist on coming along and trying to put things in it.\u201d", "author": "Terry Pratchett"},
{"text": "\u201cThink left and think right and think low and think high. Oh, the thinks you can think up if only you try!\u201d", "author": "Dr. Seuss"},
{"text": "\u201cThe reason I talk to myself is because I\u2019m the only one whose answers I accept.\u201d", "author": "George Carlin"},
{"text": "\u201cI am free of all prejudice. I hate everyone equally. \u201d", "author": "W.C. Fields"},
{"text": "\u201cA lady's imagination is very rapid; it jumps from admiration to love, from love to matrimony in a moment.\u201d", "author": "Jane Austen"}
]/scrapy-test #

scrapy runspider这个命令,只是从指定的文件中找scrapy.Spider的子类,找到以后把它运行起来。

上边架构图中有很多组件,是一个复杂的系统。在本例中这些组件如何配置没有涉及,这里应该全部是默认配置,并且所有组件都运行在一台主机上。

在复杂的项目中,必然会涉及到很多的配置用来定义涉及的全部组件如何工作。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值