1.引言
看上一次失败的尝试,用apache nutch,发现走不通。不过,如果采用低版本,也许是可行的,但用旧版比较别扭,只好放弃了。
不过其中elasticsearch, kibana部分是可以重用的,只是替换nutch为scrapy + scrapy_redis。
2.基本的scrapy爬虫
直接从scrapy官方的例子开始吧,本项目fork自scrapy/quotesbot
我自己的代码在:https://github.com/gfzheng/quotesbot.git
git clone https://github.com/gfzheng/quotesbot.git
cd quotesbot
scrapy list
scrapy crawl toscrape-xpath
官方例子一切正常。
3.增加scrapy-redis分布式特性
安装redis略过。
启动redis命令:redis-server
参考https://github.com/rmax/scrapy-redis的例子(https://github.com/rmax/scrapy-redis/tree/master/example-project)
修改redis配置,在settings.py增加:
# Enables scheduling storing requests queue in redis.
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
# Ensure all spiders share same duplicates filter through redis.
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
# Store scraped item in redis for post-processing.
ITEM_PIPELINES = {
'scrapy_redis.pipelines.RedisPipeline': 300
}
(可选)修改pipelines.py为
from datetime import datetime
class QuotesbotPipeline(object):
def process_item(self, item, spider):
item["crawled"] = datetime.utcnow()
item["spider"] = spider.name
return item
修改爬虫为:
# -*- coding: utf-8 -*-
from scrapy.spiders import Rule
from scrapy.linkextractors import LinkExtractor
from scrapy_redis.spiders import RedisCrawlSpider
#from scrapy_redis.spiders import RedisSpider
class ToScrapeSpiderCrawl(RedisCrawlSpider):
"""Spider that reads urls from redis queue (toscrawl:start_urls)."""
name = 'toscrape-crawl'
redis_key = 'toscrape:start_urls'
rules = (
# follow all links
Rule(LinkExtractor(), callback='parse_page', follow=True),
)
def __init__(self, *args, **kwargs):
# Dynamically define the allowed domains list.
# scrapy runspider -a domain=quotes.toscrape.com ./spiders/toscrape-crawl.py
domain = kwargs.pop('domain', '')
self.allowed_domains = filter(None, domain.split(','))
super(ToScrapeSpiderCrawl, self).__init__(*args, **kwargs)
def parse_page(self, response):
for quote in response.xpath('//div[@class="quote"]'):
yield {
'text': quote.xpath('./span[@class="text"]/text()').extract_first(),
'author': quote.xpath('.//small[@class="author"]/text()').extract_first(),
'tags': quote.xpath('.//div[@class="tags"]/a[@class="tag"]/text()').extract(),
'url': response.url
}
4. 运行分布式爬虫
把redis server运行起来,用redis-cli连接redis,通过keys *命令可以查看redis中保存的数据:
$ redis-cli
127.0.0.1:6379> keys *
1) "toscrape-xpath:items"
127.0.0.1:6379> exit
通过 lpush 命令可以增加key value数据。
>lpush toscrape:start_urls http://quotes.toscrape.com
此时可以运行爬虫:
scrapy runspider -a domain=quotes.toscrape.com ./spiders/health-crawler.py
其中,-a参数可以限制爬取的域名。
至此,分布式爬虫已经搭建完毕。
5. 结合ElasticSearch全文检索
安装运行ES步骤略过。
把爬到的item直接保存到ES中,需要使用:https://github.com/knockrentals/scrapy-elasticsearch
pip install ScrapyElasticSearch
使用方法 (配置 settings.py)
ITEM_PIPELINES = { 'scrapyelasticsearch.scrapyelasticsearch.ElasticSearchPipeline': 500 } ELASTICSEARCH_SERVERS = ['localhost'] ELASTICSEARCH_INDEX = 'scrapy' ELASTICSEARCH_INDEX_DATE_FORMAT = '%Y-%m' ELASTICSEARCH_TYPE = 'items' ELASTICSEARCH_UNIQ_KEY = 'url' # Custom uniqe key
修改完,重新运行爬虫。
在Kibana中,增加Index Patterns: scrapy*
即可在检索爬到的Item了!
6.总结
总之,整个爬虫检索系统包含Scrapy, Redis, ElasticSearch,Kibana几部分。
基本流程就是:把4个服务都运行起来,在redis中lpush增加start_urls,在Kibana中检索爬到的文章。