python增量爬虫_增量式爬虫 - 我用python写Bug - 博客园

最新推荐文章于 2024-04-09 17:47:36 发布

weixin_39679061

最新推荐文章于 2024-04-09 17:47:36 发布

阅读量537

点赞数

文章标签： python增量爬虫

本文链接：https://blog.csdn.net/weixin_39679061/article/details/111439627

版权

一、介绍

1、引言

比如当我们爬取一个小说网站的时候，第一天你把小说网站全部小说都爬下来了，存储好了。

一个月后，当这个小说网站又新出了几本小说，你重新爬取这个网站的时候，如果你不是增量式爬虫，

那么你的程序会重新把这个网站所有小说再爬一次，而实际上我们只需要把新增的小说爬下来即可，

这就是增量式爬虫。

2、增量式爬虫

1.概念：通过爬虫程序监测某网站数据更新的情况，以便可以爬取到该网站更新出的新数据。

2.如何进行增量式的爬取工作：

在发送请求之前判断这个URL是不是之前爬取过

在解析内容后判断这部分内容是不是之前爬取过

写入存储介质时判断内容是不是已经在介质中存在

3.分析

不难发现，其实增量爬取的核心是去重，至于去重的操作在哪个步骤起作用，只能说各有利弊。

在我看来，前两种思路需要根据实际情况取一个(也可能都用)。

第一种思路适合不断有新页面出现的网站，比如说小说的新章节，每天的最新新闻等等；

第二种思路则适合页面内容会更新的网站。

第三个思路是相当于是最后的一道防线。这样做可以最大程度上达到去重的目的。

4.去重方法

1,将爬取过程中产生的url进行存储，存储在redis的set中。当下次进行数据爬取时，首先对即将要发起的请求对应的url在存储的url的set中做判断，如果存在则不进行请求，否则才进行请求。

2,对爬取到的网页内容进行唯一标识的制定，然后将该唯一表示存储至redis的set中。当下次爬取到网页数据的时候，在进行持久化存储之前，首先可以先判断该数据的唯一标识在redis的set中是否存在，在决定是否进行持久化存储。

二、项目案例

1、爬取4567tv网站中喜剧片的所有电影的标题和上映年份

1. 爬虫文件#-*- coding: utf-8 -*-

importscrapyfrom scrapy.linkextractors importLinkExtractorfrom scrapy.spiders importCrawlSpider, Rulefrom redis importRedisfrom moviePro.items importMovieproItemclassMovieSpider(CrawlSpider):

name= 'movie'

#allowed_domains = ['www.xxx.com']

start_urls = ['https://www.4567tv.tv/index.php/vod/show/id/6/page/23.html']

rules=(

Rule(LinkExtractor(allow=r'/index.php/vod/show/id/6/page/\d+.html'), callback='parse_item', follow=True),

)#创建redis链接对象

conn = Redis(host='127.0.0.1', port=6379)defparse_item(self, response):

li_list= response.xpath('//li[@class="col-md-6 col-sm-4 col-xs-3"]')for li inli_list:#获取详情页的url

detail_url = 'http://www.4567tv.tv' + li.xpath('./div/a/@href').extract_first()#将详情页的url存入redis的set中

ex = self.conn.sadd('urls', detail_url)#设置redis的key-value成功时，会返回1，否则返回0

if ex == 1:print('该url没有被爬取过，可以进行数据的爬取')yield scrapy.Request(url=detail_url, callback=self.parst_detail)else:print('数据还没有更新，暂无新数据可爬取！')#解析详情页中的电影名称和类型，进行持久化存储

defparst_detail(self, response):

item=MovieproItem()

item['title'] = response.xpath('//div[@class="stui-content__detail"]/h3[@class="title"]/text()').extract_first()

item['year'] = response.xpath('//div[@class="stui-content__detail"]/p[1]/a[2]/@href').extract_first()yielditem2. items.pyimportscrapyclassMovieproItem(scrapy.Item):#define the fields for your item here like:

title =scrapy.Field()

year=scrapy.Field()3. pipelines.pyfrom redis importRedisclassMovieproPipeline(object):

conn=Nonedefopen_spider(self, spider):

self.conn= Redis(host='127.0.0.1', port=6379)defprocess_item(self, item, spider):

dic={'title': item['title'],'year': item['year']

}print(dic)

self.conn.lpush('movieData', dic)return item

2、爬取糗事百科中的段子和作者数据

1. 爬虫文件#-*- coding: utf-8 -*-

importscrapyfrom scrapy.linkextractors importLinkExtractorfrom scrapy.spiders importCrawlSpider, Rulefrom qiubaiZ.items importQiubaizItemfrom redis importRedisimporthashlibclassQiubaiSpider(CrawlSpider):

name= 'qiubaiz'

#allowed_domains = ['https://www.qiushibaike.com/text/']

start_urls = ['https://www.qiushibaike.com/text/']

rules=(

Rule(LinkExtractor(allow=r'/text/page/\d+/'), callback='parse_item', follow=True),

Rule(LinkExtractor(allow=r'/text/$'), callback='parse_item', follow=True),

)#创建redis链接对象

conn = Redis(host='127.0.0.1', port=6379)defparse_item(self, response):

div_list= response.xpath('//div[@id="content-left"]/div')for div indiv_list:

item=QiubaizItem()#爬取作者

author = div.xpath('.//div[@class="author clearfix"]/a/h2/text()')ifauthor:

author=author[0].extract()else:

author= "匿名用户"

#爬取这个用户的段子的内容contents

contents = div.xpath('.//div[@class="content"]/span/text()') #遇到换行br就会生成一个Selector对象

content = ''.join([selector.extract().strip() for selector incontents])

item['author'] =author

item['content'] =content#设置数据的存储格式

source = item['author'] + item['content']#将解析到的数据值生成一个唯一的标识进行redis存储

source_id =hashlib.sha256(source.encode()).hexdigest()#将解析内容的唯一表示存储到redis的data_id中

ex = self.conn.sadd('data_id', source_id)if ex == 1:print('该条数据没有爬取过，可以爬取')yielditemelse:print('该条数据已经爬取过了，不需要再次爬取了!')2. items.pyclassQiubaizItem(scrapy.Item):#define the fields for your item here like:

author =scrapy.Field()

content=scrapy.Field()3. pipelines.pyfrom redis importRedisclassQiubaizPipeline(object):

conn=Nonedefopen_spider(self, spider):

self.conn= Redis(host='127.0.0.1', port=6379)defprocess_item(self, item, spider):

dic={'author': item['author'],'content': item['content']

}

self.conn.lpush('qiubaizData', dic)return item

weixin_39679061

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python增量爬虫_增量式爬虫 - 我用python写Bug - 博客园

一、介绍1、引言比如当我们爬取一个小说网站的时候，第一天你把小说网站全部小说都爬下来了，存储好了。一个月后，当这个小说网站又新出了几本小说，你重新爬取这个网站的时候，如果你不是增量式爬虫，那么你的程序会重新把这个网站所有小说再爬一次，而实际上我们只需要把新增的小说爬下来即可，这就是增量式爬虫。2、增量式爬虫1.概念：通过爬虫程序监测某网站数据更新的情况，以便可以爬取到该网站更新出的新数据。2.如何...
复制链接

扫一扫