基于数据指纹的增量式(爬取糗百文章)
详细步骤:
- List item(创建爬虫项目)
- cd 到qbArticle新建的文件夹下
- scrapy startproject maomao(文件名)
- cd maomao
- scrapy genspider crawl qb www.baidu.com
建好项目后
qb.py 文件里编写具体爬虫语句
#导包:由于hashlib 是内置模块 所以要在最开头
import hashlib
import scrapy
from redis import Redis
from ..items import MaomaoItem
class QSpider(scrapy.Spider):
conn=Redis('127.0.0.1',6379)
name = 'qb'
# allowed_domains = ['www.baidu.com']
start_urls = ['https://www.qiushibaike.com/text/']
def parse(self, response):
div_list = response.xpath('//div[@id="content-left"]/div')
for div in div_list:
content = div.xpath('.//div[@class="content"]/span/text()').extract_first()
# print(content)
#生成数据指纹,使用hashlib的md5 (该方法占内存较小)
fp=hashlib.md5(content.encode('utf-8')).hexdigest()
ret=self.conn.sadd('fp',fp)
if ret:
item=MaomaoItem()
item['content']=content
yield item
print('有数据更新。。。。')
else:
print('没有数据更新!!!!')
items.py
import scrapy
class MaomaoItem(scrapy.Item):
# define the fields for your item here like:
content = scrapy.Field()
pass
settings.py
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36'
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'maomao.pipelines.MaomaoPipeline': 300,
}
运行
cd到maomao项目下:
(env_workspace001) bogon:maomao edward-h$ scrapy crawl q
如果需要将数据持久化存入mongogdb 则在pipelines.py文件内编写响应代码,此处省略。