前言
2021.08.30第九章比较少
第九章
增量式爬虫
1.知识点
- 增量式爬虫
- 概念:监测网站数据更新的情况,只会爬取网站最新更新出来的数据。
- 分析:
- 指定一个起始url
- 基于CrawlSpider获取其他页码链接
- 基于Rule将其他页码链接进行请求
- 从每一个页码对应的页面源码中解析出每一部电影详情页的url
- 核心: 检测电影详情页的url之前有没有请求过
- 将爬取过的电影详情页url存储
- 存储到redis的set数据结构中
- 将爬取过的电影详情页url存储
- 对详情页的url发起请求,然后解析出电影的名称和简介
- 进行持久化存储
2.代码
代码如下:
- movie.py
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from redis import Redis
from ..items import MovieproItem
class MovieSpider(CrawlSpider):
name = 'movie'
# allowed_domains = ['www.xxx.com']
start_urls = ['https://www.4567tv.tv/frim/index1.html']
rules = (
Rule(LinkExtractor(allow=r'/frim/index1-\d+\.html'), callback='parse_item', follow=True),
)
# 创建redis链接对象
conn = Redis(host='127.0.0.1', port=6379)
# 用于解析每一个页码对应页面中的电影详情页的url
def parse_item(self, response):
li_list = response.xpath('/html/body/div[1]/div/div/div/div[2]/ul/li')
for li in li_list:
# 获取详情页的url
detail_url = 'https://www.4567tv.tv' + li.xpath('./div/a/@href').extract_first()
# 将详情页的url存入redis的set中
ex = self.conn.sadd('urls', detail_url)
if ex == 1:
print('该url未被爬取,可进行数据爬取')
yield scrapy.Request(url=detail_url, callback=self.parse_detail)
else:
print('数据还未更新,暂无进行数据爬取')
# 解析详情页中的电影名称和类型,进行持久化存储
def parse_detail(self, response):
item = MovieproItem
item['name'] = response.xpath('/html/body/div[1]/div/div/div/div[2]/h1/text()')
item['desc'] = response.xpath('/html/body/div[1]/div/div/div/div[2]/p[5]/spam[2]//text()')
item['desc'] = ''.join(item['desc'])
yield item
- pipelines.py
class MovieproPipeline:
conn = None
def open_spider(self, spider):
self.conn = spider.com
def process_item(self, item, spider):
dic = {
'name': item['name'],
'desc': item['desc']
}
print(dic)
self.conn.lpush('movieData', dic)
return item
- items.py
import scrapy
class MovieproItem(scrapy.Item):
# define the fields for your item here like:
name = scrapy.Field()
desc = scrapy.Field()
# pass
- settings.py
常规操作