2021SC@SDUSC
之前的1-9分析部分中,我们对scrapy源码的框架进行了总览分析,在了解了基本的scrapy爬虫项目结构后,我计划结合官方文档,对于scrapy爬虫中比较突出的几个部分进行重点代码分析。
首先是scrapy中Item Pipeline 的部分分析:
官方文档中对于scrapy是这么定义的:
Each item pipeline component (sometimes referred as just “Item Pipeline”) is a Python class that implements a simple method. They receive an item and perform an action over it, also deciding if the item should continue through the pipeline or be dropped and no longer processed.
也就是:每个项目管道组件(有时简称为“项目管道”)都是一个实现简单方法的 Python 类。这些类接收一个项目并对其执行一个操作,同时决定该项目是应该继续通过管道还是被丢弃且不再处理。
Typical uses of item pipelines are:
cleansing HTML data
validating scraped data (checking that the items contain certain fields)
checking for duplicates (and dropping them)
storing the scraped item in a database
项目管道的典型用途是:
清理 HTML 数据
验证抓取的数据(检查项目是否包含某些字段)
检查重复项(并删除它们)
将抓取的项目存储在数据库中
源码分析与使用样例:
Each item pipeline component is a Python class that must implement the following method:
process_item(self, item, spider)¶
This method is called for every item pipeline component.
item is an item object, see Supporting All Item Types.
process_item() must either: return an item object, return a Deferred or raise a DropItem exception.
Dropped items are no longer processed by further pipeline components.
Parameters
item (item object) – the scraped item
spider (Spider object) – the spider which scraped the item
Additionally, they may also implement the following methods:
open_spider(self, spider)¶
This method is called when the spider is opened.
Parameters
spider (Spider object) – the spider which was opened
close_spider(self, spider)¶
This method is called when the spider is closed.
Parameters
spider (Spider object) – the spider which was closed
from_crawler(cls, crawler)¶
If present, this classmethod is called to create a pipeline instance from a Crawler. It must return a new instance of the pipeline. Crawler object provides access to all Scrapy core components like settings and signals; it is a way for pipeline to access them and hook its functionality into Scrapy.
Parameters
crawler (Crawler object) – crawler that uses this pipeline
可以看出:
process_item(self, item, spider)¶
为每个项目管道组件调用此方法。
process_item()
必须:返回一个项目对象,返回一个 Deferred 或引发一个 DropItem 异常。丢弃的项目不再由进一步的管道组件处理。
Parameters
item (item object) – 被抓取的物品
spider (Spider object) – 爬虫本体
此外,它们还可以实现以下方法:
open_spider(self, spider)¶
在爬虫初始化时候调用此方法
Parameters
spider (Spider object) – 检查已开启的爬虫
close_spider(self, spider)¶
使用此方法关闭爬虫
Parameters
spider (Spider object) – 检查关闭爬虫
from_crawler(cls, crawler)¶
如果存在,则调用此类方法以从 Crawler 创建管道实例。它必须返回管道的新实例。 Crawler 对象提供对所有 Scrapy 核心组件的访问,如设置和信号;这是管道访问它们并将其功能挂钩到 Scrapy 的一种方式。
参考技术文档,编写一实例:
def process_item(self, item, spider):
adapter = ItemAdapter(item)
if adapter.get('price'):
if adapter.get('price_excludes_vat'):
adapter['price'] = adapter['price'] * self.vat_factor
return item
else:
raise DropItem(f"Missing price in {item}")