Scarpy源码分析10 Item Pipeline

最新推荐文章于 2024-09-27 10:11:28 发布

及锋而试

最新推荐文章于 2024-09-27 10:11:28 发布

阅读量626

点赞数

分类专栏： 2021SC@SDUSC 文章标签： python

本文链接：https://blog.csdn.net/No_oneelse/article/details/121743594

版权

2021SC@SDUSC 专栏收录该内容

20 篇文章 0 订阅

订阅专栏

2021SC@SDUSC

之前的1-9分析部分中，我们对scrapy源码的框架进行了总览分析，在了解了基本的scrapy爬虫项目结构后，我计划结合官方文档，对于scrapy爬虫中比较突出的几个部分进行重点代码分析。

首先是scrapy中Item Pipeline 的部分分析：

官方文档中对于scrapy是这么定义的：

Each item pipeline component (sometimes referred as just “Item Pipeline”) is a Python class that implements a simple method. They receive an item and perform an action over it, also deciding if the item should continue through the pipeline or be dropped and no longer processed.

也就是：每个项目管道组件（有时简称为“项目管道”）都是一个实现简单方法的 Python 类。这些类接收一个项目并对其执行一个操作，同时决定该项目是应该继续通过管道还是被丢弃且不再处理。

Typical uses of item pipelines are:

cleansing HTML data

validating scraped data (checking that the items contain certain fields)

checking for duplicates (and dropping them)

storing the scraped item in a database

项目管道的典型用途是：

清理 HTML 数据

验证抓取的数据（检查项目是否包含某些字段）

检查重复项（并删除它们）

将抓取的项目存储在数据库中

源码分析与使用样例：

Each item pipeline component is a Python class that must implement the following method:

process_item(self, item, spider)¶
This method is called for every item pipeline component.

item is an item object, see Supporting All Item Types.

process_item() must either: return an item object, return a Deferred or raise a DropItem exception.

Dropped items are no longer processed by further pipeline components.

Parameters
item (item object) – the scraped item

spider (Spider object) – the spider which scraped the item

Additionally, they may also implement the following methods:

open_spider(self, spider)¶
This method is called when the spider is opened.

Parameters
spider (Spider object) – the spider which was opened

close_spider(self, spider)¶
This method is called when the spider is closed.

Parameters
spider (Spider object) – the spider which was closed

from_crawler(cls, crawler)¶
If present, this classmethod is called to create a pipeline instance from a Crawler. It must return a new instance of the pipeline. Crawler object provides access to all Scrapy core components like settings and signals; it is a way for pipeline to access them and hook its functionality into Scrapy.

Parameters
crawler (Crawler object) – crawler that uses this pipeline

可以看出：

process_item(self, item, spider)¶

为每个项目管道组件调用此方法。

process_item()

必须：返回一个项目对象，返回一个 Deferred 或引发一个 DropItem 异常。丢弃的项目不再由进一步的管道组件处理。

Parameters
item (item object) – 被抓取的物品

spider (Spider object) – 爬虫本体

此外，它们还可以实现以下方法：

open_spider(self, spider)¶
在爬虫初始化时候调用此方法

Parameters
spider (Spider object) – 检查已开启的爬虫

close_spider(self, spider)¶
使用此方法关闭爬虫

Parameters
spider (Spider object) – 检查关闭爬虫

from_crawler(cls, crawler)¶

如果存在，则调用此类方法以从 Crawler 创建管道实例。它必须返回管道的新实例。 Crawler 对象提供对所有 Scrapy 核心组件的访问，如设置和信号；这是管道访问它们并将其功能挂钩到 Scrapy 的一种方式。

参考技术文档，编写一实例：

    def process_item(self, item, spider):
        adapter = ItemAdapter(item)
        if adapter.get('price'):
            if adapter.get('price_excludes_vat'):
                adapter['price'] = adapter['price'] * self.vat_factor
            return item
        else:
            raise DropItem(f"Missing price in {item}")