Scarpy源码分析10 Item Pipeline

2021SC@SDUSC

之前的1-9分析部分中,我们对scrapy源码的框架进行了总览分析,在了解了基本的scrapy爬虫项目结构后,我计划结合官方文档,对于scrapy爬虫中比较突出的几个部分进行重点代码分析。

首先是scrapy中Item Pipeline 的部分分析:

官方文档中对于scrapy是这么定义的:

Each item pipeline component (sometimes referred as just “Item Pipeline”) is a Python class that implements a simple method. They receive an item and perform an action over it, also deciding if the item should continue through the pipeline or be dropped and no longer processed.

也就是:每个项目管道组件(有时简称为“项目管道”)都是一个实现简单方法的 Python 类。这些类接收一个项目并对其执行一个操作,同时决定该项目是应该继续通过管道还是被丢弃且不再处理。

Typical uses of item pipelines are:

cleansing HTML data

validating scraped data (checking that the items contain certain fields)

checking for duplicates (and dropping them)

storing the scraped item in a database

项目管道的典型用途是:

清理 HTML 数据

验证抓取的数据(检查项目是否包含某些字段)

检查重复项(并删除它们)

将抓取的项目存储在数据库中

源码分析与使用样例:

Each item pipeline component is a Python class that must implement the following method:

process_item(self, item, spider)¶
This method is called for every item pipeline component.

item is an item object, see Supporting All Item Types.

process_item() must either: return an item object, return a Deferred or raise a DropItem exception.

Dropped items are no longer processed by further pipeline components.

Parameters
item (item object) – the scraped item

spider (Spider object) – the spider which scraped the item

Additionally, they may also implement the following methods:

open_spider(self, spider)¶
This method is called when the spider is opened.

Parameters
spider (Spider object) – the spider which was opened

close_spider(self, spider)¶
This method is called when the spider is closed.

Parameters
spider (Spider object) – the spider which was closed

from_crawler(cls, crawler)¶
If present, this classmethod is called to create a pipeline instance from a Crawler. It must return a new instance of the pipeline. Crawler object provides access to all Scrapy core components like settings and signals; it is a way for pipeline to access them and hook its functionality into Scrapy.

Parameters
crawler (Crawler object) – crawler that uses this pipeline

可以看出:


process_item(self, item, spider)¶

为每个项目管道组件调用此方法。

process_item()

必须:返回一个项目对象,返回一个 Deferred 或引发一个 DropItem 异常。丢弃的项目不再由进一步的管道组件处理。

Parameters
item (item object) – 被抓取的物品

spider (Spider object) – 爬虫本体

此外,它们还可以实现以下方法:

open_spider(self, spider)¶
在爬虫初始化时候调用此方法

Parameters
spider (Spider object) – 检查已开启的爬虫

close_spider(self, spider)¶
使用此方法关闭爬虫

Parameters
spider (Spider object) – 检查关闭爬虫

from_crawler(cls, crawler)¶

如果存在,则调用此类方法以从 Crawler 创建管道实例。它必须返回管道的新实例。 Crawler 对象提供对所有 Scrapy 核心组件的访问,如设置和信号;这是管道访问它们并将其功能挂钩到 Scrapy 的一种方式。

参考技术文档,编写一实例:

 

    def process_item(self, item, spider):
        adapter = ItemAdapter(item)
        if adapter.get('price'):
            if adapter.get('price_excludes_vat'):
                adapter['price'] = adapter['price'] * self.vat_factor
            return item
        else:
            raise DropItem(f"Missing price in {item}")

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值