scrapy中item Pipeline的用法

最新推荐文章于 2022-01-22 00:03:47 发布

weixin_30920513

最新推荐文章于 2022-01-22 00:03:47 发布

阅读量129

点赞数

文章标签： python 数据库

原文链接：http://www.cnblogs.com/mljj/p/9969425.html

版权

当Item在Spider中被收集之后，它将会被传递到Item Pipeline，这些Item Pipeline组件按定义的顺序处理Item。

item pipeline的主要作用：

清理html数据
验证爬取的数据
去重并丢弃
讲爬取的结果保存到数据库中或文件中

每个item piple组件是一个独立的pyhton类，其中必须实现process_item(self,item,spider)方法。

以下是item pipeline将item写入到MongoDB的典型应用：

import pymongo

class MongoPipeline(object):

    collection_name = 'scrapy_items'

    def __init__(self, mongo_uri, mongo_db):
        self.mongo_uri = mongo_uri
        self.mongo_db = mongo_db

    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            mongo_uri=crawler.settings.get('MONGO_URI'),
            mongo_db=crawler.settings.get('MONGO_DATABASE', 'items')
        )

    def open_spider(self, spider):
        self.client = pymongo.MongoClient(self.mongo_uri)
        self.db = self.client[self.mongo_db]

    def close_spider(self, spider):
        self.client.close()

    def process_item(self, item, spider):
        self.db[self.collection_name].insert(dict(item))
        return item

为了启用Item Pipeline组件，必须将它的类添加到 settings.py文件ITEM_PIPELINES 配置，就像下面这个例子:

# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
    #'mySpider.pipelines.SomePipeline': 300,
    "mySpider.pipelines.MongoPipeline":300 }

分配给每个类的整型值，确定了他们运行的顺序，item按数字从低到高的顺序，通过pipeline，通常将这些数字定义在0-1000范围内（0-1000随意设置，数值越低，组件的优先级越高）

转载于:https://www.cnblogs.com/mljj/p/9969425.html