scrapy：Pipelines三种方法保存json文件

最新推荐文章于 2022-11-07 20:42:54 发布

pylemon

最新推荐文章于 2022-11-07 20:42:54 发布

阅读量2.6k

点赞数 1

分类专栏： python3 scrapy 爬虫 python爬虫入门到精通文章标签： python scrapy 爬虫

本文链接：https://blog.csdn.net/qq_27648991/article/details/81514941

版权

python3 同时被 3 个专栏收录

24 篇文章 0 订阅

订阅专栏

爬虫

19 篇文章 1 订阅

订阅专栏

python爬虫入门到精通

16 篇文章 2 订阅

订阅专栏

推荐：JsonLiensItemExporter

这个是每次调用export_item 时，都会存储到硬盘中。

好处：每次处理数据的时候直接储存到硬盘中，减少内存的使用，数据比较安全

坏处：字典是一行一行写入json文件中，整个文件不是一个满足json格式的文件。

解决办法：读取时，遍历每一行并用json.loads()解析

from scrapy.exporters import JsonLinesItemExporter

# 推荐使用 自带scrapy json 保存
class JsonLinesExporterPipeline(object):
    def __init__(self):
        self.file = open('qsbk_2.json', 'wb')  # 必须写入二进制
        self.exporter = JsonLinesItemExporter(self.file, ensure_ascii=False, encoding='utf-8')

    def process_item(self, item, spider):
        self.exporter.export_item(item)
        print(item)

    def close_item(self, spider):
        self.file.close()
        pass

自定义方法保存json文件

import json

# 自定义处理json保存
class QsbkDemoPipeline(object):
    def __init__(self):
        self.file = open('qsbk.json', 'w', encoding='utf-8')

    def open_spider(self, spider):
        print('爬虫开始了...')
        pass

    def process_item(self, item, spider):
        # 这里需要把item转换字典
        item_json = json.dumps(dict(item), ensure_ascii=False)
        self.file.write(item_json+'\n')
        return item

    def close_spider(self, spider):
        self.file.close()
        print('爬虫结束了...')
        pass

JsonItemExporter 保存json

这个每次把数据添加到内存中（一个列表），最后统一写入到磁盘中。

好处：存储的数据是一个满足json规则的数据，

坏处：如果是数据量比较大，非常耗内存

from scrapy.exporters import JsonItemExporter

# 利用scrapy自带json保存
class JsonExporterPipeline(object):
    def __init__(self):
        self.file = open('qsbk_1.json', 'wb')  # 必须二进制写入
        self.exporter = JsonItemExporter(self.file, encoding='utf-8', ensure_ascii=False)
        # 开始写入
        self.exporter.start_exporting()

    def open_spider(self, spider):
        print('爬虫开始')
        pass

    def process_item(self, item, spider):
        self.exporter.export_item(item)
        return item

    def close_spider(self, spider):
        # 完成写入
        self.exporter.finish_exporting()
        self.file.close()
        pass