Scrapy——数据持久化存储

最新推荐文章于 2024-04-27 12:40:48 发布

Chris的算法之旅

最新推荐文章于 2024-04-27 12:40:48 发布

阅读量625

点赞数

分类专栏：爬虫文章标签： scrapy 数据持久化爬虫 python

本文链接：https://blog.csdn.net/u012052168/article/details/79269969

版权

爬虫专栏收录该内容

6 篇文章 1 订阅

订阅专栏

本文首发于我的博客：gongyanli.com
我的简书：https://www.jianshu.com/p/2542219f6ee0

前言：本文主要讲解Scrapy的数据持久化，主要包括存储到数据库、json文件以及内置数据存储

持久化存储——JSON

pipelins.py

`import json
 from scrapy.exceptions import DropItem
 class myPipeline(object):
    def __init__(self):
        self.file=open('test.json','wb')

    def process_item(self,item,spider):
        if item['title']:
            line=json.dumps(dict(item))+'\n'
            self.file.write(line)
            return item
        else:
            raise DropItem("Missing title in %s" % item)`

持久化存储——MongoDB数据库

`class myPipeline(object):
 def __init__(self):
    self.client = pymongo.MongoClient(host=settings['MONGO_HOST'], port=settings['MONGO_PORT'])
    self.db = self.client[settings['MONGO_DB']]
    # self.coll = self.db[settings['MONGO_COLL2']]
    self.chinacwa = self.db['chinacwa']
    self.iot = self.db['iot']
    self.ny135 = self.db['ny135']
    self.productprice = self.db['productprice']
    self.allproductprice = self.db['allproductprice']

 def process_item(self, item, spider):
    if isinstance(item, ChinacwaItem):
        try:
            if item['article_title']:
                item = dict(item)
                self.chinacwa.insert(item)
                print("插入成功")
                return item
        except Exception as e:
            spider.logger.exceptionn("")`

持久化存储——内置数据存储

settings

1.JSON
> FEED_FORMAT:json
> 所用的内置输出类：JsonItemExporter

2.JSON lines
> FEED_FORMAT:jsonlines
> 所用的内置输出类：JsonLinesItemExporter

3.CSV
> FEED_FORMAT:csv
> 所用的内置输出类：CsvItemExporter

4.XML
> FEED_FORMAT:xml
> 所用的内置输出类：XmlItemExporter

5.Pickle
> FEED_FORMAT:pickle
> 所用的内置输出类：PickleItemExporter

6.Marshal
> FEED_FORMAT:marshal
> 所用的内置输出类：MarshalItemExporter

> 使用方法：
$ scrapy crawl mySpider -o test.csv

Chris的算法之旅

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Scrapy——数据持久化存储

本文首发于我的博客：gongyanli.com 我的简书：https://www.jianshu.com/p/2542219f6ee0前言：本文主要讲解Scrapy的数据持久化，主要包括存储到数据库、json文件以及内置数据存储持久化存储——JSON pipelins.py`import json from scrapy.exceptions import Drop
复制链接

扫一扫