使用Scrapy存储内容

最新推荐文章于 2024-05-13 22:06:27 发布

monkey@king

最新推荐文章于 2024-05-13 22:06:27 发布

阅读量477

点赞数 1

分类专栏： Scrapy框架文章标签： scrapy python 网络爬虫

本文链接：https://blog.csdn.net/m0_56535661/article/details/131886331

版权

Scrapy框架专栏收录该内容

6 篇文章 1 订阅

订阅专栏

Scrapy存储

前言
Scrapy的几种存储方式
总结

前言

爬取数据通常都要持久化使用，这就需要将爬取的数据进行存储，下面将通过几种方式将数据存储，代码以及方法的不足，请各位大牛指点一二！！！

Scrapy的几种存储方式

下面将通过终端存储数据，通过管道将数据存储到文件中、MySQL中、MongoDB中

1 使用终端命令进行存储

代码配置
/myspider/myspider/spiders/duanzi.py

import scrapy


class DuanziSpider(scrapy.Spider):
    name = "duanzi"
    # 允许抓取的域名
    allowed_domains = ["抓取的域名"]
    # 起始请求的url
    start_urls = ["抓取的起始网址"]

    # def parse(self, response, **kwargs):
    # 测试属性
    # 获取解析后的页面源代码
    # print(response.text)
    # 返回bytes
    # print(response.body)
    # 当前响应的url
    # print(response.url)
    # 当前响应对应的请求的url地址
    # print(response.request.url)
    # 响应头
    # print(response.headers)
    # 当前响应的请求头
    # print(response.request.url)
    # 响应状态码
    # print(response.status)

    def parse(self, response, **kwargs):
        # 获取每个段子的article标签
        article_list = response.xpath('//article[@class="excerpt"]')
        # article_list = response.xpath('//article[@class="excerpt//text()"]')
        # print(article_list)
        # print(article_list.extract())
        for article in article_list:
            # 解析所有的字符串返回列表
            # title = article.xpath('./header/h2/a/text()').extract()
            # 解析第一个对象返回字符串
            # title = article.xpath('./header/h2/a/text()')[0].extract()
            dic_data = {}
            # 获取标题
            title = article.xpath('./header/h2/a/text()').extract_first()
            # 获取内容
            con = article.xpath('./p[@class="note"]/text()').extract_first()
            # print(con)
            dic_data['title'] = title
            dic_data['con'] = con
            yield dic_data

终端命令
scrapy crawl 爬虫名称 -o 文件名.csv
scrapy crawl duanzi-o duanzi.csv
将文件存储到duanzi.csv 文件中
结果

2 存储到文件中

duanzi.py中实现

import scrapy
from duanzifile.items import DuanzifileItem  # 导入字段类

class DuanziSpider(scrapy.Spider):
    name = "duanzi"
    allowed_domains = ["抓取的域名"]
    start_urls = ["抓取的起始网址"]

    def parse(self, response, **kwargs):
        item = DuanzifileItem()
        # 获取每个段子的article标签
        article_list = response.xpath('//article[@class="excerpt"]')
        for article in article_list:
            dic_data = {}
            # 获取标题
            title = article.xpath('./header/h2/a/text()').extract_first()
            # 获取内容
            con = article.xpath('./p[@class="note"]/text()').extract_first()
            # print(con)
            item['title'] = title
            item['con'] = con
            yield item

思考：为什么要使用yield？

让整个函数变成一个生成器，有什么好处呢？
遍历这个函数的返回值的时候，挨个把数据读到内存，不会造成内存的瞬间占用过高
python3中的range和python2中的xrange同理

注意：yield能够传递的对象只能是：BaseItem,Request,dict,None

items.py中实现

import scrapy
class DuanzifileItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    # 定义了要存储的字段 title和con  只能存这两个字段 存进其他字段则报错
    title = scrapy.Field()
    con = scrapy.Field()

注意：属性名称和当前爬虫duanzi.py中抓到要存储数据的变量要一致，否则报错

开启管道

pipeline中常用的方法：

process_item(self,item,spider):实现对item数据的处理
open_spider(self, spider): 在爬虫开启的时候仅执行一次
close_spider(self, spider): 在爬虫关闭的时候仅执行一次

settings.py 打开当前注释

ITEM_PIPELINES = {
   'doubanfile.pipelines.DoubanfilePipeline': 300,
}

在pipelines.py中实现

from itemadapter import ItemAdapter

class DuanzifilePipeline:
    # 开启爬虫的时候执行一次
    def open_spider(self, item):
        self.f = open('duanzi.txt', 'w', encoding='utf-8')

    # 实现对item数据的处理
    def process_item(self, item, spider):
        self.f.write(item['title']+'\n')
        self.f.write(item['con']+'\n')
        return item

    # 关闭爬虫的时候执行一次
    def close_spider(self, item):
        self.f.close()

注意：

当前process_item中的return item必须存在，如果当前爬虫存在于多个管道的时候，如果没有return item 则下一个管道不能获取到当前的item数据

结果

3 存储到MySQL中

首先，需要配置settings.py文件，需要设置请求头，以及开启管道等，步骤与上方一致就不过多展示了，还需要创建一个MySQL数据库，这个根据数据自行创建即可
在duanzi.py中实现

import scrapy
from duanzifile.items import DuanzifileItem  # 导入字段类

class DuanziSpider(scrapy.Spider):
    name = "duanzi"
    allowed_domains = ["抓取的域名"]
    start_urls = ["抓取的起始网址"]

    def parse(self, response, **kwargs):
        item = DuanzifileItem()
        # 获取每个段子的article标签
        article_list = response.xpath('//article[@class="excerpt"]')
        for article in article_list:
            dic_data = {}
            # 获取标题
            title = article.xpath('./header/h2/a/text()').extract_first()
            # 获取内容
            con = article.xpath('./p[@class="note"]/text()').extract_first()
            # print(con)
            item['title'] = title
            item['con'] = con
            yield item

注意：这里的代码其实和上面的是一样的，只是存储方式改变，所以无需改动duanzi.py的代码

在items.py中实现

import scrapy

class DuanzimysqlItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title = scrapy.Field()
    con = scrapy.Field()

在pipelines.py中实现

from itemadapter import ItemAdapter
import pymysql

class DuanzimysqlPipeline:
    # 爬虫开启前运行一次
    def open_spider(self, item):
        # 连接数据库
        self.db = pymysql.connect(host='127.0.0.1', port=3306, db='duanzi', user='root', password='root', charset='utf8')
        # 创建游标对象
        self.cursor = self.db.cursor()

    def process_item(self, item, spider):
        try:
            # 拼接SQL
            sql = f'insert into duanzi(title, con) values("{item["title"]}", "{item["con"]}")'
            self.cursor.execute(sql)
            # 执行SQL语句
            self.db.commit()
        except Exception as e:
            print(e, sql)
            self.db.rollback()
        return item

    def close_spider(self, item):
        # 关闭mysql连接
        self.db.close()

结果

4 存储到MongDB中

这里只需要修改pipelines.py的代码，其他的和上面一样就不过多展示了
在pipelines.py中实现

from itemadapter import ItemAdapter
from pymongo import MongoClient

class DuanzimongoPipeline:
    def open_spider(self, spider):
        # 连接数据库
        self.con = MongoClient(host='127.0.0.1', port=27017)
        # 选择集合
        self.collection = self.con.spider.duanzi

    def process_item(self, item, spider):
        # 插入数据
        self.collection.insert_one(dict(item))
        return item

    def close_spider(self, spider):
        # 关闭mongo连接
        self.con.close()

结果

5 数据同时存储到文件、MySQL、MongoDB中

实现这个功能只需整合一下pipelines.py文件，以及配置settings.py文件即可，其他无需做出修改的就不过多展示了
在pipelines.py中实现

from itemadapter import ItemAdapter
from pymongo import MongoClient
import pymysql


class DuanziPipeline:
    def process_item(self, item, spider):
        return item

class DuanzimongoPipeline:
    def open_spider(self, spider):
        # 连接数据库
        self.con = MongoClient(host='127.0.0.1', port=27017)
        # 选择集合
        self.collection = self.con.spider.duanzi

    def process_item(self, item, spider):
        # 插入数据
        self.collection.insert_one(dict(item))
        return item

    def close_spider(self, spider):
        # 关闭mongo连接
        self.con.close()

class DuanzimysqlPipeline:
    # 爬虫开启前运行一次
    def open_spider(self, item):
        # 连接数据库
        self.db = pymysql.connect(host='127.0.0.1', port=3306, db='duanzi', user='root', password='root', charset='utf8')
        # 创建游标对象
        self.cursor = self.db.cursor()

    def process_item(self, item, spider):
        try:
            # 拼接SQL
            sql = f'insert into duanzi(title, con) values("{item["title"]}", "{item["con"]}")'
            self.cursor.execute(sql)
            # 执行SQL语句
            self.db.commit()
        except Exception as e:
            print(e, sql)
            self.db.rollback()
        return item

    def close_spider(self, item):
        # 关闭mysql连接
        self.db.close()


class DuanzifilePipeline:
    # 开启爬虫的时候执行一次
    def open_spider(self, item):
        self.f = open('duanzi.txt', 'w', encoding='utf-8')

    # 实现对item数据的处理
    def process_item(self, item, spider):
        self.f.write(item['title']+'\n')
        self.f.write(item['con']+'\n')
        return item

    # 关闭爬虫的时候执行一次
    def close_spider(self, item):
        self.f.close()

配置settings.py添加管道

ITEM_PIPELINES = {
   "duanzi.pipelines.DuanziPipeline": 300,  # 优先级最高
   "duanzi.pipelines.DuanzimongoPipeline": 400,
   "duanzi.pipelines.DuanzimysqlPipeline": 500,
   "duanzi.pipelines.DuanzifilePipeline": 600,  # 优先级最低
}

思考：pipeline在settings中能够开启多个，为什么需要开启多个？

不同的pipeline可以处理不同爬虫的数据，通过spider.name属性来区分
不同的pipeline能够对一个或多个爬虫进行不同的数据处理的操作，比如一个进行数据清洗，一个进行数据的保存
同一个管道类也可以处理不同爬虫的数据，通过spider.name属性来区分

总结

使用之前需要在settings中开启
pipeline在setting中键表示位置(即pipeline在项目中的位置可以自定义)，值表示距离引擎的远近，越近数据会越先经过
有多个pipeline的时候，process_item的方法必须return item,否则后一个pipeline取到的数据为None值
pipeline中process_item的方法必须有，否则item没有办法接受和处理
process_item方法接受item和spider，其中spider表示当前传递item过来的spider
open_spider(spider) :能够在爬虫开启的时候执行一次
close_spider(spider) :能够在爬虫关闭的时候执行一次
上述俩个方法经常用于爬虫和数据库的交互，在爬虫开启的时候建立和数据库的连接，在爬虫关闭的时候断开和数据库的连接

monkey@king

关注

1
点赞
踩
6

收藏

觉得还不错? 一键收藏
0
评论
使用Scrapy存储内容

爬取数据通常都要持久化使用，这就需要将爬取的数据进行存储，下面将通过几种方式将数据存储，代码以及方法的不足，请各位大牛指点一二！！！使用之前需要在settings中开启pipeline在setting中键表示位置(即pipeline在项目中的位置可以自定义)，值表示距离引擎的远近，越近数据会越先经过有多个pipeline的时候，process_item的方法必须return item,否则后一个pipeline取到的数据为None值。
复制链接

扫一扫