scrapy持久化存储

最新推荐文章于 2022-02-14 01:01:47 发布

RunHio

最新推荐文章于 2022-02-14 01:01:47 发布

阅读量164

点赞数

分类专栏：学习笔记文章标签： python

本文链接：https://blog.csdn.net/weixin_45925906/article/details/113820774

版权

学习笔记专栏收录该内容

49 篇文章 1 订阅

订阅专栏

scrapy基于管道的持久化存储

编码流程：
1.数据解析
2.在item类中定义相关的属性
3.将解析的数据封装存储到item类型的对象
4.将item类型的对象提交给管道进行持久化存储的操作
5.在管道类的process_item中要将其接受到的item对象中存储的数据进行持久化存储操作
6.在配置文件中开启管道

示例代码：

在item类中定义相关属性(items.py)：

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class FirstproItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    # pass
    title_num=scrapy.Field()
    title_name=scrapy.Field()

在爬虫文件中将解析到的数据存入item对象中

import scrapy

from ..items import FirstproItem


class FirstSpider(scrapy.Spider):
    name = 'first'
    # allowed_domains = ['www.baidu.com']
    start_urls = ['https://www.shicimingju.com/book/sanguoyanyi.html']

    def parse(self, response):
        #response中封装的xpath返回的是一个selector类型的对象。使用extract可以将selector对象中data参数存储的字符串提取
        title_list=response.xpath("//div[@class='book-mulu']/ul/li/a/text()").extract()
        for item_list in title_list:
            title_num=item_list.split("·")[0]
            title_name=item_list.split("·")[1]
            # print(title_num+":"+title_name)
            item=FirstproItem()#实例化一个item对象
            item['title_num']=title_num
            item['title_name']=title_name
            yield item#将item提交给管道

在管道中进行持久化存储(pipelines.py)
一、存储到本地

class FirstproPipeline:
    fp=None
    # 重写父类方法，只在开始爬虫的时候被调用一次
    def open_spider(self,spider):
        print("开始爬虫")
        self.fp=open("./sanguo.txt","w",encoding="utf-8")
    #接收爬虫文件提交过来的item对象，每接收到一个item就会被调用一次
    def process_item(self, item, spider):
        title_num=item['title_num']
        title_name=item['title_name']
        self.fp.write(title_num+":"+title_name+"\n")
        return item
    def close_spider(self,spider):
        print("爬虫结束")
        self.fp.close()

二、存储到数据库
需要导入pymsql模块

import pymysql
class mysqlPipeline:
    conn=None
    cursor=None
    def open_spider(self, spider):
        print("开始爬虫")
        self.conn=pymysql.Connect(host='127.0.0.1',port=3306,user='root',passwd='zhengyunyu524',db='pythontest',charset="utf8")
    def process_item(self, item, spider):
        self.cursor=self.conn.cursor()
        try:
            self.cursor.execute("insert into title values('%s','%s')"%(item['title_num'],item['title_name']))
            self.conn.commit()
        except Exception as e:
            print(e)
            self.conn.rollback()
        return item
    def close_spider(self,spider):
        print("爬虫结束")
        self.cursor.close()
        self.conn.close()

在配置文件中进行配置修改(settings.py)

ITEM_PIPELINES = {
   'firstPro.pipelines.FirstproPipeline': 300,
   'firstPro.pipelines.mysqlPipeline': 301,
}
数字代表优先级，越小优先级越高
同时还需要开启UA伪装，关闭robot规则，设置日志只显示错误信息

# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = '"Mozilla/5.0(Windows NT 10.0;Win64;x64) AppleWebKit / 537.36(KHTML, likeGecko) Chrome / 88.0.4324.150 Safari / 537.36"'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

LOG_LEVEL='ERROR'