scrapy基于管道的持久化存储
编码流程:
1.数据解析
2.在item类中定义相关的属性
3.将解析的数据封装存储到item类型的对象
4.将item类型的对象提交给管道进行持久化存储的操作
5.在管道类的process_item中要将其接受到的item对象中存储的数据进行持久化存储操作
6.在配置文件中开启管道
示例代码:
在item类中定义相关属性(items.py):
# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html
import scrapy
class FirstproItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
# pass
title_num=scrapy.Field()
title_name=scrapy.Field()
在爬虫文件中将解析到的数据存入item对象中
import scrapy
from ..items import FirstproItem
class FirstSpider(scrapy.Spider):
name = 'first'
# allowed_domains = ['www.baidu.com']
start_urls = ['https://www.shicimingju.com/book/sanguoyanyi.html']
def parse(self, response):
#response中封装的xpath返回的是一个selector类型的对象。使用extract可以将selector对象中data参数存储的字符串提取
title_list=response.xpath("//div[@class='book-mulu']/ul/li/a/text()").extract()
for item_list in title_list:
title_num=item_list.split("·")[0]
title_name=item_list.split("·")[1]
# print(title_num+":"+title_name)
item=FirstproItem()#实例化一个item对象
item['title_num']=title_num
item['title_name']=title_name
yield item#将item提交给管道
在管道中进行持久化存储(pipelines.py)
一、存储到本地
class FirstproPipeline:
fp=None
# 重写父类方法,只在开始爬虫的时候被调用一次
def open_spider(self,spider):
print("开始爬虫")
self.fp=open("./sanguo.txt","w",encoding="utf-8")
#接收爬虫文件提交过来的item对象,每接收到一个item就会被调用一次
def process_item(self, item, spider):
title_num=item['title_num']
title_name=item['title_name']
self.fp.write(title_num+":"+title_name+"\n")
return item
def close_spider(self,spider):
print("爬虫结束")
self.fp.close()
二、存储到数据库
需要导入pymsql模块
import pymysql
class mysqlPipeline:
conn=None
cursor=None
def open_spider(self, spider):
print("开始爬虫")
self.conn=pymysql.Connect(host='127.0.0.1',port=3306,user='root',passwd='zhengyunyu524',db='pythontest',charset="utf8")
def process_item(self, item, spider):
self.cursor=self.conn.cursor()
try:
self.cursor.execute("insert into title values('%s','%s')"%(item['title_num'],item['title_name']))
self.conn.commit()
except Exception as e:
print(e)
self.conn.rollback()
return item
def close_spider(self,spider):
print("爬虫结束")
self.cursor.close()
self.conn.close()
在配置文件中进行配置修改(settings.py)
ITEM_PIPELINES = {
'firstPro.pipelines.FirstproPipeline': 300,
'firstPro.pipelines.mysqlPipeline': 301,
}
数字代表优先级,越小优先级越高
同时还需要开启UA伪装,关闭robot规则,设置日志只显示错误信息
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = '"Mozilla/5.0(Windows NT 10.0;Win64;x64) AppleWebKit / 537.36(KHTML, likeGecko) Chrome / 88.0.4324.150 Safari / 537.36"'
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
LOG_LEVEL='ERROR'
代码执行后结果: