爬取数据成功后,需要持久化存储时,可以存在本地文档,也可以存在数据库中,这里使用mysql数据存储,需要用到pymysql库
,可pip自行下载
1、楼主在这里的思路是,创建一个dbapi.py文件,在其中封闭操纵数据库的所有功能,代码如下
import logging
import pymysql
logging.basicConfig(level=logging.INFO)
class MysqlConn:
"""
创建一个mysql数据库连接,数据库必须已被创建
"""
def __init__(self, host, port, user, password, database):
self.conn = pymysql.connect(
host=host, port=port, user=user, password=password, database=database)
self.c = self.conn.cursor()
def close(self, save=True):
# 断开连接,save参数判定是否保存,默认保存
fun = lambda: {'Y': self.conn.commit, 'N': fun}.get(input("是否保存修改(Y/N)"))
if save:
self.conn.commit()
self.conn.close()
def commit(self):
# 提交已执行的操作
self.conn.commit()
class MysqlOpt:
"""在库连接基础上,创建表连接"""
def __init__(self, db, tbname):
self.db = db
self.tbname = tbname
def __len__(self):
return self.db.c.execute("desc %s" % self.tbname)
@staticmethod
def trf_dict(kw):
"""将字典转换为SQL,where语句的条件"""
term = []
for k, v in kw.items():
if isinstance(v, int):
term.append(k + '=' + str(v))
else:
term.append(k + '=' + "'" + v + "'")
return ' and '.join(term)
def get_tags(self):
"""查询表的创建"""
self.db.c.execute("desc %s" % self.tbname)
return self.db.c.fetchall()
def select(self, *tags, **kw):
"""返回查询结果,tags为查询的字段,kw为查询条件"""
if not tags:
tags = '*'
if not kw:
self.db.c.execute("select {0} from {1}".format(','.join(tags), self.tbname))
else:
cond = self.tf_dict(kw)
self.db.c.execute(
"select {0} from {1} where {2}".format(','.join(tags), self.tbname, cond))
result = self.db.c.fetchall()
return result
def insert(self, values, tags=tuple()):
"""info
插入一条记录
tags 插入数据的字段,元组或列表,不传时,为插入所有字段
values 插入的数据,元组或列表
* 插入单字段数据时,必须以(data,)的形势插入,括号中的逗号是必须的
"""
if not isinstance(tags, (tuple, list)) or not isinstance(values, (tuple, list)):
raise Exception('Error: type error, must be tuple or list!')
tags = '(' + ','.join(tags) + ')'
values = tuple(values) if len(values) > 1 else str(values)[:-2] + ')'
sentence = "insert into {0} {1} values {2}".format(self.tbname, tags, values)
logging.info('SQL sentence > " %s "]' % sentence)
try:
self.db.c.execute(sentence)
self.db.commit()
except pymysql.err.ProgrammingError as e1:
print('Error: ', e1)
except pymysql.err.IntegrityError as e2:
print('Error: ', e2)
MysqlConn类用于创建数据库连接,所需参数为数据连接所需的host,port,user,password以及指定的数据库名,并创建游标cursor
MysqlOpt传入MysqlConn所创建的连接对象db及将要进行操作的表名tbname,目前包含select、insert方法及查询表结构的get_tags方法
2、在scrapy的pipelines.py文件中创建BasePipelines类,此类调用MysqlConn类,并引用settings中的属性来连接Mysql数据库。
之后可以根据需求创建相应的pipelines类,来进行相关的存储操作,代码如下:
import logging
from Utils.mysql_api import MysqlConn,MysqlOpt
logging.basicConfig(level=logging.INFO)
class BasePipeline(object):
"""pipeline基类,用于访问settings参数,连接数据库,提供连接表方法"""
def __init__(self, host, port, user, password, database):
self.conn_db = MysqlConn(
host=host, port=port, user=user, password=password, database=database)
logging.info((host, port, user, password, database))
@classmethod
def from_crawler(cls, crawler):
return cls(
host=crawler.settings.get('MYSQL_HOST'),
port=crawler.settings.get('MYSQL_PORT'),
user=crawler.settings.get('MYSQL_USER'),
password=crawler.settings.get('MYSQL_PASSWORD'),
database=crawler.settings.get('MYSQL_DB')
)
def connect_tb(self, tbname):
conn_tb = MysqlOpt(self.conn_db, tbname)
return conn_tb
class StockInfoPipeline(BasePipeline):
"""连接info表,将item分析后存入表中"""
def open_spider(self, spider):
self.conn_tb_info = self.connect_tb('info')
def close_spider(self, spider):
self.conn_db.close()
def process_item(self, item, spider):
self.conn_tb_info.insert(list(item.values()))
return item
如需存储到多个表中,则可以创建多个表连接,分别存入数据库中。
楼主真小白一名,此项目地址为:https://github.com/Done-1026/xueqiu,望各路大神指正。