scrapy抓取的数据存储到数据库
- 首先修改一下pipeline路径
pipeline文件,主要是对抓取回来的数据进行处理。
在这里,我们可以对数据进行,清洗,转化,存储。
为了方便管理,我习惯将不同功能的代码,分开文件编写。
scrapy支持多个pipeline文件的(其实就是将默认的pipeline路径,改成明文配置)
将xxx/xxx/settings.py中,默认注释掉的ITEM_PIPELINES字段打开。
然后,将自己的pipeline路径写上去。- 创建文件夹, xxx/xxx/pipelines
- 创建文件, xxx/xxx/pipelines/xxx_pipelines.py
# -*- coding: utf-8 -*- # Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html from scrapy.exceptions import DropItem class XxxSpiderPipeline(object): def __init__(self): self.param_not_n_l = [ 'price_min' ] # pipeline执行的主体 def process_item(self, item, spider): # 将价格为空的数据过滤掉 for item_k, item_v in item.items(): if item_k in self.param_not_n_l and item_v == 0: return DropItem( "{item_k} is none".format( item_k=item_k ) ) return item
- 添加pipeline配置
xxx/xxx/settings.pyITEM_PIPELINES = { 'xxx.pipelines.xxx_pipelines.EbaySpiderPipeline': 300, }
- 设置数据库配置文件
在配置文件中,添加数据库信息xxx/xxx/settings.py# 数据库的配置是自定义的,没有固定好的字段 # 在配置文件中,随便找一个你认为合适的位置,添加如下配置信息 MYSQL_CNF = { 'world': { 'HOST': '127.0.0.1', 'PORT': 3306, 'DATABASE': 'world', 'USER': 'world', 'PASSWORD': '111111' } }
- 创建文件 xxx/xxx/pipelines/mysql_pipelines.py
# -*- coding: utf-8 -*- # Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html from datetime import datetime import pymysql class MysqlPipeline(): def __init__(self, host, database, user, password, port): self.host = host self.database = database self.user = user self.password = password self.port = port @classmethod def from_crawler(cls, crawler): mysql_cnf = crawler.settings.get('MYSQL_CNF').get('world', {}) return cls( host=mysql_cnf.get('HOST', ''), database=mysql_cnf.get('DATABASE', ''), user=mysql_cnf.get('USER', ''), password=mysql_cnf.get('PASSWORD', ''), port=mysql_cnf.get('PORT', ''), ) def open_spider(self, spider): self.db = pymysql.connect(self.host, self.user, self.password, self.database, charset='utf8', port=self.port) self.cursor = self.db.cursor() def close_spider(self, spider): self.db.close() def process_item(self, item, spider): data = dict(item) keys = ', '.join(list(data.keys()) + ['create_time', 'update_time']) values = ', '.join(['%s'] * (len(data) + 2)) sql = 'insert into %s (%s) values (%s)' % (item.table, keys, values) self.cursor.execute(sql, tuple(list(data.values()) + [datetime.now().strftime('%Y-%m-%d %H:%M:%S'), datetime.now().strftime('%Y-%m-%d %H:%M:%S')])) self.db.commit() return item
- 将mysql_pipelines.py添加到settings.py文件中