scrapy基础学习-使用数据库存储MySQL

最新推荐文章于 2024-07-31 14:30:10 发布

落笔成名

最新推荐文章于 2024-07-31 14:30:10 发布

阅读量154

点赞数

分类专栏： # Scrapy 文章标签： python scrapy 爬虫

本文链接：https://blog.csdn.net/jiantong0737/article/details/102094379

版权

Scrapy 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

                    
                        
                    
                    scrapy抓取的数据存储到数据库 
首先修改一下pipeline路径
 pipeline文件，主要是对抓取回来的数据进行处理。
 在这里，我们可以对数据进行，清洗，转化，存储。
 为了方便管理，我习惯将不同功能的代码，分开文件编写。
 scrapy支持多个pipeline文件的(其实就是将默认的pipeline路径，改成明文配置)
 将xxx/xxx/settings.py中，默认注释掉的ITEM_PIPELINES字段打开。
 然后，将自己的pipeline路径写上去。 
  创建文件夹, xxx/xxx/pipelines
创建文件, xxx/xxx/pipelines/xxx_pipelines.py# -*- coding: utf-8 -*-

 # Define your item pipelines here
 #
 # Don't forget to add your pipeline to the ITEM_PIPELINES setting
 # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
 from scrapy.exceptions import DropItem
 
 
 class XxxSpiderPipeline(object):
 
     def __init__(self):
         self.param_not_n_l = [
             'price_min'
         ]

     # pipeline执行的主体
     def process_item(self, item, spider):
     # 将价格为空的数据过滤掉
         for item_k, item_v in item.items():
             if item_k in self.param_not_n_l and item_v == 0:
                 return DropItem(
                     "{item_k} is none".format(
                         item_k=item_k
                     )
                 )
         return item
 
添加pipeline配置
 xxx/xxx/settings.pyITEM_PIPELINES = {
   'xxx.pipelines.xxx_pipelines.EbaySpiderPipeline': 300,
}
 
 
设置数据库配置文件
 在配置文件中，添加数据库信息xxx/xxx/settings.py# 数据库的配置是自定义的,没有固定好的字段
# 在配置文件中，随便找一个你认为合适的位置，添加如下配置信息
MYSQL_CNF = {
    'world': {
        'HOST': '127.0.0.1',
        'PORT': 3306,
        'DATABASE': 'world',
        'USER': 'world',
        'PASSWORD': '111111'
    }
}
 
创建文件 xxx/xxx/pipelines/mysql_pipelines.py# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
from datetime import datetime

import pymysql


class MysqlPipeline():

    def __init__(self, host, database, user, password, port):
        self.host = host
        self.database = database
        self.user = user
        self.password = password
        self.port = port

    @classmethod
    def from_crawler(cls, crawler):
        mysql_cnf = crawler.settings.get('MYSQL_CNF').get('world', {})
        return cls(
            host=mysql_cnf.get('HOST', ''),
            database=mysql_cnf.get('DATABASE', ''),
            user=mysql_cnf.get('USER', ''),
            password=mysql_cnf.get('PASSWORD', ''),
            port=mysql_cnf.get('PORT', ''),
        )

    def open_spider(self, spider):
        self.db = pymysql.connect(self.host, self.user, self.password, self.database, charset='utf8', port=self.port)
        self.cursor = self.db.cursor()

    def close_spider(self, spider):
        self.db.close()

    def process_item(self, item, spider):
        data = dict(item)
        keys = ', '.join(list(data.keys()) + ['create_time', 'update_time'])
        values = ', '.join(['%s'] * (len(data) + 2))
        sql = 'insert into %s (%s) values (%s)' % (item.table, keys, values)
        self.cursor.execute(sql, tuple(list(data.values()) + [datetime.now().strftime('%Y-%m-%d %H:%M:%S'), datetime.now().strftime('%Y-%m-%d %H:%M:%S')]))
        self.db.commit()
        return item
 
将mysql_pipelines.py添加到settings.py文件中

                

落笔成名

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
打赏
0
评论
scrapy基础学习-使用数据库存储MySQL

scrapy抓取的数据存储到数据库首先修改一下pipeline路径pipeline文件，主要是对抓取回来的数据进行处理。在这里，我们可以对数据进行，清洗，转化，存储。为了方便管理，我习惯将不同功能的代码，分开文件编写。scrapy支持多个pipeline文件的(其实就是将默认的pipeline路径，改成明文配置)将xxx/xxx/settings.py中，默认注释掉的ITEM_PIP...
复制链接

扫一扫