python爬虫自学宝典——将爬取的数据写入MySQL数据库

最新推荐文章于 2024-08-08 20:03:46 发布

良木66

最新推荐文章于 2024-08-08 20:03:46 发布

阅读量4.8k

点赞数 5

分类专栏： scrapy python

本文链接：https://blog.csdn.net/qq_44503987/article/details/105118182

版权

python 同时被 2 个专栏收录

22 篇文章 0 订阅

订阅专栏

scrapy

14 篇文章 4 订阅

订阅专栏

前文回顾
上一节介绍了怎么将信息写入json中，这一节讲怎么将爬取的信息写入MySQL数据库中。写入数据库中，其实只需要修改pipeline.py文件即可，凡是输出，都只需要修改pipeline文件即可。
打开pipeline文件，咱们上一节写入的内容如下：

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html

import json
class DemoPipeline(object):
# def process_item(self, item, spider):
	# print("Blogs's name:"+item['name'])
	# print("The number of blogs' redding:"+item['red_number'])
	# print("The date of blogs' publish:",item['publish_date'])
  def __init__(self):
    self.json_file = open("./demo.json","wb+")
  def close_spider(self,spider):
    print('————————————————————关闭文件——————————————————————')
    self.json_file.close()
  def process_item(self, item, spider):
    text = json.dumps(dict(item),ensure_ascii=False)+"\n"
    self.json_file.write(text.encode("utf-8"))

那么这次呢？咱们将数据导入到MySQL数据库中去，以方便数据的存储和随后的信息挖掘。如果不会MySQL链接python的朋友，请看这篇博客：python链接MySQL
这次咱们在类的初始化中（构造函数）中，连接数据库，创建数据库表，并且判断数据库中的表是否存在。
为什么要判断数据库表是否存在呢？因为没有的时候，我们没有地方导入数据信息，所以必须在pipeline中添加一个数据库表检查操作，以防止notFoundTheDatabase错误发生。
回顾前面的讲解中，我们了解到scrapy的运行机制是：虫子创建请求，scrapy引擎将请求发送给下载器，下载器将请求发送到互联网上，互联网给出响应，将数据反馈给下载器，下载器再将数据以scrapy引擎驱动，以item的形式发送给pipeline；pipeline将数据处理，保存到文件系统，保存到数据库都行。
我推荐用MySQL自带的workbench软件，以方便有更直观的操作。先打开一个新的数据库系统，创建一个名为demo的数据库，结果如下：
在这里插入图片描述
数据库内空无一表，我们用程序进行创建表格，并添加数据。下面是我们修改后的pipeline.py文件程序：

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html

import mysql.connector


def table_exists(cursor, table_name):
    str_sql = "show tables from demo"
    cursor.execute(str_sql)
    tables = cursor.fetchall()
    for i in tables:
        for j in i:
            if j == table_name:
                return True
    return False


class DemoPipeline(object):
    def __init__(self):
        self.con = mysql.connector.connect(host='localhost', port='3306', user='root', password='11131432', \
                                           database='demo', use_unicode=True)
        # 链接数据库
        self.con.autocommit = True
        # 使事务自动提交，若是没有此句，则必须在process_spider(self,item,spider)中手动提交事务。
        self.cu = self.con.cursor()
        # 建立游标
        if not table_exists(self.cu, 'blogs'):
            self.cu.execute(
                "create table blogs(id integer primary key auto_increment,name varchar(200),red_number long,"
                "publish_date "
                "varchar(200))")
        # To judge this table whether in the database-demo.
        # if yes,pass;or not, create a new table and named with blogs.

    def close_spider(self):
        print('————————————————————关闭文件——————————————————————')
        self.cu.close()
        self.con.close()

    def process_item(self, item, spider):
        self.cu.execute("INSERT INTO `demo`.`blogs`(`name`,`red_number`,`publish_date`)VALUES(%s,%s,%s)",
                        (item['name'], item['red_number'], item['publish_date']))
        # self.cu.commit()#手动提交事务，不提交，执行不成功