第五章数据存储-CSDN博客

本文链接：https://blog.csdn.net/weixin_41998371/article/details/109760320

第五章数据存储

存储的两种方式:文件存储、数据库存储
本章内容：

文件存储
关系型数据库存储
非关系型数据库存储
爬取的豆瓣TOP250存到MongoDB中
5.1文件存储

TXT、JSON、CSV、PDF、EXCEL

5.1.1 TXT文本

操作简单不利于检索

"""从队列中获取数据打印到控制台"""
def startWork(self)：
    <省略部分代码>
	while not self.dataQueue.empty():
        print(self.num)
        print(self.data.Queue.get())

def save(self, item):
    with open('./data/douban.txt', 'a', encoding='utf-8') as file:
        file.write(item)
        file.write('\n')

5.1.2 JSON文件

def parsePage(self,url):
    < 省略部分代码 >
    for node in node_list:
        item = {}
        item['title'] = node.xpath(".//span[@class='title']/text()")[0]
        item['score'] = node.xpath(".//span[@class='rating_num']/text()")[0]
        self.dataQueue.put(item)

5.1.3 CSV文件

5.2 关系型数据库MySQL存储

常见的关系数据库：
Oracle、MySQL、DB2、SQLite、SQLServer

5.操作MySQL数据库命令

命令	命令名称
show databases;	查看所有数据库
create database 数据库名 charset=utf8;	以utf8编码格式创建数据库
drop database; 数据库名字	删除特定数据库
use database ;	切换数据库
select database() ;	切换数据库

6.操作MySQL数据库内的表命令

命令	命令名称
show tables ;	查看当前数据库的所有表
create table 表名(列及类型);	创建表
alter table 表名 add	modify
drop table 表名	删除表
desc 表名	查看表结构
rename table 原表名 to 新表名	更改表名称
show create table	查看表的创建语句

7.操作MySQL表内增删改查的命令

命令	命令名称
insert into 表名(field1,field2,…fieldN) values(value1,value2,…valueN);	新增
delete from 表名 [where 条件];	删除
update table_name set field1 = new-value1,field2 = new-value2[where 条件]	更新
select field1,field2 from table_name [where 条件];	查询

8.MySQL常用的字段约束

字段类型关键字	说明
primary_key	是否是主键
null	能否为空
unique	能否重复true重复
default	默认值
blank	在django管理后台新增或编辑一条表数据时，该字段能否为空 null是数据库范畴，blank是表单验证范畴
ImageField	图片

实例1：使用命令

create database python charset utf8;
use python;
create table student(
    id int primary key auto_increment,
    name varchar(100) not null,
    sex char(1) not null,
    phone char(11) unique not null,
    address varchar(100) default '郑州',
    birthday date not null,
    gid int not null,
    foreign key(gid) references grade(id)
);
drop database python;
create database python charset utf8;
use python;
create table grade(
    id int primary key auto_increment,
    name varchar(100) not null
);
create table student(
    id int primary key auto_increment,
    name varchar(100) not null,
    sex char(1) not null,
    phone char(11) unique not null,
    address varchar(100) default '郑州',
    birthday date not null,
    gid int not null,
    foreign key(gid) references grade(id)
);
insert into grade(name) value('一年级');
insert into grade(name) value('二年级');
insert into student(python.student.name, python.student.sex, python.student.phone, python.student.address, python.student.birthday, python.student.gid) values('王强', '男','15583678666', '开封','1990-2-4',1);
insert into student(python.student.name, python.student.sex, python.student.phone, python.student.address, python.student.birthday, python.student.gid) values('李莉','女','16683678657','郑州','1991-3-12',2);

select * from grade;
select * from python.student;
delete from python.student where python.student.id=1;

9.Python与MySQL交互

安装pymysql库
Pycharm创建链接并以utf8的格式新建数据库python

实例2：创建MySQL连接

import pymysql

conn = pymysql.Connect(host='localhost', port=3306, db='python', user='root', password='111111', charset='utf8')
print(conn)
print(type(conn))
conn.close()

输出

<pymysql.connections.Connection object at 0x0000021BCFC266D0>
<class 'pymysql.connections.Connection'>

分析
pymysql.Connect()用来创建数据库连接对象
连接对象使用完毕后需要关闭释放资源

实例3：MySQL的查询和修改

"""Python与Mysql交互-查询和修改"""
import pymysql


def select():
    """查询"""
    try:
        conn = pymysql.Connect(host='localhost', port=3306, db='python', user='root', password='111111', charset='utf8')
        # https://www.jb51.net/article/177865.htm
        # 开启游标功能，创建游标对象
        cur = conn.cursor()
        # 使用execute()方法，执行SQL语句
        cur.execute('select * from student where id=%s', [2])
        # 使用fetchone()一次性获取一条元组形式数据或fetchall()一次性获取元组列表形式所有数据
        result = cur.fetchone()
        print(result)
        conn.close()
    except Exception as ex:
        print(ex)


def update():
    """修改"""
    try:
        conn = pymysql.Connect(host='localhost', port=3306, db='python', user='root', password='111111', charset='utf8')
        cur = conn.cursor()
        count = cur.execute('update student set name=%s where id=%s', ['ss', 2])
        if count > 0:
            print('成功')
        else:
            print('失败')
        conn.commit()
        conn.close()
    except Exception as ex:
        print(ex)


if __name__ == '__main__':
    select()
    update()

查询出有多条数据时	查询只有一条数据时
cursor.fetchone()：将只取最上面的第一条结果，返回单个元组如(‘id’,‘name’)，然后多次循环使用cursor.fetchone()，依次取得下一条结果，直到为空。	cursor.fetchone()：将只返回一条结果，返回单个元组如(‘id’,‘name’)。
cursor.fetchall() :将返回所有结果，返回二维元组，如((‘id’,‘name’),(‘id’,‘name’)),	cursor.fetchall() :也将返回所有结果，返回二维元组，如((‘id’,‘name’),),

5.3非关系型数据库存储

非关系数据库又称MySQL,其特点如下：
不支持SQL语法
存储结构与传统数据库完全不同，，都是KV形式
没有通用的语言，每种NoSQL数据库都有自己的API和语法，以及擅长的业务场景

5.3.1Redis数据库

1.Redis的特性：
支持数据的持久化，内存的数据可以保存在磁盘中，重启后再次加载
支持各种数据结构的存储
支持数据的备份
2.Redis优势
3.安装

8.Python与Redis交互

实例4：创建Redis连接

"""Python与Redis创建链接"""
import redis

client = redis.StrictRedis(host='localhost', port=6379, db=0)
print(client)

# 简写
# client = redis.StrictReds()

实例5:以String为例，完成增删改查

"""Python与Redis交互-增删改查"""

from redis import *


def insert_update():
    """新增、修改"""
    try:
        sr = StrictRedis()
        result = sr.set('name', 'python')
        print(result)
    except Exception as e:
        print(e)


def select():
    """查询"""
    try:
        sr = StrictRedis()
        result = sr.get('name')
        # 输出键的值，不存在会返回None
        print(result)
    except Exception as e:
        print(e)


def delete():
    """删除"""
    try:
        # 创建StrictRedis对象，与Redis服务器建立连接
        sr = StrictRedis()
        # 设置键name的值，如果键已经存在，则进行修改，否则进行添加
        result = sr.delete('name')
        # 输出响应结果，如果删除成功，则返回受影响的键数，否则返回0
        print(result)
    except Exception as e:
        print(e)


if __name__ == '__main__':
    insert_update()
    select()
    delete()

5.3.2MongDB数据库

1.有以下特性
模式自由
面向集合的存储
完整的索引支持
复制和高可用性
自动分片以支持云级别的伸缩性
2.MongDB优势

实例6：创建MongoDB

"""Python与MongoDB交互-创建连接"""

import pymongo

# 创建MongoClient对象
myclient = pymongo.MongoClient('mongodb://localhost27017/')
# 选择数据库mydb
mydb = myclient['mydb']
print(mydb)
# 关闭
myclient.close()

实例7：MongoDB的增删改查

"""Python与MongoDB交互-增删改查"""
import pymongo


def is_having():
    """判断数据库是否存在"""
    # 获取连接对象
    myclient = pymongo.MongoClient('mongodb://localhost:27017/')
    # 获取所有数据库名称
    dblist = myclient.list_database_names()
    # 判断
    if 'mydb' in dblist:
        print('数据已经存在')
    else:
        print('数据库不存在')
    # 关闭
    myclient.close()


def insert():
    """新增"""
    # 获取连接对象
    myclient = pymongo.MongoClient('mongodb://localhost:27017/')
    # 获取数据库
    mydb = myclient['mydb']
    # 获取集合
    stu = mydb['stu']
    # 新增一条记录返回id
    # DeprecationWarning: insert is deprecated. Use insert_one or insert_many instead._id = stu.insert({
    # 不建议使用insert，应当使用insert_one或者insert_many
    _id = stu.insert_one({
        'name': '扫地僧',
        'hometown': '少林寺',
        'age': 66,
        'gender': True
    })
    print(_id)
    # 关闭
    myclient.close()


def select():
    """查询"""
    # 获取连接对象
    myclient = pymongo.MongoClient('mongodb://localhost:27017/')
    # 获取数据库
    mydb = myclient['mydb']
    # 获取集合
    stu = mydb['stu']
    # 查询所有
    ret = stu.find()
    # 遍历
    for i in ret:
        print(i)
    # 关闭
    myclient.close()


def update():
    """修改"""
    # 获取连接对象
    myclient = pymongo.MongoClient('mongodb://localhost:27017/')
    # 获取数据库
    mydb = myclient['mydb']
    # 获取集合
    stu = mydb['stu']
    # 修改
    x = stu.update_many({'age': {'$gt': 20}}, {'$inc': {'age': 1}})
    print(x.modified_count, '文档已修改')
    # 关闭
    myclient.close()


def delete():
    """删除"""
    # 获取链接对象
    myclient = pymongo.MongoClient('mongodb://localhost:27017/')
    # 获取数据库
    mydb = myclient['mydb']
    # 获取集合
    stu = mydb['stu']
    # 删除
    x = stu.delete_many({'age': {'$gt': 20}})
    print(x.deleted_count, '个文档已删除')
    # 关闭
    myclient.close()


if __name__ == '__main__':
    is_having()
    insert()
    select()
    update()
    delete()

5.4项目案例：爬豆瓣电影

实例8：爬豆瓣电影并保存到MongoDB中

"""爬取豆瓣电影信息并保存到MongoDB中不按序"""

# 导入模块
import requests
import random
import time
import threading
import json
import csv
import os
import pymongo
from lxml import etree
from queue import Queue


class DouBanSpider:
    """爬虫类"""
    def __init__(self):
        """构造方法"""
        self.headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36"}
        self.baseURL = "https://movie.douban.com/top250"
        self.client = pymongo.MongoClient('mongodb://localhost:27017/')['mydb']['stu']


    def loadPage(self, url):
        """向URL发送请求获取响应内容"""

        # 随机休眠0~2秒，避免请求过快被封禁
        time.sleep(random.random()*2)
        return requests.get(url, headers=self.headers).content


    def parsePage(self,url):
        """根据起始页提取所有的URL"""

        # 获取URL对应的响应内容
        content = self.loadPage(url)
        # XPath处理得到对应的element对象
        html = etree.HTML(content)

        # 所有的电影节点
        # 在Xpath插件中，class="info"要用双引号，在Pycharm中使用单引号,一下同理
        note_list = html.xpath("//div[@class='info']")
        # 遍历
        for node in note_list:
            # 使用字典储存数据
            item = {}
            # 每部电影的标题
            item['title'] = node.xpath(".//span[@class='title'][1]/text()")[0]
            # 每部电影的评分
            item['score'] = node.xpath(".//span[@class='rating_num']/text()")[0]
            # 将数据存储到队列中
            self.client.insert_one(item)
        # 只有在第一页时才获取所有URL组成的列表，其他页就不再获取
        if url == self.baseURL:
            # 这种方法适用于页码第一页完全显示
            return [self.baseURL + link for link in html.xpath("//div[@class='paginator']/a/@href")]


    def startWork(self):
        """开始"""

        print('begin...')
        #  第一个页面的请求，需要返回所有页面的链接，并提取第一页的电影信息
        link_list = self.parsePage(self.baseURL)

        thread_list = []
        # 循环发送每个页面的请求，并获取所有电影信息
        for link in link_list:
            # 循环创建了9个线程，每个线程都执行一个任务
            thread = threading.Thread(target=self.parsePage, args=[link])
            thread.start()
            thread_list.append(thread)
        # 父线程等待所有子线程结束，自己再结束
        for thread in thread_list:
            thread.join()
        print("end...")


if __name__ == "__main__":
    # 创建爬虫对象
    spider = DouBanSpider()
    # 开始爬虫
    spider.startWork()

实例9：爬豆瓣电影并保存到MySQL中

"""爬豆瓣电影信息并保存到MySQL"""

# 导入模块
import requests
import random
import time
import threading
import pymysql
from lxml import etree


class MysqlHelper():
    """工具类：封装的增删改查方法，方便调用。"""
    def __init__(self, host, db, user, passwd, port=3306,charset='utf8'):

        self.host = host
        self.port = port
        self.db = db
        self.user = user
        self.passwd = '111111'
        self.charset = charset

    # 启动连接
    def connect(self):
        self.conn = pymysql.Connect(host=self.host, port=self.port, db=self.db, user=self.user, passwd=self.passwd, charset=self.charset)
        self.cursor = self.conn.cursor()

    # 断开连接
    def close(self):
        self.cursor.close()
        self.conn.close()

    # 单条查询
    def select_one(self, sql, params=[]):
        result = None
        try:
            self.connect()
            self.cursor.execute(sql, params)
            result = self.cursor.fetchone()
            self.close()
        except Exception as e:
            print(e)
        return result

    # 多条查询
    def select_all(self, sql, params=[]):
        result = ()
        try:
            self.connect()
            self.cursor.execute(sql, params)
            result = self.cursor.fetchall()
            self.close()
        except Exception as e:
            print(e)
        return result

    def __edit(self, sql, params):
        # count = 0
        # try:
        self.connect()
        count = self.cursor.execute(sql, params)
        self.conn.commit()
        self.close()
        # except Exception as e:
        #     print(e)
        # return count
    # 增
    def insert(self, sql, params=[]):
        return self.__edit(sql, params)
    # 改
    def update(self, sql, params=[]):
        return self.__edit(sql, params)
    # 删
    def delete(self, sql, params=[]):
        return self.__edit(sql, params)



class DouBanSpider:
    """爬虫类"""

    def __init__(self):
        """构造方法"""

        # hearders：这是主要设置User-Agent伪装成真实浏览器
        self.headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko"}
        # baseURL：基础url
        self.baseURL = "https://movie.douban.com/top250"


    def loadPage(self, url):
        """向url发送请求,获取响应内容"""

        # 随机休眠0-2秒，避免爬虫过快，会导致爬虫被封禁
        time.sleep(random.random()*2)
        return requests.get(url, headers=self.headers).content

    def parsePage(self, url):
        """根据起始url提取所有的url"""

        # 获取url对应的响应内容
        content = self.loadPage(url)
        # xpath处理得到对应的element对象
        html = etree.HTML(content)

        # 所有的电影结点
        node_list = html.xpath("//div[@class='info']")

        # 遍历
        for node in node_list:
            # 使用字典存储数据
            item = {}

            # 每部电影的标题
            item['title'] = node.xpath(".//span[@class='title']/text()")[0]
            # 每部电影的评分
            item['score'] = node.xpath(".//span[@class='rating_num']/text()")[0]

            # MySQL助手类对象
            helper = MysqlHelper(host='localhost', port=3306, db='python', user='root', passwd='root')
            # 将数据存储到MySQL
            helper.insert('insert into douban(title,score) values(%s,%s)',[item['title'],float(item['score'])])

        # 只有在第一页的时候才获取所有url组成的列表，其它翻页就不再获取
        if url == self.baseURL:
            return [self.baseURL + link for link in html.xpath("//div[@class='paginator']/a/@href")]

    def startWork(self):
        """开始"""

        print('begin...')

        # 第一个页面的请求，需要返回所有页面链接，并提取第一页的电影信息
        link_list = self.parsePage(self.baseURL)

        thread_list = []
        # 循环发送每个页面的请求，并获取所有电影信息
        for link in link_list:
            # self.parsePage(link)
            # 循环创建了9个线程，每个线程都执行一个任务
            thread = threading.Thread(target=self.parsePage, args=[link])
            thread.start()
            thread_list.append(thread)

        # 父线程等待所有子线程结束，自己再结束
        for thread in thread_list:
            thread.join()


        print('end...')


if __name__ == "__main__":
    # 创建爬虫对象
    spider = DouBanSpider()
    # 开始爬虫
    spider.startWork()

use python;
create table douban(
    id int primary key auto_increment,
    title varchar(100),
    score float
)

第五章 数据存储

第五章 数据存储

5.1文件存储

5.1.1 TXT文本

5.1.2 JSON文件

5.1.3 CSV文件

5.2 关系型数据库MySQL存储

5.操作MySQL数据库命令

6.操作MySQL数据库内的表命令

7.操作MySQL表内增删改查的命令

8.MySQL常用的字段约束

实例1：使用命令

9.Python与MySQL交互

实例2：创建MySQL连接

实例3：MySQL的查询和修改

5.3非关系型数据库存储

5.3.1Redis数据库

8.Python与Redis交互

实例4：创建Redis连接

实例5:以String为例，完成增删改查

5.3.2MongDB数据库

实例6：创建MongoDB

实例7：MongoDB的增删改查

5.4项目案例：爬豆瓣电影

实例8：爬豆瓣电影并保存到MongoDB中

实例9：爬豆瓣电影并保存到MySQL中

第五章数据存储

第五章数据存储