15分钟学 Python 第39天：Python 爬虫入门（五）_python使用什么存取数据-CSDN博客

本文链接：https://blog.csdn.net/weixin_40780178/article/details/142711007

Day 39：Python 爬虫入门数据存储概述

在进行网页爬虫时，抓取到的数据需要存储以供后续分析和使用。常见的存储方式包括但不限于：

文件存储（如文本文件、CSV、JSON）
数据库存储（如SQLite、MySQL、MongoDB）
内存存储（如使用Python的数据结构）

每种存储方式有其优缺点，选择合适的存储方案可以提高数据处理效率。

一、文件存储

1.1 文本文件

文本文件是最简单的数据存储方式，适合于小规模数据。可以使用Python的内置文件操作来实现数据写入和读取。

示例代码：

# 写入数据到文本文件
data = "Hello, World!"
with open("output.txt", "w") as file:
    file.write(data)

# 从文本文件读取数据
with open("output.txt", "r") as file:
    content = file.read()
print(content)  # 输出: Hello, World!

1.2 CSV文件

CSV（Comma Separated Values）文件用于存储表格数据，适合处理结构化数据。可以使用Python的csv模块来处理CSV文件。

示例代码：

import csv

# 写入数据到CSV文件
data = [["name", "age"], ["Alice", 30], ["Bob", 25]]
with open("output.csv", "w", newline="") as file:
    writer = csv.writer(file)
    writer.writerows(data)

# 从CSV文件读取数据
with open("output.csv", "r") as file:
    reader = csv.reader(file)
    for row in reader:
        print(row)  # 输出: ['name', 'age'], ['Alice', '30'], ['Bob', '25']

1.3 JSON文件

JSON（JavaScript Object Notation）文件适合存储嵌套的数据结构，易于人类阅读和书写。可以使用Python的json模块。

示例代码：

import json

# 写入数据到JSON文件
data = {
    "users": [
        {"name": "Alice", "age": 30},
        {"name": "Bob", "age": 25}
    ]
}
with open("output.json", "w") as file:
    json.dump(data, file)

# 从JSON文件读取数据
with open("output.json", "r") as file:
    content = json.load(file)
print(content)  # 输出: {'users': [{'name': 'Alice', 'age': 30}, {'name': 'Bob', 'age': 25}]}

二、数据库存储

对于大规模数据及高效查询，使用数据库存储更为合适。常用的数据库有SQLite、MySQL和MongoDB。

2.1 SQLite

SQLite是一个轻量级的关系数据库，适合小型应用。Python内置支持SQLite，通过sqlite3模块操作。

示例代码：

import sqlite3

# 创建数据库连接
conn = sqlite3.connect('example.db')
c = conn.cursor()

# 创建表
c.execute('''CREATE TABLE users (name text, age integer)''')

# 插入数据
c.execute("INSERT INTO users VALUES ('Alice', 30)")
c.execute("INSERT INTO users VALUES ('Bob', 25)")

# 提交并关闭连接
conn.commit()
conn.close()

# 查询数据
conn = sqlite3.connect('example.db')
c = conn.cursor()
for row in c.execute('SELECT * FROM users'):
    print(row)  # 输出: ('Alice', 30), ('Bob', 25)
conn.close()

2.2 MySQL

MySQL是一个广泛使用的关系数据库，适合大规模的应用。首先要安装mysql-connector-python模块。

示例代码：

import mysql.connector

# 创建数据库连接
conn = mysql.connector.connect(
    host="localhost",
    user="yourusername",
    password="yourpassword",
    database="yourdatabase"
)
cursor = conn.cursor()

# 创建表
cursor.execute("CREATE TABLE users (name VARCHAR(255), age INT)")

# 插入数据
cursor.execute("INSERT INTO users (name, age) VALUES ('Alice', 30)")
cursor.execute("INSERT INTO users (name, age) VALUES ('Bob', 25)")

# 提交事务并关闭连接
conn.commit()
cursor.close()
conn.close()

# 查询数据
conn = mysql.connector.connect(
    host="localhost",
    user="yourusername",
    password="yourpassword",
    database="yourdatabase"
)
cursor = conn.cursor()

cursor.execute("SELECT * FROM users")
for row in cursor.fetchall():
    print(row)  # 输出: ('Alice', 30), ('Bob', 25)

cursor.close()
conn.close()

2.3 MongoDB

MongoDB是一个文档型数据库，适合存储非结构化数据。使用pymongo模块进行操作。

示例代码：

from pymongo import MongoClient

# 创建数据库连接
client = MongoClient('localhost', 27017)
db = client["testdb"]
collection = db["users"]

# 插入数据
collection.insert_one({"name": "Alice", "age": 30})
collection.insert_one({"name": "Bob", "age": 25})

# 查询数据
for user in collection.find():
    print(user)  # 输出: {'_id': ..., 'name': 'Alice', 'age': 30}, {'_id': ..., 'name': 'Bob', 'age': 25}

client.close()

三、内存存储

在某些情况下，可以将数据存储在内存中，适合快速处理和临时使用。使用Python的内置数据结构（如字典、列表）即可。

示例代码：

# 使用Python内置数据结构存储数据
data_storage = []

# 存储数据
data_storage.append({"name": "Alice", "age": 30})
data_storage.append({"name": "Bob", "age": 25})

# 读取数据
for item in data_storage:
    print(item)  # 输出: {'name': 'Alice', 'age': 30}, {'name': 'Bob', 'age': 25}

四、选择适合的存储方式

在选择数据存储方式时，考虑以下几点：

数据规模：数据量小可使用文件存储，量大则应考虑数据库。
查询需求：如果需要复杂查询，选择数据库存储更为合适。
数据结构：嵌套数据优先考虑JSON文件或MongoDB。
性能要求：内存存储能提供最快的读取速度，但数据持久化不可用。

五、数据存储流程图

以下是一个简单的数据存储流程图，帮助理解数据存储的步骤：

[网页爬虫]
     |
     V
[数据提取]
     |
     V
[选择存储方式]
     |
     +----- [文件存储] -----+
     |                     |
     |                     |
     +----- [数据库存储] --+
     |                     |
     |                     |
     +----- [内存存储] ----+
     |
     V
[存储数据]